Search CORE

8,901 research outputs found

A probabilistic justification for using tf.idf term weighting in information retrieval

Author: Hiemstra D.
Publication venue: Springer Verlag
Publication date: 01/01/2000
Field of study

This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf.idf term weighting. The paper shows that the new probabilistic interpretation of tf.idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm

Radboud Repository

University of Twente Research Information

From Frequency to Meaning: Vector Space Models of Semantics

Author: Pantel Patrick
Turney Peter D.
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2010
Field of study

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref

Structural Regularities in Text-based Entity Vector Spaces

Author: de Rijke Maarten
Kanoulas Evangelos
Van Gysel Christophe
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Entity retrieval is the task of finding entities such as people or products in response to a query, based solely on the textual documents they are associated with. Recent semantic entity retrieval algorithms represent queries and experts in finite-dimensional vector spaces, where both are constructed from text sequences. We investigate entity vector spaces and the degree to which they capture structural regularities. Such vector spaces are constructed in an unsupervised manner without explicit information about structural aspects. For concreteness, we address these questions for a specific type of entity: experts in the context of expert finding. We discover how clusterings of experts correspond to committees in organizations, the ability of expert representations to encode the co-author graph, and the degree to which they encode academic rank. We compare latent, continuous representations created using methods based on distributional semantics (LSI), topic models (LDA) and neural networks (word2vec, doc2vec, SERT). Vector spaces created using neural methods, such as doc2vec and SERT, systematically perform better at clustering than LSI, LDA and word2vec. When it comes to encoding entity relations, SERT performs best.Comment: ICTIR2017. Proceedings of the 3rd ACM International Conference on the Theory of Information Retrieval. 201

arXiv.org e-Print Archive

Crossref

UvA-DARE

International Migration, Integration and Social Cohesion online publications

MODELS AND ALGORITHMS FOR OPTIMIZING LEGAL INFORMATION RETRIEVAL IN THE CORPORATE NETWORK OF ACADEMIC LIBRARIES

Author: Abdullayev Murodjon
Urinkulov Odil
Publication venue: Rezekne Academy of Technologies
Publication date: 03/07/2023
Field of study

With the rapid growth of information in the global network, the challenges of finding information quickly and easily in a narrow range of fields of study and specialization are increasing. People are constantly looking for information in some form throughout their lives. This is the result of the constant striving of human beings for innovation, efforts to improve personal and professional competencies. One of the main objectives of libraries is to meet people’s needs for information. In short, this process can be called the type of informational support. The main purpose of this research is to develop models and algorithms to optimize the effective search of information about health information in corporate networks. Electronic libraries in the field of jurisprudence serve not only to train personnel in the field of jurisprudence, but also to increase legal literacy in society, to make citizens aware of their rights and obligations, and to prevent them from becoming victims of various frauds. For organizations, it serves as the most important repository of knowledge for their employees to constantly update their legal knowledge, to draw up normative-legal documents, contracts and agreements within the framework of legal requirements. Despite the fact that the field of jurisprudence is one of the most important areas of activity, the provision of scientific information to this field is not sufficiently systematized. Different organizations and institutions store their existing legal literature in the way they choose, and there is no single mechanism for making it available to users, digitizing, classifying, and searching for it. Most library users rate the efficiency of the library by the availability of the necessary literature. A survey of law students and professors was conducted to examine the interest of library users in legal electronic literature and their use. More than 50% of respondents use the electronic library daily, 93% are looking for legal literature, and 50% of participants said it is difficult to find legal literature. Also, all respondents (100%) approved the need to create a single corporate network by pooling electronic resources of higher education institutions providing legal training

Journals of Rezekne Academy of Technologies

The Scientific Journal of Rezeknes Augstskola

Modeling Meaning Associated with Documental Entities: Introducing the Brussels Quantum Approach

Author: Aerts Diederik
de Bianchi Massimiliano Sassoli
Sozzo Sandro
Veloz Tomas
Publication venue
Publication date: 03/08/2018
Field of study

We show that the Brussels operational-realistic approach to quantum physics and quantum cognition offers a fundamental strategy for modeling the meaning associated with collections of documental entities. To do so, we take the World Wide Web as a paradigmatic example and emphasize the importance of distinguishing the Web, made of printed documents, from a more abstract meaning entity, which we call the Quantum Web, or QWeb, where the former is considered to be the collection of traces that can be left by the latter, in specific measurements, similarly to how a non-spatial quantum entity, like an electron, can leave localized traces of impact on a detection screen. The double-slit experiment is extensively used to illustrate the rationale of the modeling, which is guided by how physicists constructed quantum theory to describe the behavior of the microscopic entities. We also emphasize that the superposition principle and the associated interference effects are not sufficient to model all experimental probabilistic data, like those obtained by counting the relative number of documents containing certain words and co-occurrences of words. For this, additional effects, like context effects, must also be taken into consideration.Comment: 27 pages, 6 figures, Late

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Udine

CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

Author: Boujemaa Nozha
Compañó Ramón
Dosch Christoph
Geurts Joost
Karlgren Jussi
King Paul
Kompatsiaris Yiannis
Köhler Joachim
Le Moine Jean-Yves
Ortgies Robert
Point Jean-Charles
Rotenberg Boris
Rudström Åsa
Sebe Nicu
Publication venue: Chorus Project Consortium
Publication date: 01/01/2007
Field of study

Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Human-Level Performance on Word Analogy Questions by Latent Relational Analysis

Author: Turney Peter D.
Publication venue
Publication date: 01/01/2004
Field of study

This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, machine translation, and information retrieval. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason/stone is analogous to the pair carpenter/wood; the relations between mason and stone are highly similar to the relations between carpenter and wood. Past work on semantic similarity measures has mainly been concerned with attributional similarity. For instance, Latent Semantic Analysis (LSA) can measure the degree of similarity between two words, but not between two relations. Recently the Vector Space Model (VSM) of information retrieval has been adapted to the task of measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus (they are not predefined), (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data (it is also used this way in LSA), and (3) automatically generated synonyms are used to explore reformulations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying noun-modifier relations, LRA achieves similar gains over the VSM, while using a smaller corpus

arXiv.org e-Print Archive

NRC Publications Archive

Information Retrieval Performance Enhancement Using The Average Standard Estimator And The Multi-criteria Decision Weighted Set

Author: Ahram TAREQ
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2008
Field of study

Information retrieval is much more challenging than traditional small document collection retrieval. The main difference is the importance of correlations between related concepts in complex data structures. These structures have been studied by several information retrieval systems. This research began by performing a comprehensive review and comparison of several techniques of matrix dimensionality estimation and their respective effects on enhancing retrieval performance using singular value decomposition and latent semantic analysis. Two novel techniques have been introduced in this research to enhance intrinsic dimensionality estimation, the Multi-criteria Decision Weighted model to estimate matrix intrinsic dimensionality for large document collections and the Average Standard Estimator (ASE) for estimating data intrinsic dimensionality based on the singular value decomposition (SVD). ASE estimates the level of significance for singular values resulting from the singular value decomposition. ASE assumes that those variables with deep relations have sufficient correlation and that only those relationships with high singular values are significant and should be maintained. Experimental results over all possible dimensions indicated that ASE improved matrix intrinsic dimensionality estimation by including the effect of both singular values magnitude of decrease and random noise distracters. Analysis based on selected performance measures indicates that for each document collection there is a region of lower dimensionalities associated with improved retrieval performance. However, there was clear disagreement between the various performance measures on the model associated with best performance. The introduction of the multi-weighted model and Analytical Hierarchy Processing (AHP) analysis helped in ranking dimensionality estimation techniques and facilitates satisfying overall model goals by leveraging contradicting constrains and satisfying information retrieval priorities. ASE provided the best estimate for MEDLINE intrinsic dimensionality among all other dimensionality estimation techniques, and further, ASE improved precision and relative relevance by 10.2% and 7.4% respectively. AHP analysis indicates that ASE and the weighted model ranked the best among other methods with 30.3% and 20.3% in satisfying overall model goals in MEDLINE and 22.6% and 25.1% for CRANFIELD. The weighted model improved MEDLINE relative relevance by 4.4%, while the scree plot, weighted model, and ASE provided better estimation of data intrinsic dimensionality for CRANFIELD collection than Kaiser-Guttman and Percentage of variance. ASE dimensionality estimation technique provided a better estimation of CISI intrinsic dimensionality than all other tested methods since all methods except ASE tend to underestimate CISI document collection intrinsic dimensionality. ASE improved CISI average relative relevance and average search length by 28.4% and 22.0% respectively. This research provided evidence supporting a system using a weighted multi-criteria performance evaluation technique resulting in better overall performance than a single criteria ranking model. Thus, the weighted multi-criteria model with dimensionality reduction provides a more efficient implementation for information retrieval than using a full rank model

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Searching for text documents

Author: Hiemstra Djoerd
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

University of Twente Research Information