6 research outputs found
Measuring Semantic Similarity by Latent Relational Analysis
This paper introduces Latent Relational Analysis (LRA), a method for measuring semantic similarity. LRA measures similarity in the semantic relations between two pairs of words. When two pairs have a high degree of relational similarity, they are analogous. For example, the pair cat:meow is analogous to the pair dog:bark. There is evidence from cognitive science that relational similarity is fundamental to many cognitive and linguistic tasks (e.g., analogical reasoning). In the Vector Space Model (VSM) approach to measuring relational similarity, the similarity between two pairs is calculated by the cosine of the angle between the vectors that represent the two pairs. The elements in the vectors are based on the frequencies of manually constructed patterns in a large corpus. LRA extends the VSM approach in three ways: (1) patterns are derived automatically from the corpus, (2) Singular Value Decomposition is used to smooth the frequency data, and (3) synonyms are used to reformulate word pairs. This paper describes the LRA algorithm and experimentally compares LRA to VSM on two tasks, answering college-level multiple-choice word analogy questions and classifying semantic relations in noun-modifier expressions. LRA achieves state-of-the-art results, reaching human-level performance on the analogy questions and significantly exceeding VSM performance on both tasks
Human-Level Performance on Word Analogy Questions by Latent Relational Analysis
This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, machine translation, and information retrieval. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason/stone is analogous to the pair carpenter/wood; the relations between mason and stone are highly similar to the relations between carpenter and wood. Past work on semantic similarity measures has mainly been concerned with attributional similarity. For instance, Latent Semantic Analysis (LSA) can measure the degree of similarity between two words, but not between two relations. Recently the Vector Space Model (VSM) of information retrieval has been adapted to the task of measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus (they are not predefined), (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data (it is also used this way in LSA), and (3) automatically generated synonyms are used to explore reformulations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying noun-modifier relations, LRA achieves similar gains over the VSM, while using a smaller corpus
Recommended from our members
Parallel methods for the generation of partitioned inverted files
Purpose
– The generation of inverted indexes is one of the most computationally intensive activities for information retrieval systems: indexing large multi‐gigabyte text databases can take many hours or even days to complete. We examine the generation of partitioned inverted files in order to speed up the process of indexing. Two types of index partitions are investigated: TermId and DocId.
Design/methodology/approach
– We use standard measures used in parallel computing such as speedup and efficiency to examine the computing results and also the space costs of our trial indexing experiments.
Findings
– The results from runs on both partitioning methods are compared and contrasted, concluding that DocId is the more efficient method.
Practical implications
– The practical implications are that the DocId partitioning method would in most circumstances be used for distributing inverted file data in a parallel computer, particularly if indexing speed is the primary consideration.
Originality/value
– The paper is of value to database administrators who manage large‐scale text collections, and who need to use parallel computing to implement their text retrieval services
Similarity of Semantic Relations
There are at least two kinds of similarity. Relational similarity is
correspondence between relations, in contrast with attributional similarity,
which is correspondence between attributes. When two words have a high
degree of attributional similarity, we call them synonyms. When two pairs
of words have a high degree of relational similarity, we say that their
relations are analogous. For example, the word pair mason:stone is analogous
to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA),
a method for measuring relational similarity. LRA has potential applications in many
areas, including information extraction, word sense disambiguation,
and information retrieval. Recently the Vector Space Model (VSM) of information
retrieval has been adapted to measuring relational similarity,
achieving a score of 47% on a collection of 374 college-level multiple-choice
word analogy questions. In the VSM approach, the relation between a pair of words is
characterized by a vector of frequencies of predefined patterns in a large corpus.
LRA extends the VSM approach in three ways: (1) the patterns are derived automatically
from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency
data, and (3) automatically generated synonyms are used to explore variations of the
word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the
average human score of 57%. On the related problem of classifying semantic relations, LRA
achieves similar gains over the VSM
Building query-based relevance sets without human intervention
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophycollections are the standard framework used in the evaluation of an information retrieval system and the comparison between different systems. A text test collection consists of a set of documents, a set of topics, and a set of relevance assessments which is a list indicating the relevance of each document to each topic. Traditionally, forming the relevance assessments is done manually by human judges. But in large scale environments, such as the web, examining each document retrieved to determine its relevance is not possible. In the past there have been several studies that aimed to reduce the human effort required in building these assessments which are referred to as qrels (query-based relevance sets). Some research has also been done to completely automate the process of generating the qrels. In this thesis, we present different methodologies that lead to producing the qrels automatically without any human intervention. A first method is based on keyphrase (KP) extraction from documents presumed relevant; a second method uses Machine Learning classifiers, Naïve Bayes and Support Vector Machines. The experiments were conducted on the TREC-6, TREC-7 and TREC-8 test collections. The use of machine learning classifiers produced qrels resulting in information retrieval system rankings which were better correlated with those produced by TREC human assessments than any of the automatic techniques proposed in the literature. In order to produce a test collection which could discriminate between the best performing systems, an enhancement to the machine learning technique was made that used a small number of real or actual qrels as training sets for the classifiers. These actual relevant documents were selected by Losada et al.’s (2016) pooling technique. This modification led to an improvement in the overall system rankings and enabled discrimination between the best systems with only a little human effort. We also used the bpref-10 and infAP measures for evaluating the systems and comparing between the rankings, since they are more robust in incomplete judgment environments. We applied our new techniques to the French and Finnish test collections from CLEF2003 in order to confirm their reproducibility on non-English languages, and we achieved high correlations as seen for English