3,021 research outputs found

    Human-Level Performance on Word Analogy Questions by Latent Relational Analysis

    Get PDF
    This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, machine translation, and information retrieval. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason/stone is analogous to the pair carpenter/wood; the relations between mason and stone are highly similar to the relations between carpenter and wood. Past work on semantic similarity measures has mainly been concerned with attributional similarity. For instance, Latent Semantic Analysis (LSA) can measure the degree of similarity between two words, but not between two relations. Recently the Vector Space Model (VSM) of information retrieval has been adapted to the task of measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus (they are not predefined), (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data (it is also used this way in LSA), and (3) automatically generated synonyms are used to explore reformulations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying noun-modifier relations, LRA achieves similar gains over the VSM, while using a smaller corpus

    Similarity of Semantic Relations

    Get PDF
    There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason:stone is analogous to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of information retrieval has been adapted to measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and (3) automatically generated synonyms are used to explore variations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying semantic relations, LRA achieves similar gains over the VSM

    Acquiring symbolic design optimization problem reformulation knowledge: On computable relationships between design syntax and semantics

    Get PDF
    This thesis presents a computational method for the inductive inference of explicit and implicit semantic design knowledge from the symbolic-mathematical syntax of design formulations using an unsupervised pattern recognition and extraction approach. Existing research shows that AI / machine learning based design computation approaches either require high levels of knowledge engineering or large training databases to acquire problem reformulation knowledge. The method presented in this thesis addresses these methodological limitations. The thesis develops, tests, and evaluates ways in which the method may be employed for design problem reformulation. The method is based on the linear algebra based factorization method Singular Value Decomposition (SVD), dimensionality reduction and similarity measurement through unsupervised clustering. The method calculates linear approximations of the associative patterns of symbol cooccurrences in a design problem representation to infer induced coupling strengths between variables, constraints and system components. Unsupervised clustering of these approximations is used to identify useful reformulations. These two components of the method automate a range of reformulation tasks that have traditionally required different solution algorithms. Example reformulation tasks that it performs include selection of linked design variables, parameters and constraints, design decomposition, modularity and integrative systems analysis, heuristically aiding design “case” identification, topology modeling and layout planning. The relationship between the syntax of design representation and the encoded semantic meaning is an open design theory research question. Based on the results of the method, the thesis presents a set of theoretical postulates on computable relationships between design syntax and semantics. The postulates relate the performance of the method with empirical findings and theoretical insights provided by cognitive neuroscience and cognitive science on how the human mind engages in symbol processing and the resulting capacities inherent in symbolic representational systems to encode “meaning”. The performance of the method suggests that semantic “meaning” is a higher order, global phenomenon that lies distributed in the design representation in explicit and implicit ways. A one-to-one local mapping between a design symbol and its meaning, a largely prevalent approach adopted by many AI and learning algorithms, may not be sufficient to capture and represent this meaning. By changing the theoretical standpoint on how a “symbol” is defined in design representations, it was possible to use a simple set of mathematical ideas to perform unsupervised inductive inference of knowledge in a knowledge-lean and training-lean manner, for a knowledge domain that traditionally relies on “giving” the system complex design domain and task knowledge for performing the same set of tasks

    How Many Topics? Stability Analysis for Topic Models

    Full text link
    Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

    On the Performance of Latent Semantic Indexing-based Information Retrieval

    Get PDF
    Conventional vector based Information Retrieval (IR) models, Vector Space Model (VSM) and Generalized Vector Space Model (GVSM), represents documents and queries as vectors in a multidimensional space. This high dimensional data places great demands for computing resources. To overcome these problems, Latent Semantic Indexing (LSI): a variant of VSM, projects the documents into a lower dimensional space, computed via Singular Value Decomposition. It is stated in IR literature that LSI model is 30% more effective than classical VSM models. However statistical significance tests are required to evaluate the reliability of such comparisons. But to the best of our knowledge significance of performance of LSI model is not analyzed so far. Focus of this paper is to address this issue. We discuss the tradeoffs of VSM, GVSM and LSI and empirically evaluate the difference in performance on four testing document collections. Then we analyze the statistical significance of these performance differences
    corecore