640 research outputs found

    A study on mutual information-based feature selection for text categorization

    Get PDF
    Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Many existing experiments show IG is one of the most effective methods, by contrast, MI has been demonstrated to have relatively poor performance. According to one existing MI method, the mutual information of a category c and a term t can be negative, which is in conflict with the definition of MI derived from information theory where it is always non-negative. We show that the form of MI used in TC is not derived correctly from information theory. There are two different MI based feature selection criteria which are referred to as MI in the TC literature. Actually, one of them should correctly be termed "pointwise mutual information" (PMI). In this paper, we clarify the terminological confusion surrounding the notion of "mutual information" in TC, and detail an MI method derived correctly from information theory. Experiments with the Reuters-21578 collection and OHSUMED collection show that the corrected MI method’s performance is similar to that of IG, and it is considerably better than PMI

    Co-occurrence Vectors from Corpora vs. Distance Vectors from Dictionaries

    Full text link
    A comparison was made of vectors derived by using ordinary co-occurrence statistics from large text corpora and of vectors derived by measuring the inter-word distances in dictionary definitions. The precision of word sense disambiguation by using co-occurrence vectors from the 1987 Wall Street Journal (20M total words) was higher than that by using distance vectors from the Collins English Dictionary (60K head words + 1.6M definition words). However, other experimental results suggest that distance vectors contain some different semantic information from co-occurrence vectors.Comment: 6 pages, appeared in the Proc. of COLING94 (pp. 304-309)

    Word Activation Forces Map Word Networks

    Get PDF
    Words associate with each other in a manner of intricate clusters^1-3^. Yet the brain capably encodes the complex relations into workable networks^4-7^ such that the onset of a word in the brain automatically and selectively activates its associates, facilitating language understanding and generation^8-10^. One believes that the activation strength from one word to another forges and accounts for the latent structures of the word networks. This implies that mapping the word networks from brains to computers^11,12^, which is necessary for various purposes^1,2,13-15^, may be achieved through modeling the activation strengths. However, although a lot of investigations on word activation effects have been carried out^8-10,16-20^, modeling the activation strengths remains open. Consequently, huge labor is required to do the mappings^11,12^. Here we show that our found word activation forces, statistically defined by a formula in the same form of the universal gravitation, capture essential information on the word networks, leading to a superior approach to the mappings. The approach compatibly encodes syntactical and semantic information into sparse coding directed networks, comprehensively highlights the features of individual words. We find that based on the directed networks, sensible word clusters and hierarchies can be efficiently discovered. Our striking results strongly suggest that the word activation forces might reveal the encoding of word networks in the brain

    The logic and linguistic model for automatic extraction of collocation similarity

    Get PDF
    The article discusses the process of automatic identification of collocation similarity. The semantic analysis is one of the most advanced as well as the most difficult NLP task. The main problem of semantic processing is the determination of polysemy and synonymy of linguistic units. In addition, the task becomes complicated in case of word collocations. The paper suggests a logical and linguistic model for automatic determining semantic similarity between colocations in Ukraine and English languages. The proposed model formalizes semantic equivalence of collocations by means of semantic and grammatical characteristics of collocates. The basic idea of this approach is that morphological, syntactic and semantic characteristics of lexical units are to be taken into account for the identification of collocation similarity. Basic mathematical means of our model are logical-algebraic equations of the finite predicates algebra. Verb-noun and noun-adjective collocations in Ukrainian and English languages consist of words belonged to main parts of speech. These collocations are examined in the model. The model allows extracting semantically equivalent collocations from semi-structured and non-structured texts. Implementations of the model will allow to automatically recognize semantically equivalent collocations. Usage of the model allows increasing the effectiveness of natural language processing tasks such as information extraction, ontology generation, sentiment analysis and some others

    An Approach to Proper Name Tagging for German

    Full text link
    This paper presents an incremental method for the tagging of proper names in German newspaper texts. The tagging is performed by the analysis of the syntactic and textual contexts of proper names together with a morphological analysis. The proper names selected by this process supply new contexts which can be used for finding new proper names, and so on. This procedure was applied to a small German corpus (50,000 words) and correctly disambiguated 65% of the capitalized words, which should improve when it is applied to a very large corpus.Comment: 6 pages, LaTeX, 2 uuencoded tar-compressed eps-figures added, EACL-SIGDAT 9

    Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

    Get PDF
    Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical methods for detecting this vocabulary involve gathering a domain corpus, calculating statistics on the terms found there, and then comparing these statistics to a background or general language corpus. Terms which are found significantly more often in the specialized corpus than in the background corpus are candidates for the characteristic vocabulary of the domain. Here we present two tools, a directed crawler, and a distributional semantics package, that can be used together, circumventing the need of a background corpus. Both tools are available on the web
    corecore