49 research outputs found

    Description of Chinese Intransitive Verbs and Adjuncts Within the LFG Formalism

    Get PDF

    Period disambiguation with MaxEnt model

    Get PDF
    Abstract. This paper presents our recent work on period disambiguation, the kernel problem in sentence boundary identification, with the maximum entropy (Maxent) model. A number of experiments are conducted on PTB-II WSJ corpus for the investigation of how context window, feature space and lexical information such as abbreviated and sentence-initial words affect the learning performance. Such lexical information can be automatically acquired from a training corpus by a learner. Our experimental results show that extending the feature space to integrate these two kinds of lexical information can eliminate 93.52% of the remaining errors from the baseline Maxent model, achieving an F-score of 99.8227%.

    An Improved Corpus Comparison Approach to Domain Specific Term Recognition

    Get PDF
    PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200

    Viewpoint Discovery and Understanding in Social Networks

    Full text link
    The Web has evolved to a dominant platform where everyone has the opportunity to express their opinions, to interact with other users, and to debate on emerging events happening around the world. On the one hand, this has enabled the presence of different viewpoints and opinions about a - usually controversial - topic (like Brexit), but at the same time, it has led to phenomena like media bias, echo chambers and filter bubbles, where users are exposed to only one point of view on the same topic. Therefore, there is the need for methods that are able to detect and explain the different viewpoints. In this paper, we propose a graph partitioning method that exploits social interactions to enable the discovery of different communities (representing different viewpoints) discussing about a controversial topic in a social network like Twitter. To explain the discovered viewpoints, we describe a method, called Iterative Rank Difference (IRD), which allows detecting descriptive terms that characterize the different viewpoints as well as understanding how a specific term is related to a viewpoint (by detecting other related descriptive terms). The results of an experimental evaluation showed that our approach outperforms state-of-the-art methods on viewpoint discovery, while a qualitative analysis of the proposed IRD method on three different controversial topics showed that IRD provides comprehensive and deep representations of the different viewpoints

    My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

    Full text link
    Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations

    Unsupervised Lexical Learning as Inductive Inference

    No full text
    This paper presents a learning-via-compression approach to unsupervised acquisition of word forms with no a priori knowledge. Following the basic ideas in Solomonoff’s theory of inductive inference and Rissanen’s MDL framework, the learning is formulated as a process of inferring regularities, in the form of string patterns (i.e., words), from a given set of data. A segmentation algorithm is designed to segment each input utterance into a sequence of word candidates giving an optimal sum of description length gain (DLG). The learning model has a lexical refinement module to exploit this algorithm to derive finer-grained word candidates recursively until no more compression effect is available. Experimental results on an infant-directed speech corpus show that this approach reaches a state-of-art performance in terms of precision and recall of both words and word boundaries.

    Description of Chinese Intransitive Verbs and Adjuncts Within the LFG Formalism

    No full text

    How Does Lexical Acquisition Begin? A Cognitive Perspective

    No full text
    Lexical acquisition is a critical stage of language development, during which human infants learn a set of word forms and their association with meanings, starting from little a priori knowledge about words- they do not even know whether there are words in their mother tongues. How do the infants infer individual words from the continuous speech stream to which they are exposed? This paper intends to conduct a comprehensive review of contemporary studies on how the lexical acquisition starts. It first gives a brief introduction to language development, and then examines the characteristics of the speech input to lexical-learning infants and the speech perceptual abilities they have developed at the very beginning of the learning. Possible strategies of speech segmentation for word discovery and various cues that may facilitate the bootstrapping process involved in the learning, including the prosodic, allophonic, phonotactic and distributional cues, are discussed in detail, and a number of questions concerning the cue-based studies are asked: how do the infants acquire the cues for discovering words? Are the cues the starting point, or the by-product, of the learning? Is there any more fundamental cognitive mechanism that the infants exploit to induce the cues and words

    A Goodness Measure for Phrase Learning via Compression with the MDL Principle

    No full text
    This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or character) as a phrase (or a word), which can be calculated easily following classic information theory. The paper also presents a best-first learning algorithm based on this measure. Experiments on phrase and lexical learning from POS tag and character sequence, respectively, show promising results
    corecore