1,394 research outputs found

    Discovery of Linguistic Relations Using Lexical Attraction

    Full text link
    This work has been motivated by two long term goals: to understand how humans learn language and to build programs that can understand language. Using a representation that makes the relevant features explicit is a prerequisite for successful learning and understanding. Therefore, I chose to represent relations between individual words explicitly in my model. Lexical attraction is defined as the likelihood of such relations. I introduce a new class of probabilistic language models named lexical attraction models which can represent long distance relations between words and I formalize this new class of models using information theory. Within the framework of lexical attraction, I developed an unsupervised language acquisition program that learns to identify linguistic relations in a given sentence. The only explicitly represented linguistic knowledge in the program is lexical attraction. There is no initial grammar or lexicon built in and the only input is raw text. Learning and processing are interdigitated. The processor uses the regularities detected by the learner to impose structure on the input. This structure enables the learner to detect higher level regularities. Using this bootstrapping procedure, the program was trained on 100 million words of Associated Press material and was able to achieve 60% precision and 50% recall in finding relations between content-words. Using knowledge of lexical attraction, the program can identify the correct relations in syntactically ambiguous sentences such as ``I saw the Statue of Liberty flying over New York.''Comment: dissertation, 56 page

    Lexically specific knowledge and individual differences in adult native speakers’ processing of the English passive

    Get PDF
    This article provides experimental evidence for the role of lexically specific representations in the processing of passive sentences and considerable education-related differences in comprehension of the passive construction. The experiment measured response time and decision accuracy of participants with high and low academic attainment using an online task that compared processing and comprehension of active and passive sentences containing verbs strongly associated with the passive and active constructions, as determined by collostructional analysis. As predicted by usage-based accounts, participants’ performance was influenced by frequency (both groups processed actives faster than passives; the low academic attainment participants also made significantly more errors on passive sentences) and lexical specificity (i.e., processing of passives was slower with verbs strongly associated with the active). Contra to proposals made by Dąbrowska and Street (2006), the results suggest that all participants have verb-specific as well as verb-general representations, but that the latter are not as entrenched in the participants with low academic attainment, resulting in less reliable performance. The results also show no evidence of a speed–accuracy trade-off, making alternative accounts of the results (e.g., those of two-stage processing models, such as Townsend & Bever, 2001) problematic

    Statistical keyword detection in literary corpora

    Full text link
    Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2

    Formulaicity in an agglutinating language: the case of Turkish

    Get PDF
    publication-status: Acceptedtypes: ArticleThis published version of the article replaces the accepted version which is available in ORE at: http://hdl.handle.net/10871/9615This study examines the extent to which complex inflectional patterns found in Turkish, a language with a rich agglutinating morphology, can be described as formulaic. It is found that many prototypically formulaic phenomena previously attested at the multi-word level in English – frequent co-occurrence of specific elements, fixed ‘bundles’ of elements, and associations between lexis and grammar – also play an important role at the morphological level in Turkish. It is argued that current psycholinguistic models of agglutinative morphology need to be complexified to incorporate such patterns. Conclusions are also drawn for the practice of Turkish as a Foreign Language teaching and for the methodology of Turkish corpus linguistics

    Eight Dimensions for the Emotions

    Get PDF
    The author proposes a dimensional model of our emotion concepts that is intended to be largely independent of one’s theory of emotions and applicable to the different ways in which emotions are measured. He outlines some conditions for selecting the dimensions based on these motivations and general conceptual grounds. Given these conditions he then advances an 8-dimensional model that is shown to effectively differentiate emotion labels both within and across cultures, as well as more obscure expressive language. The 8 dimensions are: (1) attracted—repulsed, (2) powerful—weak, (3) free—constrained, (4) certain—uncertain, (5) generalized—focused, (6) future directed—past directed, (7) enduring—sudden, (8) socially connected—disconnected

    Text Segmentation Using Exponential Models

    Full text link
    This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.Comment: 12 pages, LaTeX source and postscript figures for EMNLP-2 pape
    • 

    corecore