54 research outputs found

    Workshop Proceedings of the 12th edition of the KONVENS conference

    Get PDF
    The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

    Null Element Restoration

    Get PDF
    Understanding the syntactic structure of a sentence is a necessary preliminary to understanding its semantics and therefore for many practical applications. The field of natural language processing has achieved a high degree of accuracy in parsing, at least in English. However, the syntactic structures produced by the most commonly used parsers are less detailed than those structures found in the treebanks the parsers were trained on. In particular, these parsers typically lack the null elements used to indicate wh-movement, control, and other phenomena. This thesis presents a system for inserting these null elements into parse trees in English. It then examines the problem in Arabic, which motivates a second, joint- inference system which has improved performance on English as well. Finally, it examines the application of information derived from the Google Web 1T corpus as a way of reducing certain data sparsity issues related to wh-movement

    Positive words carry less information than negative words

    Get PDF
    We show that the frequency of word use is not only determined by the word length \cite{Zipf1935} and the average information content \cite{Piantadosi2011}, but also by its emotional content. We have analyzed three established lexica of affective word usage in English, German, and Spanish, to verify that these lexica have a neutral, unbiased, emotional content. Taking into account the frequency of word usage, we find that words with a positive emotional content are more frequently used. This lends support to Pollyanna hypothesis \cite{Boucher1969} that there should be a positive bias in human expression. We also find that negative words contain more information than positive words, as the informativeness of a word increases uniformly with its valence decrease. Our findings support earlier conjectures about (i) the relation between word frequency and information content, and (ii) the impact of positive emotions on communication and social links.Comment: 16 pages, 3 figures, 3 table

    Exploiting Wikipedia Semantics for Computing Word Associations

    No full text
    Semantic association computation is the process of automatically quantifying the strength of a semantic connection between two textual units based on various lexical and semantic relations such as hyponymy (car and vehicle) and functional associations (bank and manager). Humans have can infer implicit relationships between two textual units based on their knowledge about the world and their ability to reason about that knowledge. Automatically imitating this behavior is limited by restricted knowledge and poor ability to infer hidden relations. Various factors affect the performance of automated approaches to computing semantic association strength. One critical factor is the selection of a suitable knowledge source for extracting knowledge about the implicit semantic relations. In the past few years, semantic association computation approaches have started to exploit web-originated resources as substitutes for conventional lexical semantic resources such as thesauri, machine readable dictionaries and lexical databases. These conventional knowledge sources suffer from limitations such as coverage issues, high construction and maintenance costs and limited availability. To overcome these issues one solution is to use the wisdom of crowds in the form of collaboratively constructed knowledge sources. An excellent example of such knowledge sources is Wikipedia which stores detailed information not only about the concepts themselves but also about various aspects of the relations among concepts. The overall goal of this thesis is to demonstrate that using Wikipedia for computing word association strength yields better estimates of humans' associations than the approaches based on other structured and unstructured knowledge sources. There are two key challenges to achieve this goal: first, to exploit various semantic association models based on different aspects of Wikipedia in developing new measures of semantic associations; and second, to evaluate these measures compared to human performance in a range of tasks. The focus of the thesis is on exploring two aspects of Wikipedia: as a formal knowledge source, and as an informal text corpus. The first contribution of the work included in the thesis is that it effectively exploited the knowledge source aspect of Wikipedia by developing new measures of semantic associations based on Wikipedia hyperlink structure, informative-content of articles and combinations of both elements. It was found that Wikipedia can be effectively used for computing noun-noun similarity. It was also found that a model based on hybrid combinations of Wikipedia structure and informative-content based features performs better than those based on individual features. It was also found that the structure based measures outperformed the informative content based measures on both semantic similarity and semantic relatedness computation tasks. The second contribution of the research work in the thesis is that it effectively exploited the corpus aspect of Wikipedia by developing a new measure of semantic association based on asymmetric word associations. The thesis introduced the concept of asymmetric associations based measure using the idea of directional context inspired by the free word association task. The underlying assumption was that the association strength can change with the changing context. It was found that the asymmetric association based measure performed better than the symmetric measures on semantic association computation, relatedness based word choice and causality detection tasks. However, asymmetric-associations based measures have no advantage for synonymy-based word choice tasks. It was also found that Wikipedia is not a good knowledge source for capturing verb-relations due to its focus on encyclopedic concepts specially nouns. It is hoped that future research will build on the experiments and discussions presented in this thesis to explore new avenues using Wikipedia for finding deeper and semantically more meaningful associations in a wide range of application areas based on humans' estimates of word associations

    Text mining and natural language processing for the early stages of space mission design

    Get PDF
    Final thesis submitted December 2021 - degree awarded in 2022A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes.A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes

    Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

    Get PDF

    Towards the extraction of cross-sentence relations through event extraction and entity coreference

    Get PDF
    Cross-sentence relation extraction deals with the extraction of relations beyond the sentence boundary. This thesis focuses on two of the NLP tasks which are of importance to the successful extraction of cross-sentence relation mentions: event extraction and coreference resolution. The first part of the thesis focuses on addressing data sparsity issues in event extraction. We propose a self-training approach for obtaining additional labeled examples for the task. The process starts off with a Bi-LSTM event tagger trained on a small labeled data set which is used to discover new event instances in a large collection of unstructured text. The high confidence model predictions are selected to construct a data set of automatically-labeled training examples. We present several ways in which the resulting data set can be used for re-training the event tagger in conjunction with the initial labeled data. The best configuration achieves statistically significant improvement over the baseline on the ACE 2005 test set (macro-F1), as well as in a 10-fold cross validation (micro- and macro-F1) evaluation. Our error analysis reveals that the augmentation approach is especially beneficial for the classification of the most under-represented event types in the original data set. The second part of the thesis focuses on the problem of coreference resolution. While a certain level of precision can be reached by modeling surface information about entity mentions, their successful resolution often depends on semantic or world knowledge. This thesis investigates an unsupervised source of such knowledge, namely distributed word representations. We present several ways in which word embeddings can be utilized to extract features for a supervised coreference resolver. Our evaluation results and error analysis show that each of these features helps improve over the baseline coreference system’s performance, with a statistically significant improvement (CoNLL F1) achieved when the proposed features are used jointly. Moreover, all features lead to a reduction in the amount of precision errors in resolving references between common nouns, demonstrating that they successfully incorporate semantic information into the process

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    The Processing of Lexical Sequences

    Get PDF
    Psycholinguistics has traditionally been defined as the study of how we process units of language such as letters, words and sentences. But what about other units? This dissertation concerns itself with short lexical sequences called n- grams, longer than words but shorter than most sentences. N-grams can be phrases (such as the 3-gram "the great divide") or just fragments (such as the 4- gram means "nothing to a"). Words are often thought to be the universal, atomic building block of longer lexical sequences, but n-grams are equally capable of carrying meaning and being combined to create any sentence. Are n-grams more than just the sum of their parts (the sum of their words)? How do language users process n-grams when they are asked to read them or produce them? Using evidence that I have gathered, I will address these and other questions with the goal of better understanding n-gram processing
    • …
    corecore