524 research outputs found

    Coreference Resolution Evaluation Based on Descriptive Specificity

    Get PDF
    International audienceThis paper introduces a new evaluation method for the coreference resolution task. Considering that coreference resolution is a matter of linking expressions to discourse referents, we set our evaluation criteron in terms of an evaluation of the denotations assigned to the expressions. This criterion requires that the coreference chains identified in one annotation stand in a one-to-one correspondence with the coreference chains in the other. To determine this correspondence and with a view to keep closer to what human interpretation of the coreference chains would be, we take into account the fact that, in a coreference chain, some expressions are more specific to their referent than others. With this observation in mind, we measure the similarity between the chains in one annotation and the chains in the other, and then compute the optimal similarity between the two annotations. Evaluation then consists in checking whether the denotations assigned to the expressions are correct or not. New measures to analyse errors are also introduced. A comparison with other methods is given at the end of the paper

    An Evaluation of Inter-Annotator Agreement in the Observation of Anaphoric and Referential Relations

    Get PDF
    International audienceWhen proposing a description of the data he observes, the linguist must make sure that his observations may be also regularly made by other persons. In this paper, we introduce a typology of anaphoric and referential relations and an experiment which aims at assessing that this typology is operational. Given three newspaper articles, five students were asked to identify anaphoric and/or referential relations between expressions and referents. This inter-subjectivity test confirms results already obtained: coreference is an operational notion, but the perspicuity of other relations is not obvious

    Verbose, Laconic or Just Right: A Simple Computational Model of Content Appropriateness under Length Constraints

    Get PDF
    Length constraints impose implicit requirements on the type of content that can be included in a text. Here we pro- pose the first model to computationally assess if a text deviates from these requirements. Specifically, our model predicts the appropriate length for texts based on content types present in a snippet of constant length. We consider a range of features to approximate content type, including syntactic phrasing, constituent compression probability, presence of named entities, sentence specificity and intersentence continuity. Weights for these features are learned using a corpus of summaries written by experts and on high quality journalistic writing. During test time, the difference between actual and predicted length allows us to quantify text verbosity. We use data from manual evaluation of summarization systems to assess the verbosity scores produced by our model. We show that the automatic verbosity scores are significantly negatively correlated with manual content quality scores given to the summaries

    From Information Overload to Knowledge Graphs: An Automatic Information Process Model

    Get PDF
    Continuously increasing text data such as news, articles, and scientific papers from the Internet have caused the information overload problem. Collecting valuable information as well as coding the information efficiently from enormous amounts of unstructured textual information becomes a big challenge in the information explosion age. Although many solutions and methods have been developed to reduce information overload, such as the deduction of duplicated information, the adoption of personal information management strategies, and so on, most of the existing methods only partially solve the problem. What’s more, many existing solutions are out of date and not compatible with the rapid development of new modern technology techniques. Thus, an effective and efficient approach with new modern IT (Information Technology) techniques that can collect valuable information and extract high-quality information has become urgent and critical for many researchers in the information overload age. Based on the principles of Design Science Theory, the paper presents a novel approach to tackle information overload issues. The proposed solution is an automated information process model that employs advanced IT techniques such as web scraping, natural language processing, and knowledge graphs. The model can automatically process the full cycle of information flow, from information Search to information Collection, Information Extraction, and Information Visualization, making it a comprehensive and intelligent information process tool. The paper presents the model capability to gather critical information and convert unstructured text data into a structured data model with greater efficiency and effectiveness. In addition, the paper presents multiple use cases to validate the feasibility and practicality of the model. Furthermore, the paper also performed both quantitative and qualitative evaluation processes to assess its effectiveness. The results indicate that the proposed model significantly reduces the information overload and is valuable for both academic and real-world research

    Improved Coreference Resolution Using Cognitive Insights

    Get PDF
    Coreference resolution is the task of extracting referential expressions, or mentions, in text and clustering these by the entity or concept they refer to. The sustained research interest in the task reflects the richness of reference expression usage in natural language and the difficulty in encoding insights from linguistic and cognitive theories effectively. In this thesis, we design and implement LIMERIC, a state-of-the-art coreference resolution engine. LIMERIC naturally incorporates both non-local decoding and entity-level modelling to achieve the highly competitive benchmark performance of 64.22% and 59.99% on the CoNLL-2012 benchmark with a simple model and a baseline feature set. As well as strong performance, a key contribution of this work is a reconceptualisation of the coreference task. We draw an analogy between shift-reduce parsing and coreference resolution to develop an algorithm which naturally mimics cognitive models of human discourse processing. In our feature development work, we leverage insights from cognitive theories to improve our modelling. Each contribution achieves statistically significant improvements and sum to gains of 1.65% and 1.66% on the CoNLL-2012 benchmark, yielding performance values of 65.76% and 61.27%. For each novel feature we propose, we contribute an accompanying analysis so as to better understand how cognitive theories apply to real language data. LIMERIC is at once a platform for exploring cognitive insights into coreference and a viable alternative to current systems. We are excited by the promise of incorporating our and further cognitive insights into more complex frameworks since this has the potential to both improve the performance of computational models, as well as our understanding of the mechanisms underpinning human reference resolution

    Linking named entities to Wikipedia

    Get PDF
    Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems

    Predicting Text Quality: Metrics for Content, Organization and Reader Interest

    Get PDF
    When people read articles---news, fiction or technical---most of the time if not always, they form perceptions about its quality. Some articles are well-written and others are poorly written. This thesis explores if such judgements can be automated so that they can be incorporated into applications such as information retrieval and automatic summarization. Text quality does not involve a single aspect but is a combination of numerous and diverse criteria including spelling, grammar, organization, informative nature, creative and beautiful language use, and page layout. In the education domain, comprehensive lists of such properties are outlined in the rubrics used for assessing writing. But computational methods for text quality have addressed only a handful of these aspects, mainly related to spelling, grammar and organization. In addition, some text quality aspects could be more relevant for one genre versus another. But previous work have placed little focus on specialized metrics based on the genre of texts. This thesis proposes new insights and techniques to address the above issues. We introduce metrics that score varied dimensions of quality such as content, organization and reader interest. For content, we present two measures: specificity and verbosity level. Specificity measures the amount of detail present in a text while verbosity captures which details are essential to include. We measure organization quality by quantifying the regularity of the intentional structure in the article and also using the specificity levels of adjacent sentences in the text. Our reader interest metrics aim to identify engaging and interesting articles. The development of these measures is backed by the use of articles from three different genres: academic writing, science journalism and automatically generated summaries. Proper presentation of content is critical during summarization because summaries have a word limit. Our specificity and verbosity metrics are developed with this genre as the focus. The argumentation structure of academic writing lends support to the idea of using intentional structure to model organization quality. Science journalism articles convey research findings in an engaging manner and are ideally suited for the development and evaluation of measures related to reader interest
    • 

    corecore