330 research outputs found

    An Infrastructure for acquiring high quality semantic metadata

    Get PDF
    Because metadata that underlies semantic web applications is gathered from distributed and heterogeneous data sources, it is important to ensure its quality (i.e., reduce duplicates, spelling errors, ambiguities). However, current infrastructures that acquire and integrate semantic data have only marginally addressed the issue of metadata quality. In this paper we present our metadata acquisition infrastructure, ASDI, which pays special attention to ensuring that high quality metadata is derived. Central to the architecture of ASDI is a erification engine that relies on several semantic web tools to check the quality of the derived data. We tested our prototype in the context of building a semantic web portal for our lab, KMi. An experimental evaluation omparing the automatically extracted data against manual annotations indicates that the verification engine enhances the quality of the extracted semantic metadata

    Efficient Feature Selection in the Presence of Multiple Feature Classes

    Get PDF
    We present an information theoretic approach to feature selection when the data possesses feature classes. Feature classes are pervasive in real data. For example, in gene expression data, the genes which serve as features may be divided into classes based on their membership in gene families or pathways. When doing word sense disambiguation or named entity extraction, features fall into classes including adjacent words, their parts of speech, and the topic and venue of the document the word is in. When predictive features occur predominantly in a small number of feature classes, our information theoretic approach significantly improves feature selection. Experiments on real and synthetic data demonstrate substantial improvement in predictive accuracy over the standard L0 penalty-based stepwise and stream wise feature selection methods as well as over Lasso and Elastic Nets, all of which are oblivious to the existence of feature classes

    Data-driven Synset Induction and Disambiguation for Wordnet Development

    Get PDF
    International audienceAutomatic methods for wordnet development in languages other than English generally exploit information found in Princeton WordNet (PWN) and translations extracted from parallel corpora. A common approach consists in preserving the structure of PWN and transferring its content in new languages using alignments, possibly combined with information extracted from multilingual semantic resources. Even if the role of PWN remains central in this process, these automatic methods offer an alternative to the manual elaboration of new wordnets. However, their limited coverage has a strong impact on that of the resulting resources. Following this line of research, we apply a cross-lingual word sense disambiguation method to wordnet development. Our approach exploits the output of a data-driven sense induction method that generates sense clusters in new languages, similar to wordnet synsets, by identifying word senses and relations in parallel corpora. We apply our cross-lingual word sense disambiguation method to the task of enriching a French wordnet resource, the WOLF, and show how it can be efficiently used for increasing its coverage. Although our experiments involve the English-French language pair, the proposed methodology is general enough to be applied to the development of wordnet resources in other languages for which parallel corpora are available. Finally, we show how the disambiguation output can serve to reduce the granularity of new wordnets and the degree of polysemy present in PWN

    Lingering misinterpretations of garden path sentences arise from competing syntactic representations

    Get PDF
    Recent work has suggested that readers 19 initial and incorrect interpretation of temporarily ambiguous ("garden path") sentences (e.g., Christianson, Hollingworth, Halliwell, & Ferreira, 2001) sometimes lingers even after attempts at reanalysis. These lingering effects have been attributed to incomplete reanalysis. In two eye tracking experiments, we distinguish between two types of incompleteness: the language comprehension system might not build a faithful syntactic structure, or it might not fully erase the structure built during an initial misparse. The first experiment used reflexive binding and the Gender Mismatch paradigm to show that a complete and faithful structure is built following processing of the garden-path. The second experiment used two-sentence texts to examine the extent to which the garden-path meaning from the first sentence interferes with reading of the second. Together, the results indicate that misinterpretation effects are attributable not to failure in building a proper structure, but rather to failure in cleaning up all remnants of earlier attempts to build that syntactic representation

    Using Games to Create Language Resources: Successes and Limitations of the Approach

    Get PDF
    Abstract One of the more novel approaches to collaboratively creating language resources in recent years is to use online games to collect and validate data. The most significant challenges collaborative systems face are how to train users with the necessary expertise and how to encourage participation on a scale required to produce high quality data comparable with data produced by “traditional ” experts. In this chapter we provide a brief overview of collaborative creation and the different approaches that have been used to create language resources, before analysing games used for this purpose. We discuss some key issues in using a gaming approach, including task design, player motivation and data quality, and compare the costs of each approach in terms of development, distribution and ongoing administration. In conclusion, we summarise the benefits and limitations of using a gaming approach to resource creation and suggest key considerations for evaluating its utility in different research scenarios

    Automatic morphosyntactic analysis of Light Warlpiri corpus data

    Get PDF
    Morphosyntactic analysis aligns a morphosyntactic tag (‘gloss’) for each word in a given text. Manual morphosyntactic glossing requires significant time and effort to implement on a larger scale, such as for a language corpus. Computational methods of automatic analysis can aid in automating this process. In this thesis, I applied a method of automatic morphosyntactic analysis to a set of Light Warlpiri corpus data (O’Shannessy, 2005). The method used the software tool Computerised Language Analysis (MacWhinney, 2000) to apply rules-based word analysis and syntactic disambiguation to the data. My thesis will describe how this method was adapted to the morphosyntactic properties of Light Warlpiri, as well as its performance on the corpus data. Overall, the method was successfully adapted to the Light Warlpiri data, with some recurring challenges noted. Finally, the thesis will discuss the variables within the workflow that affected the adaptation of the method, with emphasis on practical considerations

    An engineering approach to knowledge acquisition by the interactive analysis of dictionary definitions

    Get PDF
    It has long been recognised that everyday dictionaries are a potential source of lexical and world knowledge of the type required by many Natural Language Processing (NLP) systems. This research presents a semi-automated approach to the extraction of rich semantic relationships from dictionary definitions. The definitions are taken from the recently published "Cambridge International Dictionary of English" (CIDE). The thesis illustrates how many of the innovative features of CIDE can be exploited during the knowledge acquisition process. The approach introduced in this thesis uses the LOLITA NLP system to extract and represent semantic relationships, along with a human operator to resolve the different forms of ambiguity which exist within dictionary definitions. Such a strategy combines the strengths of both participants in the acquisition process: automated procedures provide consistency in the construction of complex and inter-related semantic relationships, while the human participant can use his or her knowledge to determine the correct interpretation of a definition. This semi-automated strategy eliminates the weakness of many existing approaches because it guarantees feasibility and correctness: feasibility is ensured by exploiting LOLITA's existing NLP capabilities so that humans with minimal linguistic training can resolve the ambiguities within dictionary definitions; and correctness is ensured because incorrectly interpreted definitions can be manually eliminated. The feasibility and correctness of the solution is supported by the results of an evaluation which is presented in detail in the thesis

    Language Models for Text Understanding and Generation

    Get PDF
    corecore