300 research outputs found

    Using distributional similarity to organise biomedical terminology

    Get PDF
    We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy

    Полипарадигмальный анализ многокомпонентных терминологических словосочетаний нефтегазовой отрасли

    Get PDF
    Результаты полипарадигмального анализа многокомпонентных терминологических словосочетаний нефтегазовой отрасли, который предлагается использовать в качестве основного способа описания технической терминологи

    Information Science in the web era: a term-based approach to domain mapping.

    Get PDF
    International audienceWe propose a methodology for mapping the research in Information Science (IS) field based on a combined use of symbolic (linguistic) and numeric information. Using the same list of 12 IS journals as in earlier studies on this same topic (White & McCain 1998 ; Zhao & Strotmann 2008a&b), we mapped the structure of research in IS for two consecutive periods: 1996-2005 and 2006-2008. We focused on mapping the content of scientific publications from the title and abstract fields of underlying publications. The labels of clusters were automatically derived from titles and abstracts of scientific publications based on linguistic criteria. The results showed that while Information Retrieval (IR) and Citation studies continued to be the two structuring poles of research in IS, other prominent poles have emerged: webometrics in the first period (1996-2005) evolved into general web studies in the second period, integrating more aspects of IR research. Hence web studies and IR are more interwoven. There is still persistence of user studies in IS but now dispersed among the web studies and the IR poles. The presence of some recent trends in IR research such as automatic summarization and the use of language models were also highlighted by our method. Theoretic research on "information science" continue to occupy a smaller but persistence place. Citation studies on the other hand remains a monolithic block, isolated from the two other poles (IR and web studies) save for a tenuous link through user studies. Citation studies have also recently evolved internally to accommodate newcomers like "h-index, Google scholar and the open access model". All these results were automatically generated by our method without resorting to manual labeling of specialties nor reading the publication titles. Our results show that mapping domain knowledge structures at the term level offers a more detailed and intuitive picture of the field as well as capturing emerging trends

    Terminology extraction: an analysis of linguistic and statistical approaches

    Get PDF
    Are linguistic properties and behaviors important to recognize terms? Are statistical measures effective to extract terms? Is it possible to capture a sort of termhood with computation linguistic techniques? Or maybe, terms are too much sensitive to exogenous and pragmatic factors that cannot be confined in computational linguistic? All these questions are still open. This study tries to contribute in the search of an answer, with the belief that it can be found only through a careful experimental analysis of real case studies and a study of their correlation with theoretical insights

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    The Role of Formulaic Language in the Creation of Grammar

    Get PDF
    Research in the field of Formulaic Language has shown it to be a very diverse phenomenon in both the form it takes and the functions it performs (e.g., Erma and Warren, 2000; Wray, 2002). The proposal made by Sinclair (1991) states that language as a system is organized according to two principles, the idiom principle\u27, which includes the use of all multi-word prefabricated sequences, and \u27the open choice principle,\u27 which covers word-for-word operations. Formulaic language is the embodiment of the idiom principle and constitutes the core of linguistic structure. Therefore, it must be subjected to scientific scrutiny from the variety of perspectives \u2013 typological, psycholinguistic, socio-pragmatic, and language acquisition. This dissertation reports on the percentage of formulaic sequences - prefabs - in spoken and written Russian; the distribution of prefab types across two spoken and four written genres, and their interaction with non-prefabricated language and the impact that prefabs have on the structure of a particular language type. Russian is the language typologically and structurally different from English. The main structural difference between English and Russian is that the Russian language has a free word order, wide inflectional system to code grammatical relations, and a satellite verb system. I hypothesize that these structural differences influence the quantity and the nature of formulaic sequences used in the language, the nature of alternation of prefabricated and non-prefabricated strings, and the preference of the speakers for one rather than the other aforementioned principles. The method applied in the analysis of Russian prefabs is developed by Erman and Warren (2000) and originally was applied to the analysis of the English texts. This dissertation seeks to address a methodological issue of applying this method to typologically different languages. It has been argued (Garcia and Florimon van Putte, 1989) that the fixedness of the English word order contributes to the co-occurrence of elements and the formation of formulaic sequences in English. In this case, formulaic language becomes a language-specific tendency pertaining to English, and not a universal mechanism for language storage, processing, production and use. The findings support the usage-based approaches driven by forces resulting from the frequency of use, discourse and communicative functions, grounded in the fine balance between the economy principle and the power of language creativity. The results of the study are used to draw implications for language processing and language modeling. As we continue to perfect the methods of identification, classification and analysis of formulaic sequences, we will be in a better position to describe not only the amount but the nature of formulaic language, its interaction with non-formulas, and the impact this alternation has on the linguistic structure as a whole. The current study investigates the nature of formulaic language in a free word order language. We seek to apply the method of identification, classification and analysis of prefabs, its interaction with each other and with non-formulaic language, as well as the estimation of choices made in producing spoken and written language. My dissertation results suggest that a free word order language uses at least as many prefabs as a fixed word order language. On average, in a free word order language like Russian 65% of spoken and 58% of written language is composed of multiword formulaic sequences. The results strengthen the hypothesis that the idiom principle is a mechanism of global linguistic organization and processing. The proportion and distribution or prefabs is less affected by language type than by spoken written medium distinction and genre variation. In addition, the results show that prefabs are frozen structures not amicable to standard syntactic transformations even in a free word order language. The results support the dual system of language processing, i.e., holistic and analytic, present in a free word order language

    Ontology Extraction and Semantic Ranking of Unambiguous Requirements

    Get PDF
    Abstract: This paper describes a new method for ontology based standardization of concepts in a domain. In Requirements engineering, abstraction of the concepts and the entities in a domain is significant as most of the software fail due to incorrectly elicited requirements. In this paper, we introduce a framework for requirements engineering that applies Semantic Ranking and significant terms extraction in a domain. This work aims to identify and present concepts and their relationships as domain specific ontologies of particular significance. The framework is build to detect and eliminate ambiguities. Semantic Graph is constructed using semantic relatedness between two ontologies which is computed based on highest value path connecting any pair of the terms. Based on the nodes of the graph and their significance scores, both single as well as multi word terms can be extracted from the domain documents. A reference document of ontologies that will help requirement analyst to create SRS and will be useful in the design is created

    Register-Specific Collocational Constructions in English and Spanish: A Usage-Based Approach

    Get PDF
    Constructions are usage-based, conventionalised pairings of form and function within a cline of complexity and schematisation. Most research within Construction Grammar has focused on the monolingual description of schematic constructions: Mainly in English, but to a lesser extent in other languages as well. By contrast, very little constructional analyses have been carried out across languages. In this study we will focus on a type of partially substantive construction from the point of view of contrastive analysis and translation which, to the best of our knowledge, is one of the first studies of this kind. The first half of the article lays down the theoretical foundations of the study and introduces Construction Grammar as well as other formalisms used in literature in order to provide a construal account of collocations, a pervasive phenomenon in language. The experimental part describes the case study of V NP collocations with disease/enfermedad in comparable corpora in English and Spanish, both in the general domain and in the specialised medical domain. It is provided a comparative analysis of these constructions across domains and languages in terms of token-type ratio (constructional restriction-rate), lexical function, type of determiner, frequency ranking of the verbal collocate and domain specificity of collocates, among others. New measures to assess construal bondness will be put forward (lexical filledness rate and individual productivity rate) and special attention will be paid to register-dependent equivalent semantic-functional counterparts in English and Spanish and mismatches.Published versio

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations
    corecore