16 research outputs found

    Knowledge Sharing from Domain-specific Documents

    Get PDF
    Recently, collaborative discussions based on the participant generated documents, e.g., customer questionnaires, aviation reports and medical records, are required in various fields such as marketing, transport facilities and medical treatment, in order to share useful knowledge which is crucial to maintain various kind of securities, e.g., avoiding air-traffic accidents and malpractice. We introduce several techniques in natural language processing for extracting information from such text data and verify the validity of such techniques by using aviation documents as an example. We automatically and statistically extract from the documents related words that have not only taxonomical relations like synonyms but also thematic (non-taxonomical) relations including causal and entailment relations. These related words are useful for sharing information among participants. Moreover, we acquire domain-specific terms and phrases from the documents in order to pick up and share important topics from such reports

    An Improved Corpus Comparison Approach to Domain Specific Term Recognition

    Get PDF
    PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200

    A Comparative Study of the Effect of Word Segmentation On Chinese Terminology Extraction

    Get PDF
    PACLIC 20 / Wuhan, China / 1-3 November, 200

    Terminology extraction: an analysis of linguistic and statistical approaches

    Get PDF
    Are linguistic properties and behaviors important to recognize terms? Are statistical measures effective to extract terms? Is it possible to capture a sort of termhood with computation linguistic techniques? Or maybe, terms are too much sensitive to exogenous and pragmatic factors that cannot be confined in computational linguistic? All these questions are still open. This study tries to contribute in the search of an answer, with the belief that it can be found only through a careful experimental analysis of real case studies and a study of their correlation with theoretical insights

    DEXTER: Automatic Extraction of Domain-Specific Glossaries for Language Teaching

    Full text link
    [EN] Many researchers emphasize the importance of corpora in the design of Language-for-Specific-Purposes courses in higher education. However, identifying those lexical units which belong to a given specific domain is often a complex task for language teachers, where simple introspection or concordance analysis does not really become effective. The goal of this paper is to describe DEXTER, an open-access platform for data mining and terminology management, whose aim is not only the search, retrieval, exploration and analysis of texts in domain-specific corpora but also the automatic extraction of specialized words from the domain.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2011-29798-C02-01.Periñán Pascual, JC.; Mestre-Mestre, EM. (2015). DEXTER: Automatic Extraction of Domain-Specific Glossaries for Language Teaching. Procedia Social and Behavioral Sciences. 198:377-385. https://doi.org/10.1016/j.sbspro.2015.07.45737738519

    The underpinnings of a composite measure for automatic term extraction: The case of SRC

    Full text link
    The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small-and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grants FFI2011-29798-C02-01 and FFI2014-53788-C3-1-P.Periñán Pascual, JC. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology. 21(2):151-179. doi:10.1075/term.21.2.02perS15117921

    Experimental Evaluation of Ranking and Selection Methods in Term Extraction

    Get PDF
    An automatic term extraction system consists of a term candidate extraction subsystem, a ranking subsystem and a selection subsystem. In this paper, we experimentally evaluate two ranking methods and two selection methods. As for ranking, a dichotomy of unithood and termhood is a key notion. We evaluate these two notions experimentally by comparing Imp based ranking method that is based directly on termhood and C-value based method that is indirectly based on both termhood and unithood. As for selection, we compare the simple threshold method with the window method that we propose. We did the experimental evaluation with several Japanese technical manuals. The result does not show much difference in recall and precision. The small difference between the extracted terms by these two ranking methods depends upon their ranking mechanism per se

    DEXTER: A workbench for automatic term extraction with specialized corpora

    Full text link
    [EN] Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñán-Pascual, C. (2018). DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering. 24(2):163-198. https://doi.org/10.1017/S1351324917000365S16319824

    A Semi-Supervised Approach to the Construction of Semantic Lexicons

    Get PDF
    A growing number of applications require dictionaries of words belonging to semantic classes present in specialized domains. Manually constructed knowledge bases often do not provide sufficient coverage of specialized vocabulary and require substantial effort to build and keep up-to-date. In this thesis, we propose a semi-supervised approach to the construction of domain-specific semantic lexicons based on the distributional similarity hypothesis. Our method starts with a small set of seed words representing the target class and an unannotated text corpus. It locates instances of seed words in the text and generates lexical patterns from their contexts; these patterns in turn extract more words/phrases that belong to the semantic category in an iterative manner. This bootstrapping process can be continued until the output lexicon reaches the desired size. We explore employing techniques such as learning lexicons for multiple semantic classes at the same time and using feedback from competing lexicons to increase the learning precision. Evaluated for extraction of dish names and subjective adjectives from a corpus of restaurant reviews, our approach demonstrates great flexibility in learning various word classes, and also performance improvements over state of the art bootstrapping and distributional similarity techniques for the extraction of semantically similar words. Its shallow lexical patterns also prove to perform superior to syntactic patterns in capturing the semantic class of words
    corecore