19 research outputs found

    A Supervised Learning Approach to Acronym Identification

    Get PDF
    This paper addresses the task of finding acronym-definition pairs in text. Most of the previous work on the topic is about systems that involve manually generated rules or regular expressions. In this paper, we present a supervised learning approach to the acronym identification task. Our approach reduces the search space of the supervised learning system by putting some weak constraints on the kinds of acronym-definition pairs that can be identified. We obtain results comparable to hand-crafted systems that use stronger constraints. We describe our method for reducing the search space, the features used by our supervised learning system, and our experiments with various learning schemes

    Biomedical abbreviation recognition and resolution by PROSA-MED

    Get PDF
    The amount of abbreviations used in biomedical literature increases constantly. Despite the existence of acronym dictionaries, it is not viable to keep them updated with new creations. Thus, in the processing of biomedical texts, discovering and disambiguating acronyms and their expanded forms are essential aspects and this is the objective proposed by BARR task at IberEval 2017 Workshop. This paper presents our participation in this task. We propose five systems that deal with the problem in different ways. Three of the systems are atomic approaches, while two of them are combinations of the atomic systems. One of the systems clearly outperforms the others, both in the detection of entities (F-score of 0.749 in the test set) as well as identifying relations between short-long forms (F-score of 0.697 in the test set).Peer ReviewedPostprint (published version

    Texting in Newsgroups - how technology may influence a language

    Get PDF

    A comparison study on algorithms of detecting long forms for short forms in biomedical text

    Get PDF
    <p>Abstract</p> <p>Motivation</p> <p>With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases.</p> <p>Method</p> <p>We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases.</p> <p>Results</p> <p>We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases.</p> <p>Availability</p> <p>The web site is <url>http://gauss.dbb.georgetown.edu/liblab/SFThesaurus</url>.</p

    MBA: a literature mining system for extracting biomedical abbreviations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter.</p> <p>Results</p> <p>A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus.</p> <p>Conclusion</p> <p>We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., <CNS1, cyclophilin seven suppressor>), but also non-acronym-type abbreviations (e.g., <Fas, CD95>).</p

    Abbreviation definition identification based on automatic precision estimates

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation.</p> <p>Results</p> <p>On the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm.</p> <p>Conclusion</p> <p>We developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic.</p

    ANALISIS KONTRASTIF ABREVIASI WAKAMONO KOTOBA DALAM BAHASA JEPANG DAN BAHASA INDONESIA

    Get PDF
    Penelitian ini bertujuan untuk menganalisis persamaan dan perbedaan abreviasi wakamono kotoba yang digunakan dalam bahasa Jepang dan bahasa Indonesia dengan menggunakan data bahasa tulis dari situs Taberogu dan Pergikuliner serta data bahasa lisan dari acara televisi Ariyoshi Zemi dan Bikin Laper. Data dalam penelitian ini dikumpulkan melalui metode rekam dan catat serta teknik pilah unsur penentu. Selanjutnya, data diklasifikasikan berdasarkan teori karakteristik abreviasi dari Yonekawa (1998) dan She (2021) untuk data wakamono kotoba dalam bahasa Jepang, serta dari Kridalaksana (1992) dan Yule (1996) untuk data bahasa Indonesia. Hasil penelitian menemukan bahwa terdapat lima jenis proses abreviasi wakamono kotoba dalam bahasa Jepang, yaitu penyingkatan di akhir kata, penghilangan pada akhir kata, penyingkatan kalimat atau frasa, penyingkatan tiga bagian dalam kata majemuk, dan pengekalan huruf sebagai komponen kata. Sementara itu, pada bahasa gaul dalam bahasa Indonesia terdapat lima jenis proses abreviasi, yaitu singkatan, penggalan, akronim, kontraksi, dan lambang huruf. Persamaan karakteristik abreviasi wakamono kotoba dalam bahasa Jepang dan bahasa Indonesia adalah terdapat beberapa jenis abreviasi yang sama, proses abreviasi dari bahasa lain, perubahan makna yang menghasilkan kosakata baru, abreviasi pada bagian akhir kata, pola abreviasi dalam kata majemuk, pola abreviasi dari tiga kata, penambahan silabel, perubahan bentuk, dan penggabungan dengan bahasa lain. Namun, terdapat beberapa perbedaan, yaitu objek yang mengalami penyingkatan, fokus objek yang mengalami abreviasi, penggunaan bahasa tulis dalam bahasa lisan, abreviasi murni dari bahasa asli atau bahasa asing, proses pembentukan fragmen baru dalam data bahasa Indonesia, serta penghapusan konjungsi dalam data bahasa Indonesia dan partikel dalam data bahasa Jepang. This study aims to analyze the similarities and differences in wakamono kotoba abbreviations used in Japanese and Indonesian, using written language data from the Taberogu and Pergikuliner sites, as well as spoken language data from television shows Ariyoshi Zemi and Bikin Laper. The data for this study were collected through recording and sorting out the determinants. Furthermore, the data is classified based on the characteristic theory of word abbreviations from Yonekawa (1998) and She (2021) for wakamono kotoba data in Japanese. In Indonesian, it is classified based on the abbreviation theory of Kridalaksana (1992) and Yule (1996). The study's results identified five types of wakamono kotoba abbreviation processes in Japanese: abbreviation at the end of the word, omission at the end of each word, abbreviation of sentences or phrases, abbreviation of three parts in compound words, and abbreviations formed from the initial letters of each word. Meanwhile, in Indonesian bahasa gaul, there are five abbreviation processes: shortened words from a combination of letters, acronyms of letters, omission of word parts, combining from two separate forms, and abbreviation as letter symbols. Common characteristics of wakamono kotoba abbreviations in Japanese and Indonesian include having the same types of abbreviation processes, the abbreviation process borrowed from another language, changes in meaning resulting in new vocabulary, abbreviations at the end of words, abbreviation patterns of compound words and three-word abbreviations, the addition of syllables, changing forms, and integration with other languages. However, differences exist in terms of the objects that experience abbreviations, the focus on objects experiencing abbreviations, the use of written language in spoken language, pure abbreviations from native or foreign languages, the process of new fragments in Indonesian data, and the deletion of conjunctions in Indonesian data and particles in Japanese data

    Coarse-grained Candidate Generation and Fine-grained Re-ranking for Chinese Abbreviation Prediction

    Full text link
    Correctly predicting abbreviations given the full forms is important in many natural language processing systems. In this paper we propose a two-stage method to find the corresponding abbreviation given its full form. We first use the contextual information given a large corpus to get abbreviation candidates for each full form and get a coarse-grained ranking through graph random walk. This coarse-grained rank list fixes the search space inside the top-ranked candidates. Then we use a similarity sensitive re-ranking strategy which can utilize the features of the candidates to give a fine-grained re-ranking and select the final result. Our method achieves good results and outperforms the state-ofthe- Art systems. One advantage of our method is that it only needs weak supervision and can get competitive results with fewer training data. The candidate generation and coarse-grained ranking is totally unsupervised. The re-ranking phase can use a very small amount of training data to get a reasonably good result. ? 2014 Association for Computational Linguistics.EI
    corecore