9 research outputs found

    Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm

    Get PDF
    Abstract: Background: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer’s output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text. Results: Medpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively. Conclusions: Our evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization

    The Paradigm Discovery Problem

    Full text link
    This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work. Our code and data are available for public use.Comment: Forthcoming at ACL 202

    ImmunoLingo: Linguistics-based formalization of the antibody language

    Full text link
    Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar (i.e., the rules that link sequence well-formedness, structure, and meaning) has led to largely domain-unspecific applications of LMs, which do not take into account the underlying structure of the biological sequences studied. A linguistic formalization, on the other hand, establishes linguistically-informed and thus domain-adapted components for LM applications. It would facilitate a better understanding of how differences and similarities between natural language and biological sequences influence the quality of LMs, which is crucial for the design of interpretable models with extractable sequence-functions relationship rules, such as the ones underlying the antibody specificity prediction problem. Deciphering the rules of antibody specificity is crucial to accelerating rational and in silico biotherapeutic drug design. Here, we formalize the properties of the antibody language and thereby establish not only a foundation for the application of linguistic tools in adaptive immune receptor analysis but also for the systematic immunolinguistic studies of immune receptor specificity in general.Comment: 19 pages, 3 figure

    Analisis dan Perancangan Aplikasi Chatbot Menggunakan Framework Rasa dan Sistem Informasi Pemeliharaan Aplikasi (Studi Kasus: Chatbot Penerimaan Mahasiswa Baru Politeknik Astra)

    Get PDF
    Chatbot menjadi suatu kebutuhan bisnis yang membutuhkan pelayanan interaksi secara real-time dan 24 jam. Kebutuhan tersebut juga diperlukan saat penerimaan mahasiswa baru di Politeknik Astra. Chatbot dapat menjadi salah satu penyedia informasi yang interaktif untuk calon mahasiswa Politeknik Astra, ketika mencari informasi terkait proses pendaftaran mahasiswa baru maupun terkait Politeknik Astra secara umum. Proses analisis dan perancangan sistem dilakukan, dimulai dengan studi literatur. Hasil dari studi literatur dipilihlah Framework RASA yang akan digunakan dalam pengembangan chatbot. Framework RASA memiliki performa yang baik karena memiliki Rasa NLU dan Rasa CORE. Rasa NLU sebagai basis library yang membangunteraksi antara komputer dan manusia dengan menerapkan dua metode dan algoritma kecerdasan buatan yaitu pemrosesan bahasa alami dan mesin pembelajaran. Rasa NLU bertanggung jawab membuat interaksi lebih nyata, pengguna layanan akan merasakan interaksi langsung seperti dengan manusia bukan dengan komputer. Rasa CORE juga berperan dalam membuat interaksi terasa nyata, dengan mengatur interaksi dialog antara antara bot (komputer dibalik chatbot) dengan pengguna. Framework Rasa juga bersifat open source sehingga memiliki adaptabilitas yang tinggi ketika diperlukan modifikasi untuk menyesuaikan kebutuhan bisnis yang ada. Proses pengembangan sistem dilanjutkan dengan melakukan analisis dan perancangan sistem informasi penunjang. Sistem informasi penunjang chatbot ini dibangun untuk mengakomodir kebutuhan proses CRUD (Create, Read, Update dan Delete) dari pertanyaan – respon yang nantinya dipelajari oleh chatbot sebelum dapat berinteraksi seperti manusia. Sistem informasi penunjang ini membantu penyesuaian konfigurasi chatbot dalam merespon pertanyaan sehingga operasional kebutuhan tersebut dapat mudah dilakukan oleh admin tanpa latar belakang IT. Hasil dari penelitian ini adalah kebutuhan sistem yang direpresentasikan pada use case diagram dan flowchart lalu pemilihan pipeline NLU untuk chatbot, arsitektur sistem, perancangan database dalam bentuk physical data model, dan perancangan desain antarmuka (mockup) sistem penunjang chatbot framework RASA. AbstractChatbots have become essential in business that requires interaction with customers in real-time and 24 hours. The requirements have become a necessity in Polytechnic Astra especially during the acceptance period of new students. Chatbot can be an interactive provider of information to prospective students of Polytechnic Astra who are looking information about the registration process or information related to Polytechnic Astra in general. The analysis and design process are conducted, starting with study literature.  The results of the literature study, the RASA Framework were chosen as a tool to develop chatbots. RASA Framework performs well with the Rasa NLU and Rasa Core. Rasa NLU as a base library to build interactions between computers and humans using artificial intelligence. Rasa NLU is responsible for making interaction much real, like direct interaction with humans. Rasa core is a base library to regulate the interaction dialogue between chatbots and users. The Rasa Framework is also open source, so it has high adaptability to be modified to suit existing business needs. This supporting information system help to adjust the configuration chatbot in responding to questions, so that the operational can be easily carried out by the admin without IT background. The results of this research are the system requirements represented in use case diagrams and flowcharts and the selection of NLU pipeline for chatbots, the system architecture, the database design in the form of physical data models, and the interface design (mockup) for the chatbot framework support system RASA

    Improvements in Transition Based Systems for Dependency Parsing

    Get PDF
    This thesis investigates transition based systems for parsing of natural language using dependency grammars. Dependency parsing provides a good and simple syntactic representation of the grammatical relations in a sentence. In the last years, this basic task has become a fundamental step in many applications that deal with natural language processing. Specifically, transition based systems have strong practical and psycholinguistic motivations. From a practical point of view, these systems are the only parsing systems that are fast enough to be used in web-scale applications. From a psycholinguistic point of view, they very closely resemble how humans incrementally process the language. However, these systems fall back in accuracy when compared with graph-based parsing, a family of parsing techniques that are based on a more traditional graph theoretic / dynamic programming approach, and that are more demanding on a computational perspective. Recently, some techniques have been developed in order to improve the accuracy of transition based systems. Most successful techniques are based on beam search or on the combination of the output of different parsing algorithms. However, all these techniques have a negative impact on parsing time. In this thesis, I will explore an alternative approach for transition based parsing, one that improves the accuracy without sacrificing computational efficiency. I will focus on greedy transition based systems and I will show how it is possible to improve the accuracy by using a dynamic oracle and a flexible parsing strategy. Dynamic oracles allow to reduce the error propagation at parsing time. Dynamic oracles may have some impact on training time, but there is no efficiency loss at parsing time. A flexible parsing strategy allows to reduce constraints over the parsing process and the time impact in both training and parsing time is almost negligible. Finally, these two techniques work really well when combined together, and they are orthogonal to previously explored proposals such as beam search or system combinations. As far as I know, the obtained experimental results are still state-of-the-art for greedy transition based parsing based on dependency grammars

    Statistical morphological disambiguation with application to disambiguation of pronunciations in Turkish /

    Get PDF
    The statistical morphological disambiguation of agglutinative languages suffers from data sparseness. In this study, we introduce the notion of distinguishing tag sets (DTS) to overcome the problem. The morphological analyses of words are modeled with DTS and the root major part-of-speech tags. The disambiguator based on the introduced representations performs the statistical morphological disambiguation of Turkish with a recall of as high as 95.69 percent. In text-to-speech systems and in developing transcriptions for acoustic speech data, the problem occurs in disambiguating the pronunciation of a token in context, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. We apply the morphological disambiguator to this problem of pronunciation disambiguation and achieve 99.54 percent recall with 97.95 percent precision. Most text-to-speech systems perform phrase level accentuation based on content word/function word distinction. This approach seems easy and adequate for some right headed languages such as English but is not suitable for languages such as Turkish. We then use a a heuristic approach to mark up the phrase boundaries based on dependency parsing on a basis of phrase level accentuation for Turkish TTS synthesizers

    Coreference Resolution for Arabic

    Get PDF
    Recently, there has been enormous progress in coreference resolution. These recent developments were applied to Chinese, English and other languages, with outstanding results. However, languages with a rich morphology or fewer resources, such as Arabic, have not received as much attention. In fact, when this PhD work started there was no neural coreference resolver for Arabic, and we were not aware of any learning-based coreference resolver for Arabic since [Björkelund and Kuhn, 2014]. In addition, as far as we know, whereas lots of attention had been devoted to the phemomenon of zero anaphora in languages such as Chinese or Japanese, no neural model for Arabic zero-pronoun anaphora had been developed. In this thesis, we report on a series of experiments on Arabic coreference resolution in general and on zero anaphora in particular. We propose a new neural coreference resolver for Arabic, and we present a series of models for identifying and resolving Arabic zero pronouns. Our approach for zero-pronoun identification and resolution is applicable to other languages, and was also evaluated on Chinese, with results surpassing the state of the art at the time. This research also involved producing revised versions of standard datasets for Arabic coreference

    The syntax and processing of relative clauses in Mandarin Chinese

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, 2003.Includes bibliographical references (leaves 125-133).This thesis investigates relative clauses (henceforth RCs) in Mandarin Chinese as spoken in Taiwan from both syntactic and processing perspectives. I also explore the interaction between these two areas, for example, how evidence from one area lends support to or undermines theories in the other area. There are several goals I hope to achieve: First of all, there is a significant gap in the sentence processing literature on Mandarin Chinese and in particular on RCs in Mandarin Chinese. I aim to bridge this gap by conducting experiments that will provide basic understanding of how Chinese RCs are processed. In doing so, I also provide a more complete picture of processing RCs across languages. In this thesis, I report three online reading experiments on Chinese RCs. I show that even though Chinese is also an SVO language like English and French, the results with regard to processing subject-extracted versus object-extracted RCs in Mandarin Chinese are very different from results for the same construction in other SVO languages. Thus, even though subject-extracted RCs are less complex in other SVO languages, they are more complex in Mandarin Chinese. These findings help tease apart various processing theories, in particular, I show that even though resource-based theories, canonical/non-canonical word order (frequency) theories, theory based on accessibility of syntactic positions and perspective shift theory all account for the facts reported in other SVO languages, results from Chinese are only compatible with resource-based theories and canonical/non-canonical (frequency) theories.(cont.) Secondly, it has been noted that in many cases, resource-based theories and canonical/non-canonical word order (frequency) theories are both compatible with data from sentence processing studies. Resource-based theories attribute processing difficulty associated with subject-extracted RCs to higher storage cost in processing subject-extracted RCs whereas frequency-based canonical word order theory such as the one proposed in Mitchell et al. 1995 attributes this to the less frequent occurrences of subject-extracted RCs in corpora. As a result, it is very difficult to tease these two theories apart. However, I conducted a Chinese corpus study in this thesis and I show that there is no correlation between structural frequencies in corpora and behavioral measures such as reading times, as predicted by frequency theories. As a matter of fact, subject-extracted RCs occur more frequently in the Chinese corpus. This undermines the validity of frequency theories in explaining the processing data reported in this thesis. Thirdly, Aoun and Li to appear argue that there is syntactic and semantic evidence in favor of positing two distinct syntactic derivations for RCs with or without resumptive pronouns. RCs containing gaps involve head-raising of the head NP (i.e. no operator movement) as reconstruction of the head NP back to the RC is available. On the other hand, RCs containing resumptive pronouns involve an empty operator in [Spec, CP] and no head-raising of the head NP (since reconstruction is unavailable) ...by Franny Pai-Fang Hsiao.Ph.D

    Critical Tokenization and its Properties

    No full text
    This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understandin
    corecore