18 research outputs found

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Rapid Resource Transfer for Multilingual Natural Language Processing

    Get PDF
    Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was desgined for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data

    Phoneme-based statistical transliteration of foreign names for OOV problem.

    Get PDF
    Gao Wei.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 79-82).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.iiiBibliographic Notes --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- What is Transliteration? --- p.1Chapter 1.2 --- Existing Problems --- p.2Chapter 1.3 --- Objectives --- p.4Chapter 1.4 --- Outline --- p.4Chapter 2 --- Background --- p.6Chapter 2.1 --- Source-channel Model --- p.6Chapter 2.2 --- Transliteration for English-Chinese --- p.8Chapter 2.2.1 --- Rule-based Approach --- p.8Chapter 2.2.2 --- Similarity-based Framework --- p.8Chapter 2.2.3 --- Direct Semi-Statistical Approach --- p.9Chapter 2.2.4 --- Source-channel-based Approach --- p.11Chapter 2.3 --- Chapter Summary --- p.14Chapter 3 --- Transliteration Baseline --- p.15Chapter 3.1 --- Transliteration Using IBM SMT --- p.15Chapter 3.1.1 --- Introduction --- p.15Chapter 3.1.2 --- GIZA++ for Transliteration Modeling --- p.16Chapter 3.1.3 --- CMU-Cambridge Toolkits for Language Modeling --- p.21Chapter 3.1.4 --- Re Write Decoder for Decoding --- p.21Chapter 3.2 --- Limitations of IBM SMT --- p.22Chapter 3.3 --- Experiments Using IBM SMT --- p.25Chapter 3.3.1 --- Data Preparation --- p.25Chapter 3.3.2 --- Performance Measurement --- p.27Chapter 3.3.3 --- Experimental Results --- p.27Chapter 3.4 --- Chapter Summary --- p.28Chapter 4 --- Direct Transliteration Modeling --- p.29Chapter 4.1 --- Soundness of the Direct Model一Direct-1 --- p.30Chapter 4.2 --- Alignment of Phoneme Chunks --- p.31Chapter 4.3 --- Transliteration Model Training --- p.33Chapter 4.3.1 --- EM Training for Symbol-mappings --- p.33Chapter 4.3.2 --- WFST for Phonetic Transition --- p.36Chapter 4.3.3 --- Issues for Incorrect Syllables --- p.36Chapter 4.4 --- Language Model Training --- p.36Chapter 4.5 --- Search Algorithm --- p.39Chapter 4.6 --- Experimental Results --- p.41Chapter 4.6.1 --- Experiment I: C.A. Distribution --- p.41Chapter 4.6.2 --- Experiment II: Top-n Accuracy --- p.41Chapter 4.6.3 --- Experiment III: Comparisons with the Baseline --- p.43Chapter 4.6.4 --- Experiment IV: Influence of m Candidates --- p.43Chapter 4.7 --- Discussions --- p.43Chapter 4.8 --- Chapter Summary --- p.46Chapter 5 --- Improving Direct Transliteration --- p.47Chapter 5.1 --- Improved Direct Model´ؤDirect-2 --- p.47Chapter 5.1.1 --- Enlightenment from Source-Channel --- p.47Chapter 5.1.2 --- Using Contextual Features --- p.48Chapter 5.1.3 --- Estimation Based on MaxEnt --- p.49Chapter 5.1.4 --- Features for Transliteration --- p.51Chapter 5.2 --- Direct-2 Model Training --- p.53Chapter 5.2.1 --- Procedure and Results --- p.53Chapter 5.2.2 --- Discussions --- p.53Chapter 5.3 --- Refining the Model Direct-2 --- p.55Chapter 5.3.1 --- Refinement Solutions --- p.55Chapter 5.3.2 --- Direct-2R Model Training --- p.56Chapter 5.4 --- Evaluation --- p.57Chapter 5.4.1 --- Search Algorithm --- p.57Chapter 5.4.2 --- Direct Transliteration Models vs. Baseline --- p.59Chapter 5.4.3 --- Direct-2 vs. Direct-2R --- p.63Chapter 5.4.4 --- Experiments on Direct-2R --- p.65Chapter 5.5 --- Chapter Summary --- p.71Chapter 6 --- Conclusions --- p.72Chapter 6.1 --- Thesis Summary --- p.72Chapter 6.2 --- Cross Language Applications --- p.73Chapter 6.3 --- Future Work and Directions --- p.74Chapter A --- IPA-ARPABET Symbol Mapping Table --- p.77Bibliography --- p.8

    A Kaleidoscope of Digital American Literature

    Get PDF
    The word kaleidoscope comes from a Greek phrase meaning to view a beautiful form, and this report makes the leap of faith that all scholarship is beautiful (Ayers 2005b). This review is divided into three major sections. Part I offers a sampling of the types of digital resources currently available or under development in support of American literature and identifies the prevailing concerns of specialists in the field as expressed during interviews conducted between July 2004 and May 2005. Part two of the report consolidates the results of these interviews with an exploration of resources currently available to illustrate, on the one hand, a kaleidoscope of differing attitudes and assessments, and, on the other, an underlying design that gives shape to the parts. Part three examines six categories of digital work in progress: (1) quality-controlled subject gateways, (2) author studies, (3) public domain e-book collections and alternative publishing models, (4) proprietary reference resources and full-text primary source collections, (5) collections by design, and (6) teaching applications. This survey is informed by a selective review of the recent literature, focusing especially on contributions from scholars that have appeared in discipline-based journals

    Large vocabulary off-line handwritten word recognition

    Get PDF
    Considerable progress has been made in handwriting recognition technology over the last few years. Thus far, handwriting recognition systems have been limited to small-scale and very constrained applications where the number on different words that a system can recognize is the key point for its performance. The capability of dealing with large vocabularies, however, opens up many more applications. In order to translate the gains made by research into large and very-large vocabulary handwriting recognition, it is necessary to further improve the computational efficiency and the accuracy of the current recognition strategies and algorithms. In this thesis we focus on efficient and accurate large vocabulary handwriting recognition. The main challenge is to speedup the recognition process and to improve the recognition accuracy. However. these two aspects are in mutual conftict. It is relatively easy to improve recognition speed while trading away some accuracy. But it is much harder to improve the recognition speed while preserving the accuracy. First, several strategies have been investigated for improving the performance of a baseline recognition system in terms of recognition speed to deal with large and very-large vocabularies. Next, we improve the performance in terms of recognition accuracy while preserving all the original characteristics of the baseline recognition system: omniwriter, unconstrained handwriting, and dynamic lexicons. The main contributions of this thesis are novel search strategies and a novel verification approach that allow us to achieve a 120 speedup and 10% accuracy improvement over a state-of-art baselinè recognition system for a very-large vocabulary recognition task (80,000 words). The improvements in speed are obtained by the following techniques: lexical tree search, standard and constrained lexicon-driven level building algorithms, fast two-level decoding algorithm, and a distributed recognition scheme. The recognition accuracy is improved by post-processing the list of the candidate N-best-scoring word hypotheses generated by the baseline recognition system. The list also contains the segmentation of such word hypotheses into characters . A verification module based on a neural network classifier is used to generate a score for each segmented character and in the end, the scores from the baseline recognition system and the verification module are combined to optimize performance. A rejection mechanism is introduced over the combination of the baseline recognition system with the verification module to improve significantly the word recognition rate to about 95% while rejecting 30% of the word hypotheses

    Large Data-to-Text Generation

    Get PDF
    This thesis presents a domain-driven approach to sports game summarization, a specific instance of large data-to-text generation (DTG). We first address the data fidelity issue in the Rotowire dataset by supplementing existing input records and demonstrating larger relative improvements compared to previously proposed purification schemes. As this method further increases the total number of input records, we alternatively formulate this problem as a multimodal problem (i.e. visual data-to-text), discussing potential advantages over purely textual approaches and studying its effectiveness for future expansion. We work exclusively with pre-trained end-to-end transformers throughout, allowing us to evaluate the efficacy of sparse attention and multimodal encoder-decoders in DTG and providing appropriate benchmarks for future work. To automatically evaluate the statistical correctness of generated summaries, we also extend prior work on automatic relation extraction and build an updated pipeline that incorporates low amounts of human-annotated data which are quickly inflated via data augmentation. By formulating this in a ”text-to-text” fashion, we are able to take advantage of LLMs and achieve significantly higher precision and recall than previous methods while tracking three times the number of unique relations. Our updated models are more consistent and reliable by incorporating human-verified data partitions into the training and evaluation process

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    JURI SAYS:An Automatic Judgement Prediction System for the European Court of Human Rights

    Get PDF
    In this paper we present the web platform JURI SAYS that automatically predicts decisions of the European Court of Human Rights based on communicated cases, which are published by the court early in the proceedings and are often available many years before the final decision is made. Our system therefore predicts future judgements of the court. The platform is available at jurisays.com and shows the predictions compared to the actual decisions of the court. It is automatically updated every month by including the prediction for the new cases. Additionally, the system highlights the sentences and paragraphs that are most important for the prediction (i.e. violation vs. no violation of human rights)

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 – to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects
    corecore