68 research outputs found

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Unsupervised Methods for Learning and Using Semantics of Natural Language

    Get PDF
    Teaching the computer to understand language is the major goal in the field of natural language processing. In this thesis we introduce computational methods that aim to extract language structure — e.g. grammar, semantics or syntax — from text, which provides the computer with information in order to understand language. During the last decades, scientific efforts and the increase of computational resources made it possible to come closer to the goal of understanding language. In order to extract language structure, many approaches train the computer on manually created resources. Most of these so-called supervised methods show high performance when applied to similar textual data. However, they perform inferior when operating on textual data, which are different to the one they are trained on. Whereas training the computer is essential to obtain reasonable structure from natural language, we want to avoid training the computer using manually created resources. In this thesis, we present so-called unsupervised methods, which are suited to learn patterns in order to extract structure from textual data directly. These patterns are learned with methods that extract the semantics (meanings) of words and phrases. In comparison to manually built knowledge bases, unsupervised methods are more flexible: they can extract structure from text of different languages or text domains (e.g. finance or medical texts), without requiring manually annotated structure. However, learning structure from text often faces sparsity issues. The reason for these phenomena is that in language many words occur only few times. If a word is seen only few times no precise information can be extracted from the text it occurs. Whereas sparsity issues cannot be solved completely, information about most words can be gained by using large amounts of data. In the first chapter, we briefly describe how computers can learn to understand language. Afterwards, we present the main contributions, list the publications this thesis is based on and give an overview of this thesis. Chapter 2 introduces the terminology used in this thesis and gives a background about natural language processing. Then, we characterize the linguistic theory on how humans understand language. Afterwards, we show how the underlying linguistic intuition can be operationalized for computers. Based on this operationalization, we introduce a formalism for representing words and their context. This formalism is used in the following chapters in order to compute similarities between words. In Chapter 3 we give a brief description of methods in the field of computational semantics, which are targeted to compute similarities between words. All these methods have in common that they extract a contextual representation for a word that is generated from text. Then, this representation is used to compute similarities between words. In addition, we also present examples of the word similarities that are computed with these methods. Segmenting text into its topically related units is intuitively performed by humans and helps to extract connections between words in text. We equip the computer with these abilities by introducing a text segmentation algorithm in Chapter 4. This algorithm is based on a statistical topic model, which learns to cluster words into topics solely on the basis of the text. Using the segmentation algorithm, we demonstrate the influence of the parameters provided by the topic model. In addition, our method yields state-of-the-art performances on two datasets. In order to represent the meaning of words, we use context information (e.g. neighboring words), which is utilized to compute similarities. Whereas we described methods for word similarity computations in Chapter 3, we introduce a generic symbolic framework in Chapter 5. As we follow a symbolic approach, we do not represent words using dense numeric vectors but we use symbols (e.g. neighboring words or syntactic dependency parses) directly. Such a representation is readable for humans and is preferred in sensitive applications like the medical domain, where the reason for decisions needs to be provided. This framework enables the processing of arbitrarily large data. Furthermore, it is able to compute the most similar words for all words within a text collection resulting in a distributional thesaurus. We show the influence of various parameters deployed in our framework and examine the impact of different corpora used for computing similarities. Performing computations based on various contextual representations, we obtain the best results when using syntactic dependencies between words within sentences. However, these syntactic dependencies are predicted using a supervised dependency parser, which is trained on language-dependent and human-annotated resources. To avoid such language-specific preprocessing for computing distributional thesauri, we investigate the replacement of language-dependent dependency parsers by language-independent unsupervised parsers in Chapter 6. Evaluating the syntactic dependencies from unsupervised and supervised parses against human-annotated resources reveals that the unsupervised methods are not capable to compete with the supervised ones. In this chapter we use the predicted structure of both types of parses as context representation in order to compute word similarities. Then, we evaluate the quality of the similarities, which provides an extrinsic evaluation setup for both unsupervised and supervised dependency parsers. In an evaluation on English text, similarities computed based on contextual representations generated with unsupervised parsers do not outperform the similarities computed with the context representation extracted from supervised parsers. However, we observe the best results when applying context retrieved by the unsupervised parser for computing distributional thesauri on German language. Furthermore, we demonstrate that our framework is capable to combine different context representations, as we obtain the best performance with a combination of both flavors of syntactic dependencies for both languages. Most languages are not composed of single-worded terms only, but also contain many multi-worded terms that form a unit, called multiword expressions. The identification of multiword expressions is particularly important for semantics, as e.g. the term New York has a different meaning than its single terms New or York. Whereas most research on semantics avoids handling these expressions, we target on the extraction of multiword expressions in Chapter 7. Most previously introduced methods rely on part-of-speech tags and apply a ranking function to rank term sequences according to their multiwordness. Here, we introduce a language-independent and knowledge-free ranking method that uses information from distributional thesauri. Performing evaluations on English and French textual data, our method achieves the best results in comparison to methods from the literature. In Chapter 8 we apply information from distributional thesauri as features for various applications. First, we introduce a general setting for tackling the out-of-vocabulary problem. This problem describes the inferior performance of supervised methods according to words that are not contained in the training data. We alleviate this issue by replacing these unseen words with the most similar ones that are known, extracted from a distributional thesaurus. Using a supervised part-of-speech tagging method, we show substantial improvements in the classification performance for out-of-vocabulary words based on German and English textual data. The second application introduces a system for replacing words within a sentence with a word of the same meaning. For this application, the information from a distributional thesaurus provides the highest-scoring features. In the last application, we introduce an algorithm that is capable to detect the different meanings of a word and groups them into coarse-grained categories, called supersenses. Generating features by means of supersenses and distributional thesauri yields an performance increase when plugged into a supervised system that recognized named entities (e.g. names, organizations or locations). Further directions for using distributional thesauri are presented in Chapter 9. First, we lay out a method, which is capable of incorporating background information (e.g. source of the text collection or sense information) into a distributional thesaurus. Furthermore, we describe an approach on building thesauri for different text domains (e.g. medical or finance domain) and how they can be combined to have a high coverage of domain-specific knowledge as well as a broad background for the open domain. In the last section we characterize yet another method, suited to enrich existing knowledge bases. All three directions might be further extensions, which induce further structure based on textual data. The last chapter gives a summary of this work: we demonstrate that without language-dependent knowledge, a computer can learn to extract useful structure from text by using computational semantics. Due to the unsupervised nature of the introduced methods, we are able to extract new structure from raw textual data. This is important especially for languages, for which less manually created resources are available as well as for special domains e.g. medical or finance. We have demonstrated that our methods achieve state-of-the-art performance. Furthermore, we have proven their impact by applying the extracted structure in three natural language processing tasks. We have also applied the methods to different languages and large amounts of data. Thus, we have not proposed methods, which are suited for extracting structure for a single language, but methods that are capable to explore structure for “language” in general

    PowerAqua: Open Question Answering on the Semantic Web

    Get PDF
    With the rapid growth of semantic information in the Web, the processes of searching and querying these very large amounts of heterogeneous content have become increasingly challenging. This research tackles the problem of supporting users in querying and exploring information across multiple and heterogeneous Semantic Web (SW) sources. A review of literature on ontology-based Question Answering reveals the limitations of existing technology. Our approach is based on providing a natural language Question Answering interface for the SW, PowerAqua. The realization of PowerAqua represents a considerable advance with respect to other systems, which restrict their scope to an ontology-specific or homogeneous fraction of the publicly available SW content. To our knowledge, PowerAqua is the only system that is able to take advantage of the semantic data available on the Web to interpret and answer user queries posed in natural language. In particular, PowerAqua is uniquely able to answer queries by combining and aggregating information, which can be distributed across heterogeneous semantic resources. Here, we provide a complete overview of our work on PowerAqua, including: the research challenges it addresses; its architecture; the techniques we have realised to map queries to semantic data, to integrate partial answers drawn from different semantic resources and to rank alternative answers; and the evaluation studies we have performed, to assess the performance of PowerAqua. We believe our experiences can be extrapolated to a variety of end-user applications that wish to open up to large scale and heterogeneous structured datasets, to be able to exploit effectively what possibly is the greatest wealth of data in the history of Artificial Intelligence

    Linguistic and conceptual structures in the Beaver (Athapascan) mental lexicon. A study of body part terms and emotion expressions.

    Get PDF
    Diese Arbeit befasst sich mit den Strukturen im mentalen Lexikon des Beaver (Athabaskisch). Dabei werden zwei Ziele verfolgt: zum einen sollen Körperteilbezeichnungen und deren Verwendung v.a. zum Ausdruck von Emotionen untersucht werden. Zum anderen wird hier der aktuelle theoretische Überbau diskutiert und kritisiert. So wird die Konzeptuelle Metapher Theorie (v.a. Lakoff) sowie konzeptuelle Netzwerke nach Langacker angewandt und dort modifiziert, wo die Daten alternative Analyseansätze fordern. Hierbei werden, wenn möglich, metalinguistische Aussagen der Sprecher als relevante Daten hinzugezogen, um einen tieferen Einblick in die Konzepte zu erhalten, da diese direkt kaum zugänglich sind. Grundlagen für die Beschreibung der sprachlichen und konzeptuellen Formen sind "embodiment" und Konventionalisierung in Zusammenhang mit soziokulturellen und somit sprach-individuellen Aspekten. Als "Ergebnisse" dieser Prozesse entstehen polyseme Lexeme, die über diverse Strategien Bedeutungsextensionen erfahren haben. Diese werden hier in semantischen und konzeptuellen Netzwerken dargestellt, um die Verbindungen und konzeptuellen Distanzen zwischen den Lesarten nachzuvollziehen. Diverse Körperteilbezeichnungen werden hier detailiert dargestellt und in ihren vielfältigen Verwendungen analysiert. Ebenso werden die Emotionsausdrücke, welche Körperteilbezeichnungen beinhalten, in ihren Bedeutungen und Verwendungen beschrieben und mit Hilfe der Sprecheraussagen analysiert. Dabei wird deutlich, dass konzeptuelle Strukturen nicht immer in vollem Ausmaß für die Sprecher zugänglich sind, gewisse Verbindungen von Bedeutungen jedoch aufgrund ihrer lexikalischen Teile von den Sprechern nachvollzogen werden können. Dabei wird deutlich, dass die Konzeptuelle Metapher Theorie nicht für alle sprachlichen Formen als Erklärungsansatz herangezogen werden kann, da nicht jeder figurative Konstruktion auf eine konzeptuelle Metapher zurückzuführen ist bzw. da die in nicht prototypischer Bedeutung verwendeten Konzepte nicht immer auf zwei unabhängigen Domänen beruhen. Diese Arbeit leistet auf der einen Seite einen wichtigen Beitrag zur Beschreibung und zum Verständnis Athabaskischer Sprachen. Auf der anderen Seite wird die Diskussion innerhalb der kognitiven Linguistik, genauer im Bereich der Konzeptuellen Metapher und figurativen Sprache, mit neuen Daten gespeist und kritisch weiter angeregt

    Pertanika Journal of Social Sciences & Humanities

    Get PDF

    Studies related to the process of program development

    Get PDF
    The submitted work consists of a collection of publications arising from research carried out at Rhodes University (1970-1980) and at Heriot-Watt University (1980-1992). The theme of this research is the process of program development, i.e. the process of creating a computer program to solve some particular problem. The papers presented cover a number of different topics which relate to this process, viz. (a) Programming methodology programming. (b) Properties of programming languages. aspects of structured. (c) Formal specification of programming languages. (d) Compiler techniques. (e) Declarative programming languages. (f) Program development aids. (g) Automatic program generation. (h) Databases. (i) Algorithms and applications
    corecore