380 research outputs found

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks

    Event extraction from biomedical texts using trimmed dependency graphs

    Get PDF
    This thesis explores the automatic extraction of information from biomedical publications. Such techniques are urgently needed because the biosciences are publishing continually increasing numbers of texts. The focus of this work is on events. Information about events is currently manually curated from the literature by biocurators. Biocuration, however, is time-consuming and costly so automatic methods are needed for information extraction from the literature. This thesis is dedicated to modeling, implementing and evaluating an advanced event extraction approach based on the analysis of syntactic dependency graphs. This work presents the event extraction approach proposed and its implementation, the JReX (Jena Relation eXtraction) system. This system was used by the University of Jena (JULIE Lab) team in the "BioNLP 2009 Shared Task on Event Extraction" competition and was ranked second among 24 competing teams. Thereafter JReX was the highest scorer on the worldwide shared U-Compare event extraction server, outperforming the competing systems from the challenge. This success was made possible, among other things, by extensive research on event extraction solutions carried out during this thesis, e.g., exploring the effects of syntactic and semantic processing procedures on solving the event extraction task. The evaluations executed on standard and community-wide accepted competition data were complemented by real-life evaluation of large-scale biomedical database reconstruction. This work showed that considerable parts of manually curated databases can be automatically re-created with the help of the event extraction approach developed. Successful re-creation was possible for parts of RegulonDB, the world's largest database for E. coli. In summary, the event extraction approach justified, developed and implemented in this thesis meets the needs of a large community of human curators and thus helps in the acquisition of new knowledge in the biosciences

    Enabling modeling framework with surrogate modeling capabilities and complex networks

    Get PDF
    Conceptual and physically based environmental simulation models as products of research environments efforts became complex software over time in order to allow describing the behaviour of natural phenomena more accurately. Results from these models are considered accurate but often require to operate an entire system of modeling resources with dedicated knowledge, an extensive set up, and sometimes significant computational time. Model complexity limits wide model adaptation among consultants because of lower available technical resources and capabilities. However, models should be ubiquitous to use in both research and consulting environments. This dissertation aims to address and alleviate two aspects of research model complexity: 1) for researchers, the model design complexity with respect to its internal software structure and 2) for consultants, the model application complexity with respect to data and parameter setup, runtime requirements, and proper model infrastructure setup. The first contribution provides modeling design and implementation support by managing interacting modeling solutions as “Directed Acyclic Graph”, while the second one helps to create surrogate models of complex physical models as a streamlined process. Both contributions are implemented within the OMS/CSIP modeling framework and infrastructure and were applied in various studies. First, a machine learning (ML)-based surrogate model approach is presented to respond to field application requirements to get quick but “accurate enough” model results with limited input and limited a-priori knowledge of the internal physical processes involved. The surrogate model aims to capture the behaviour of a physical model as an ensemble system of artificial neural networks (ANN). Here, the NeuroEvolution of Augmenting Topology (NEAT) technique has been leveraged because of its integration of a genetic approach to build and evolve its ANNs during supervised training. Throughout this phase, the thorough design of the services facilitate seamless monitoring of structural mutations of the artificial neural network and its performances with respect to behavioural emulation of the original model response. This results in a streamlined surrogate model generation. Furthermore, the stochasticity inherent to the evolutionary genetic algorithm combined with a specially designed cross-validation approach allows for straightforward use of the ensemble application. Several, slightly different artificial neural networks are concurrently trained. The ensemble system is built upon the selection of the utmost performant surrogate models and is used collectively to provide uncertainty quantified results when applied against new data. Secondly, a Directed Acyclic Graph (DAG) modeling structure NET3 was developed. NET3 provides appropriate data structures to represent modeling states interactions as relationships based on network topologies. The inherent structure of the DAG commands the execution of modeling tasks. NET3 implicitly manages the parallel computation depending on the network topology. A node of a NET3 modeling structure encapsulates any sort of modeling solution such as a system of ordinary differential equations, a set of statistical rules, or a system of partial differential equations. Each link connects these modeling solutions by handling their data flow. As a result, NET3 simplifies 1) the translation of physical mathematical concepts into model components, and 2) the management of complex interactions of modeling solutions. NET3 also pushes forward the idea of separating concerns between software architecture and scientific model codebase. It manages aspects that relate to the architectural design of the graph modeling structure and lets research scientist focus on their model’s domain. NET3 improves encapsulation and reusability of scientific/mathematical concepts. It avoids code duplication by allowing the same modeling solution to be adopted in different nodes and finely adapted to specific requirements. In summary, NET3 enables a new level of modeling flexibility by allowing to quickly change model representations to explore new modeling solutions. The two presented contributions were integrated into the Object Modeling System/Cloud Services Integrated Platform (OMS/CSIP) environmental modeling framework (EMF). EMFs are standard practice in environmental modeling because they represent a software solution of separating the burden of software architectural design management from scientific research. Here, OMS/CSIP has been identified “advanced” in terms of EMFs design. It offers high flexibility, low language invasiveness, fine and thorough architectural design, and innovative cloud computing deployment infrastructure. These aspects make OMS/CSIP infrastructure the suitable platform to host NEAT based surrogate modeling and NET3 extensions. Framework-enabled NEAT based Surrogate modeling (FeNS) results from the full integration of NEAT based surrogate modeling approach with OMS/CSIP platform. Here, the surrogate model approach was developed as CSIP services to help transitioning from research models to “field models” by enabling the modeling framework to interact with CSIP services, ML libraries, and a NoSQL database to emerge model surrogates for a(ny) modelling solution. OMS/CSIP was extended to harvest data from each model run and automatically derive the surrogate model at the modeling framework level. NET3 extends OMS modeling simulations to run as a graph network of interconnected modeling solutions. Furthermore, it enhances available OMS calibration algorithms to become multi-site calibration procedures. OMS already provided implicit parallel computation of independent components in a modeling solution. NET3 now adds a further layer of implicit parallelism by concurrently running independent modeling solutions. Two studies were carried out to develop and test FeSN while three applications supported the development and testing of NET3. Surrogate models of the Revised Universal Soil Loss Equation, Version 2 (R2) were generated to scale up from simple test cases with a constrained input space to more generic applications including a larger variety of input parameters. The main goal of the surrogate model was to streamline and simplify access to the R2 model behaviour. We performed sensitivity analysis of R2 to limit the input space to only relevant parameters (e.g. soil properties, climate parameter, field geometries, crop rotation description). The main study area was the State of Iowa starting from a single county (Clay county) ending up to four counties (Buena Vista, Cherokee, Clay, and Wright). Clustering methodologies were applied to improve surrogate model accuracy and to accelerate the training process by reducing the dataset size. The overall “goodness-of-fit” against the testing dataset estimated on the median of the uncertainty quantified result of the surrogate models ensemble was always above 0.95 Nash-Sutcliffe (NS), root mean squared error (RMSE) between 0.13 and 0.36, and bias between -0.07 and 0.02. In many cases, accuracy of the surrogate model with respect to testing dataset was above 0.98 NS. Surrogate models of the AgroEcoSystem (AgES) were generated to apply and test FeNS methodology to a semi-distributed hydrologic model. The main goal of the surrogate model was to streamline and simplify access to the AgES model behaviour. Only relevant lumped parameters on watershed centroid were used to train the surrogate models and limit the input space to only relevant parameters (e.g. precipitation, groundwater level, LAI, and potential evapotranspiration). The main study area was the South Fork Iowa River (SFIR) watershed in the State of Iowa across Wright, Franklin, Hamilton, and Hardin counties. The overall “goodness-of-fit” against the testing dataset estimated on the median of the uncertainty quantified result of the surrogate models ensemble was above 0.97 Nash-Sutcliffe (NS), root mean squared error (RMSE) of 2.24, and bias of -0.0794. With respect to NET3, the first application is the real-time modeling of flood forecasting through GEOframe system for the Civil Protection of Regione Basilicata implemented by PhD Bancheri. To scale the computation and finely tune calibration parameters, the Basilicata river basins were split into subcatchments where each was represented by a different NET3 node. The second application was part of Mr. Dalla Torre’s master thesis where the computational core of the rainfall-runoff model of Storm Water Management Model (SWMM by EPA) was componentized. NET3 now allows for reimplementing a concise and lightweight SWMM modeling core and highly parallel model runs. Software architectural design of rainfall-runoff, routing and sewer pipe design components targeted separation of concerns, single responsibility, and encapsulation principles. It resulted in clean and minimized code base. NET3 manages component connections and scalable computation by hosting rainfall-runoff modeling solution into separated nodes from routing and sewer pipe design modeling solution. It also enables each node of the modeling structure to 1) access a shared data structure to fetch input data from and push results to (SWMMobject), and 2) internally analyze the upstream subtree in order to adjust sewer pipe design parameters. The third test case is the application of a “system of systems” of urban models where each node of the graph modeling structure encapsulates a single responsibility system of models. Because of the stochasticity involved in each system of models, the entire graph modeling solution was required to run several times and generate independent realizations. Hence, NET3 was enabled to run a “graph of graphs” modeling structure

    Towards using fluctuations in internal quality metrics to find design intents

    Full text link
    Le contrĂŽle de version est la pierre angulaire des processus de dĂ©veloppement de logiciels modernes. Tout en construisant des logiciels de plus en plus complexes, les dĂ©veloppeurs doivent comprendre des sous-systĂšmes de code source qui leur sont peu familier. Alors que la comprĂ©hension de la logique d'un code Ă©tranger est relativement simple, la comprĂ©hension de sa conception et de sa genĂšse est plus compliquĂ©e. Elle n'est souvent possible que par les descriptions des rĂ©visions et de la documentation du projet qui sont dispersĂ©es et peu fiables -- quand elles existent. Ainsi, les dĂ©veloppeurs ont besoin d'une base de rĂ©fĂ©rence fiable et pertinente pour comprendre l'historique des projets logiciels. Dans cette thĂšse, nous faisons les premiers pas vers la comprĂ©hension des motifs de changement dans les historiques de rĂ©vision. Nous Ă©tudions les changements prenant place dans les mĂ©triques logicielles durant l'Ă©volution d'un projet. Au travers de multiples Ă©tudes exploratoires, nous rĂ©alisons des expĂ©riences quantitatives et qualitatives sur plusieurs jeux de donnĂ©es extraits Ă  partir d'un ensemble de 13 projets. Nous extrayons les changements dans les mĂ©triques logicielles de chaque commit et construisons un jeu de donnĂ©e annotĂ© manuellement comme vĂ©ritĂ© de base. Nous avons identifiĂ© plusieurs catĂ©gories en analysant ces changements. Un motif en particulier nommĂ© "compromis", dans lequel certaines mĂ©triques peuvent s'amĂ©liorer au dĂ©triment d'autres, s'est avĂ©rĂ© ĂȘtre un indicateur prometteur de changements liĂ©s Ă  la conception -- dans certains cas, il laisse Ă©galement entrevoir une intention de conception consciente de la part des auteurs des changements. Pour dĂ©montrer les observations de nos Ă©tudes exploratoires, nous construisons un modĂšle gĂ©nĂ©ral pour identifier l'application d'un ensemble bien connu de principes de conception dans de nouveaux projets. Nos rĂ©sultats suggĂšrent que les fluctuations de mĂ©triques ont le potentiel d'ĂȘtre des indicateurs pertinents pour gagner des aperçus macroscopiques sur l'Ă©volution de la conception dans l'historique de dĂ©veloppement d'un projet.Version control is the backbone of the modern software development workflow. While building more and more complex systems, developers have to understand unfamiliar subsystems of source code. Understanding the logic of unfamiliar code is relatively straightforward. However, understanding its design and its genesis is often only possible through scattered and unreliable commit messages and project documentation -- when they exist. Thus, developers need a reliable and relevant baseline to understand the history of software projects. In this thesis, we take the first steps towards understanding change patterns in commit histories. We study the changes in software metrics through the evolution of projects. Through multiple exploratory studies, we conduct quantitative and qualitative experiments on several datasets extracted from a pool of 13 projects. We mine the changes in software metrics for each commit of the respective projects and manually build oracles to represent ground truth. We identified several categories by analyzing these changes. One pattern, in particular, dubbed "tradeoffs", where some metrics may improve at the expense of others, proved to be a promising indicator of design-related changes -- in some cases, also hinting at a conscious design intent from the authors of the changes. Demonstrating the findings of our exploratory studies, we build a general model to identify the application of a well-known set of design principles in new projects. Our overall results suggest that metric fluctuations have the potential to be relevant indicators for valuable macroscopic insights about the design evolution in a project's development history

    Text–to–Video: Image Semantics and NLP

    Get PDF
    When aiming at automatically translating an arbitrary text into a visual story, the main challenge consists in finding a semantically close visual representation whereby the displayed meaning should remain the same as in the given text. Besides, the appearance of an image itself largely influences how its meaningful information is transported towards an observer. This thesis now demonstrates that investigating in both, image semantics as well as the semantic relatedness between visual and textual sources enables us to tackle the challenging semantic gap and to find a semantically close translation from natural language to a corresponding visual representation. Within the last years, social networking became of high interest leading to an enormous and still increasing amount of online available data. Photo sharing sites like Flickr allow users to associate textual information with their uploaded imagery. Thus, this thesis exploits this huge knowledge source of user generated data providing initial links between images and words, and other meaningful data. In order to approach visual semantics, this work presents various methods to analyze the visual structure as well as the appearance of images in terms of meaningful similarities, aesthetic appeal, and emotional effect towards an observer. In detail, our GPU-based approach efficiently finds visual similarities between images in large datasets across visual domains and identifies various meanings for ambiguous words exploring similarity in online search results. Further, we investigate in the highly subjective aesthetic appeal of images and make use of deep learning to directly learn aesthetic rankings from a broad diversity of user reactions in social online behavior. To gain even deeper insights into the influence of visual appearance towards an observer, we explore how simple image processing is capable of actually changing the emotional perception and derive a simple but effective image filter. To identify meaningful connections between written text and visual representations, we employ methods from Natural Language Processing (NLP). Extensive textual processing allows us to create semantically relevant illustrations for simple text elements as well as complete storylines. More precisely, we present an approach that resolves dependencies in textual descriptions to arrange 3D models correctly. Further, we develop a method that finds semantically relevant illustrations to texts of different types based on a novel hierarchical querying algorithm. Finally, we present an optimization based framework that is capable of not only generating semantically relevant but also visually coherent picture stories in different styles.Bei der automatischen Umwandlung eines beliebigen Textes in eine visuelle Geschichte, besteht die grĂ¶ĂŸte Herausforderung darin eine semantisch passende visuelle Darstellung zu finden. Dabei sollte die Bedeutung der Darstellung dem vorgegebenen Text entsprechen. DarĂŒber hinaus hat die Erscheinung eines Bildes einen großen Einfluß darauf, wie seine bedeutungsvollen Inhalte auf einen Betrachter ĂŒbertragen werden. Diese Dissertation zeigt, dass die Erforschung sowohl der Bildsemantik als auch der semantischen Verbindung zwischen visuellen und textuellen Quellen es ermöglicht, die anspruchsvolle semantische LĂŒcke zu schließen und eine semantisch nahe Übersetzung von natĂŒrlicher Sprache in eine entsprechend sinngemĂ€ĂŸe visuelle Darstellung zu finden. Des Weiteren gewann die soziale Vernetzung in den letzten Jahren zunehmend an Bedeutung, was zu einer enormen und immer noch wachsenden Menge an online verfĂŒgbaren Daten gefĂŒhrt hat. Foto-Sharing-Websites wie Flickr ermöglichen es Benutzern, Textinformationen mit ihren hochgeladenen Bildern zu verknĂŒpfen. Die vorliegende Arbeit nutzt die enorme Wissensquelle von benutzergenerierten Daten welche erste Verbindungen zwischen Bildern und Wörtern sowie anderen aussagekrĂ€ftigen Daten zur VerfĂŒgung stellt. Zur Erforschung der visuellen Semantik stellt diese Arbeit unterschiedliche Methoden vor, um die visuelle Struktur sowie die Wirkung von Bildern in Bezug auf bedeutungsvolle Ähnlichkeiten, Ă€sthetische Erscheinung und emotionalem Einfluss auf einen Beobachter zu analysieren. Genauer gesagt, findet unser GPU-basierter Ansatz effizient visuelle Ähnlichkeiten zwischen Bildern in großen Datenmengen quer ĂŒber visuelle DomĂ€nen hinweg und identifiziert verschiedene Bedeutungen fĂŒr mehrdeutige Wörter durch die Erforschung von Ähnlichkeiten in Online-Suchergebnissen. Des Weiteren wird die höchst subjektive Ă€sthetische Anziehungskraft von Bildern untersucht und "deep learning" genutzt, um direkt Ă€sthetische Einordnungen aus einer breiten Vielfalt von Benutzerreaktionen im sozialen Online-Verhalten zu lernen. Um noch tiefere Erkenntnisse ĂŒber den Einfluss des visuellen Erscheinungsbildes auf einen Betrachter zu gewinnen, wird erforscht, wie alleinig einfache Bildverarbeitung in der Lage ist, tatsĂ€chlich die emotionale Wahrnehmung zu verĂ€ndern und ein einfacher aber wirkungsvoller Bildfilter davon abgeleitet werden kann. Um bedeutungserhaltende Verbindungen zwischen geschriebenem Text und visueller Darstellung zu ermitteln, werden Methoden des "Natural Language Processing (NLP)" verwendet, die der Verarbeitung natĂŒrlicher Sprache dienen. Der Einsatz umfangreicher Textverarbeitung ermöglicht es, semantisch relevante Illustrationen fĂŒr einfache Textteile sowie fĂŒr komplette HandlungsstrĂ€nge zu erzeugen. Im Detail wird ein Ansatz vorgestellt, der AbhĂ€ngigkeiten in Textbeschreibungen auflöst, um 3D-Modelle korrekt anzuordnen. Des Weiteren wird eine Methode entwickelt die, basierend auf einem neuen hierarchischen Such-Anfrage Algorithmus, semantisch relevante Illustrationen zu Texten verschiedener Art findet. Schließlich wird ein optimierungsbasiertes Framework vorgestellt, das nicht nur semantisch relevante, sondern auch visuell kohĂ€rente Bildgeschichten in verschiedenen Bildstilen erzeugen kann
    • 

    corecore