1,351 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Extracting novel facts from tables for Knowledge Graph completion

    Get PDF
    We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our method further disambiguates cell values using KG embeddings as additional ranking method. Other distinctive features are the lack of assumptions about the underlying KG and the enabling of a fine-grained tuning of the precision/recall trade-off of extracted facts. Our experiments show that our approach has a higher recall during the interpretation process than the state-of-the-art, and is more resistant against the bias observed in extracting mostly redundant facts since it produces more novel extractions

    Inferring Tabular Analysis Metadata by Infusing Distribution and Knowledge Information

    Full text link
    Many data analysis tasks heavily rely on a deep understanding of tables (multi-dimensional data). Across the tasks, there exist comonly used metadata attributes of table fields / columns. In this paper, we identify four such analysis metadata: Measure/dimension dichotomy, common field roles, semantic field type, and default aggregation function. While those metadata face challenges of insufficient supervision signals, utilizing existing knowledge and understanding distribution. To inference these metadata for a raw table, we propose our multi-tasking Metadata model which fuses field distribution and knowledge graph information into pre-trained tabular models. For model training and evaluation, we collect a large corpus (~582k tables from private spreadsheet and public tabular datasets) of analysis metadata by using diverse smart supervisions from downstream tasks. Our best model has accuracy = 98%, hit rate at top-1 > 67%, accuracy > 80%, and accuracy = 88% for the four analysis metadata inference tasks, respectively. It outperforms a series of baselines that are based on rules, traditional machine learning methods, and pre-trained tabular models. Analysis metadata models are deployed in a popular data analysis product, helping downstream intelligent features such as insights mining, chart / pivot table recommendation, and natural language QA...Comment: 13pages, 7 figures, 9 table

    Site-Specific Rules Extraction in Precision Agriculture

    Get PDF
    El incremento sostenible en la producción alimentaria para satisfacer las necesidades de una población mundial en aumento es un verdadero reto cuando tenemos en cuenta el impacto constante de plagas y enfermedades en los cultivos. Debido a las importantes pérdidas económicas que se producen, el uso de tratamientos químicos es demasiado alto; causando contaminación del medio ambiente y resistencia a distintos tratamientos. En este contexto, la comunidad agrícola divisa la aplicación de tratamientos más específicos para cada lugar, así como la validación automática con la conformidad legal. Sin embargo, la especificación de estos tratamientos se encuentra en regulaciones expresadas en lenguaje natural. Por este motivo, traducir regulaciones a una representación procesable por máquinas está tomando cada vez más importancia en la agricultura de precisión.Actualmente, los requisitos para traducir las regulaciones en reglas formales están lejos de ser cumplidos; y con el rápido desarrollo de la ciencia agrícola, la verificación manual de la conformidad legal se torna inabordable.En esta tesis, el objetivo es construir y evaluar un sistema de extracción de reglas para destilar de manera efectiva la información relevante de las regulaciones y transformar las reglas de lenguaje natural a un formato estructurado que pueda ser procesado por máquinas. Para ello, hemos separado la extracción de reglas en dos pasos. El primero es construir una ontología del dominio; un modelo para describir los desórdenes que producen las enfermedades en los cultivos y sus tratamientos. El segundo paso es extraer información para poblar la ontología. Puesto que usamos técnicas de aprendizaje automático, implementamos la metodología MATTER para realizar el proceso de anotación de regulaciones. Una vez creado el corpus, construimos un clasificador de categorías de reglas que discierne entre obligaciones y prohibiciones; y un sistema para la extracción de restricciones en reglas, que reconoce información relevante para retener el isomorfismo con la regulación original. Para estos componentes, empleamos, entre otra técnicas de aprendizaje profundo, redes neuronales convolucionales y “Long Short- Term Memory”. Además, utilizamos como baselines algoritmos más tradicionales como “support-vector machines” y “random forests”.Como resultado, presentamos la ontología PCT-O, que ha sido alineada con otras ontologías como NCBI, PubChem, ChEBI y Wikipedia. El modelo puede ser utilizado para la identificación de desórdenes, el análisis de conflictos entre tratamientos y la comparación entre legislaciones de distintos países. Con respecto a los sistemas de extracción, evaluamos empíricamente el comportamiento con distintas métricas, pero la métrica F1 es utilizada para seleccionar los mejores sistemas. En el caso del clasificador de categorías de reglas, el mejor sistema obtiene un macro F1 de 92,77% y un F1 binario de 85,71%. Este sistema usa una red “bidirectional long short-term memory” con “word embeddings” como entrada. En relación al extractor de restricciones de reglas, el mejor sistema obtiene un micro F1 de 88,3%. Este extractor utiliza como entrada una combinación de “character embeddings” junto a “word embeddings” y una red neuronal “bidirectional long short-term memory”.<br /

    Structuring the Unstructured: Unlocking pharmacokinetic data from journals with Natural Language Processing

    Get PDF
    The development of a new drug is an increasingly expensive and inefficient process. Many drug candidates are discarded due to pharmacokinetic (PK) complications detected at clinical phases. It is critical to accurately estimate the PK parameters of new drugs before being tested in humans since they will determine their efficacy and safety outcomes. Preclinical predictions of PK parameters are largely based on prior knowledge from other compounds, but much of this potentially valuable data is currently locked in the format of scientific papers. With an ever-increasing amount of scientific literature, automated systems are essential to exploit this resource efficiently. Developing text mining systems that can structure PK literature is critical to improving the drug development pipeline. This thesis studied the development and application of text mining resources to accelerate the curation of PK databases. Specifically, the development of novel corpora and suitable natural language processing architectures in the PK domain were addressed. The work presented focused on machine learning approaches that can model the high diversity of PK studies, parameter mentions, numerical measurements, units, and contextual information reported across the literature. Additionally, architectures and training approaches that could efficiently deal with the scarcity of annotated examples were explored. The chapters of this thesis tackle the development of suitable models and corpora to (1) retrieve PK documents, (2) recognise PK parameter mentions, (3) link PK entities to a knowledge base and (4) extract relations between parameter mentions, estimated measurements, units and other contextual information. Finally, the last chapter of this thesis studied the feasibility of the whole extraction pipeline to accelerate tasks in drug development research. The results from this thesis exhibited the potential of text mining approaches to automatically generate PK databases that can aid researchers in the field and ultimately accelerate the drug development pipeline. Additionally, the thesis presented contributions to biomedical natural language processing by developing suitable architectures and corpora for multiple tasks, tackling novel entities and relations within the PK domain
    • 

    corecore