1,351 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Extracting novel facts from tables for Knowledge Graph completion
We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our method further disambiguates cell values using KG embeddings as additional ranking method. Other distinctive features are the lack of assumptions about the underlying KG and the enabling of a fine-grained tuning of the precision/recall trade-off of extracted facts. Our experiments show that our approach has a higher recall during the interpretation process than the state-of-the-art, and is more resistant against the bias observed in extracting mostly redundant facts since it produces more novel extractions
Inferring Tabular Analysis Metadata by Infusing Distribution and Knowledge Information
Many data analysis tasks heavily rely on a deep understanding of tables
(multi-dimensional data). Across the tasks, there exist comonly used metadata
attributes of table fields / columns. In this paper, we identify four such
analysis metadata: Measure/dimension dichotomy, common field roles, semantic
field type, and default aggregation function. While those metadata face
challenges of insufficient supervision signals, utilizing existing knowledge
and understanding distribution. To inference these metadata for a raw table, we
propose our multi-tasking Metadata model which fuses field distribution and
knowledge graph information into pre-trained tabular models. For model training
and evaluation, we collect a large corpus (~582k tables from private
spreadsheet and public tabular datasets) of analysis metadata by using diverse
smart supervisions from downstream tasks. Our best model has accuracy = 98%,
hit rate at top-1 > 67%, accuracy > 80%, and accuracy = 88% for the four
analysis metadata inference tasks, respectively. It outperforms a series of
baselines that are based on rules, traditional machine learning methods, and
pre-trained tabular models. Analysis metadata models are deployed in a popular
data analysis product, helping downstream intelligent features such as insights
mining, chart / pivot table recommendation, and natural language QA...Comment: 13pages, 7 figures, 9 table
Site-Specific Rules Extraction in Precision Agriculture
El incremento sostenible en la produccioÌn alimentaria para satisfacer las necesidades de una poblacioÌn mundial en aumento es un verdadero reto cuando tenemos en cuenta el impacto constante de plagas y enfermedades en los cultivos. Debido a las importantes peÌrdidas econoÌmicas que se producen, el uso de tratamientos quiÌmicos es demasiado alto; causando contaminacioÌn del medio ambiente y resistencia a distintos tratamientos. En este contexto, la comunidad agriÌcola divisa la aplicacioÌn de tratamientos maÌs especiÌficos para cada lugar, asiÌ como la validacioÌn automaÌtica con la conformidad legal. Sin embargo, la especificacioÌn de estos tratamientos se encuentra en regulaciones expresadas en lenguaje natural. Por este motivo, traducir regulaciones a una representacioÌn procesable por maÌquinas estaÌ tomando cada vez maÌs importancia en la agricultura de precisioÌn.Actualmente, los requisitos para traducir las regulaciones en reglas formales estaÌn lejos de ser cumplidos; y con el raÌpido desarrollo de la ciencia agriÌcola, la verificacioÌn manual de la conformidad legal se torna inabordable.En esta tesis, el objetivo es construir y evaluar un sistema de extraccioÌn de reglas para destilar de manera efectiva la informacioÌn relevante de las regulaciones y transformar las reglas de lenguaje natural a un formato estructurado que pueda ser procesado por maÌquinas. Para ello, hemos separado la extraccioÌn de reglas en dos pasos. El primero es construir una ontologiÌa del dominio; un modelo para describir los desoÌrdenes que producen las enfermedades en los cultivos y sus tratamientos. El segundo paso es extraer informacioÌn para poblar la ontologiÌa. Puesto que usamos teÌcnicas de aprendizaje automaÌtico, implementamos la metodologiÌa MATTER para realizar el proceso de anotacioÌn de regulaciones. Una vez creado el corpus, construimos un clasificador de categoriÌas de reglas que discierne entre obligaciones y prohibiciones; y un sistema para la extraccioÌn de restricciones en reglas, que reconoce informacioÌn relevante para retener el isomorfismo con la regulacioÌn original. Para estos componentes, empleamos, entre otra teÌcnicas de aprendizaje profundo, redes neuronales convolucionales y âLong Short- Term Memoryâ. AdemaÌs, utilizamos como baselines algoritmos maÌs tradicionales como âsupport-vector machinesâ y ârandom forestsâ.Como resultado, presentamos la ontologiÌa PCT-O, que ha sido alineada con otras ontologiÌas como NCBI, PubChem, ChEBI y Wikipedia. El modelo puede ser utilizado para la identificacioÌn de desoÌrdenes, el anaÌlisis de conflictos entre tratamientos y la comparacioÌn entre legislaciones de distintos paiÌses. Con respecto a los sistemas de extraccioÌn, evaluamos empiÌricamente el comportamiento con distintas meÌtricas, pero la meÌtrica F1 es utilizada para seleccionar los mejores sistemas. En el caso del clasificador de categoriÌas de reglas, el mejor sistema obtiene un macro F1 de 92,77% y un F1 binario de 85,71%. Este sistema usa una red âbidirectional long short-term memoryâ con âword embeddingsâ como entrada. En relacioÌn al extractor de restricciones de reglas, el mejor sistema obtiene un micro F1 de 88,3%. Este extractor utiliza como entrada una combinacioÌn de âcharacter embeddingsâ junto a âword embeddingsâ y una red neuronal âbidirectional long short-term memoryâ.<br /
Structuring the Unstructured: Unlocking pharmacokinetic data from journals with Natural Language Processing
The development of a new drug is an increasingly expensive and inefficient process. Many drug candidates are discarded due to pharmacokinetic (PK) complications detected at clinical phases. It is critical to accurately estimate the PK parameters of new drugs before being tested in humans since they will determine their efficacy and safety outcomes. Preclinical predictions of PK parameters are largely based on prior knowledge from other compounds, but much of this potentially valuable data is currently locked in the format of scientific papers. With an ever-increasing amount of scientific literature, automated systems are essential to exploit this resource efficiently. Developing text mining systems that can structure PK literature is critical to improving the drug development pipeline.
This thesis studied the development and application of text mining resources to accelerate the curation of PK databases. Specifically, the development of novel corpora and suitable natural language processing architectures in the PK domain were addressed. The work presented focused on machine learning approaches that can model the high diversity of PK studies, parameter mentions, numerical measurements, units, and contextual information reported across the literature. Additionally, architectures and training approaches that could efficiently deal with the scarcity of annotated examples were explored. The chapters of this thesis tackle the development of suitable models and corpora to (1) retrieve PK documents, (2) recognise PK parameter mentions, (3) link PK entities to a knowledge base and (4) extract relations between parameter mentions, estimated measurements, units and other contextual information. Finally, the last chapter of this thesis studied the feasibility of the whole extraction pipeline to accelerate tasks in drug development research.
The results from this thesis exhibited the potential of text mining approaches to automatically generate PK databases that can aid researchers in the field and ultimately accelerate the drug development pipeline. Additionally, the thesis presented contributions to biomedical natural language processing by developing suitable architectures and corpora for multiple tasks, tackling novel entities and relations within the PK domain
- âŠ