965 research outputs found
Construction of the Literature Graph in Semantic Scholar
We describe a deployed scalable system for organizing published scientific
literature into a heterogeneous graph to facilitate algorithmic manipulation
and discovery. The resulting literature graph consists of more than 280M nodes,
representing papers, authors, entities and various interactions between them
(e.g., authorships, citations, entity mentions). We reduce literature graph
construction into familiar NLP tasks (e.g., entity extraction and linking),
point out research challenges due to differences from standard formulations of
these tasks, and report empirical results for each task. The methods described
in this paper are used to enable semantic features in www.semanticscholar.orgComment: To appear in NAACL 2018 industry trac
BoxEL: Concept and Role Box Embeddings for the Description Logic EL++
Description logic (DL) ontologies extend knowledge graphs (KGs) with
conceptual information and logical background knowledge. In recent years, there
has been growing interest in inductive reasoning techniques for such
ontologies, which promise to complement classical deductive reasoning
algorithms. Similar to KG completion, several existing approaches learn
ontology embeddings in a latent space, while additionally ensuring that they
faithfully capture the logical semantics of the underlying DL. However, they
suffer from several shortcomings, mainly due to a limiting role representation.
We propose BoxEL, which represents both concepts and roles as boxes (i.e.,
axis-aligned hyperrectangles) and demonstrate how it overcomes the limitations
of previous methods. We theoretically prove the soundness of our model and
conduct an extensive experimental evaluation, achieving state-of-the-art
results across a variety of datasets. As part of our evaluation, we introduce a
novel benchmark for subsumption prediction involving both atomic and complex
concepts
A rule-based approach to embedding techniques for text document classification
With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset. Keywords: text classification; rule-based; word embedding; Doc2vec.publishedVersio
Site-Specific Rules Extraction in Precision Agriculture
El incremento sostenible en la producción alimentaria para satisfacer las necesidades de una población mundial en aumento es un verdadero reto cuando tenemos en cuenta el impacto constante de plagas y enfermedades en los cultivos. Debido a las importantes pérdidas económicas que se producen, el uso de tratamientos químicos es demasiado alto; causando contaminación del medio ambiente y resistencia a distintos tratamientos. En este contexto, la comunidad agrícola divisa la aplicación de tratamientos más específicos para cada lugar, así como la validación automática con la conformidad legal. Sin embargo, la especificación de estos tratamientos se encuentra en regulaciones expresadas en lenguaje natural. Por este motivo, traducir regulaciones a una representación procesable por máquinas está tomando cada vez más importancia en la agricultura de precisión.Actualmente, los requisitos para traducir las regulaciones en reglas formales están lejos de ser cumplidos; y con el rápido desarrollo de la ciencia agrícola, la verificación manual de la conformidad legal se torna inabordable.En esta tesis, el objetivo es construir y evaluar un sistema de extracción de reglas para destilar de manera efectiva la información relevante de las regulaciones y transformar las reglas de lenguaje natural a un formato estructurado que pueda ser procesado por máquinas. Para ello, hemos separado la extracción de reglas en dos pasos. El primero es construir una ontología del dominio; un modelo para describir los desórdenes que producen las enfermedades en los cultivos y sus tratamientos. El segundo paso es extraer información para poblar la ontología. Puesto que usamos técnicas de aprendizaje automático, implementamos la metodología MATTER para realizar el proceso de anotación de regulaciones. Una vez creado el corpus, construimos un clasificador de categorías de reglas que discierne entre obligaciones y prohibiciones; y un sistema para la extracción de restricciones en reglas, que reconoce información relevante para retener el isomorfismo con la regulación original. Para estos componentes, empleamos, entre otra técnicas de aprendizaje profundo, redes neuronales convolucionales y “Long Short- Term Memory”. Además, utilizamos como baselines algoritmos más tradicionales como “support-vector machines” y “random forests”.Como resultado, presentamos la ontología PCT-O, que ha sido alineada con otras ontologías como NCBI, PubChem, ChEBI y Wikipedia. El modelo puede ser utilizado para la identificación de desórdenes, el análisis de conflictos entre tratamientos y la comparación entre legislaciones de distintos países. Con respecto a los sistemas de extracción, evaluamos empíricamente el comportamiento con distintas métricas, pero la métrica F1 es utilizada para seleccionar los mejores sistemas. En el caso del clasificador de categorías de reglas, el mejor sistema obtiene un macro F1 de 92,77% y un F1 binario de 85,71%. Este sistema usa una red “bidirectional long short-term memory” con “word embeddings” como entrada. En relación al extractor de restricciones de reglas, el mejor sistema obtiene un micro F1 de 88,3%. Este extractor utiliza como entrada una combinación de “character embeddings” junto a “word embeddings” y una red neuronal “bidirectional long short-term memory”.<br /
Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text
We advance the state of the art in biomolecular interaction extraction with
three contributions: (i) We show that deep, Abstract Meaning Representations
(AMR) significantly improve the accuracy of a biomolecular interaction
extraction system when compared to a baseline that relies solely on surface-
and syntax-based features; (ii) In contrast with previous approaches that infer
relations on a sentence-by-sentence basis, we expand our framework to enable
consistent predictions over sets of sentences (documents); (iii) We further
modify and expand a graph kernel learning framework to enable concurrent
exploitation of automatically induced AMR (semantic) and dependency structure
(syntactic) representations. Our experiments show that our approach yields
interaction extraction systems that are more robust in environments where there
is a significant mismatch between training and test conditions.Comment: Appearing in Proceedings of the Thirtieth AAAI Conference on
Artificial Intelligence (AAAI-16
Automatic Text Summarization
Writing text was one of the first ever methods used by humans to represent their knowledge.
Text can be of different types and have different purposes.
Due to the evolution of information systems and the Internet, the amount of textual information available has increased exponentially in a worldwide scale, and many documents tend
to have a percentage of unnecessary information. Due to this event, most readers have difficulty in digesting all the extensive information contained in multiple documents, produced
on a daily basis.
A simple solution to the excessive irrelevant information in texts is to create summaries, in
which we keep the subject’s related parts and remove the unnecessary ones.
In Natural Language Processing, the goal of automatic text summarization is to create systems that process text and keep only the most important data. Since its creation several
approaches have been designed to create better text summaries, which can be divided in two
separate groups: extractive approaches and abstractive approaches.
In the first group, the summarizers decide what text elements should be in the summary. The
criteria by which they are selected is diverse. After they are selected, they are combined into
the summary. In the second group, the text elements are generated from scratch. Abstractive
summarizers are much more complex so they still need a lot of research, in order to represent
good results.
During this thesis, we have investigated the state of the art approaches, implemented our
own versions and tested them in conventional datasets, like the DUC dataset.
Our first approach was a frequencybased approach, since it analyses the frequency in which
the text’s words/sentences appear in the text. Higher frequency words/sentences automatically receive higher scores which are then filtered with a compression rate and combined in
a summary.
Moving on to our second approach, we have improved the original TextRank algorithm by
combining it with word embedding vectors. The goal was to represent the text’s sentences as
nodes from a graph and with the help of word embeddings, determine how similar are pairs
of sentences and rank them by their similarity scores. The highest ranking sentences were
filtered with a compression rate and picked for the summary.
In the third approach, we combined feature analysis with deep learning. By analysing certain
characteristics of the text sentences, one can assign scores that represent the importance of
a given sentence for the summary. With these computed values, we have created a dataset
for training a deep neural network that is capable of deciding if a certain sentence must be
or not in the summary.
An abstractive encoderdecoder summarizer was created with the purpose of generating words
related to the document subject and combining them into a summary. Finally, every single
summarizer was combined into a full system.
Each one of our approaches was evaluated with several evaluation metrics, such as ROUGE.
We used the DUC dataset for this purpose and the results were fairly similar to the ones in
the scientific community. As for our encoderdecode, we got promising results.O texto é um dos utensílios mais importantes de transmissão de ideias entre os seres humanos. Pode ser de vários tipos e o seu conteúdo pode ser mais ou menos fácil de interpretar,
conforme a quantidade de informação relevante sobre o assunto principal.
De forma a facilitar o processamento pelo leitor existe um mecanismo propositadamente criado para reduzir a informação irrelevante num texto, chamado sumarização de texto. Através
da sumarização criamse versões reduzidas do text original e mantémse a informação do assunto principal.
Devido à criação e evolução da Internet e outros meios de comunicação, surgiu um aumento
exponencial de documentos textuais, evento denominado de sobrecarga de informação, que
têm na sua maioria informação desnecessária sobre o assunto que retratam.
De forma a resolver este problema global, surgiu dentro da área científica de Processamento
de Linguagem Natural, a sumarização automática de texto, que permite criar sumários automáticos de qualquer tipo de texto e de qualquer lingua, através de algoritmos computacionais.
Desde a sua criação, inúmeras técnicas de sumarização de texto foram idealizadas, podendo
ser classificadas em dois tipos diferentes: extractivas e abstractivas. Em técnicas extractivas,
são transcritos elementos do texto original, como palavras ou frases inteiras que sejam as
mais ilustrativas do assunto do texto e combinadas num documento. Em técnicas abstractivas, os algoritmos geram elementos novos.
Nesta dissertação pesquisaramse, implementaramse e combinaramse algumas das técnicas com melhores resultados de modo a criar um sistema completo para criar sumários.
Relativamente às técnicas implementadas, as primeiras três são técnicas extractivas enquanto
que a ultima é abstractiva. Desta forma, a primeira incide sobre o cálculo das frequências dos
elementos do texto, atribuindose valores às frases que sejam mais frequentes, que por sua
vez são escolhidas para o sumário através de uma taxa de compressão. Outra das técnicas
incide na representação dos elementos textuais sob a forma de nodos de um grafo, sendo
atribuidos valores de similaridade entre os mesmos e de seguida escolhidas as frases com
maiores valores através de uma taxa de compressão. Uma outra abordagem foi criada de
forma a combinar um mecanismo de análise das caracteristicas do texto com métodos baseados em inteligência artificial. Nela cada frase possui um conjunto de caracteristicas que são
usadas para treinar um modelo de rede neuronal. O modelo avalia e decide quais as frases
que devem pertencer ao sumário e filtra as mesmas através deu uma taxa de compressão.
Um sumarizador abstractivo foi criado para para gerar palavras sobre o assunto do texto e
combinar num sumário. Cada um destes sumarizadores foi combinado num só sistema. Por
fim, cada uma das técnicas pode ser avaliada segundo várias métricas de avaliação, como
por exemplo a ROUGE. Segundo os resultados de avaliação das técnicas, com o conjunto de
dados DUC, os nossos sumarizadores obtiveram resultados relativamente parecidos com os
presentes na comunidade cientifica, com especial atenção para o codificadordescodificador
que em certos casos apresentou resultados promissores
Natural Language Interfaces to Data
Recent advances in NLU and NLP have resulted in renewed interest in natural
language interfaces to data, which provide an easy mechanism for non-technical
users to access and query the data. While early systems evolved from keyword
search and focused on simple factual queries, the complexity of both the input
sentences as well as the generated SQL queries has evolved over time. More
recently, there has also been a lot of focus on using conversational interfaces
for data analytics, empowering a line of non-technical users with quick
insights into the data. There are three main challenges in natural language
querying (NLQ): (1) identifying the entities involved in the user utterance,
(2) connecting the different entities in a meaningful way over the underlying
data source to interpret user intents, and (3) generating a structured query in
the form of SQL or SPARQL.
There are two main approaches for interpreting a user's NLQ. Rule-based
systems make use of semantic indices, ontologies, and KGs to identify the
entities in the query, understand the intended relationships between those
entities, and utilize grammars to generate the target queries. With the
advances in deep learning (DL)-based language models, there have been many
text-to-SQL approaches that try to interpret the query holistically using DL
models. Hybrid approaches that utilize both rule-based techniques as well as DL
models are also emerging by combining the strengths of both approaches.
Conversational interfaces are the next natural step to one-shot NLQ by
exploiting query context between multiple turns of conversation for
disambiguation. In this article, we review the background technologies that are
used in natural language interfaces, and survey the different approaches to
NLQ. We also describe conversational interfaces for data analytics and discuss
several benchmarks used for NLQ research and evaluation.Comment: The full version of this manuscript, as published by Foundations and
Trends in Databases, is available at http://dx.doi.org/10.1561/190000007
- …