57,469 research outputs found
Towards Building a Knowledge Base of Monetary Transactions from a News Collection
We address the problem of extracting structured representations of economic
events from a large corpus of news articles, using a combination of natural
language processing and machine learning techniques. The developed techniques
allow for semi-automatic population of a financial knowledge base, which, in
turn, may be used to support a range of data mining and exploration tasks. The
key challenge we face in this domain is that the same event is often reported
multiple times, with varying correctness of details. We address this challenge
by first collecting all information pertinent to a given event from the entire
corpus, then considering all possible representations of the event, and
finally, using a supervised learning method, to rank these representations by
the associated confidence scores. A main innovative element of our approach is
that it jointly extracts and stores all attributes of the event as a single
representation (quintuple). Using a purpose-built test set we demonstrate that
our supervised learning approach can achieve 25% improvement in F1-score over
baseline methods that consider the earliest, the latest or the most frequent
reporting of the event.Comment: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital
Libraries (JCDL '17), 201
Extraction of Transcript Diversity from Scientific Literature
Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term âalternative splicingâ to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/
A text-mining system for extracting metabolic reactions from full-text articles
Background: Increasingly biological text mining research is focusing on the extraction of complex relationships
relevant to the construction and curation of biological networks and pathways. However, one important category of
pathwayâmetabolic pathwaysâhas been largely neglected.
Here we present a relatively simple method for extracting metabolic reaction information from free text that scores
different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence
and location of stemmed keywords. This method extends an approach that has proved effective in the context of the
extraction of proteinâprotein interactions.
Results: When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our
method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the
well-known protein-protein interaction extraction task.
Conclusions: We conclude that automated metabolic pathway construction is more tractable than has often been
assumed, and that (as in the case of proteinâprotein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed
Automatic extraction of knowledge from web documents
A large amount of digital information available is written as text documents in the form of web pages, reports, papers, emails, etc. Extracting the knowledge of interest from such documents from multiple sources in a timely fashion is therefore crucial. This paper provides an update on the Artequakt system which uses natural language tools to automatically extract knowledge about artists from multiple documents based on a predefined ontology. The ontology represents the type and form of knowledge to extract. This knowledge is then used to generate tailored biographies. The information extraction process of Artequakt is detailed and evaluated in this paper
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction
Most existing event extraction (EE) methods merely extract event arguments
within the sentence scope. However, such sentence-level EE methods struggle to
handle soaring amounts of documents from emerging applications, such as
finance, legislation, health, etc., where event arguments always scatter across
different sentences, and even multiple such event mentions frequently co-exist
in the same document. To address these challenges, we propose a novel
end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic
graph to fulfill the document-level EE (DEE) effectively. Moreover, we
reformalize a DEE task with the no-trigger-words design to ease the
document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we
build a large-scale real-world dataset consisting of Chinese financial
announcements with the challenges mentioned above. Extensive experiments with
comprehensive analyses illustrate the superiority of Doc2EDAG over
state-of-the-art methods. Data and codes can be found at
https://github.com/dolphin-zs/Doc2EDAG.Comment: Accepted by EMNLP 201
- âŚ