80,227 research outputs found
A Bootstrapping architecture for time expression recognition in unlabelled corpora via syntactic-semantic patterns
In this paper we describe a semi-supervised approach to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping.
Bootstrapping techniques rely on a relatively small amount of initial human-supplied examples (termed “seeds”) of the type of entity or concept to be learned, in order to capture an initial set of patterns or rules from the unlabelled text that extract the supplied data. In turn, the learned patterns are employed to find new potential examples, and the process is repeated to grow the set of patterns and (optionally) the set of examples. In order to prevent the learned pattern set from producing spurious results, it becomes essential
to implement a ranking and selection procedure to filter out “bad” patterns and, depending on the case, new candidate examples. Therefore, the type of patterns employed (knowledge representation) as well as the ranking and selection procedure are paramount to the quality of the results. We present a complete bootstrapping algorithm for recognition of time expressions, with a special emphasis on the type of patterns used (a combination of semantic and morpho- syntantic elements) and the ranking and selection criteria. Bootstrap-
ping techniques have been previously employed with limited success for several NLP problems, both of recognition and classification, but their application to time expression recognition is, to the best of our knowledge, novel. As of this writing, the described architecture is in the final stages of implementation, with experimention and evalution being already underway.Postprint (published version
A Machine learning approach to POS tagging
We have applied inductive learning of statistical decision trees
and relaxation labelling to the Natural Language Processing (NLP)
task of morphosyntactic disambiguation (Part Of Speech Tagging).
The learning process is supervised and obtains a language
model oriented to resolve POS ambiguities. This model consists
of a set of statistical decision trees expressing distribution of
tags and words in some relevant contexts.
The acquired language models are complete enough to be directly
used as sets of POS disambiguation rules, and include more complex
contextual information than simple collections of n-grams usually
used in statistical taggers.
We have implemented a quite simple and fast tagger that has been
tested and evaluated on the Wall Street Journal (WSJ) corpus with
a remarkable accuracy.
However, better results can be obtained by translating the trees
into rules to feed a flexible relaxation labelling based tagger.
In this direction we describe a tagger which is able to use
information of any kind (n-grams, automatically acquired constraints,
linguistically motivated manually written constraints, etc.), and in
particular to incorporate the machine learned decision trees.
Simultaneously, we address the problem of tagging when only
small training material is available, which is crucial in any process
of constructing, from scratch, an annotated corpus. We show that quite
high accuracy can be achieved with our system in this situation.Postprint (published version
Learning Dictionaries for Named Entity Recognition using Minimal Supervision
This paper describes an approach for automatic construction of dictionaries
for Named Entity Recognition (NER) using large amounts of unlabeled data and a
few seed examples. We use Canonical Correlation Analysis (CCA) to obtain lower
dimensional embeddings (representations) for candidate phrases and classify
these phrases using a small number of labeled examples. Our method achieves
16.5% and 11.3% F-1 score improvement over co-training on disease and virus NER
respectively. We also show that by adding candidate phrase embeddings as
features in a sequence tagger gives better performance compared to using word
embeddings.Comment: In 14th Conference of the European Chapter of the Association for
Computational Linguistic, 201
Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art
Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover
- …