1,288 research outputs found
Semi-Supervised Named Entity Recognition:\ud Learning to Recognize 100 Entity Types with Little Supervision\ud
Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. \ud
\ud
In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. \ud
\ud
Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. \ud
\ud
We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts. \u
Higher-order inference in conditional random fields using submodular functions
Higher-order and dense conditional random fields (CRFs) are expressive graphical
models which have been very successful in low-level computer vision applications
such as semantic segmentation, and stereo matching. These models are able to
capture long-range interactions and higher-order image statistics much better
than pairwise CRFs. This expressive power comes at a price though - inference
problems in these models are computationally very demanding. This is a
particular challenge in computer vision, where fast inference is important and
the problem involves millions of pixels.
In this thesis, we look at how submodular functions can help us designing
efficient inference methods for higher-order and dense CRFs. Submodular
functions are special discrete functions that have important properties from
an optimisation perspective, and are closely related to convex functions. We
use submodularity in a two-fold manner: (a) to design efficient MAP inference
algorithm for a robust higher-order model that generalises the widely-used
truncated convex models, and (b) to glean insights into a recently proposed
variational inference algorithm which give us a principled approach for applying
it efficiently to higher-order and dense CRFs
Named Entity Recognition for Bacterial Type IV Secretion Systems
Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents
- …