16,704 research outputs found
Kernelized Hashcode Representations for Relation Extraction
Kernel methods have produced state-of-the-art results for a number of NLP
tasks such as relation extraction, but suffer from poor scalability due to the
high cost of computing kernel similarities between natural language structures.
A recently proposed technique, kernelized locality-sensitive hashing (KLSH),
can significantly reduce the computational cost, but is only applicable to
classifiers operating on kNN graphs. Here we propose to use random subspaces of
KLSH codes for efficiently constructing an explicit representation of NLP
structures suitable for general classification methods. Further, we propose an
approach for optimizing the KLSH model for classification problems by
maximizing an approximation of mutual information between the KLSH codes
(feature vectors) and the class labels. We evaluate the proposed approach on
biomedical relation extraction datasets, and observe significant and robust
improvements in accuracy w.r.t. state-of-the-art classifiers, along with
drastic (orders-of-magnitude) speedup compared to conventional kernel methods.Comment: To appear in the proceedings of conference, AAAI-1
Information Extraction in Illicit Domains
Extracting useful entities and attribute values from illicit domains such as
human trafficking is a challenging problem with the potential for widespread
social impact. Such domains employ atypical language models, have `long tails'
and suffer from the problem of concept drift. In this paper, we propose a
lightweight, feature-agnostic Information Extraction (IE) paradigm specifically
designed for such domains. Our approach uses raw, unlabeled text from an
initial corpus, and a few (12-120) seed annotations per domain-specific
attribute, to learn robust IE models for unobserved pages and websites.
Empirically, we demonstrate that our approach can outperform feature-centric
Conditional Random Field baselines by over 18\% F-Measure on five annotated
sets of real-world human trafficking datasets in both low-supervision and
high-supervision settings. We also show that our approach is demonstrably
robust to concept drift, and can be efficiently bootstrapped even in a serial
computing environment.Comment: 10 pages, ACM WWW 201
Polarized Broad-Line Emission from Low-Luminosity Active Galactic Nuclei
In order to determine whether unified models of active galactic nuclei apply
to low-luminosity objects, we have undertaken a spectropolarimetric survey of
of LINERs and Seyfert nuclei at the Keck Observatory. The 14 objects observed
have a median H-alpha luminosity of 8x10^{39} erg/s, well below the typical
value of ~10^{41} erg/s for Markarian Seyfert nuclei. Polarized broad H-alpha
emission is detected in three LINERs: NGC 315, NGC 1052, and NGC 4261. Each of
these is an elliptical galaxy with a double-sided radio jet, and the
emission-line polarization in each case is oriented roughly perpendicular to
the jet axis, as expected for the obscuring torus model. NGC 4261 and NGC 315
are known to contain dusty circumnuclear disks, which may be the outer
extensions of the obscuring tori. The detection of polarized broad-line
emission suggests that these objects are nearby, low-luminosity analogs of
obscured quasars residing in narrow-line radio galaxies. The nuclear continuum
of the low-luminosity Seyfert 1 galaxy NGC 4395 is polarized at p = 0.67%,
possibly the result of an electron scattering region near the nucleus.
Continuum polarization is detected in other objects, with a median level of p =
0.36% over 5100-6100 A, but in most cases this is likely to be the result of
transmission through foreground dust. The lack of significant broad-line
polarization in most type 1 LINERs is consistent with the hypothesis that we
view the broad-line regions of these objects directly, rather than in scattered
light.Comment: 28 pages, including 3 tables and 16 figures. Uses the emulateapj
latex style file. Accepted for publication in The Astrophysical Journa
Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art
Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover
Algebraic Comparison of Partial Lists in Bioinformatics
The outcome of a functional genomics pipeline is usually a partial list of
genomic features, ranked by their relevance in modelling biological phenotype
in terms of a classification or regression model. Due to resampling protocols
or just within a meta-analysis comparison, instead of one list it is often the
case that sets of alternative feature lists (possibly of different lengths) are
obtained. Here we introduce a method, based on the algebraic theory of
symmetric groups, for studying the variability between lists ("list stability")
in the case of lists of unequal length. We provide algorithms evaluating
stability for lists embedded in the full feature set or just limited to the
features occurring in the partial lists. The method is demonstrated first on
synthetic data in a gene filtering task and then for finding gene profiles on a
recent prostate cancer dataset
- âŚ