30 research outputs found
Extracting protein-protein interactions from text using rich feature vectors and feature selection
Because of the intrinsic complexity of natural language, automatically extracting accurate information from text remains a challenge. We have applied rich featurevectors derived from dependency graphs to predict protein-protein interactions using machine learning techniques. We present the first extensive analysis of applyingfeature selection in this domain, and show that it can produce more cost-effective models. For the first time, our technique was also evaluated on several large-scalecross-dataset experiments, which offers a more realistic view on model performance.
During benchmarking, we encountered several fundamental problems hindering comparability with other methods. We present a set of practical guidelines to set up ameaningful evaluation.
Finally, we have analysed the feature sets from our experiments before and after feature selection, and evaluated the contribution of both lexical and syntacticinformation to our method. The gained insight will be useful to develop better performing methods in this domain
A Labeled Graph Kernel for Relationship Extraction
In this paper, we propose an approach for Relationship Extraction (RE) based
on labeled graph kernels. The kernel we propose is a particularization of a
random walk kernel that exploits two properties previously studied in the RE
literature: (i) the words between the candidate entities or connecting them in
a syntactic representation are particularly likely to carry information
regarding the relationship; and (ii) combining information from distinct
sources in a kernel may help the RE system make better decisions. We performed
experiments on a dataset of protein-protein interactions and the results show
that our approach obtains effectiveness values that are comparable with the
state-of-the art kernel methods. Moreover, our approach is able to outperform
the state-of-the-art kernels when combined with other kernel methods
Extroverts Tweet Differently from Introverts in Weibo
Being dominant factors driving the human actions, personalities can be
excellent indicators in predicting the offline and online behavior of different
individuals. However, because of the great expense and inevitable subjectivity
in questionnaires and surveys, it is challenging for conventional studies to
explore the connection between personality and behavior and gain insights in
the context of large amount individuals. Considering the more and more
important role of the online social media in daily communications, we argue
that the footprint of massive individuals, like tweets in Weibo, can be the
inspiring proxy to infer the personality and further understand its functions
in shaping the online human behavior. In this study, a map from self-reports of
personalities to online profiles of 293 active users in Weibo is established to
train a competent machine learning model, which then successfully identifies
over 7,000 users as extroverts or introverts. Systematical comparisons from
perspectives of tempo-spatial patterns, online activities, emotion expressions
and attitudes to virtual honor surprisingly disclose that the extrovert indeed
behaves differently from the introvert in Weibo. Our findings provide solid
evidence to justify the methodology of employing machine learning to
objectively study personalities of massive individuals and shed lights on
applications of probing personalities and corresponding behaviors solely
through online profiles.Comment: Datasets of this study can be freely downloaded through:
https://doi.org/10.6084/m9.figshare.4765150.v
Extracting spatial relations from document for geographic information retrieval
IEEE Geoscience and Remote Sensing Society (IEEE GRSS); East China Norm. Univ., Sch. Resour. Environ. Sci.; Shanghai Urban Dev. Inf. Res. Cent.; The Geographical Society of Shanghai; East China Univ. Sci. Technol., Bus. Sch.<span class="MedBlackText">Geographic information retrieval (GIR) is developed to retrieve geographical information from unstructured text (commonly web documents). Previous researches focus on applying traditional information retrieval (IR) techniques to GIR, such as ranking geographic relevance by vector space model (VSM). In many cases, these keyword-based methods can not support spatial query very well. For example, searching documents on "debris flow took place in Hunan last year", the documents selected in this way may only contain the words "debris flow" and "Hunan" rather than refer to "debris" flow actually occurred in "Hunan". Lack of spatial relations between thematic activates (debris flow) and geographic entities (Hunan) is the key reason for this problem. In this paper, we present a kernel-based approach and apply it in support vector machine (SVM) to extract spatial relations from free text for further GIS service and spatial reasoning. First, we analyze the characters of spatial relation expressions in natural language and there are two types of spatial relations: topology and direction. Both of them are used to qualitatively describe the relative positions of spatial objects to each other. Then we explore the use of dependency tree (a dependency tree represents the grammatical dependencies in a sentence and it can be generated by syntax parser) to identify these spatial relations. We observe that the features required to find a relationship between two spatial named entities in the same sentence is typically captured by the shortest path between the two entities in the dependency tree. Therefore, we construct a shortest path dependency kernel for SVM to complete the task. The experiment results show that our dependency tree kernel achieves significant improvement than previous method. </span
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
Detection of Gene Interactions Based on Syntactic Relations
Interactions between proteins and genes are considered essential in
the description of biomolecular phenomena, and networks of interactions
are applied in a system's biology approach. Recently, many studies have
sought to extract information from biomolecular text using natural language
processing technology. Previous studies have asserted that linguistic
information is useful for improving the detection of gene interactions.
In particular, syntactic relations among linguistic information are good
for detecting gene interactions. However, previous systems give a reasonably
good precision but poor recall. To improve recall without sacrificing
precision, this paper proposes a three-phase method for detecting gene
interactions based on syntactic relations. In the first phase, we retrieve
syntactic encapsulation categories for each candidate agent and target.
In the second phase, we construct a verb list that indicates the nature of
the interaction between pairs of genes. In the last phase, we determine
direction rules to detect which of two genes is the agent or target. Even
without biomolecular knowledge, our method performs reasonably well using
a small training dataset. While the first phase contributes to improve
recall, the second and third phases contribute to improve precision. In
the experimental results using ICML 05 Workshop on Learning Language
in Logic (LLL05) data, our proposed method gave an F-measure of 67.2% for the test data, significantly outperforming previous methods. We also
describe the contribution of each phase to the performance