2,998 research outputs found
Document Filtering for Long-tail Entities
Filtering relevant documents with respect to entities is an essential task in
the context of knowledge base construction and maintenance. It entails
processing a time-ordered stream of documents that might be relevant to an
entity in order to select only those that contain vital information.
State-of-the-art approaches to document filtering for popular entities are
entity-dependent: they rely on and are also trained on the specifics of
differentiating features for each specific entity. Moreover, these approaches
tend to use so-called extrinsic information such as Wikipedia page views and
related entities which is typically only available only for popular head
entities. Entity-dependent approaches based on such signals are therefore
ill-suited as filtering methods for long-tail entities. In this paper we
propose a document filtering method for long-tail entities that is
entity-independent and thus also generalizes to unseen or rarely seen entities.
It is based on intrinsic features, i.e., features that are derived from the
documents in which the entities are mentioned. We propose a set of features
that capture informativeness, entity-saliency, and timeliness. In particular,
we introduce features based on entity aspect similarities, relation patterns,
and temporal expressions and combine these with standard features for document
filtering. Experiments following the TREC KBA 2014 setup on a publicly
available dataset show that our model is able to improve the filtering
performance for long-tail entities over several baselines. Results of applying
the model to unseen entities are promising, indicating that the model is able
to learn the general characteristics of a vital document. The overall
performance across all entities---i.e., not just long-tail entities---improves
upon the state-of-the-art without depending on any entity-specific training
data.Comment: CIKM2016, Proceedings of the 25th ACM International Conference on
Information and Knowledge Management. 201
Extracting information from informal communication
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (leaves 89-93).This thesis focuses on the problem of extracting information from informal communication. Textual informal communication, such as e-mail, bulletin boards and blogs, has become a vast information resource. However, such information is poorly organized and difficult for a computer to understand due to lack of editing and structure. Thus, techniques which work well for formal text, such as newspaper articles, may be considered insufficient on informal text. One focus of ours is to attempt to advance the state-of-the-art for sub-problems of the information extraction task. We make contributions to the problems of named entity extraction, co-reference resolution and context tracking. We channel our efforts toward methods which are particularly applicable to informal communication. We also consider a type of information which is somewhat unique to informal communication: preferences and opinions. Individuals often expression their opinions on products and services in such communication. Others' may read these "reviews" to try to predict their own experiences. However, humans do a poor job of aggregating and generalizing large sets of data. We develop techniques that can perform the job of predicting unobserved opinions.(cont.) We address both the single-user case where information about the items is known, and the multi-user case where we can generalize opinions without external information. Experiments on large-scale rating data sets validate our approach.by Jason D.M. Rennie.Ph.D
- …