11 research outputs found
Toward Gender-Inclusive Coreference Resolution
Correctly resolving textual mentions of people fundamentally entails making
inferences about those people. Such inferences raise the risk of systemic
biases in coreference resolution systems, including biases that can harm binary
and non-binary trans and cis stakeholders. To better understand such biases, we
foreground nuanced conceptualizations of gender from sociology and
sociolinguistics, and develop two new datasets for interrogating bias in crowd
annotations and in existing coreference resolution systems. Through these
studies, conducted on English text, we confirm that without acknowledging and
building systems that recognize the complexity of gender, we build systems that
lead to many potential harms.Comment: 28 pages; ACL versio
Semi-Supervised Learning For Identifying Opinions In Web Content
Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented
Incorporation of constraints to improve machine learning approaches on coreference resolution
Master'sMASTER OF SCIENC
Bootstrapping Coreference Classifiers with Multiple Machine Learning Algorithms
Successful application of multi-view cotraining algorithms relies on the ability to factor the available features into views that are compatible and uncorrelated. This can potentially preclude their use on problems such as coreference resolution that lack an obvious feature split. To bootstrap coreference classifiers, we propose and evaluate a single-view weakly supervised algorithm that relies on two different learning algorithms in lieu of the two different views required by co-training. In addition, we investigate a method for ranking unlabeled instances to be fed back into the bootstrapping loop as labeled data, aiming to alleviate the problem of performance deterioration that is commonly observed in the course of bootstrapping
PERICLES Deliverable 4.3:Content Semantics and Use Context Analysis Techniques
The current deliverable summarises the work conducted within task T4.3 of WP4, focusing on the extraction and the subsequent analysis of semantic information from digital content, which is imperative for its preservability. More specifically, the deliverable defines content semantic information from a visual and textual perspective, explains how this information can be exploited in long-term digital preservation and proposes novel approaches for extracting this information in a scalable manner. Additionally, the deliverable discusses novel techniques for retrieving and analysing the context of use of digital objects. Although this topic has not been extensively studied by existing literature, we believe use context is vital in augmenting the semantic information and maintaining the usability and preservability of the digital objects, as well as their ability to be accurately interpreted as initially intended.PERICLE
Generic named entity extraction
This thesis proposes and evaluates different ways of performing generic named entity
recognition, that is the construction of a system capable of recognising names in free
text which is not specific to any particular domain or task.
The starting point is an implementation of a well known baseline system which is based
on maximum entropy models that utilise lexically-oriented features to recognised names
in text. Although this system achieves good levels of performance, both maximum
entropy models and lexically-oriented features have their limitations. Three alternative
ways in which this system can be extended to overcome these limitations are then
studied:
[> more linguistically-oriented features are extracted from a generic lexical source,
namely WordNet®, and then added to the pool of features of the maximum entropy
model
[> the maximum entropy model is bias towards training samples that are similar to
the piece of text being analysed
[> a bootstrapping procedure is introduced to allow maximum entropy models to
collect new, valuable information from unlabelled text
Results in this thesis indicate that the maximum entropy model is a very strong approach
that accomplishes levels of performance that are very hard to improve on. However,
these results also suggest that these extensions of the baseline system could yield improvements,
though some difficulties must be addressed and more research is needed to
obtain more assertive conclusions.
This thesis has nonetheless provided important contributions: a novel approach to
estimate the complexity of a named entity extraction task, a method for selecting the
features to be used by the maximum entropy model from a large pool of features and a
novel procedure to bootstrap maximum entropy models
Intelligent Sensor Networks
In the last decade, wireless or wired sensor networks have attracted much attention. However, most designs target general sensor network issues including protocol stack (routing, MAC, etc.) and security issues. This book focuses on the close integration of sensing, networking, and smart signal processing via machine learning. Based on their world-class research, the authors present the fundamentals of intelligent sensor networks. They cover sensing and sampling, distributed signal processing, and intelligent signal learning. In addition, they present cutting-edge research results from leading experts