11 research outputs found

    Toward Gender-Inclusive Coreference Resolution

    Full text link
    Correctly resolving textual mentions of people fundamentally entails making inferences about those people. Such inferences raise the risk of systemic biases in coreference resolution systems, including biases that can harm binary and non-binary trans and cis stakeholders. To better understand such biases, we foreground nuanced conceptualizations of gender from sociology and sociolinguistics, and develop two new datasets for interrogating bias in crowd annotations and in existing coreference resolution systems. Through these studies, conducted on English text, we confirm that without acknowledging and building systems that recognize the complexity of gender, we build systems that lead to many potential harms.Comment: 28 pages; ACL versio

    Semi-Supervised Learning For Identifying Opinions In Web Content

    Get PDF
    Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented

    Incorporation of constraints to improve machine learning approaches on coreference resolution

    Get PDF
    Master'sMASTER OF SCIENC

    Bootstrapping Coreference Classifiers with Multiple Machine Learning Algorithms

    No full text
    Successful application of multi-view cotraining algorithms relies on the ability to factor the available features into views that are compatible and uncorrelated. This can potentially preclude their use on problems such as coreference resolution that lack an obvious feature split. To bootstrap coreference classifiers, we propose and evaluate a single-view weakly supervised algorithm that relies on two different learning algorithms in lieu of the two different views required by co-training. In addition, we investigate a method for ranking unlabeled instances to be fed back into the bootstrapping loop as labeled data, aiming to alleviate the problem of performance deterioration that is commonly observed in the course of bootstrapping

    PERICLES Deliverable 4.3:Content Semantics and Use Context Analysis Techniques

    Get PDF
    The current deliverable summarises the work conducted within task T4.3 of WP4, focusing on the extraction and the subsequent analysis of semantic information from digital content, which is imperative for its preservability. More specifically, the deliverable defines content semantic information from a visual and textual perspective, explains how this information can be exploited in long-term digital preservation and proposes novel approaches for extracting this information in a scalable manner. Additionally, the deliverable discusses novel techniques for retrieving and analysing the context of use of digital objects. Although this topic has not been extensively studied by existing literature, we believe use context is vital in augmenting the semantic information and maintaining the usability and preservability of the digital objects, as well as their ability to be accurately interpreted as initially intended.PERICLE

    Generic named entity extraction

    Get PDF
    This thesis proposes and evaluates different ways of performing generic named entity recognition, that is the construction of a system capable of recognising names in free text which is not specific to any particular domain or task. The starting point is an implementation of a well known baseline system which is based on maximum entropy models that utilise lexically-oriented features to recognised names in text. Although this system achieves good levels of performance, both maximum entropy models and lexically-oriented features have their limitations. Three alternative ways in which this system can be extended to overcome these limitations are then studied: [> more linguistically-oriented features are extracted from a generic lexical source, namely WordNet®, and then added to the pool of features of the maximum entropy model [> the maximum entropy model is bias towards training samples that are similar to the piece of text being analysed [> a bootstrapping procedure is introduced to allow maximum entropy models to collect new, valuable information from unlabelled text Results in this thesis indicate that the maximum entropy model is a very strong approach that accomplishes levels of performance that are very hard to improve on. However, these results also suggest that these extensions of the baseline system could yield improvements, though some difficulties must be addressed and more research is needed to obtain more assertive conclusions. This thesis has nonetheless provided important contributions: a novel approach to estimate the complexity of a named entity extraction task, a method for selecting the features to be used by the maximum entropy model from a large pool of features and a novel procedure to bootstrap maximum entropy models

    Intelligent Sensor Networks

    Get PDF
    In the last decade, wireless or wired sensor networks have attracted much attention. However, most designs target general sensor network issues including protocol stack (routing, MAC, etc.) and security issues. This book focuses on the close integration of sensing, networking, and smart signal processing via machine learning. Based on their world-class research, the authors present the fundamentals of intelligent sensor networks. They cover sensing and sampling, distributed signal processing, and intelligent signal learning. In addition, they present cutting-edge research results from leading experts
    corecore