2,234 research outputs found
A Bootstrapping architecture for time expression recognition in unlabelled corpora via syntactic-semantic patterns
In this paper we describe a semi-supervised approach to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping.
Bootstrapping techniques rely on a relatively small amount of initial human-supplied examples (termed âseedsâ) of the type of entity or concept to be learned, in order to capture an initial set of patterns or rules from the unlabelled text that extract the supplied data. In turn, the learned patterns are employed to find new potential examples, and the process is repeated to grow the set of patterns and (optionally) the set of examples. In order to prevent the learned pattern set from producing spurious results, it becomes essential
to implement a ranking and selection procedure to filter out âbadâ patterns and, depending on the case, new candidate examples. Therefore, the type of patterns employed (knowledge representation) as well as the ranking and selection procedure are paramount to the quality of the results. We present a complete bootstrapping algorithm for recognition of time expressions, with a special emphasis on the type of patterns used (a combination of semantic and morpho- syntantic elements) and the ranking and selection criteria. Bootstrap-
ping techniques have been previously employed with limited success for several NLP problems, both of recognition and classification, but their application to time expression recognition is, to the best of our knowledge, novel. As of this writing, the described architecture is in the final stages of implementation, with experimention and evalution being already underway.Postprint (published version
Detecting and Explaining Causes From Text For a Time Series Event
Explaining underlying causes or effects about events is a challenging but
valuable task. We define a novel problem of generating explanations of a time
series event by (1) searching cause and effect relationships of the time series
with textual data and (2) constructing a connecting chain between them to
generate an explanation. To detect causal features from text, we propose a
novel method based on the Granger causality of time series between features
extracted from text such as N-grams, topics, sentiments, and their composition.
The generation of the sequence of causal entities requires a commonsense
causative knowledge base with efficient reasoning. To ensure good
interpretability and appropriate lexical usage we combine symbolic and neural
representations, using a neural reasoning algorithm trained on commonsense
causal tuples to predict the next cause step. Our quantitative and human
analysis show empirical evidence that our method successfully extracts
meaningful causality relationships between time series with textual features
and generates appropriate explanation between them.Comment: Accepted at EMNLP 201
Named Entity Extraction and Disambiguation: The Reinforcement Effect.
Named entity extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics are highly dependent, almost no existing works examine this dependency. It is the aim of this paper to examine the dependency and show how one affects the other, and vice versa. We conducted experiments with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms as a representative example of named entities. We experimented with three approaches for disambiguation with the purpose to infer the country of the holiday home. We examined how the effectiveness of extraction influences the effectiveness of disambiguation, and reciprocally, how filtering out ambiguous names (an activity that depends on the disambiguation process) improves the effectiveness of extraction. Since this, in turn, may improve the effectiveness of disambiguation again, it shows that extraction and disambiguation may reinforce each other.\u
Doctor of Philosophy in Computer Science
dissertationOver the last decade, social media has emerged as a revolutionary platform for informal communication and social interactions among people. Publicly expressing thoughts, opinions, and feelings is one of the key characteristics of social media. In this dissertation, I present research on automatically acquiring knowledge from social media that can be used to recognize people's affective state (i.e., what someone feels at a given time) in text. This research addresses two types of affective knowledge: 1) hashtag indicators of emotion consisting of emotion hashtags and emotion hashtag patterns, and 2) affective understanding of similes (a form of figurative comparison). My research introduces a bootstrapped learning algorithm for learning hashtag in- dicators of emotions from tweets with respect to five emotion categories: Affection, Anger/Rage, Fear/Anxiety, Joy, and Sadness/Disappointment. With a few seed emotion hashtags per emotion category, the bootstrapping algorithm iteratively learns new hashtags and more generalized hashtag patterns by analyzing emotion in tweets that contain these indicators. Emotion phrases are also harvested from the learned indicators to train additional classifiers that use the surrounding word context of the phrases as features. This is the first work to learn hashtag indicators of emotions. My research also presents a supervised classification method for classifying affective polarity of similes in Twitter. Using lexical, semantic, and sentiment properties of different simile components as features, supervised classifiers are trained to classify a simile into a positive or negative affective polarity class. The property of comparison is also fundamental to the affective understanding of similes. My research introduces a novel framework for inferring implicit properties that 1) uses syntactic constructions, statistical association, dictionary definitions and word embedding vector similarity to generate and rank candidate properties, 2) re-ranks the top properties using influence from multiple simile components, and 3) aggregates the ranks of each property from different methods to create a final ranked list of properties. The inferred properties are used to derive additional features for the supervised classifiers to further improve affective polarity recognition. Experimental results show substantial improvements in affective understanding of similes over the use of existing sentiment resources
Slot Filling
Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naĂŻve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating contextâwe explore adding additional contextual features to address thisâmany of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks
Recommended from our members
Ranking for Scalable Information Extraction
Information extraction systems are complex software tools that discover structured information in natural language text. For instance, an information extraction system trained to extract tuples for an Occurs-in(Natural Disaster, Location) relation may extract the tuple from the sentence: "A tsunami swept the coast of Hawaii." Having information in structured form enables more sophisticated querying and data mining than what is possible over the natural language text. Unfortunately, information extraction is a time-consuming task. For example, a state-of-the-art information extraction system to extract Occurs-in tuples may take up to two hours to process only 1,000 text documents. Since document collections routinely contain millions of documents or more, improving the efficiency and scalability of the information extraction process over these collections is critical. As a significant step towards this goal, this dissertation presents approaches for (i) enabling the deployment of efficient information extraction systems and (ii) scaling the information extraction process to large volumes of text.
To enable the deployment of efficient information extraction systems, we have developed two crucial building blocks for this task. As a first contribution, we have created REEL, a toolkit to easily implement, evaluate, and deploy full-fledged relation extraction systems. REEL, in contrast to existing toolkits, effectively modularizes the key components involved in relation extraction systems and can integrate other long-established text processing and machine learning toolkits. To define a relation extraction system for a new relation and text collection, users only need to specify the desired configuration, which makes REEL a powerful framework for both research and application building. As a second contribution, we have addressed the problem of building representative extraction task-specific document samples from collections, a step often required by approaches for efficient information extraction. Specifically, we devised fully automatic document sampling techniques for information extraction that can produce better-quality document samples than the state-of-the-art sampling strategies; furthermore, our techniques are substantially more efficient than the existing alternative approaches.
To scale the information extraction process to large volumes of text, we have developed approaches that address the efficiency and scalability of the extraction process by focusing the extraction effort on the collections, documents, and sentences worth processing for a given extraction task. For collections, we have studied both (adaptations of) state-of-the art approaches for estimating the number of documents in a collection that lead to the extraction of tuples as well as information extraction-specific approaches. Using these estimations we can identify the collections worth processing and ignore the rest, for efficiency. For documents, we have developed an adaptive document ranking approach that relies on learning-to-rank techniques to prioritize the documents that are likely to produce tuples for an extraction task of choice. Our approach revises the (learned) ranking decisions periodically as the extraction process progresses and new characteristics of the useful documents are revealed. Finally, for sentences, we have developed an approach based on the sparse group selection problem that identifies sentences|modeled as groups of words|that best characterize the extraction task. Beyond identifying sentences worth processing, our approach aims at selecting sentences that lead to the extraction of unseen, novel tuples. Our approaches are lightweight and efficient, and dramatically improve the efficiency and scalability of the information extraction process. We can often complete the extraction task by focusing on just a very small fraction of the available text, namely, the text that contains relevant information for the extraction task at hand. Our approaches therefore constitute a substantial step towards efficient and scalable information extraction over large volumes of text
Recommended from our members
The Corpus Expansion Toolkit: finding what we want on the web
This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example âseedâ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable:
1. Domain known â source known
2. Domain known â source unknown
3. Domain unknown â source unknown
First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system
Large-Scale Pattern-Based Information Extraction from the World Wide Web
Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web
- âŠ