4,838 research outputs found
Sentiment Lexicon Adaptation with Context and Semantics for the Social Web
Sentiment analysis over social streams offers governments and organisations a fast and effective way to monitor the publics' feelings towards policies, brands, business, etc. General purpose sentiment lexicons have been used to compute sentiment from social streams, since they are simple and effective. They calculate the overall sentiment of texts by using a general collection of words, with predetermined sentiment orientation and strength. However, words' sentiment often vary with the contexts in which they appear, and new words might be encountered that are not covered by the lexicon, particularly in social media environments where content emerges and changes rapidly and constantly. In this paper, we propose a lexicon adaptation approach that uses contextual as well as semantic information extracted from DBPedia to update the words' weighted sentiment orientations and to add new words to the lexicon. We evaluate our approach on three different Twitter datasets, and show that enriching the lexicon with contextual and semantic information improves sentiment computation by 3.4% in average accuracy, and by 2.8% in average F1 measure
Low Size-Complexity Inductive Logic Programming: The East-West Challenge Considered as a Problem in Cost-Sensitive Classification
The Inductive Logic Programming community has considered
proof-complexity and model-complexity, but, until recently,
size-complexity has received little attention. Recently a
challenge was issued "to the international computing community"
to discover low size-complexity Prolog programs for classifying
trains. The challenge was based on a problem first proposed by
Ryszard Michalski, 20 years ago. We interpreted the challenge
as a problem in cost-sensitive classification and we applied a
recently developed cost-sensitive classifier to the competition.
Our algorithm was relatively successful (we won a prize). This
paper presents our algorithm and analyzes the results of the
competition
Answering Subcognitive Turing Test Questions: A Reply to French
Robert French has argued that a disembodied computer is incapable of
passing a Turing Test that includes subcognitive questions. Subcognitive
questions are designed to probe the network of cultural and perceptual
associations that humans naturally develop as we live, embodied and
embedded in the world. In this paper, I show how it is possible for a
disembodied computer to answer subcognitive questions appropriately,
contrary to Frenchs claim. My approach to answering subcognitive
questions is to use statistical information extracted from a very large
collection of text. In particular, I show how it is possible to answer a
sample of subcognitive questions taken from French, by issuing queries to
a search engine that indexes about 350 million Web pages. This simple
algorithm may shed light on the nature of human (sub-) cognition, but the
scope of this paper is limited to demonstrating that French is mistaken: a
disembodied computer can answer subcognitive questions
Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data.
A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that capture the articles main topics. Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are conceptually related to keyphrase-frequency and I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive. The new features are generated by issuing queries to a Web search engine, based on the candidate phrases in the input document. The feature values are calculated from the number of hits for the queries (the number of matching Web pages). In essence, these new features are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases
Extraction of Keyphrases from Text: Evaluation of Four Algorithms
This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithms keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsofts Word 97, (2) an algorithm based on Eric Brills part-of-speech tagger, (3) the Summarize feature in Veritys Search 97, and (4) NRCs Extractor algorithm. For all five document collections, NRCs Extractor yields the best match with the manually generated keyphrases
Robust classification with context-sensitive features
This paper addresses the problem of classifying observations when features are context-sensitive, especially when the testing set involves a context that is different from the training set. The paper begins with a precise definition of the problem, then general strategies are presented for enhancing the performance of classification algorithms on this type of problem. These strategies are tested on three domains. The first domain is the diagnosis of gas turbine engines. The problem is to diagnose a faulty engine in one context, such as warm weather, when the fault has previously been seen only in another context, such as cold weather. The second domain is speech recognition. The context is given by the identity of the speaker. The problem is to recognize words spoken by a new speaker, not represented in the training set. The third domain is medical prognosis. The problem is to predict whether a patient with hepatitis will live or die. The context is the age of the patient. For all three domains, exploiting context results in substantially more accurate classification
Data Engineering for the Analysis of Semiconductor Manufacturing Data
We have analyzed manufacturing data from several different semiconductor
manufacturing plants, using decision tree induction software called
Q-YIELD. The software generates rules for predicting when a given product
should be rejected. The rules are intended to help the process engineers
improve the yield of the product, by helping them to discover the causes
of rejection. Experience with Q-YIELD has taught us the importance of
data engineering -- preprocessing the data to enable or facilitate
decision tree induction. This paper discusses some of the data engineering
problems we have encountered with semiconductor manufacturing data.
The paper deals with two broad classes of problems: engineering the features
in a feature vector representation and engineering the definition of the
target concept (the classes). Manufacturing process data present special
problems for feature engineering, since the data have multiple levels of
granularity (detail, resolution). Engineering the target concept is important,
due to our focus on understanding the past, as opposed to the more common
focus in machine learning on predicting the future
Types of cost in inductive concept learning
Inductive concept learning is the task of learning to assign cases to a discrete set of classes. In real-world applications of concept learning, there are many different types of cost involved. The majority of the machine learning literature ignores all types of cost (unless accuracy is interpreted as a type of cost measure). A few papers have investigated the cost of misclassification errors. Very few papers have examined the many other types of cost. In this paper, we attempt to create a taxonomy of the different types of cost that are involved in inductive concept learning. This taxonomy may help to organize the literature on cost-sensitive learning. We hope that it will inspire researchers to investigate all types of cost in inductive concept learning in more depth
Myths and Legends of the Baldwin Effect
This position paper argues that the Baldwin effect is widely
misunderstood by the evolutionary computation community. The
misunderstandings appear to fall into two general categories.
Firstly, it is commonly believed that the Baldwin effect is
concerned with the synergy that results when there is an evolving
population of learning individuals. This is only half of the story.
The full story is more complicated and more interesting. The Baldwin
effect is concerned with the costs and benefits of lifetime
learning by individuals in an evolving population. Several
researchers have focussed exclusively on the benefits, but there
is much to be gained from attention to the costs. This paper explains
the two sides of the story and enumerates ten of the costs and
benefits of lifetime learning by individuals in an evolving population.
Secondly, there is a cluster of misunderstandings about the relationship
between the Baldwin effect and Lamarckian inheritance of acquired
characteristics. The Baldwin effect is not Lamarckian. A Lamarckian
algorithm is not better for most evolutionary computing problems than
a Baldwinian algorithm. Finally, Lamarckian inheritance is not a
better model of memetic (cultural) evolution than the Baldwin effect
The identification of context-sensitive features: A formal definition of context for concept learning
A large body of research in machine learning is concerned with supervised learning from examples. The examples are typically represented as vectors in a multi- dimensional feature space (also known as attribute-value descriptions). A teacher partitions a set of training examples into a finite number of classes. The task of the learning algorithm is to induce a concept from the training examples. In this paper, we formally distinguish three types of features: primary, contextual, and irrelevant features. We also formally define what it means for one feature to be context-sensitive to another feature. Context-sensitive features complicate the task of the learner and potentially impair the learner's performance. Our formal definitions make it possible for a learner to automatically identify context-sensitive features. After context-sensitive features have been identified, there are several strategies that the learner can employ for managing the features; however, a discussion of these strategies is outside of the scope of this paper. The formal definitions presented here correct a flaw in previously proposed definitions. We discuss the relationship between our work and a formal definition of relevance
- …