31 research outputs found

    Analysing domain suitability of a sentiment lexicon by identifying distributionally bipolar words

    Get PDF
    Abstract Contemporary sentiment analysis approaches rely heavily on lexicon based methods. This is mainly due to their simplicity, although the best empirical results can be achieved by more complex techniques. We introduce a method to assess suitability of generic sentiment lexicons for a given domain, namely to identify frequent bigrams where a polar word switches polarity. Our bigrams are scored using Lexicographers Mutual Information and leveraging large automatically obtained corpora. Our score matches human perception of polarity and demonstrates improvements in classification results using our enhanced contextaware method. Our method enhances the assessment of lexicon based sentiment detection algorithms and can be further used to quantify ambiguous words

    First Results of the Full-Scale OSQAR Photon Regeneration Experiment

    Full text link
    Recent intensive theoretical and experimental studies shed light on possible new physics beyond the standard model of particle physics, which can be probed with sub-eV energy experiments. In the second run of the OSQAR photon regeneration experiment, which looks for the conversion of photon to axion (or Axion-Like Particle), two spare superconducting dipole magnets of the Large Hadron Collider (LHC) have been used. In this paper we report on first results obtained from a light beam propagating in vacuum within the 9 T field of two LHC dipole magnets. No excess of events above the background was detected and the two-photon couplings of possible new scalar and pseudo-scalar particles could be constrained.Comment: 5 pages, 4 figures, Photon 2011 Conference, Submitted to JO

    Axion Search by Laser-based Experiment OSQAR

    Get PDF
    International audienceLaser-based experimentOSQAR in CERN is aimed to the search of the axions by twomethods. The photon regeneration experiment is using two LHC dipole magnets of the length 14.3 m and magnetic field 9.5 T equipped with an optical barrier at the end of the first magnet. It looks as light shining through the wall. No excess of events above the background was detected at this arrangement. Nevertheless, this result extends the exclusion region for the axion mass. The second method wants to measure the ultra-fine Vacuum Magnetic Birefringence for the first time. An optical scheme with electro-optical modulator has been proposed, validated and subsequently improved. Cotton-Mouton constant for air was determined in this experiment setup

    Leveraging Lexical-Semantic Knowledge for Text Classification Tasks

    No full text
    This dissertation is concerned with the applicability of knowledge, contained in lexical-semantic resources, to text classification tasks. Lexical-semantic resources aim at systematically encoding various types of information about the meaning of words and their relations. Text classification is the task of sorting a set of documents into categories from a predefined set, for example, “spam” and “not spam”. With the increasing amount of digitized text, as well as the increased availability of the computing power, the techniques to automate text classification have witnessed a booming interest. The early techniques classified documents using a set of rules, manually defined by experts, e.g. computational linguists. The rise of big data led to the increased popularity of distributional hypothesis - i.e., ``a meaning of word comes from its context'' - and to the criticism of lexical-semantic resources as too academic for real-world NLP applications. For long, it was assumed that the lexical-semantic knowledge will not lead to better classification results, as the meaning of every word can be directly learned from the document itself. In this thesis, we show that this assumption is not valid as a general statement and present several approaches how lexicon-based knowledge will lead to better results. Moreover, we show why these improved results can be expected. One of the first problems in natural language processing is the lexical-semantic ambiguity. In text classification tasks, the ambiguity problem has often been neglected. For example, to classify a topic of a document containing the word 'bank', we don’t need to explicitly disambiguate it, if we find the word 'river' or 'finance'. However, such additional word may not be always present. Conveniently, lexical-semantic resources typically enumerate all senses of a word, letting us choose which word sense is the most plausible in our context. What if we use the knowledge-based sense disambiguation methods in addition to the information provided implicitly by the word context in the document? In this thesis, we evaluate the performance of selected resource-based word sense disambiguation algorithms on a range of document classification tasks (Chapter 3). We note that the lexicographic sense distinctions provided by the lexical-semantic resources are not always optimal for every text classification task, and propose an alternative technique for disambiguation of word meaning in its context for sentiment analysis applications. The second problem in text classification, and natural language processing in general, is the one with synonymy. The words used in training documents represent only a tiny fraction of the words in the total possible vocabulary. If we learn individual words, or senses, as features in the classification model, our system will not be able to interpret the paraphrases, where the synonymous meaning is conveyed using different expressions. How much would the classification performance improve if the system could determine that two very different words represent the same meaning? In this thesis, we propose to address the synonymy problem by automatically enriching the training and testing data with conceptual annotations accessible through lexical-semantic resources (Chapter 4). We show that such conceptual information (``supersenses''), in combination with the previous word sense disambiguation step, helps to build more robust classifiers and improves classification performance of multiple tasks (Chapter 5). We further circumvent the sense disambiguation step by training a supersense tagging model directly. Previous evidence suggests that the sense distinctions of expert lexical-semantic resources are far subtler than what is needed for downstream NLP applications, and by disambiguating the concepts directly on a supersense level (e.g., ``is the 'duck' an animal or a food?'' rather than choosing between its eight WordNet senses), we can reduce the number of errors. The third problem in text classification is the curse of dimensionality. We want to know not only if each single word predicts certain document class, but which combinations of words predict it and which ones do not. Our need for training data thus grows exponentially with the number of words monitored. Several techniques for dimensionality reduction were proposed, most recently the representation learning, producing continuous word representations in a dense vector space, also known as word embeddings. However, these vectors are again produced on an ambiguous word level, and the valuable piece of information about possible distinct senses of the same word is lost, in favor of the most frequent one(s). In this thesis, we explore if, or how, we can use lexical-semantic resources to regain the sense-level notion of semantic relatedness back while operating within the deep learning paradigm, therefore still being able to access the high-level conceptual information. We propose and evaluate a method to integrate word and supersense embeddings from large sense-disambiguated resources such as Wikipedia. We examine the impact of different training data for the quality of these embeddings, and demonstrate how to employ them in deep learning text classification experiments. Using convolutional and recurrent neural networks, we achieve a significant performance improvement over word embeddings in a range of downstream classification tasks. The application of methods proposed in this thesis is demonstrated on experiments estimating the demographics and personality of a text author, and labeling the text with its subjective charge and sentiment conveyed. We therefore also provide empirical insights into which types of features are informative for these document classification problems, and suggest explanations grounded in psychology and sociology. We further discuss the issues that can occur as human experts are prone to diverse biases when classifying data. To summarize, we could show that lexical-semantic knowledge can improve text classification tasks by supplying the hierarchy of abstract concepts, which enable better generalization over words, and that these methods are effective also in combination with the deep learning techniques

    Leveraging Lexical-Semantic Knowledge for Text Classification Tasks

    Get PDF
    This dissertation is concerned with the applicability of knowledge, contained in lexical-semantic resources, to text classification tasks. Lexical-semantic resources aim at systematically encoding various types of information about the meaning of words and their relations. Text classification is the task of sorting a set of documents into categories from a predefined set, for example, “spam” and “not spam”. With the increasing amount of digitized text, as well as the increased availability of the computing power, the techniques to automate text classification have witnessed a booming interest. The early techniques classified documents using a set of rules, manually defined by experts, e.g. computational linguists. The rise of big data led to the increased popularity of distributional hypothesis - i.e., ``a meaning of word comes from its context'' - and to the criticism of lexical-semantic resources as too academic for real-world NLP applications. For long, it was assumed that the lexical-semantic knowledge will not lead to better classification results, as the meaning of every word can be directly learned from the document itself. In this thesis, we show that this assumption is not valid as a general statement and present several approaches how lexicon-based knowledge will lead to better results. Moreover, we show why these improved results can be expected. One of the first problems in natural language processing is the lexical-semantic ambiguity. In text classification tasks, the ambiguity problem has often been neglected. For example, to classify a topic of a document containing the word 'bank', we don’t need to explicitly disambiguate it, if we find the word 'river' or 'finance'. However, such additional word may not be always present. Conveniently, lexical-semantic resources typically enumerate all senses of a word, letting us choose which word sense is the most plausible in our context. What if we use the knowledge-based sense disambiguation methods in addition to the information provided implicitly by the word context in the document? In this thesis, we evaluate the performance of selected resource-based word sense disambiguation algorithms on a range of document classification tasks (Chapter 3). We note that the lexicographic sense distinctions provided by the lexical-semantic resources are not always optimal for every text classification task, and propose an alternative technique for disambiguation of word meaning in its context for sentiment analysis applications. The second problem in text classification, and natural language processing in general, is the one with synonymy. The words used in training documents represent only a tiny fraction of the words in the total possible vocabulary. If we learn individual words, or senses, as features in the classification model, our system will not be able to interpret the paraphrases, where the synonymous meaning is conveyed using different expressions. How much would the classification performance improve if the system could determine that two very different words represent the same meaning? In this thesis, we propose to address the synonymy problem by automatically enriching the training and testing data with conceptual annotations accessible through lexical-semantic resources (Chapter 4). We show that such conceptual information (``supersenses''), in combination with the previous word sense disambiguation step, helps to build more robust classifiers and improves classification performance of multiple tasks (Chapter 5). We further circumvent the sense disambiguation step by training a supersense tagging model directly. Previous evidence suggests that the sense distinctions of expert lexical-semantic resources are far subtler than what is needed for downstream NLP applications, and by disambiguating the concepts directly on a supersense level (e.g., ``is the 'duck' an animal or a food?'' rather than choosing between its eight WordNet senses), we can reduce the number of errors. The third problem in text classification is the curse of dimensionality. We want to know not only if each single word predicts certain document class, but which combinations of words predict it and which ones do not. Our need for training data thus grows exponentially with the number of words monitored. Several techniques for dimensionality reduction were proposed, most recently the representation learning, producing continuous word representations in a dense vector space, also known as word embeddings. However, these vectors are again produced on an ambiguous word level, and the valuable piece of information about possible distinct senses of the same word is lost, in favor of the most frequent one(s). In this thesis, we explore if, or how, we can use lexical-semantic resources to regain the sense-level notion of semantic relatedness back while operating within the deep learning paradigm, therefore still being able to access the high-level conceptual information. We propose and evaluate a method to integrate word and supersense embeddings from large sense-disambiguated resources such as Wikipedia. We examine the impact of different training data for the quality of these embeddings, and demonstrate how to employ them in deep learning text classification experiments. Using convolutional and recurrent neural networks, we achieve a significant performance improvement over word embeddings in a range of downstream classification tasks. The application of methods proposed in this thesis is demonstrated on experiments estimating the demographics and personality of a text author, and labeling the text with its subjective charge and sentiment conveyed. We therefore also provide empirical insights into which types of features are informative for these document classification problems, and suggest explanations grounded in psychology and sociology. We further discuss the issues that can occur as human experts are prone to diverse biases when classifying data. To summarize, we could show that lexical-semantic knowledge can improve text classification tasks by supplying the hierarchy of abstract concepts, which enable better generalization over words, and that these methods are effective also in combination with the deep learning techniques

    Supersense Embeddings: A Unified Model for Supersense Interpretation, Prediction and Utilization

    No full text

    Personality Profiling of Fictional Characters using Sense-Level Links between Lexical Resources

    No full text
    www.ukp.tu-darmstadt.de “Always be yourself, unless you can be Batman. Then always be Batman.” – Bill Murray This study focuses on personality predic-tion of protagonists in novels based on the Five-Factor Model of personality. We present and publish a novel collaboratively built dataset of fictional character person-ality and design our task as a text classifi-cation problem. We incorporate a range of semantic features, including WordNet and VerbNet sense-level information and word vector representations. We evalu-ate three machine learning models based on the speech, actions and predicatives of the main characters, and show that espe-cially the lexical-semantic features signifi-cantly outperform the baselines. The most predictive features correspond to reported findings in personality psychology.

    Wikipedia Article Feedback

    No full text
    The corpus lists article IDs of biographies of living and dead people, rated as above average or below average along four categories (trustowrthy, objective, well written, complete) based on the ratings from Wikipedia Article Feedback v4 [http://en.wikipedia.org/wiki/Wikipedia:Article_Feedback_Tool] (each of the listed articles rated at least 10 times)
    corecore