411 research outputs found

    An Analysis of Radicals-based Features in Subjectivity Classification on Simplified Chinese Sentences

    Get PDF
    Department of Chinese and Bilingual StudiesRefereed conference pape

    Measuring Semantic Similarity of Documents by Using Named Entity Recognition Methods

    Get PDF
    The work presented in this thesis was born from the desire to map documents with similar semantic concepts between them. We decided to address this problem as a named entity recognition task, where we have identified key concepts in the texts we use, and we have categorized them. So, we can apply named entity recognition techniques and automatically recognize these key concepts inside other documents. However, we propose the use of a classification method based on the recognition of named entities or key phrases, where the method can detect similarities between key concepts of the texts to be analyzed, and through the use of Poincaré embeddings, the model can associate the existing relationship between these concepts. Thanks to the Poincaré Embeddings’ ability to capture relationships between words, we were able to implement this feature in our classifier. Consequently for each word in a text we check if there are words close to it that are also close to the words that make up the key phrases that we use as Gold Standard. Therefore when detecting potential close words that make up a named entity, the classifier then applies a series of characteristics to classify it. The methodology used performed better than when we only considered the POS structure of the named entities and their n-grams. However, determining the POS structure and the n-grams were important to improve the recognition of named entities in our research. By improving time to recognize similar key phrases between documents, some common tasks in large companies can have a notorious benefit. An important example is the evaluation of resumes, to determine the best professional for a specific position. This task is characterized by consuming a lot of time to find the best profiles for a position, but our contribution in this research work considerably reduces that time, finding the best profiles for a job. Here the experiments are shown considering job descriptions and real resumes, and the methodology used to determine the representation of each of these documents through their key phrases is explained

    Proceedings

    Get PDF
    Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources. Editors: Tatiana Gornostay and Andrejs Vasiļjevs. NEALT Proceedings Series, Vol. 12 (2011). © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16956

    Semi-Supervised Learning For Identifying Opinions In Web Content

    Get PDF
    Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented

    Multilingual Sentiment Analysis: State of the Art and Independent Comparison of Techniques

    Get PDF
    With the advent of Internet, people actively express their opinions about products, services, events, political parties, etc., in social media, blogs, and website comments. The amount of research work on sentiment analysis is growing explosively. However, the majority of research efforts are devoted to English-language data, while a great share of information is available in other languages. We present a state-of-the-art review on multilingual sentiment analysis. More importantly, we compare our own implementation of existing approaches on common data. Precision observed in our experiments is typically lower than the one reported by the original authors, which we attribute to the lack of detail in the original presentation of those approaches. Thus, we compare the existing works by what they really offer to the reader, including whether they allow for accurate implementation and for reliable reproduction of the reported results

    Proceedings of the Conference on Natural Language Processing 2010

    Get PDF
    This book contains state-of-the-art contributions to the 10th conference on Natural Language Processing, KONVENS 2010 (Konferenz zur Verarbeitung natürlicher Sprache), with a focus on semantic processing. The KONVENS in general aims at offering a broad perspective on current research and developments within the interdisciplinary field of natural language processing. The central theme draws specific attention towards addressing linguistic aspects ofmeaning, covering deep as well as shallow approaches to semantic processing. The contributions address both knowledgebased and data-driven methods for modelling and acquiring semantic information, and discuss the role of semantic information in applications of language technology. The articles demonstrate the importance of semantic processing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting semantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. The collection highlights the importance of semantic processing for different areas and applications in Natural Language Processing, and provides the reader with an overview of current research in this field

    An investigation of tense, aspect and other verb group features for English proficiency assessment on different Asian learner corpora

    Get PDF
    Recent interest in second language acquisition has resulted in studying the relationship between linguistic indices and writing proficiency in English. This thesis investigates the influence of basic linguistic notions, introduced early in English grammar, on automatic proficiency evaluation tasks. We discuss the predictive potential of verb features (tense, aspect, voice, type and degree of em- bedding) and compare them to word level n-grams (unigrams, bigrams, trigrams) for proficiency assessment. We conducted four experiments using standard language corpora that differ in authors’ cultural backgrounds and essay topic variety. Tense showed little variation across proficiency lev- els or language of origin making it a bad predictor for our corpora, but tense and aspect showed promise, especially for more natural and varied datasets. Overall, our experiments illustrated that verb features, when examined individually, form a baseline for writing proficiency prediction. Feature combinations, however, perform better for these verb features, which are grammatically not independent. Finally, we investigate how language homogeneity due to corpus design influences the performance of our features. We find that the majority of the essays we examined use present tense, indefinite aspect and passive voice, thus greatly limiting the discriminative power of tense, aspect, and voice features. Thus linguistic features have to be tested for their interoperability together with their effectiveness on the corpora used. We conclude that all corpus-based research should include an early validation step that investigates feature independence, feature interoperability, and feature value distribution in a reference corpus to anticipate potentially spurious data sparsity effects
    corecore