799 research outputs found

    Linguistic Geometries for Unsupervised Dimensionality Reduction

    Full text link
    Text documents are complex high dimensional objects. To effectively visualize such data it is important to reduce its dimensionality and visualize the low dimensional embedding as a 2-D or 3-D scatter plot. In this paper we explore dimensionality reduction methods that draw upon domain knowledge in order to achieve a better low dimensional embedding and visualization of documents. We consider the use of geometries specified manually by an expert, geometries derived automatically from corpus statistics, and geometries computed from linguistic resources.Comment: 13 pages, 15 figure

    Event detection, tracking, and visualization in Twitter: a mention-anomaly-based approach

    Full text link
    The ever-growing number of people using Twitter makes it a valuable source of timely information. However, detecting events in Twitter is a difficult task, because tweets that report interesting events are overwhelmed by a large volume of tweets on unrelated topics. Existing methods focus on the textual content of tweets and ignore the social aspect of Twitter. In this paper we propose MABED (i.e. mention-anomaly-based event detection), a novel statistical method that relies solely on tweets and leverages the creation frequency of dynamic links (i.e. mentions) that users insert in tweets to detect significant events and estimate the magnitude of their impact over the crowd. MABED also differs from the literature in that it dynamically estimates the period of time during which each event is discussed, rather than assuming a predefined fixed duration for all events. The experiments we conducted on both English and French Twitter data show that the mention-anomaly-based approach leads to more accurate event detection and improved robustness in presence of noisy Twitter content. Qualitatively speaking, we find that MABED helps with the interpretation of detected events by providing clear textual descriptions and precise temporal descriptions. We also show how MABED can help understanding users' interest. Furthermore, we describe three visualizations designed to favor an efficient exploration of the detected events.Comment: 17 page

    Improving OCR Post Processing with Machine Learning Tools

    Full text link
    Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

    Evaluation of semantic dependencies in a conceptual co-occurrence network of a medical vocabulary

    Get PDF
    The amount of medical knowledge is constantly growing thus providing new hope for people having health-relatedproblems. However a challenge is to develop flexible methods to facilitate managing and interpretinglarge medical knowledge entities. There is a need to enhance health literacy by developing personalized healthsupport tools. Furthermore there is a need to assist decision-making with decision support tools. The recent andon-going changes in everyday life both on technological and societal levels (for example adoption of smartphones and personal mobile medical tracking devices, social networking, open source and open data initiatives,fast growth of accumulated medical data, need for new self-care solutions for aging European population)motivate to invest in the development of new computerized personalized methods for knowledge management ofmedical data for diagnosis and treatment. To enable creation of new adaptive personalized health support toolswe have carried out an evaluation of semantic dependencies in a conceptual co-occurrence network covering aset of concepts of a medical vocabulary with experimental results ranging up to 2994 unique nouns, 82814unique conceptual links and 200000 traversed link steps.Peer reviewe

    Crowdsourcing a Word-Emotion Association Lexicon

    Full text link
    Even though considerable attention has been given to the polarity of words (positive and negative) and the creation of large polarity lexicons, research in emotion analysis has had to rely on limited and small emotion lexicons. In this paper we show how the combined strength and wisdom of the crowds can be used to generate a large, high-quality, word-emotion and word-polarity association lexicon quickly and inexpensively. We enumerate the challenges in emotion annotation in a crowdsourcing scenario and propose solutions to address them. Most notably, in addition to questions about emotions associated with terms, we show how the inclusion of a word choice question can discourage malicious data entry, help identify instances where the annotator may not be familiar with the target term (allowing us to reject such annotations), and help obtain annotations at sense level (rather than at word level). We conducted experiments on how to formulate the emotion-annotation questions, and show that asking if a term is associated with an emotion leads to markedly higher inter-annotator agreement than that obtained by asking if a term evokes an emotion

    An Emergent Approach to Text Analysis Based on a Connectionist Model and the Web

    Get PDF
    In this paper, we present a method to provide proactive assistance in text checking, based on usage relationships between words structuralized on the Web. For a given sentence, the method builds a connectionist structure of relationships between word n-grams. Such structure is then parameterized by means of an unsupervised and language agnostic optimization process. Finally, the method provides a representation of the sentence that allows emerging the least prominent usage-based relational patterns, helping to easily find badly-written and unpopular text. The study includes the problem statement and its characterization in the literature, as well as the proposed solving approach and some experimental use

    Enabling personalized healthcare by analyzing semantic dependencies in a conceptual co-occurrence network based on a medical vocabulary

    Get PDF
    The amount of medical knowledge is constantly growing thus providing new hope for people having health-related problems. However a challenge is to develop flexible methods to facilitate managing and interpreting large medical knowledge entities. There is a need to enhance health literacy by developing personalized health support tools. Furthermore there is a need to assist decision-making with decision support tools. The recent and on-going changes in everyday life both on technological and societal levels (for example adoption of smart phones and personal mobile medical tracking devices, social networking, open source and open data initiatives, fast growth of accumulated medical data, need for new self-care solutions for aging European population) motivate to invest in the development of new computerized personalized methods for knowledge management of medical data for diagnosis and treatment. To enable creation of new adaptive personalized health support tools we have carried out an evaluation of semantic dependencies in a conceptual co-occurrence network covering a set of concepts of a medical vocabulary with experimental results ranging up to 2994 unique nouns, 82814 unique conceptual links and 200000 traversed link steps.Peer reviewe

    A Real-Time N-Gram Approach to Choosing Synonyms Based on Context

    Get PDF
    Synonymy is an important part of all natural language but not all synonyms are created equal. Just because two words are synonymous, it usually doesn’t mean they can always be interchanged. The problem that we attempt to address is that of near-synonymy and choosing the right word based purely on its surrounding words. This new computational method, unlike previous methods used on this problem, is capable of making multiple word suggestions which more accurately models human choice. It contains a large number of words, does not require training, and is able to be run in real-time. On previous testing data, when able to make multiple suggestions, it improved by over 17 percentage points on the previous best method and 4.5 percentage points on average, with a maximum of 14 percentage points, on the human annotators near-synonym choice. In addition this thesis also presents new synonym sets and human annotated test data that more accurately fits this problem

    Semantic modeling of healthcare guidelines to support health literacy and patient engagement

    Get PDF
    Developing new methods and solutions of personalized medicine can address manycurrent sosio-economical challenges both locally and globally. The investments made to supporthealth can help people to have an independent, productive and happy life. To motivate thedevelopment of new patient support tools we illustrate the need for better health literacy and patientengagement, some common frameworks for modeling medical knowledge and some ways tosupport patients with online health queries and shared decision making. Then we provide someexperimental results we have generated by semantic analysis about healthcare guidelines offered byThe Finnish Medical Society Duodecim containing 85 055 words so that we created a conceptualnetwork of 57 679 unique conceptual links traversed with 200 000 link steps. We suggest that ourapproach to semantic modeling of medical knowledge can be modularly applied to develop variedcomputational solutions for personalized medicine and health informatics.Peer reviewe

    Detection of semantic errors in Arabic texts

    Get PDF
    AbstractDetecting semantic errors in a text is still a challenging area of investigation. A lot of research has been done on lexical and syntactic errors while fewer studies have tackled semantic errors, as they are more difficult to treat. Compared to other languages, Arabic appears to be a special challenge for this problem. Because words are graphically very similar to each other, the risk of getting semantic errors in Arabic texts is bigger. Moreover, there are special cases and unique complexities for this language. This paper deals with the detection of semantic errors in Arabic texts but the approach we have adopted can also be applied for texts in other languages. It combines four contextual methods (using statistics and linguistic information) in order to decide about the semantic validity of a word in a sentence. We chose to implement our approach on a distributed architecture, namely, a Multi Agent System (MAS). The implemented system achieved a precision rate of about 90% and a recall rate of about 83%
    • …
    corecore