121,465 research outputs found

    Confidence measures for hybrid HMM/ANN speech recognition.

    Get PDF
    In this paper we introduce four acoustic confidence measures which are derived from the output of a hybrid HMM/ANN large vocabulary continuous speech recognition system. These confidence measures, based on local posterior probability estimates computed by an ANN, are evaluated at both phone and word levels, using the North American Business News corpus

    Query Expansion with Locally-Trained Word Embeddings

    Full text link
    Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings

    Case Notes

    Get PDF

    Exploratory topic modeling with distributional semantics

    Full text link
    As we continue to collect and store textual data in a multitude of domains, we are regularly confronted with material whose largely unknown thematic structure we want to uncover. With unsupervised, exploratory analysis, no prior knowledge about the content is required and highly open-ended tasks can be supported. In the past few years, probabilistic topic modeling has emerged as a popular approach to this problem. Nevertheless, the representation of the latent topics as aggregations of semi-coherent terms limits their interpretability and level of detail. This paper presents an alternative approach to topic modeling that maps topics as a network for exploration, based on distributional semantics using learned word vectors. From the granular level of terms and their semantic similarity relations global topic structures emerge as clustered regions and gradients of concepts. Moreover, the paper discusses the visual interactive representation of the topic map, which plays an important role in supporting its exploration.Comment: Conference: The Fourteenth International Symposium on Intelligent Data Analysis (IDA 2015

    Distantly Labeling Data for Large Scale Cross-Document Coreference

    Full text link
    Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on ``distantly-labeling'' a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3.5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of such labeling); then we train and evaluate a conditional random field coreference model that has factors on cross-document entities as well as mention-pairs. This coreference model obtains high accuracy in resolving mentions and entities that are not present in the training data, indicating applicability to non-Wikipedia data. Given the large amount of data, our work is also an exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201

    Animacy in early New Zealand english

    Get PDF
    The literature suggests that animacy effects in present-day spoken New Zealand English (NZE) differ from animacy effects in other varieties of English. We seek to determine if such differences have a history in earlier NZE writing or not. We revisit two grammatical phenomena — progressives and genitives — that are well known to be sensitive to animacy effects, and we study these phenomena in corpora sampling 19th- and early 20th-century written NZE; for reference purposes, we also study parallel samples of 19th- and early 20th-century British English and American English. We indeed find significant regional differences between early New Zealand writing and the other varieties in terms of the effect that animacy has on the frequency and probabilities of grammatical phenomena

    THE ACCUSED IS ENTERING THE COURTROOM: THE LIVE-TWEETING OF A MURDER TRIAL.

    Get PDF
    © 2017 Informa UK Limited, trading as Taylor & Francis GroupThe use of social media is now widely accepted within journalism as an outlet for news information. Live tweeting of unfolding events is standard practice. In March 2014, Oscar Pistorius went on trial in the Gauteng High Court for murder. Hundreds of journalists present began live-tweeting coverage, an unprecedented combination of international interest, permission to use technology and access which resulted in massive streams of consciousness reports of events as they unfolded. Based on a corpus of Twitter feeds of twenty-four journalists covering the trial, this study analyses the content and strategies of these feeds in order to present an understanding of how microblogging is used as a live reporting tool. This study shows the development of standardised language and strategies in reporting on Twitter, concluding that journalists adopt a narrow range of approaches, with no significant variation in terms of gender, location, or medium. This is in contrast to earlier studies in the field (Awad, 2006, Hedman, 2015; Kothari, 2010; Lariscy, Avery, Sweetser, & Howes, 2009 Lasorsa, 2012; Lasorsa, Lewis, & Holton, 2011; Sigal, 1999, Vis, 2013).Peer reviewe
    • 

    corecore