121,465 research outputs found
Confidence measures for hybrid HMM/ANN speech recognition.
In this paper we introduce four acoustic confidence measures which are derived from the output of a hybrid HMM/ANN large vocabulary continuous speech recognition system. These confidence measures, based on local posterior probability estimates computed by an ANN, are evaluated at both phone and word levels, using the North American Business News corpus
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
Exploratory topic modeling with distributional semantics
As we continue to collect and store textual data in a multitude of domains,
we are regularly confronted with material whose largely unknown thematic
structure we want to uncover. With unsupervised, exploratory analysis, no prior
knowledge about the content is required and highly open-ended tasks can be
supported. In the past few years, probabilistic topic modeling has emerged as a
popular approach to this problem. Nevertheless, the representation of the
latent topics as aggregations of semi-coherent terms limits their
interpretability and level of detail.
This paper presents an alternative approach to topic modeling that maps
topics as a network for exploration, based on distributional semantics using
learned word vectors. From the granular level of terms and their semantic
similarity relations global topic structures emerge as clustered regions and
gradients of concepts. Moreover, the paper discusses the visual interactive
representation of the topic map, which plays an important role in supporting
its exploration.Comment: Conference: The Fourteenth International Symposium on Intelligent
Data Analysis (IDA 2015
Distantly Labeling Data for Large Scale Cross-Document Coreference
Cross-document coreference, the problem of resolving entity mentions across
multi-document collections, is crucial to automated knowledge base construction
and data mining tasks. However, the scarcity of large labeled data sets has
hindered supervised machine learning research for this task. In this paper we
develop and demonstrate an approach based on ``distantly-labeling'' a data set
from which we can train a discriminative cross-document coreference model. In
particular we build a dataset of more than a million people mentions extracted
from 3.5 years of New York Times articles, leverage Wikipedia for distant
labeling with a generative model (and measure the reliability of such
labeling); then we train and evaluate a conditional random field coreference
model that has factors on cross-document entities as well as mention-pairs.
This coreference model obtains high accuracy in resolving mentions and entities
that are not present in the training data, indicating applicability to
non-Wikipedia data. Given the large amount of data, our work is also an
exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201
Animacy in early New Zealand english
The literature suggests that animacy effects in present-day spoken New Zealand English (NZE) differ from animacy effects in other varieties of English. We seek to determine if such differences have a history in earlier NZE writing or not. We revisit two grammatical phenomena â progressives and genitives â that are well known to be sensitive to animacy effects, and we study these phenomena in corpora sampling 19th- and early 20th-century written NZE; for reference purposes, we also study parallel samples of 19th- and early 20th-century British English and American English. We indeed find significant regional differences between early New Zealand writing and the other varieties in terms of the effect that animacy has on the frequency and probabilities of grammatical phenomena
THE ACCUSED IS ENTERING THE COURTROOM: THE LIVE-TWEETING OF A MURDER TRIAL.
© 2017 Informa UK Limited, trading as Taylor & Francis GroupThe use of social media is now widely accepted within journalism as an outlet for news information. Live tweeting of unfolding events is standard practice. In March 2014, Oscar Pistorius went on trial in the Gauteng High Court for murder. Hundreds of journalists present began live-tweeting coverage, an unprecedented combination of international interest, permission to use technology and access which resulted in massive streams of consciousness reports of events as they unfolded. Based on a corpus of Twitter feeds of twenty-four journalists covering the trial, this study analyses the content and strategies of these feeds in order to present an understanding of how microblogging is used as a live reporting tool. This study shows the development of standardised language and strategies in reporting on Twitter, concluding that journalists adopt a narrow range of approaches, with no significant variation in terms of gender, location, or medium. This is in contrast to earlier studies in the field (Awad, 2006, Hedman, 2015; Kothari, 2010; Lariscy, Avery, Sweetser, & Howes, 2009 Lasorsa, 2012; Lasorsa, Lewis, & Holton, 2011; Sigal, 1999, Vis, 2013).Peer reviewe
- âŠ