127 research outputs found
Accurate Local Estimation of Geo-Coordinates for Social Media Posts
Associating geo-coordinates with the content of social media posts can
enhance many existing applications and services and enable a host of new ones.
Unfortunately, a majority of social media posts are not tagged with
geo-coordinates. Even when location data is available, it may be inaccurate,
very broad or sometimes fictitious. Contemporary location estimation approaches
based on analyzing the content of these posts can identify only broad areas
such as a city, which limits their usefulness. To address these shortcomings,
this paper proposes a methodology to narrowly estimate the geo-coordinates of
social media posts with high accuracy. The methodology relies solely on the
content of these posts and prior knowledge of the wide geographical region from
where the posts originate. An ensemble of language models, which are smoothed
over non-overlapping sub-regions of a wider region, lie at the heart of the
methodology. Experimental evaluation using a corpus of over half a million
tweets from New York City shows that the approach, on an average, estimates
locations of tweets to within just 2.15km of their actual positions.Comment: In Proceedings of the 26th International Conference on Software
Engineering and Knowledge Engineering, pp. 642 - 647, 201
Memory-Based Learning: Using Similarity for Smoothing
This paper analyses the relation between the use of similarity in
Memory-Based Learning and the notion of backed-off smoothing in statistical
language modeling. We show that the two approaches are closely related, and we
argue that feature weighting methods in the Memory-Based paradigm can offer the
advantage of automatically specifying a suitable domain-specific hierarchy
between most specific and most general conditioning information without the
need for a large number of parameters. We report two applications of this
approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art
performance in both domains, and allows the easy integration of diverse
information sources, such as rich lexical representations.Comment: 8 pages, uses aclap.sty, To appear in Proc. ACL/EACL 9
Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
An automatic word classification system has been designed which processes
word unigram and bigram frequency statistics extracted from a corpus of natural
language utterances. The system implements a binary top-down form of word
clustering which employs an average class mutual information metric. Resulting
classifications are hierarchical, allowing variable class granularity. Words
are represented as structural tags --- unique -bit numbers the most
significant bit-patterns of which incorporate class information. Access to a
structural tag immediately provides access to all classification levels for the
corresponding word. The classification system has successfully revealed some of
the structure of English, from the phonemic to the semantic level. The system
has been compared --- directly and indirectly --- with other recent word
classification systems. Class based interpolated language models have been
constructed to exploit the extra information supplied by the classifications
and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
Various robust search methods in a Hungarian speech recognition system
This work focuses on the search aspect of speech recognition. We describe some standard algorithms such as stack decoding, multi-stack decoding, the Viterbi beam search and an A* heuristic, then present improvements on these search methods. Finally we compare the performance of each algorithm, grading them according to their performance. We will show that our improvements can outperform the standard methods
- …