127 research outputs found

    Accurate Local Estimation of Geo-Coordinates for Social Media Posts

    Get PDF
    Associating geo-coordinates with the content of social media posts can enhance many existing applications and services and enable a host of new ones. Unfortunately, a majority of social media posts are not tagged with geo-coordinates. Even when location data is available, it may be inaccurate, very broad or sometimes fictitious. Contemporary location estimation approaches based on analyzing the content of these posts can identify only broad areas such as a city, which limits their usefulness. To address these shortcomings, this paper proposes a methodology to narrowly estimate the geo-coordinates of social media posts with high accuracy. The methodology relies solely on the content of these posts and prior knowledge of the wide geographical region from where the posts originate. An ensemble of language models, which are smoothed over non-overlapping sub-regions of a wider region, lie at the heart of the methodology. Experimental evaluation using a corpus of over half a million tweets from New York City shows that the approach, on an average, estimates locations of tweets to within just 2.15km of their actual positions.Comment: In Proceedings of the 26th International Conference on Software Engineering and Knowledge Engineering, pp. 642 - 647, 201

    Memory-Based Learning: Using Similarity for Smoothing

    Full text link
    This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the advantage of automatically specifying a suitable domain-specific hierarchy between most specific and most general conditioning information without the need for a large number of parameters. We report two applications of this approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art performance in both domains, and allows the easy integration of diverse information sources, such as rich lexical representations.Comment: 8 pages, uses aclap.sty, To appear in Proc. ACL/EACL 9

    Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

    Full text link
    An automatic word classification system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a binary top-down form of word clustering which employs an average class mutual information metric. Resulting classifications are hierarchical, allowing variable class granularity. Words are represented as structural tags --- unique nn-bit numbers the most significant bit-patterns of which incorporate class information. Access to a structural tag immediately provides access to all classification levels for the corresponding word. The classification system has successfully revealed some of the structure of English, from the phonemic to the semantic level. The system has been compared --- directly and indirectly --- with other recent word classification systems. Class based interpolated language models have been constructed to exploit the extra information supplied by the classifications and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil

    Various robust search methods in a Hungarian speech recognition system

    Get PDF
    This work focuses on the search aspect of speech recognition. We describe some standard algorithms such as stack decoding, multi-stack decoding, the Viterbi beam search and an A* heuristic, then present improvements on these search methods. Finally we compare the performance of each algorithm, grading them according to their performance. We will show that our improvements can outperform the standard methods
    corecore