3 research outputs found

    Grounding language models in spatiotemporal context

    Get PDF
    Natural language is rich and varied, but also highly structured. The rules of grammar are a primary source of linguistic regularity, but there are many other factors that govern patterns of language use. Language models attempt to capture linguistic regularities, typically by modeling the statistics of word use, thereby folding in some aspects of grammar and style. Spoken language is an important and interesting subset of natural language that is temporally and spatially grounded. While time and space may directly contribute to a speaker’s choice of words, they may also serve as indicators for communicative intent or other contextual and situational factors. To investigate the value of spatial and temporal information, we build a series of language models using a large, naturalistic corpus of spatially and temporally coded speech collected from a home environment. We incorporate this extralinguistic information by building spatiotemporal word classifiers that are mixed with traditional unigram and bigram models. Our evaluation shows that both perplexity and word error rate can be significantly improved by incorporating this information in a simple framework. The underlying principles of this work could be applied in a wide range of scenarios in which temporal or spatial information is available

    Digital Stylometry: Linking Profiles Across Social Networks

    Get PDF
    There is an ever growing number of users with accounts on multiple social media and networking sites. Consequently, there is increasing interest in matching user accounts and profiles across different social networks in order to create aggregate profiles of users. In this paper, we present models for Digital Stylometry, which is a method for matching users through stylometry inspired techniques. We experimented with linguistic, temporal, and combined temporal-linguistic models for matching user accounts, using standard and novel techniques. Using publicly available data, our best model, a combined temporal-linguistic one, was able to correctly match the accounts of 31% of 5,612 distinct users across Twitter and Facebook.Comment: SocInfo'15, Beijing, China. In proceedings of the 7th International Conference on Social Informatics (SocInfo 2015). Beijing, Chin
    corecore