1,018 research outputs found

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

    The corporate blog as an emerging genre of computer-mediated communication: features, constraints, discourse situation

    Get PDF
    Digital technology is increasingly impacting how we keep informed, how we communicate professionally and privately, and how we initiate and maintain relationships with others. The function and meaning of new forms of computer-mediated communication (CMC) is not always clear to users on the onset and must be negotiated by communities, institutions and individuals alike. Are chatrooms and virtual environments suitable for business communication? Is email increasingly a channel for work-related, formal communication and thus "for old people", as especially young Internet users flock to Social Networking Sites (SNSs)? Cornelius Puschmann examines the linguistic and rhetorical properties of the weblog, another relatively young genre of CMC, to determine its function in private and professional (business) communication. He approaches the question of what functions blogs realize for authors and readers and argues that corporate blogs, which, like blogs by private individuals, are a highly diverse in terms of their form, function and intended audience, essentially mimic key characteristics of private blogs in order to appear open, non-persuasive and personal, all essential qualities for companies that wish to make a positive impression on their constituents.Digital technology is increasingly impacting how we keep informed, how we communicate professionally and privately, and how we initiate and maintain relationships with others. The function and meaning of new forms of computer-mediated communication (CMC) is not always clear to users on the onset and must be negotiated by communities, institutions and individuals alike. Are chatrooms and virtual environments suitable for business communication? Is email increasingly a channel for work-related, formal communication and thus "for old people", as especially young Internet users flock to Social Networking Sites (SNSs)? Cornelius Puschmann examines the linguistic and rhetorical properties of the weblog, another relatively young genre of CMC, to determine its function in private and professional (business) communication. He approaches the question of what functions blogs realize for authors and readers and argues that corporate blogs, which, like blogs by private individuals, are a highly diverse in terms of their form, function and intended audience, essentially mimic key characteristics of private blogs in order to appear open, non-persuasive and personal, all essential qualities for companies that wish to make a positive impression on their constituents

    Evaluation and Sociolinguistic Analysis of Text Features for Gender and Age Identification

    Get PDF
    The paper presents an interdisciplinary study in the field of automatic gender and age identification, under the scope of sociolinguistic knowledge on gendered and age linguistic choices that social media users make. The authors investigated and gathered standard and novel text features used in text mining approaches on the author's demographic information and profiling and they examined their efficacy in gender and age detection tasks on a corpus consisted of social media texts. An analysis of the most informative features is attempted according to the nature of each feature and the information derived after the characteristics' score of importance is discussed

    Linguistic Variation and Identity Representation in Personal Blogs: A Corpus-Linguistic Approach

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Automating Author Gender Identification from Blogs

    Get PDF
    The rapid growth of public blogging on the Internet has opened up a vast trove of information that can be text mined for potential insights. This study explores the potential of automating blog author gender based on differences in lexical expressions. The results of this study were mixed, and further refinement is needed.Master of Science in Information Scienc

    In search for totemic foods: Exploring discursive foodscapes online in Finnish, English and French

    Get PDF
    This interdisciplinary research investigates how chilli and chocolate emerge as totemic foods in online foodie discourse. The corpus is compiled from Social Networking Services (blogs, community websites, recipe sharing sites, and conversation fora) in Finnish, English and French. The theoretical framework is construed with post-Bourdieusian taste and distinction studies on discourse, complemented by a feminist positioning. A netnographically inspired inquiry in an observer’s position enhances the methodology of critical discourse studies. The study introduces a theoretical concept: discursive foodscapes, contributing on two dimensions to extant theorising. It focuses the observation on multivocal online communities and extends foodscape analysis towards non-concrete consumption, on a discursive level. Moreover, the study suggests new practices for taste engineering, relevant in online consumption contexts. Three research questions draw on chilli and chocolate as totemic substances, interpreted in a framework of contemporary tribalism within the paradigmatic viewpoint of Consumer Culture Theory: emergence of chocolate and chilli as totemic foods; taste and distinction performance; and representations of gender and power. They are studied separately, although perceiving the triad as entwined. The discursive foodscape related to each research question reflects findings: it is described with the combination of discursive themes, frames and strategies identified in the empirical analysis. Findings reveal a more diversified vista on chocolate and chilli as discursive foci than extant research mostly claims: they are ascribed with a variety of totemic significations, shifting contextually from highly indulgent to environmentally concerned. Knowledge-intensive foodie discourse emerges as relatively gender-neutral. However, across embodied, experiential elements in consumption the discourse becomes more gender-flagged, and contextual changes are highly significant. This variation generates discursively interesting constellations where stylistic categories reflect areas of culinary and discursive competence. Cross-linguistic variation is detected with all research questions, introducing a pioneer-type endeavor in terms of discourse analysis of foodie sites online, across three language

    Who’s Blogging Now? Linguistic Features and Authorship Analysis in Sports Blogs

    Get PDF
    abstract: The field of authorship determination, previously largely falling under the umbrella of literary analysis but recently becoming a large subfield of forensic linguistics, has grown substantially over the last two decades. As its body of research and its record of successful forensic application continue to grow, this growth is paralleled by the demand for its application. However, methods which have undergone rigorous testing to show their reliability and replicability, allowing them to meet the strict Daubert criteria put forth by the US court system, have not truly been established. In this study, I set out to investigate how a list of parameters, many commonly used in the methodologies of previous researchers, would perform when used to test documents of bloggers from a sports blog, Winging It in Motown. Three prolific bloggers were chosen from the site, and a corpus of posts was created for each blogger which was then examined for each of the chosen parameters. One test document for each of the three bloggers which was not included in that blogger’s corpus was then chosen from the blog page, and these documents were examined for each of the parameters via the same methodologies as were used to examine the corpora. Once data for the corpora and all three test documents was obtained, the results were compared for similarity, and an author determination was made for each test document along each parameter. The findings indicated that overall the parameters were quite unsuccessful in determining authorship for these test documents based on the author corpora developed for the study. Only two parameters successfully identified the authors of the test documents at a rate higher than chance, and the possibility exists that other factors may be driving these successful identifications, demanding further research to confirm their validity as parameters for the purpose of authorship work.Dissertation/ThesisDoctoral Dissertation English 201

    Personal information prediction from written texts

    Full text link
    La détection de la paternité textuelle est un domaine de recherche qui existe depuis les années 1960. Il consiste à prédire l’auteur d’un texte en se basant sur d’autres textes dont les auteurs sont connus. Pour faire cela, plusieurs traits sur le style d’écriture et le contenu sont extraits. Pour ce mémoire, deux sous-problèmes de détection de la paternité textuelle ont été traités : la prédiction du genre et de l’âge de l’auteur. Des données collectées de blogs en ligne ont été utilisées pour faire cela. Dans ce travail, plusieurs traits (features) textuels ont été comparé en utilisant des méthodes d’apprentissage automatique. De même, des méthodes d’apprentissage profond ont été appliqués. Pour la tâche de classification du genre, les meilleurs résultats ont été obtenus en appliquant un système de vote majoritaire sur la prédiction d’autres modèles. Pour la classification d’âge, les meilleurs résultats ont été obtenu en utilisant un classificateur entrainé sur TF-IDF.Authorship Attribution (AA) is a field of research that exists since the 60s. It consists of identifying the author of a certain text based on texts with known authors. This is done by extracting features about the writing style and the content of the text. In this master thesis, two sub problems of AA were treated: gender and age classification using a corpus collected from online blogs. In this work, several features were compared using several feature-based algorithms. As well as deep learning methods. For the gender classification task, the best results are the ones obtained by a majority vote system over the outputs of several classifiers. For the age classification task, the best result was obtained using classifier trained over TFIDF
    corecore