39 research outputs found

    How character limit affects language usage in tweets

    Get PDF
    In November 2017 Twitter doubled the available character space from 140 to 280 characters. This provided an opportunity for researchers to investigate the linguistic effects of length constraints in online communication. We asked whether the character limit change (CLC) affected language usage in Dutch tweets and hypothesized that there would be a reduction in the need for character-conserving writing styles. Pre-CLC tweets were compared with post-CLC tweets. Three separate analyses were performed: (I) general analysis: the number of characters, words, and sentences per tweet, as well as the average word and sentence length. (II) Token analysis: the relative frequency of tokens and bigrams; (III) part-of-speech analysis: the grammatical structure of the sentences in tweets (i.e., adjectives, adverbs, articles, conjunctives, interjections, nouns, prepositions, pronouns, and verbs); pre-CLC tweets showed relatively more textisms, which are used to abbreviate and conserve character space. Consequently, they represent more informal language usage (e.g., internet slang); in turn, post-CLC tweets contained relatively more articles, conjunctions, and prepositions. The results show that online language producers adapt their texts to overcome limit constraints

    Towards Transparent Linguistic Analysis of Dutch Newspaper Article Genres using Machine Learning

    Get PDF
    Systematic study of genre in newspapers sheds light on the development of journalism discourse. The genre conventions that can be discerned in a newspaper text signal the underlying discursive norms and practices of journalism as a profession. Historical newspapers are increasingly becoming available thanks to digital newspaper archives (in the Netherlands available through Delpher.nl), providing the opportunity for large-scale empirical research. However, the digital archives do not contain fine-grained genre information that is required for this purpose. Therefore, we use machine learning to automatically assign genre labels to newspaper articles.Machine learning facilitates substantial improvements to the outcomes of existing research by providing increased amounts of enriched data. However, the decision-making process of the machine learning pipeline needs to be verified. Our previous findings (Bilgin et al., 2018) show that accuracy scores alone are not enough to assess the performance of these pipelines and that making an informed choice not only empowers optimal study of the historical development of genre, but also increases the trustworthiness of the results. This work shows that employing a transparent approach driven by model interpretability facilitates fair comparison as well as validation of the underlying decision-making criteria of the machine learning pipelines. The criteria are presented in the form of important features, creating insights on interactions between genre-related linguistic features and bag-of-words features.</p

    Cornetto: A Combinatorial Lexical Semantic Database for Dutch

    Get PDF
    One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology.A semantic database makes it possible to go from words to concepts and consequently, to develop technologies that access and use knowledge rather than textual representations

    A meta-analysis of state-of-the-art electoral prediction from Twitter data

    Full text link
    Electoral prediction from Twitter data is an appealing research topic. It seems relatively straightforward and the prevailing view is overly optimistic. This is problematic because while simple approaches are assumed to be good enough, core problems are not addressed. Thus, this paper aims to (1) provide a balanced and critical review of the state of the art; (2) cast light on the presume predictive power of Twitter data; and (3) depict a roadmap to push forward the field. Hence, a scheme to characterize Twitter prediction methods is proposed. It covers every aspect from data collection to performance evaluation, through data processing and vote inference. Using that scheme, prior research is analyzed and organized to explain the main approaches taken up to date but also their weaknesses. This is the first meta-analysis of the whole body of research regarding electoral prediction from Twitter data. It reveals that its presumed predictive power regarding electoral prediction has been rather exaggerated: although social media may provide a glimpse on electoral outcomes current research does not provide strong evidence to support it can replace traditional polls. Finally, future lines of research along with a set of requirements they must fulfill are provided.Comment: 19 pages, 3 table

    Consolidating Heterogeneous Enterprise Data for Named Entity Linking and Web Intelligence

    Get PDF
    Linking named entities to structured knowledge sources paves the way for state-of-the-art Web intelligence applications which assign sentiment to the correct entities, identify trends, and reveal relations between organizations, persons and products. For this purpose this paper introduces Recognyze, a named entity linking component that uses background knowledge obtained from linked data repositories, and outlines the process of transforming heterogeneous data silos within an organization into a linked enterprise data repository which draws upon popular linked open data vocabularies to foster interoperability with public data sets. The presented examples use comprehensive real-world data sets from Orell Füssli Business Information, Switzerland's largest business information provider. The linked data repository created from these data sets comprises more than nine million triples on companies, the companies' contact information, key people, products and brands. We identify the major challenges of tapping into such sources for named entity linking, and describe required data pre-processing techniques to use and integrate such data sets, with a special focus on disambiguation and ranking algorithms. Finally, we conduct a comprehensive evaluation based on business news from the New Journal of Zurich and AWP Financial News to illustrate how these techniques improve the performance of the Recognyze named entity linking component

    Visualizing Literary Data

    No full text
    We look at different aspects of Dutch magazines, both from the fields of literary studies and linguistic studies. We explore the background of authors with respect to birth locations, ages and gender, and also in how language use in the magazines evolved over a period of several decades. We have created several interactive visualizations which enable researchers to browse and analyze text data and their metadata. The design of these visualizations was nontrivial: invoking questions about how to deal with missing data and documents with multiple authors. The data required for some of the visualizations useful for researchers, were infeasible for the software architecture to generate within a reasonable time-span. In a case study, we look at some of the research questions that can be answered by the data visualizations and suggest another data view that could be interesting for literary research. Interesting topics for future research rely heavily on improvements of the search architecture used and including extra annotation layers to our text corpora
    corecore