4 research outputs found

    Towards Interactive Multidimensional Visualisations for Corpus Linguistics

    Get PDF
    We propose the novel application of dynamic and interactive visualisation techniques to support the iterative and exploratory investigations typical of the corpus linguistics methodology. Very large scale text analysis is already carried out in corpus-based language analysis by employing methods such as frequency profiling, keywords, concordancing, collocations and n-grams. However, at present only basic visualisation methods are utilised. In this paper, we describe case studies of multiple types of key word clouds, explorer tools for collocation networks, and compare network and language distance visualisations for online social networks. These are shown to fit better with the iterative data-driven corpus methodology, and permit some level of scalability to cope with ever increasing corpus size and complexity. In addition, they will allow corpus linguistic methods to be used more widely in the digital humanities and social sciences since the learning curve with visualisations is shallower for non-expert


    Get PDF
    Each speaker has linguistic characteristics conditioned by the circumstances of contact. In forensic linguistics, applied linguistics, and corpus linguistics it is considered that regularity can be recognized as a specific modus operandi, and this usually corresponds to communities that share the same purpose, with equally precise subjects, and by means of specific linguistic strategies. In this article, an approach to a linguistic profile corresponding to potential sexual offenders of infants is carried out to obtain a constant linguistic behavior that subsequently allows its recognition in online conversations. The methodology used in this study is empirical and based on the natural language analysis proposed by Rayson's corpus linguistics. Through it, the linguistic behavior in online textual communication extracted from a corpus (with prior processing) is analyzed. It consists of five conversations, in which the lexical frequency was established to generate seventeen thematic modules. Finally, regularity was found in the first seven interventions, which points to a specific linguistic behavior: a linguistic profile that could lead to a timely recognition of online identities that represent a danger to children.Cada hablante posee características lingüísticas particulares condicionadas por las circunstancias de contacto. En la lingüística forense, lingüística aplicada y lingüística de corpus se considera que una regularidad puede ser reconocida como un modus operandi específico y este suele corresponder a comunidades que comparten un mismo fin, con sujetos igualmente precisos, mediante determinadas estrategias lingüísticas. En este artículo se realiza un acercamiento a un perfil lingüístico, correspondiente a potenciales agresores sexuales de infantes, con el fin de obtener un comportamiento lingüístico constante que posteriormente permita reconocerlo en conversaciones online. La metodología empleada en este estudio es empírica, basada en el análisis de lenguaje natural que propone la lingüística de corpus de Rayson. Mediante ella, se analiza el comportamiento lingüístico en la comunicación textual vía online extraído de un corpus (con un procesamiento previo), constituido de cinco conversaciones, en las cuales se estableció la frecuencia léxica para generar diecisiete módulos temáticos. Finalmente, se encontró regularidad en las primeras siete intervenciones, lo que señala un comportamiento lingüístico específico: un perfil lingüístico que podría guiar a un reconocimiento oportuno de las identidades online que representan un peligro para las infancias

    Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics

    Get PDF
    A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions. This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods. After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis