27 research outputs found

    Kultūriniai-skaitmeniniai literatūros žinomumo tyrinėjimai pasitelkiant „Google knygas“: bandomasis tyrimas

    Get PDF
    The availability of databases of digitised literary materials, such as Google Books, Europeana and historical newspaper databases, has revolutionised many disciplines, e.g., linguistics and history. So far, the use of digitised materials has not been very frequent in the history of books and the history of reading. This article presents tools, methodologies and practices that offer new possibilities in the study of book history and the history of reading. The use of these tools makes it possible to study vast amounts of data quickly and effectively, to present results in helpful visualisations, to make it possible to follow the line of reasoning and, if necessary, to check the reliability of the research by presenting the data for control. The examples presented are drawn from the Google Books database using a simple piece of software that exploits the API of the Google Books Ngram Viewer tool that is available free of charge.Suskaitmenintų literatūros duomenų bazių tokių kaip „Google knygos“, „Europeana“ ir istorinių laikraščių duomenų bazių prieinamumas sukėlė revoliuciją daugelyje disciplinų, pvz., kalbotyroje ir istorijoje. Iki šiol knygų istorijoje ir skaitymo istorijoje suskaitmenintos medžiagos naudojimas nebuvo ypač dažnas. Šiame straipsnyje pristatomos priemonės, metodikos ir praktikos, kurios suteikia naujas knygų istorijos ir skaitymo istorijos tyrimo galimybes. Šių įrankių naudojimas leidžia greitai ir efektyviai ištirti didžiulius duomenų kiekius, pateikti rezultatus pasitelkiant naudingas vizualizacijas, suteikia galimybę remtis argumentavimu ir, prireikus, nustatyti tyrimo patikimumą pateikiant duomenis patikrinimui. Surinkti pavyzdžiai yra paimti iš duomenų bazės „Google knygos“, naudojant paprastą programinę įrangą ir pasitelkiant nemokamo „Google Books Ngram Viewer“ įrankio aplikacijų programavimo sąsają (angl. API)

    A fully data-driven method to identify (correlated) changes in diachronic corpora

    Get PDF
    In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that have undergone the most pronounced change in frequency in a given period of time, is computationally very cheap and that it allows interpretations of diachronic trends that are both intuitively plausible and motivated from the perspective of information theory. Furthermore, it demonstrates that the method is able to identify correlated linguistic changes and diachronic shifts that can be linked to historical events. Finally, it can help to improve diachronic POS tagging and complement existing NLP approaches. This indicates that the approach can facilitate an improved understanding of diachronic processes in language change.Comment: typological changes only: reference-source-not-found-errors remove

    HathiTrust as a Data Source for Researching Early Nineteenth-Century Library Collections

    Get PDF
    An intriguing new opportunity for research into the nineteenth-century history of print culture, libraries, and local communities is performing full-text analyses on the corpus of books held by a specific library or group of libraries. Creating corpora using books that are known to have been owned by a given library at a given point in time is potentially feasible because digitized records of the books in several hundred nineteenth-century library collections are available in the form of scanned book catalogs: a book or pamphlet listing all of the books available in a particular library. However, there are two potential problems with using those book catalogs to create corpora. First, it is not clear whether most or all of the books that were in these collections have been digitized. Second, the prospect of identifying the digital representations of the books listed in the catalogs is daunting, given the diversity of cataloging practices at the time. This article will report on progress towards developing an automated method to match entries in early nineteenth-century book catalogs with digitized versions of those books, and will also provide estimates of the fractions of the library holdings that have been digitized and made available in the Google Books/HathiTrust corpus

    Challenges of combining structured and unstructured data in corpus development

    Get PDF
    Special issue, Challenges of Combining Structured and Unstructured Data in Corpus Development, ed. by Tanja Säily & Jukka Tyrkkö.Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.Non peer reviewe

    Corpus linguistics as digital scholarship : Big data, rich data and uncharted data

    Get PDF
    This introductory chapter begins by considering how the fields of corpus linguistics, digital linguistics and digital humanities overlap, intertwine and feed off each other when it comes to making use of the increasing variety of resources available for linguistic research today. We then move on to discuss the benefits and challenges of three partly overlapping approaches to the use of digital data sources: (1) increasing data size to create “big data”, (2) supplying multi-faceted co(n)textual information and analyses to produce “rich data”, and (3) adapting existing data sets to new uses by drawing on hitherto “uncharted data”. All of them also call for new digital tools and methodologies that, in Tim Hitchcock’s words, “allow us to think small; at the same time as we are generating tools to imagine big.” We conclude the chapter by briefly describing how the contributions in this volume make use of their various data sources to answer new research questions about language use and to revisit old questions in new ways.Peer reviewe

    Investigating the extent to which words or phrases with specific attributes can be retrieved from digital text collections

    Get PDF
    INTRODUCTION : Digital text collections are increasingly being used. Various tools have been developed to allow researchers to explore such collections. Enhanced retrieval will be possible if texts are encoded with granular metadata. METHOD : A selection of tools used to explore digital text collections was evaluated to determine to what extent they allow for the retrieval of words or phrases with specific attributes. ANALYSIS : Tools were evaluated according to the metadata that are available in the data, the search options in the tool, how the results are displayed, and the expertise required to use the tool. RESULTS : Many tools with powerful functions have been developed. However, there are limitations. It is not possible to search according to semantics or in-text bibliographic metadata. Analysis of the tools revealed that there are limited options to combine multiple levels of metadata and typically, without some programming expertise or knowledge of the structure and encoding of data, researchers cannot currently retrieve words or phrases with specific attributes from digital text collections. CONCLUSION : Granular metadata should be identified, and tools that can utilise these metadata to enable the retrieval of words or phrases with specific attributes in an intuitive manner should be developed.http://informationr.net/irhj2023Information Scienc

    La infoesfera y el proyecto GDELT

    Get PDF
    The dizzying expansion of the mass media and information technologies in recent decades has given shape to new cultural configurations. One of the most current expressions of these processes is Big Data. This phenomenon has transformed many of the economic, social and political processes with which it has interacted. In the field of social sciences, new objects of study emerged such as Social Networks or the Web, while new data analysis techniques were incorporated. Considering this context, the present work intends to relate the concept of Infosphere, considered by Franco Berardi as the circular space in which signals carrying cultural intention transit, and the development of the sub-field of Social Sciences known as Culturomics 2.0. In order to visualize some of the conceptual tensions, a case study on the GDELT Project will be carried out. Its function is to monitor the messages that circulate in the Digital Mass Media and Social Networks; in order to create an open platform in real time that allows to investigate the events, dreams, fears and conflicts that occur around the world today
    corecore