36 research outputs found

    The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme

    Get PDF
    International audienceAbstract. This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM.The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficient full-text search on annotated corpora and for statistical data analysis.The architecture is based on a Java toolbox articulating a full-text search engine component with a statistical computing environment and with an original import environment able to process a large variety of data sources, including XML-TEI, and to apply embedded NLP tools to them.The platform is distributed as an open-source Eclipse project for developers and in the form of two demonstrator applications for end users: a standard application to install on a workstation and an online web application framework

    Tekstometrijske metode i TXM platforma za analizu i vizuelnu prezentaciju korpusa

    Get PDF
    Textometric approach has long been applied as a useful method for corpus analysis in various fields of humanities and social sciences. Textometry allows the non-linear quantitative and qualitative study of digital corpora, combining lexicometric and statistical research with developed corpus technologies. In this paper, the current version of the srpELTeC corpus was analyzed within the TXM program environment to illustrate the possibilities of the textometric approach and visual presentation of the obtained results.Tekstometrijski pristup se već dugo primenjuje kao korisna metoda za analizu korpusa u različitim oblastima društveno-humanističkih nauka. Kombinu]ući leksikometri]ska i statistička istraživanja sa razvi]enim korpusnim tehnologijama, tekstometri]a omogućava nelinearno kvantitativno i kvalitativno proučavanje digitalnih korpusa. U ovom radu je s ciljem ilustrovanja mogućnosti tekstometrijskog pristupa u okviru TXM programskog okruženja izvršena analiza tekuće verzije srpELTeC korpusa, uz predstavljanje mogućnosti vizuelnog prikaza dobijenih rezultat

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)

    TXM : Une plateforme logicielle open-source pour la textométrie - conception et développement

    Get PDF
    International audienceThe research project Federation and Research Developments in Textometry around the creation of an Open- Source Platform distributes its XML-TEI encoded corpus textometric analysis platform online. The design of this platform is based on a synthesis of features of existing textometric software. It relies on identifying the open-source software technology available and effectively processing digital resources encoded in XML and Unicode, and on a state of the art of open-source full-text search engines on structured and annotated corpora. The architecture is based on a Java toolkit component articulating a search engine (IMS CWB), a statistical computing environment (R) and a module for importing XML-TEI encoded corpora. The platform is distributed as an open-source toolkit for developers and in the form of two applications for end users of textometry: a local application to install on a workstation (Windows or Linux) and an online web application. Still early in its development, the platform implements at present only a few essential features, but its distribution in open-source already allows an open community development. This should facilitate its development and integration of new models and methods.Le projet de recherche Fédération des recherches et développements en textométrie autour de la création d'une plateforme logicielle ouverte diffuse sa plateforme d'analyse textométrique de corpus XML-TEI en ligne. La conception de cette plateforme repose sur une synthèse des fonctionnalités des logiciels de textométrie existants. Elle s'appuie sur le recensement des technologies logicielles open-source disponibles et efficaces pour manipuler des ressources numériques XML et Unicode, et sur un état de l'art des moteurs de recherche en texte intégral sur corpus structurés et étiquetés. L'architecture consiste en une boîte à outils Java articulant un composant moteur de recherche (IMS CWB), un environnement de calcul statistique (R) et un module d'importation de corpus XML-TEI. La plateforme est diffusée sous la forme d'une boite à outils en open-source pour les développeurs informatique mais également sous la forme de deux applications pour les utilisateurs finaux de la textométrie : une application à installer sur un poste local (Windows ou Linux) et une application web accessible en ligne. Encore au début de son développement, la plateforme n'implémente à l'heure actuelle que quelques fonctionnalités essentielles, mais sa diffusion en open-source autorise un développement communautaire ouvert. Cela doit faciliter son évolution et l'intégration de nouveaux modèles et méthodes

    Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives

    Get PDF
    The aim of this contribution is to reflect on the process of building the multilingual European Literary Text Collection (ELTeC) that is being created in the framework of the networking project Distant Reading for European Literary History funded by COST (European Cooperation in Science and Technology). To provide some background, we briefly introduce the basic idea of ELTeC with a focus on the overall goals and intended usage scenarios. We then describe the collection composition principles that we have derived from the usage scenarios. In our discussion of the corpus-building process, we focus on collections of novels from four different literary traditions as components of ELTeC: French, Portuguese, Romanian, and Slovenian, selected from the more than twenty collections that are currently in preparation. For each collection, we describe some of the challenges we have encountered and the solutions developed while building ELTeC. In each case, the literary tradition, the history of the language, the current state of digitization of cultural heritage, the resources available locally, and the scholars’ training level with regard to digitization and corpus building have been vastly different. How can we, in this context, hope to build comparable collections of novels that can usefully be integrated into a multilingual resource such as ELTeC and used in Distant Reading research? Based on our individual and collective experience with contributing to ELTeC, we end this contribution with some lessons learned regarding collaborative, multilingual corpus building

    The CLiGS Textbox: Building and Using Collections of Literary Texts in Romance Languages Encoded in TEI XML

    Get PDF
    The CLiGS textbox is published by the Computational Literary Genre Stylistics (CLiGS) group. The textbox is the group’s publication channel for several collections of literary texts. We describe the rationale for the manner in which the collections of literary texts included in the textbox have been compiled, annotated, and published. Furthermore, we suggest several ways in which the text collections can be used for research in literary studies. We aim to document some of the work of the CLiGS group, to showcase the unique TEI XML-based collections of French, Spanish, Spanish-American, and Portuguese novels and French drama we make available, and to encourage reuse of these text collections by others. We argue that agreement on common formats and procedures for text preparation, encoding, and publication fosters the accessibility, analysis, and reuse potential of literary text collections

    Materiality of TEI Encoding and Decoding: An Analysis of the Western European Union Archives on Armament Policy

    Get PDF
    By combining traditional historical enquiry with TEI XML encoding and decoding in a corpus analysis phase, the project aims at addressing research questions mainly related to the French and British positions on the topics of armament design and production and of armament control within the Western European Union (WEU) from 1954 to 1982. The paper focuses on the annotation of speakers (different countries and institutional representatives) and their discourse in a selection of institutional documents (minutes, notes, studies, memoranda) (encoding phase) and the identification of linguistic patterns on armament issues in their discourse, as well as the interpretation of results (decoding phase). From a larger perspective, the study considers the TEI encoding as adding to the original text a “material” layer that further supports both machine and human interpretation (decoding). In this sense, this study may move closer to the concept of “material hermeneutics,” by understanding code, and digital technology in general, as an instrument we can use in hermeneutic ways to produce knowledge

    Cross Disciplinary Overtures with Interview Data: Integrating Digital Practices and Tools in the Scholarly Workflow

    Get PDF
    There is much talk about the need for multidisciplinary approaches to research and the opportunities that have been created by digital technologies. A good example of this is the CLARIN Portal, that promotes and supports such research by offering a large suite of tools for working with textual and audio-visual data. Yet scholars who work with interview material are largely unaware of this resource and are still predominantly oriented towards familiar traditional research methods. To reach out to these scholars and assess the potential for integration of these new technologies a multidisciplinary international community of experts set out to test CLARIN-type approaches and tools on different scholars by eliciting and documenting their feedback. This was done through a series of workshops held from 2016 to 2019, and funded by CLARIN and affiliated EU funding. This paper presents the goals, the tools that were tested and the evaluation of how they were experienced. It concludes by setting out envisioned pathways for a better use of the CLARIN family of approaches and tools in the area of qualitative and oral history data analysi