Search CORE

36 research outputs found

A tool set for the quick and efficient exploration of large document collections

Author: Erjavec Tomaz
Ignat Camelia
Pouliquen Bruno
Steinberger Ralf
Publication venue
Publication date: 01/01/2005
Field of study

We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to be particularly useful when users repeatedly have to sieve through large collections of documents such as those downloaded automatically from the internet. The proposed system takes a whole document collection as input. It first carries out some automatic analysis tasks (named entity recognition, geo-coding, clustering, term extraction), annotates the texts with the generated meta-information and stores the meta-information in a database. The system then generates a zoomable and hyperlinked geographic map enhanced with information on entities and terms found. When the system is used on a regular basis, it builds up a historical database that contains information on which names have been mentioned together with which other names or places, and users can query this database to retrieve information extracted in the past.Comment: 10 page

arXiv.org e-Print Archive

CiteSeerX

Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives

Author: Erjavec Tomaz
Patras Roxana
Santos Diana
Schöch Christof
Publication venue
Publication date: 01/01/2021
Field of study

The aim of this contribution is to reflect on the process of building the multilingual European Literary Text Collection (ELTeC) that is being created in the framework of the networking project Distant Reading for European Literary History funded by COST (European Cooperation in Science and Technology). To provide some background, we briefly introduce the basic idea of ELTeC with a focus on the overall goals and intended usage scenarios. We then describe the collection composition principles that we have derived from the usage scenarios. In our discussion of the corpus-building process, we focus on collections of novels from four different literary traditions as components of ELTeC: French, Portuguese, Romanian, and Slovenian, selected from the more than twenty collections that are currently in preparation. For each collection, we describe some of the challenges we have encountered and the solutions developed while building ELTeC. In each case, the literary tradition, the history of the language, the current state of digitization of cultural heritage, the resources available locally, and the scholars’ training level with regard to digitization and corpus building have been vastly different. How can we, in this context, hope to build comparable collections of novels that can usefully be integrated into a multilingual resource such as ELTeC and used in Distant Reading research? Based on our individual and collective experience with contributing to ELTeC, we end this contribution with some lessons learned regarding collaborative, multilingual corpus building

Repositório Comum

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

NORA - Norwegian Open Research Archives

A Common XML-based Framework for Syntactic Annotations

Author: Erjavec Tomaz
Ide Nancy
Romary Laurent
Publication venue: HAL CCSD
Publication date: 30/11/2001
Field of study

Colloque avec actes et comité de lecture. internationale.International audienceIt is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator's approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes

INRIA a CCSD electronic archive server

HAL-Rennes 1

East meets West: Producing Multilingual Resources in a European Context

Author: Erjavec Tomaz
Lawson Ann
Romary Laurent
Publication venue: HAL CCSD
Publication date: 01/01/1998
Field of study

International audienceThe EU concerted action TELRI has released a two-volume CD-ROM, which contains multilingual language resources, namelycorpora, lexica, and tools for language engineering. This CD-ROM provides harmonised resources for unprecedented numbers and kindsof languages, mainly from non-EU countries, for which such resources still tend to be scarce. The first volume of the CD includes thealigned text of Plato’s Republic in twenty one languages, while the second volume contains extended results of the EU MULTEXTEastproject, including the aligned and tagged novel ’1984’ by Goerge Orwell and accompanying lexica in seven languages. The paperpresents the CD-ROM, the methods employed in its creation and its prospective uses

CiteSeerX

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

Author: Erjavec Tomaz
Ignat Camelia
Pouliquen Bruno
Steinberger Ralf
Tufis Dan
Varga Daniel
Widiger Anna
Publication venue
Publication date: 01/01/2006
Field of study

We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).Comment: A multilingual textual resource with meta-data freely available for download at http://langtech.jrc.it/JRC-Acquis.htm

arXiv.org e-Print Archive

CiteSeerX

Towards an international standard on feature structures representation

Author: Bauman Syd
Bunt Harry
Burnard Lou
Clement Lionel
Declerck Thierry
Erjavec Tomaz
Lee Kiyong
Romary Laurent
Roussanaly Azim
Roux Claude
Villemonte de La Clergerie Éric
Publication venue: HAL CCSD
Publication date: 26/05/2004
Field of study

Colloque avec actes et comité de lecture. internationale.International audienceThis paper describes the preliminary results of a joint initiative of the TEI (Text Encoding Initiative) Consortium and the ISO Committee TC 37SC 4 (language Resource management) to provide a standard for the representation and interchange of feature structures

INRIA a CCSD electronic archive server