36 research outputs found
A tool set for the quick and efficient exploration of large document collections
We are presenting a set of multilingual text analysis tools that can help
analysts in any field to explore large document collections quickly in order to
determine whether the documents contain information of interest, and to find
the relevant text passages. The automatic tool, which currently exists as a
fully functional prototype, is expected to be particularly useful when users
repeatedly have to sieve through large collections of documents such as those
downloaded automatically from the internet. The proposed system takes a whole
document collection as input. It first carries out some automatic analysis
tasks (named entity recognition, geo-coding, clustering, term extraction),
annotates the texts with the generated meta-information and stores the
meta-information in a database. The system then generates a zoomable and
hyperlinked geographic map enhanced with information on entities and terms
found. When the system is used on a regular basis, it builds up a historical
database that contains information on which names have been mentioned together
with which other names or places, and users can query this database to retrieve
information extracted in the past.Comment: 10 page
Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives
The aim of this contribution is to reflect on the process of building the multilingual European Literary Text Collection (ELTeC) that is being created in the framework of the networking project Distant Reading for European Literary History funded by COST (European Cooperation in Science and Technology). To provide some background, we briefly introduce the basic idea of ELTeC with a focus on the overall goals and intended usage scenarios. We then describe the collection composition principles that we have derived from the usage scenarios. In our discussion of the corpus-building process, we focus on collections of novels from four different literary traditions as components of ELTeC: French, Portuguese, Romanian, and Slovenian, selected from the more than twenty collections that are currently in preparation. For each collection, we describe some of the challenges we have encountered and the solutions developed while building ELTeC. In each case, the literary tradition, the history of the language, the current state of digitization of cultural heritage, the resources available locally, and the scholars’ training level with regard to digitization and corpus building have been vastly different. How can we, in this context, hope to build comparable collections of novels that can usefully be integrated into a multilingual resource such as ELTeC and used in Distant Reading research? Based on our individual and collective experience with contributing to ELTeC, we end this contribution with some lessons learned regarding collaborative, multilingual corpus building
A Common XML-based Framework for Syntactic Annotations
Colloque avec actes et comité de lecture. internationale.International audienceIt is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator's approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes
East meets West: Producing Multilingual Resources in a European Context
International audienceThe EU concerted action TELRI has released a two-volume CD-ROM, which contains multilingual language resources, namelycorpora, lexica, and tools for language engineering. This CD-ROM provides harmonised resources for unprecedented numbers and kindsof languages, mainly from non-EU countries, for which such resources still tend to be scarce. The first volume of the CD includes thealigned text of Plato’s Republic in twenty one languages, while the second volume contains extended results of the EU MULTEXTEastproject, including the aligned and tagged novel ’1984’ by Goerge Orwell and accompanying lexica in seven languages. The paperpresents the CD-ROM, the methods employed in its creation and its prospective uses
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages
We present a new, unique and freely available parallel corpus containing
European Union (EU) documents of mostly legal nature. It is available in all 20
official EUanguages, with additional documents being available in the languages
of the EU candidate countries. The corpus consists of almost 8,000 documents
per language, with an average size of nearly 9 million words per language.
Pair-wise paragraph alignment information produced by two different aligners
(Vanilla and HunAlign) is available for all 190+ language pair combinations.
Most texts have been manually classified according to the EUROVOC subject
domains so that the collection can also be used to train and test multi-label
classification algorithms and keyword-assignment software. The corpus is
encoded in XML, according to the Text Encoding Initiative Guidelines. Due to
the large number of parallel texts in many languages, the JRC-Acquis is
particularly suitable to carry out all types of cross-language research, as
well as to test and benchmark text analysis software across different languages
(for instance for alignment, sentence splitting and term extraction).Comment: A multilingual textual resource with meta-data freely available for
download at http://langtech.jrc.it/JRC-Acquis.htm
Towards an international standard on feature structures representation
Colloque avec actes et comité de lecture. internationale.International audienceThis paper describes the preliminary results of a joint initiative of the TEI (Text Encoding Initiative) Consortium and the ISO Committee TC 37SC 4 (language Resource management) to provide a standard for the representation and interchange of feature structures