Search CORE

15 research outputs found

The ALVIS Format for Linguistically Annotated Documents

Author: Alphonse Erick
Derivière Julien
Hamon Thierry
Nazarenko Adeline
Vauvert Guillaume
Weissenbacher Davy
Publication venue
Publication date: 01/01/2006
Field of study

The paper describes the ALVIS annotation format designed for the indexing of large collections of documents in topic-specific search engines. This paper is exemplified on the biological domain and on MedLine abstracts, as developing a specialized search engine for biologists is one of the ALVIS case studies. The ALVIS principle for linguistic annotations is based on existing works and standard propositions. We made the choice of stand-off annotations rather than inserted mark-up. Annotations are encoded as XML elements which form the linguistic subsection of the document record

arXiv.org e-Print Archive

CiteSeerX

HAL-Paris 13

A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis

Author: Aubin Sophie
Derivière Julien
Hamon Thierry
Nazarenko Adeline
Poibeau Thierry
Publication venue
Publication date: 30/05/2007
Field of study

Web semantic access in specific domains calls for specialized search engines with enhanced semantic querying and indexing capacities, which pertain both to information retrieval (IR) and to information extraction (IE). A rich linguistic analysis is required either to identify the relevant semantic units to index and weight them according to linguistic specific statistical distribution, or as the basis of an information extraction process. Recent developments make Natural Language Processing (NLP) techniques reliable enough to process large collections of documents and to enrich them with semantic annotations. This paper focuses on the design and the development of a text processing platform, Ogmios, which has been developed in the ALVIS project. The Ogmios platform exploits existing NLP modules and resources, which may be tuned to specific domains and produces linguistically annotated documents. We show how the three constraints of genericity, domain semantic awareness and performance can be handled all together

arXiv.org e-Print Archive

CiteSeerX

HAL-Paris 13

Annotation linguistique de documents Web dans une architecture distribuée et adaptable

Author: Derivière Julien
Hamon Thierry
Publication venue: HAL CCSD
Publication date: 25/11/2006
Field of study

The French Perl Workshop (Journées Francophones de Perl - FPW2006) Communication oraleDans le cadre du projet ALVIS (www.alvis.info/alvis), nous avons conçu une plate-forme d'enrichissement linguistique de documents issus du Web, exploitant des outils de Traitement Automatique des Langues (TAL) existants. Cette architecture est distribuée afin de répondre aux contraintes liées aux traitements de gros volumes de textes, et adaptable pour spécialiser l'analyse linguistique de ces textes. Une collection de 55 329 documents (soit plus 80 millions de mots) a pu être annotée en 3 jours. La plate-forme, développée en Perl et disponible sous forme de modules, peut être vu comme un cadre de travail modulaire dans lequel il est possible d'intégrer de nouveaux outils de TAL. Lors de l'exposé, nous présenterons la plate-forme, aussi bien du point de vue de sa conception que de son implémentation. Nous donnerons également un aperçu des performances obtenues

HAL-Paris 13

Deliverable D5.2: Report on theory and software of normalization options for IR (platform conception)

Author: Derivière Julien
Nazarenko Adeline
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

ALVIS Deliverable ReportThis document gives technical details regarding the implementation and usage of the ALVIS platform for English, French, Chinese and Slovene

HAL-Paris 13

Alvis NLP Platform

Author: Derivière Julien
Hamon Thierry
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

The Alvis NLP Platform is a scalable arcitecture using existing NLP tools to annotate large collections of web documents

HAL-Paris 13

Ogmios : une plate-forme d'annotation linguistique

Author: Derivière Julien
Hamon Thierry
Nazarenko Adeline
Publication venue: IRIT Press
Publication date: 05/06/2007
Field of study

National audienceL'un des objectifs du projet ALVIS est d'intégrer des informations linguistiques dans des moteurs de recherche spécialisés. Dans ce contexte, nous avons conçu une plate-forme d'enrichissement linguistique de documents issus du Web, Ogmios, exploitant des outils de TAL existants. Les documents peuvent être en français ou en anglais. Cette architecture est distribuée, afin de répondre aux contraintes liées aux traitements de gros volumes de textes, et adaptable, afin de spécialiser l'analyse linguistique de ces textes. La plate-forme est développée en Perl et disponible sous forme de modules CPAN. Elle peut être vue comme un cadre de travail modulaire dans lequel il est possible d'intégrer des ressources adaptées au domaine traité mais aussi de nouveaux outils de TAL. Nous avons évalué les performances de la plateforme sur plusieurs collections de documents. En distribuant les traitements sur vingt machines, une collection de 55~329 documents du domaine de la biologie (106 millions de mots) a été annotée en 35 heures tandis qu'une collection de 48 422 dépêches relatives aux moteurs de recherche (14 millions de mots) a été annotée en 3 heures et 15 minutes

HAL-Paris 13

Ogmios: a scalable NLP platform for annotating large web document collections

Author: Adeline Nazarenko
Julien Derivière
Thierry Hamon
Publication venue
Publication date: 01/07/2007
Field of study

Search engines like Google or Yahoo offer access to billions of textual web pages. These tools are very popular and seem to be sufficient for a large number of general user queries on the Internet. However, some other queries are more complex, requiring specific knowledge or processing strategies: no really satisfactory solution exists for these requests

CiteSeerX

HAL-Paris 13

A Scalable and Distributed NLP Architecture for Web Document Annotation

Author: Derivière Julien
Hamon Thierry
Nazarenko Adeline
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2006
Field of study

HAL-Paris 13

Développement d'une plate-forme d'enrichissement des documents textuels~: l'expérience du projet ALVIS

Author: Derivière Julien
Hamon Thierry
Nazarenko Adeline
Vauvert Guillaume
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

HAL-Paris 13