35 research outputs found
Similitud entre documentos multilingües de carácter cientÃfico-técnico en un entorno Web
En este artÃculo se presenta un sistema para la agrupación multilingüe de documentos que tratan temas similares. Para la representación de los documentos se ha empleado el modelo de espacio vectorial, utilizando criterios lingüÃsticos para la selección de las palabras clave, la fórmula tf-idf para el cálculo de sus relevancias, y RSS feedback y wrappers para actualizar el repositorio. Respecto al tratamiento multilingüe se ha seguido una
estrategia basada en diccionarios bilingües con desambiguación. Debido al carácter cientÃfico-técnico de los textos se han empleado diccionarios técnicos combinados con diccionarios de carácter general. Los resultados obtenidos han sido evaluados manualmente.In this paper we present a system to identify documents of similar content. To
represent the documents we’ve used the vector space model using linguistic knowledge to
choose keywords and tf-idf to calculate the relevancy. The documents repository is updated by RSS and HTML wrappers. As for the multilingual treatment we have used a strategy based in
bilingual dictionaries. Due to the scientific-technical nature of the texts, the translation of the
vector has been carried off by technical dictionaries combined with general dictionaries. The
obtained results have been evaluated in order to estimate the precision of the system.Este trabajo está subvencionado por el Departamento de Industria del Gobierno Vasco (proyectos Dokusare SA-2005/00272, Dokusare SA-2006/00167)
GaIn : un buscador Internet/Intranet avanzado para textos en euskera
En este artÃculo se presentan las tareas realizadas para combinar la explotación de la
información de la Web y las técnicas del NLP dando como resultado un buscador avanzado
para textos en euskera, todo ello desarrollado dentro del proyecto GaIn financiado
parcialmente por la Diputación Foral de Gipuzkoa y por el programa Universidad-Empresa
del Gobierno Vasco (UE-1999-2). Este trabajo ha sido desarrollado por el grupo IxA
(ixa.si.ehu.es) de la Universidad del PaÃs Vasco en colaboración con el portal Jalgi
(www.jalgi.com) de Plazagune S.L.
La herramienta realizada es un buscador de Internet/Intranet con su robot, indexador y
buscador, que tiene dos módulos de NLP que lo convierten en avanzado: por un lado un
identificador de idioma y por otro un lematizador robusto
Automatic diachronic distance between diatopic variants of portuguese and spanish
[POR] O objetivo deste trabalho é aplicar uma metodologia baseada na perplexidade, para calcular automaticamente a distância interlinguÃstica entre diferentes perÃodos históricos de variantes diatópicas de idiomas. Esta metodologia aplica-se a um corpus construÃdo adhoc em ortografia original, numa base equilibrada de ficção e não-ficção, que mede a distância histórica entre o português europeu e do Brasil, por um lado, e o espanhol europeu e o da Argentina, por outro. Os resultados mostram distâncias muito próximas em ortografia original e transcrita automaticamente, entre as variedades diatópicas do português e do espanhol, com ligeiras convergências/divergências desde meados do século XX até hoje. É de salientar que o método não é supervisionado e pode ser aplicado a outras variedades diatópicas de lÃnguas.[EN] The objective of this work is to apply a perplexitybased
methodology to automatically calculate the
cross-lingual distance between different historical periods
of diatopic language variants. This methodology
applies to an adhoc constructed corpus in original
spelling, on a balanced basis of fiction and non-fiction,
which measures the historical distance between European
and Brazilian Portuguese on the one hand,
and European and Argentinian Spanish on the other.
The results show very close distances, both in original
spelling and automatically transcribed spelling,
between the diatopic varieties of Portuguese and Spanish,
with slight convergences/divergences from the
middle of the 20th century until today. It should be noted that the method is not supervised and can be
applied to other diatopic varieties of languages
Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron
Este artÃculo presenta sistemas de identificación de chunks y cláusulas para el
euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más
concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, MÃ rquez
y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas
parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite
incorporar nuevos atributos, y posibilita asà el uso de información de diferentes fuentes. De esta
manera, hemos añadido información lingüÃstica en los algoritmos de aprendizaje. AsÃ, los
resultados del identificador de chunks han mejorado considerablemente y se ha compensado la
influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera.
En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos,
debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos
actualmente.This paper presents systems for syntactic chunking and clause identification for
Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used
Filtering-Ranking with Perceptrons (Carreras, MÃ rquez and Castro, 2005): a learning model that
recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for
these tasks in English. This model allows incorporating a rich set of features to represent
syntactic phrases, making possible to use information from different sources. We used this
property in order to include more linguistic features in the learning model and the results
obtained in chunking have been improved greatly. This way, we have made up for the relatively
small training data available for Basque to learn a chunking model. In the case of clause
identification, our preliminary results are low, which suggest that this is due to the free order of
Basque and to the small corpus available.Research partly funded by the Basque
Government (Department of Education,
University and Research, IT-397-07), the
Spanish Ministry of Education and Science
(TIN2007-63173) and the ETORTEK-ANHITZ
project from the Basque Government
(Department of Culture and Industry, IE06-
185)
Glass-Transition Dynamics of Mixtures of Linear Poly(Vinyl Methyl Ether) with Single-Chain Polymer Nanoparticles: Evidence of a New Type of Nanocomposite Materials
Single-chain polymer nanoparticles (SCNPs) obtained through chain collapse by intramolecular cross-linking are attracting increasing interest as components of all-polymer nanocomposites, among other applications. We present a dielectric relaxation study on the dynamics of mixtures of poly(vinyl methyl ether) (PVME) and polystyrene (PS)-based SCNPs with various compositions. Analogous dielectric measurements on a miscible blend of PVME with the linear precursor chains of the SCNPs are taken as reference for this study. Both systems present completely different behaviors: While the blend with the linear precursor presents dynamics very similar to that reported for PVME/PS miscible blends, in the PVME/SCNP mixtures there are an appreciable amount of PVME segments that are barely affected by the presence of SCNPs, which nearly vanishes only for mixtures with high SCNP content. Interestingly, in the frame of a simple two-phase system, our findings point towards the existence of a SCNP-rich phase with a constant PVME fraction, regardless of the overall concentration of the mixture. Moreover, the dynamics of the PVME segments in this SCNP-rich phase display an extreme dynamic heterogeneity, a signature of constraint effects.This research was funded by Eusko Jaurlaritza project code: IT-654-13 and the Ministerio de Economia y Competitividad project code: MAT2015-63704-P (MINECO/FEDER, UE)
Euskarazko hitz anitzeko unitate lexikalen tratamendu konputazionala
Multi-word Lexical Units (MWLU) are of great importance in language in general, and in Natural Language Processing in particular, since they are not governed by the free rules of the system. In this article, we give an overview of the different types of phraseological units, explaining briefly each one's features. Our priority being to process idioms automatically in Basque texts, we concisely analyze several approaches for the inflectional description of MWLUs, and then, we explain the system we have developed for Basque: (i) a general representation for describing MWLUs in the lexical database for Basque (EDBL), (ii) HABIL, a tool capable of detecting and analyzing them based on the features described in the database, and (iii) a constraint grammar for disambiguating ambiguous MWLUs
Euskarazko hitz anitzeko unitate lexikalen tratamendu konputazionala
Multi-word Lexical Units (MWLU) are of great importance in language in general, and in Natural Language Processing in particular, since they are not governed by the free rules of the system. In this article, we give an overview of the different types of phraseological units, explaining briefly each one's features. Our priority being to process idioms automatically in Basque texts, we concisely analyze several approaches for the inflectional description of MWLUs, and then, we explain the system we have developed for Basque: (i) a general representation for describing MWLUs in the lexical database for Basque (EDBL), (ii) HABIL, a tool capable of detecting and analyzing them based on the features described in the database, and (iii) a constraint grammar for disambiguating ambiguous MWLUs
Traducción automática basada en tectogramática para inglés-español e inglés-euskara
Presentamos los primeros sistemas de traducción automática para inglés-español e inglés-euskara basados en tectogramática. A partir del modelo ya existente inglés-checo, describimos las herramientas para el análisis y sÃntesis, y los recursos para la trasferencia. La evaluación muestra el potencial de estos sistemas para adaptarse a nuevas lenguas y dominios.We present the first attempt to build machine translation systems for the English-Spanish and English-Basque language pairs following the tectogrammar approach. Based on the English-Czech system, we describe the language-specific tools added in the analysis and synthesis steps, and the resources for bilingual transfer. Evaluation shows the potential of these systems for new languages and domains.The research leading to these results has received funding from FP7-ICT-2013-10-610516 (QTLeap project, qtleap.eu)
Wikipedia eta itzulpen automatikoa: "harri batez bizpalau xori"
Artikulu honetan elkarlanean egindako proiektu bat aurkezten dugu. Boluntario talde bat bildu dugu espainierazko Wikipediako hainbat artikulu euskarara itzultzeko, baina boluntarioen lana errazteko, Matxin itzultzaile automatikoa erabili dugu aurreitzulpenak sortzeko, eta horrela boluntarioen lana errare eta akatsak dituzten itzulpen automatiko horiek aztertu eta zuzentzea izan da. Lan honekin, batetik, Euskal Wikipedia aberastu dugu, 50.000 hitz berri gehituz. Beste alde batetik, sistema automatikoaren itzulpenak eta posteditatutako bertsio zuzenduekin corpus bat sortu dugu. Corpus hori erabili dugu posteditore estatistiko bat sortzeko, Matxin itzulpen automatikoko sistemaren irteeraren doitasuna % 10ean hobetuz