35 research outputs found

    Similitud entre documentos multilingües de carácter científico-técnico en un entorno Web

    Get PDF
    En este artículo se presenta un sistema para la agrupación multilingüe de documentos que tratan temas similares. Para la representación de los documentos se ha empleado el modelo de espacio vectorial, utilizando criterios lingüísticos para la selección de las palabras clave, la fórmula tf-idf para el cálculo de sus relevancias, y RSS feedback y wrappers para actualizar el repositorio. Respecto al tratamiento multilingüe se ha seguido una estrategia basada en diccionarios bilingües con desambiguación. Debido al carácter científico-técnico de los textos se han empleado diccionarios técnicos combinados con diccionarios de carácter general. Los resultados obtenidos han sido evaluados manualmente.In this paper we present a system to identify documents of similar content. To represent the documents we’ve used the vector space model using linguistic knowledge to choose keywords and tf-idf to calculate the relevancy. The documents repository is updated by RSS and HTML wrappers. As for the multilingual treatment we have used a strategy based in bilingual dictionaries. Due to the scientific-technical nature of the texts, the translation of the vector has been carried off by technical dictionaries combined with general dictionaries. The obtained results have been evaluated in order to estimate the precision of the system.Este trabajo está subvencionado por el Departamento de Industria del Gobierno Vasco (proyectos Dokusare SA-2005/00272, Dokusare SA-2006/00167)

    GaIn : un buscador Internet/Intranet avanzado para textos en euskera

    Get PDF
    En este artículo se presentan las tareas realizadas para combinar la explotación de la información de la Web y las técnicas del NLP dando como resultado un buscador avanzado para textos en euskera, todo ello desarrollado dentro del proyecto GaIn financiado parcialmente por la Diputación Foral de Gipuzkoa y por el programa Universidad-Empresa del Gobierno Vasco (UE-1999-2). Este trabajo ha sido desarrollado por el grupo IxA (ixa.si.ehu.es) de la Universidad del País Vasco en colaboración con el portal Jalgi (www.jalgi.com) de Plazagune S.L. La herramienta realizada es un buscador de Internet/Intranet con su robot, indexador y buscador, que tiene dos módulos de NLP que lo convierten en avanzado: por un lado un identificador de idioma y por otro un lematizador robusto

    Automatic diachronic distance between diatopic variants of portuguese and spanish

    Get PDF
    [POR] O objetivo deste trabalho é aplicar uma metodologia baseada na perplexidade, para calcular automaticamente a distância interlinguística entre diferentes períodos históricos de variantes diatópicas de idiomas. Esta metodologia aplica-se a um corpus construído adhoc em ortografia original, numa base equilibrada de ficção e não-ficção, que mede a distância histórica entre o português europeu e do Brasil, por um lado, e o espanhol europeu e o da Argentina, por outro. Os resultados mostram distâncias muito próximas em ortografia original e transcrita automaticamente, entre as variedades diatópicas do português e do espanhol, com ligeiras convergências/divergências desde meados do século XX até hoje. É de salientar que o método não é supervisionado e pode ser aplicado a outras variedades diatópicas de línguas.[EN] The objective of this work is to apply a perplexitybased methodology to automatically calculate the cross-lingual distance between different historical periods of diatopic language variants. This methodology applies to an adhoc constructed corpus in original spelling, on a balanced basis of fiction and non-fiction, which measures the historical distance between European and Brazilian Portuguese on the one hand, and European and Argentinian Spanish on the other. The results show very close distances, both in original spelling and automatically transcribed spelling, between the diatopic varieties of Portuguese and Spanish, with slight convergences/divergences from the middle of the 20th century until today. It should be noted that the method is not supervised and can be applied to other diatopic varieties of languages

    Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron

    Get PDF
    Este artículo presenta sistemas de identificación de chunks y cláusulas para el euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, Màrquez y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite incorporar nuevos atributos, y posibilita así el uso de información de diferentes fuentes. De esta manera, hemos añadido información lingüística en los algoritmos de aprendizaje. Así, los resultados del identificador de chunks han mejorado considerablemente y se ha compensado la influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera. En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos, debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos actualmente.This paper presents systems for syntactic chunking and clause identification for Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used Filtering-Ranking with Perceptrons (Carreras, Màrquez and Castro, 2005): a learning model that recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for these tasks in English. This model allows incorporating a rich set of features to represent syntactic phrases, making possible to use information from different sources. We used this property in order to include more linguistic features in the learning model and the results obtained in chunking have been improved greatly. This way, we have made up for the relatively small training data available for Basque to learn a chunking model. In the case of clause identification, our preliminary results are low, which suggest that this is due to the free order of Basque and to the small corpus available.Research partly funded by the Basque Government (Department of Education, University and Research, IT-397-07), the Spanish Ministry of Education and Science (TIN2007-63173) and the ETORTEK-ANHITZ project from the Basque Government (Department of Culture and Industry, IE06- 185)

    Glass-Transition Dynamics of Mixtures of Linear Poly(Vinyl Methyl Ether) with Single-Chain Polymer Nanoparticles: Evidence of a New Type of Nanocomposite Materials

    Get PDF
    Single-chain polymer nanoparticles (SCNPs) obtained through chain collapse by intramolecular cross-linking are attracting increasing interest as components of all-polymer nanocomposites, among other applications. We present a dielectric relaxation study on the dynamics of mixtures of poly(vinyl methyl ether) (PVME) and polystyrene (PS)-based SCNPs with various compositions. Analogous dielectric measurements on a miscible blend of PVME with the linear precursor chains of the SCNPs are taken as reference for this study. Both systems present completely different behaviors: While the blend with the linear precursor presents dynamics very similar to that reported for PVME/PS miscible blends, in the PVME/SCNP mixtures there are an appreciable amount of PVME segments that are barely affected by the presence of SCNPs, which nearly vanishes only for mixtures with high SCNP content. Interestingly, in the frame of a simple two-phase system, our findings point towards the existence of a SCNP-rich phase with a constant PVME fraction, regardless of the overall concentration of the mixture. Moreover, the dynamics of the PVME segments in this SCNP-rich phase display an extreme dynamic heterogeneity, a signature of constraint effects.This research was funded by Eusko Jaurlaritza project code: IT-654-13 and the Ministerio de Economia y Competitividad project code: MAT2015-63704-P (MINECO/FEDER, UE)

    Euskarazko hitz anitzeko unitate lexikalen tratamendu konputazionala

    Get PDF
    Multi-word Lexical Units (MWLU) are of great importance in language in general, and in Natural Language Processing in particular, since they are not governed by the free rules of the system. In this article, we give an overview of the different types of phraseological units, explaining briefly each one's features. Our priority being to process idioms automatically in Basque texts, we concisely analyze several approaches for the inflectional description of MWLUs, and then, we explain the system we have developed for Basque: (i) a general representation for describing MWLUs in the lexical database for Basque (EDBL), (ii) HABIL, a tool capable of detecting and analyzing them based on the features described in the database, and (iii) a constraint grammar for disambiguating ambiguous MWLUs

    Euskarazko hitz anitzeko unitate lexikalen tratamendu konputazionala

    Get PDF
    Multi-word Lexical Units (MWLU) are of great importance in language in general, and in Natural Language Processing in particular, since they are not governed by the free rules of the system. In this article, we give an overview of the different types of phraseological units, explaining briefly each one's features. Our priority being to process idioms automatically in Basque texts, we concisely analyze several approaches for the inflectional description of MWLUs, and then, we explain the system we have developed for Basque: (i) a general representation for describing MWLUs in the lexical database for Basque (EDBL), (ii) HABIL, a tool capable of detecting and analyzing them based on the features described in the database, and (iii) a constraint grammar for disambiguating ambiguous MWLUs

    Traducción automática basada en tectogramática para inglés-español e inglés-euskara

    Get PDF
    Presentamos los primeros sistemas de traducción automática para inglés-español e inglés-euskara basados en tectogramática. A partir del modelo ya existente inglés-checo, describimos las herramientas para el análisis y síntesis, y los recursos para la trasferencia. La evaluación muestra el potencial de estos sistemas para adaptarse a nuevas lenguas y dominios.We present the first attempt to build machine translation systems for the English-Spanish and English-Basque language pairs following the tectogrammar approach. Based on the English-Czech system, we describe the language-specific tools added in the analysis and synthesis steps, and the resources for bilingual transfer. Evaluation shows the potential of these systems for new languages and domains.The research leading to these results has received funding from FP7-ICT-2013-10-610516 (QTLeap project, qtleap.eu)

    Wikipedia eta itzulpen automatikoa: "harri batez bizpalau xori"

    Get PDF
    Artikulu honetan elkarlanean egindako proiektu bat aurkezten dugu. Boluntario talde bat bildu dugu espainierazko Wikipediako hainbat artikulu euskarara itzultzeko, baina boluntarioen lana errazteko, Matxin itzultzaile automatikoa erabili dugu aurreitzulpenak sortzeko, eta horrela boluntarioen lana errare eta akatsak dituzten itzulpen automatiko horiek aztertu eta zuzentzea izan da. Lan honekin, batetik, Euskal Wikipedia aberastu dugu, 50.000 hitz berri gehituz. Beste alde batetik, sistema automatikoaren itzulpenak eta posteditatutako bertsio zuzenduekin corpus bat sortu dugu. Corpus hori erabili dugu posteditore estatistiko bat sortzeko, Matxin itzulpen automatikoko sistemaren irteeraren doitasuna % 10ean hobetuz
    corecore