565 research outputs found

    Acquiring Domain-Specific Knowledge for WordNet from a Terminological Database

    Get PDF
    In this research we explore a terminological database (Termoteca) in order to expand the Portuguese and Galician wordnets (PULO and Galnet) with the addition of new synset variants (word forms for a concept), usage examples for the variants, and synset glosses or definitions. The methodology applied in this experiment is based on the alignment between concepts of WordNet (synsets) and concepts described in Termoteca (terminological records), taking into account the lexical forms in both resources, their morphological category and their knowledge domains, using the information provided by the WordNet Domains Hierarchy and the Termoteca field domains to reduce the incidence of polysemy and homography in the results of the experiment. The results obtained confirm our hypothesis that the combined use of the semantic domain information included in both resources makes it possible to minimise the problem of lexical ambiguity and to obtain a very acceptable index of precision in terminological information extraction tasks, attaining a precision above 89% when there are two or more different languages sharing at least one lexical form between the synset in Galnet and the Termoteca record

    Consumer Eroski parallel corpus

    Get PDF
    This paper introduces the Consumer Eroski Parallel Corpus, a collection of articles originally written in Spanish and later translated to three languages also spoken in Spain: Basque, Catalan and Galician. The articles have been correlated in the four languages at the sentence level automatically using Moore's bilingual sentence alignment tool (2002). The Spanish section is also annotated morphosyntactically for parts of speech using SVMtool (Giménez and Márquez 2004). The Basque, Catalan and Galician sections may be annotated in a future release with the collaboration of Computational Linguistics Groups in Spain. To my knowledge, the Consumer Eroski Parallel Corpus is the first resource to exist that encompasses a substantial body of parallel text from these four languages spoken in Spain. I would like to thank the Eroski Foundation for granting permission to share the corpus in the public domain. Making this resource public will provide additional opportunities to test, train and develop natural language processing tools in the computational linguistics community. It may also help translators as a reference. With the addition of an advanced search interface, currently under development, the corpus may be consulted by Basque and Romance linguists interested in cross-linguistic research

    Consumer Eroski parallel corpus

    Get PDF
    This paper introduces the Consumer Eroski Parallel Corpus, a collection of articles originally written in Spanish and later translated to three languages also spoken in Spain: Basque, Catalan and Galician. The articles have been correlated in the four languages at the sentence level automatically using Moore's bilingual sentence alignment tool (2002). The Spanish section is also annotated morphosyntactically for parts of speech using SVMtool (Giménez and Márquez 2004). The Basque, Catalan and Galician sections may be annotated in a future release with the collaboration of Computational Linguistics Groups in Spain. To my knowledge, the Consumer Eroski Parallel Corpus is the first resource to exist that encompasses a substantial body of parallel text from these four languages spoken in Spain. I would like to thank the Eroski Foundation for granting permission to share the corpus in the public domain. Making this resource public will provide additional opportunities to test, train and develop natural language processing tools in the computational linguistics community. It may also help translators as a reference. With the addition of an advanced search interface, currently under development, the corpus may be consulted by Basque and Romance linguists interested in cross-linguistic research

    The first Mirandese text-to-speech system 

    Get PDF
    This paper describes the creation of base NLP resources and tools for an under-resourced minority language spoken in Portugal, Mirandese, in the context of the generation of a text-to-speech system, a collaborative citizenship project between Microsoft, ILTEC, and ALM – Associaçon de la Lhéngua Mirandesa. Development efforts encompassed the compilation of a large textual corpus, definition of a complete phone-set, development of a tokenizer, inflector, TN and GTP modules, and creation of a large phonetic lexicon with syllable segmentation, stress mark-up, and POS. The TTS system will provide an open access web interface freely available to the community, along with the other resources. We took advantage of mature tools, resources, and processes already available for phylogenetically-close languages, allowing us to cut development time and resources to a great extent, a solution that can be viable for other lesser-spoken languages which enjoy a similar situation.National Foreign Language Resource Cente

    Casa de la Lhéngua: A set of language resources and natural language processing tools for Mirandese

    Get PDF
    This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.info:eu-repo/semantics/publishedVersio

    The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names

    Get PDF
    There are growing needs to understand the nature and detailed composition of ethnicgroups in today?s increasingly multicultural societies. Ethnicity classifications areoften hotly contested, but still greater problems arise from the quality and availabilityof classifications, with knock on consequences for our ability meaningfully tosubdivide populations. Name analysis and classification has been proposed as oneefficient method of achieving such subdivisions in the absence of ethnicity data, andmay be especially pertinent to public health and demographic applications. However,previous approaches to name analysis have been designed to identify one or a smallnumber of ethnic minorities, and not complete populations.This working paper presents a new methodology to classify the UK population andneighbourhoods into groups of common origin using surnames and forenames. Itproposes a new ontology of ethnicity that combines some of its multidimensionalfacets; language, religion, geographical region, and culture. It uses data collected atvery fine temporal and spatial scales, and made available, subject to safeguards, at thelevel of the individual. Such individuals are classified into 185 independentlyassigned categories of Cultural Ethnic and Linguistic (CEL) groups, based on theprobable origins of names. We include a justification for the need of classifyingethnicity, a proposed CEL taxonomy, a description of how the CEL classification wasbuilt and applied, a preliminary external validation, and some examples of current andpotential applications

    Language policies on the ground : parental language management in urban Galician homes

    Get PDF
    Recent language policy and planning research reveals how policy-makers endorse the interests of dominant social groups, marginalise minority languages and perpetuate systems of sociolinguistic inequality. In the Castilian-dominated Galician linguistic landscape, this study examines the rise of grassroots level actors or agents (i.e. parents, family members, and other speakers of minority Galician) who play a significant role in interpreting and implementing language policy on the ground. The primary focus of this study is to investigate the impact of top-down language policies inside home domain, it looks at how the individual linguistic practices and ideologies of Galician parents act as visible and/or invisible language planning measures influencing their children’s language learning. However, these individual linguistic ideologies and language management decisions are difficult to detect because they are implicit, subtle, informal, and often hidden from the public eye, and therefore, frequently overlooked by language policy researchers and policy makers. Drawing from multiple ethnographic research methods including observations, in-depth fieldwork interviews, focus group discussions and family language audits with thirty-two Galician parents, this study attempts to ascertain whether these parents can restore intergenerational transmission of Galician and if their grassroots level interrogation of the dominant discourse could lead to bottom-up language policies

    How are Spanish academics coping with changes? Responses from a life histories research.

    Get PDF
    Podeu consultar la versió en català a: http://diposit.ub.edu/dspace/handle/2445/20983[eng] This report is part of the research project, The effects of social changes in work and professional life of Spanish academics, partially financed by the Spanish Ministry of Science and Innovation (SEJ2006-01876), that has explored change in legislation, organisation, research schemes and so on, in the last thirty years. The main aim of this project is deepening our understanding of the impact of undergoing economic, social, cultural, technological and labour change in Spanish universities in the life and professional identity of the teaching and research staff, taking into account the national and european context. This paper gathers part of the results gained from the project, being its primary objective to contribute to an improved knowledge-base on professional knowledge and work experience in higher education institutions in Spain and, as a consequence, to understand how Spanish academics are coping with current changes.[spa] Este documento forma parte de la investigación, Los efectos de los cambios sociales en el trabajo y la vida profesional de los docentes universitarios, parcialmente financiado por el Ministerio de Ciencia e Innovación (SEJ2006-01876), en el que hemos explorado los cambios en la legislación, la organización, los contextos de la investigación y la docencia, etc., en los últimos treinta años. El principal objetivo de este documento es profundizar en nuestra comprensión sobre el impacto del cambio económico, social, cultural, tecnológico y laboral que están experimentando las universidades españolas en la vida y la identidad profesional del personal docente e investigador, teniendo en cuenta el contexto nacional y europeo. Este trabajo recoge parte de los resultados obtenidos en el proyecto, siendo su principal objetivo contribuir a mejorar el conocimiento basado en la investigación sobre el saber profesional y la experiencia laboral en las universidades españolas, y en consecuencia, favorecer nuestra comprensión sobre cómo los académicos se están enfrentando con los cambios actuales.[cat] Aquest document forma part de la investigació, Els efectes dels canvis socials en el treball i la vida professional dels docents universitaris, parcialment finançat pel Ministeri de Ciència i Innovació (SEJ2006-01876), on hem explorat els canvis de la legislació, l'organització, els contextos d'investigació i docència etc., durant els darrers trenta anys. El principal objectiu d'aquest document és aprofundir en la comprensió de l'impacte del canvi econòmic, social, cultural, tecnològic i laboral que s'està experimentant a les universitats espanyoles en la vida i en la identitat professional de les persones docents i investigadores, tenint en compte el context nacional i europeu. Aquest treball recull part dels resultats obtinguts en el projecte, essent el seu principal objectiu contribuir a la millora del coneixement basat en la investigació sobre el saber professional i l'experiència laboral a les universitats espanyoles, i conseqüentment, afavorir la comprensió sobre com els acadèmics estan encarant els actuals canvis
    corecore