3 research outputs found

    Implementación de un lematizador para una lengua de escasos recursos: caso shipibo-konibo

    Get PDF
    Desde que el Ministerio de Educación oficializó el alfabeto shipibo-konibo, existe la necesidad de generar una gran cantidad de documentos educativos y oficiales para los hablantes de esta lengua, los cuales solo se realizan actualmente mediante el apoyo de traductores o personas bilingües. Sin embargo, en el campo de la lingüística computacional existen herramientas que permiten facilitar estas labores, como es el caso de un lematizador, el cual se encarga de obtener el lema o forma base de una palabra a partir de su forma flexionada. Su realización se da comúnmente mediante dos métodos: el uso de reglas morfológicas y el uso de diccionarios. Debido a esto, este proyecto tiene como objetivo principal desarrollar una herramienta de lematización para el shipibo-konibo usando un corpus de palabras, la cual se base en los estándares de anotación utilizados en otras lenguas, y que sea fácil de utilizar mediante una librería de funciones y un servicio web. Esta herramienta final se realizó utilizando principalmente el método de clasificación de los k-vecinos más cercanos, el cual permite estimar la clase de un nuevo caso mediante la comparación de sus características con las de casos previamente clasificados y dando como resultado la clase más frecuente para valores similares. Finalmente, la herramienta de lematización desarrollada logró alcanzar una precisión de 0.736 y de esta manera superar a herramientas utilizadas en otros idiomas.Tesi

    Luonnollisen kielen käsittelyn menetelmät sanojen samankaltaisuuden mittaamisessa

    Get PDF
    An artificial intelligence application considered in this thesis was harnessed to extract competencies from job descriptions and higher education curricula written in natural language. Using these extracted competencies, the application is able to visualize the skills supply of the schools and the skills demand of the labor market. However, to understand natural language, computer must learn to evaluate the relatedness between words. The aim of the thesis is to propose the best methods for open text data mining and measuring the semantic similarity and relatedness between words. Different words can have similar meanings in natural language. The computer can learn the relatedness between words mainly by two different methods. We can construct an ontology from the studied domain, which models the concepts of the domain as well as the relations between them. The ontology can be considered as a directed graph. The nodes are the concepts of the domain and the edges between the nodes describe their relations. The semantic similarity between the concepts can be computed based on the distance and the strength of the relations between them. The other way to measure the word relatedness is based on statistical language models. The model learns the similarity between words relying on their probability distribution in large corpora. The words appearing in similar contexts, i.e., surrounded by similar words, tend to have similar meanings. The words are often represented as continuous distributed word vectors, each dimension representing some feature of the word. The feature can be either semantic, syntactic or morphological. However, the feature is latent, and usually not under understandable to a human. If the angle between the word vectors in the feature space is small, the words share same features and hence are similar. The study was conducted by reviewing available literature and implementing a web scraper for retrieving open text data from the web. The scraped data was fed into the AI application, which extracted the skills from the data and visualized the result in semantic maps
    corecore