9 research outputs found

    Russian Lexicographic Landscape: a Tale of 12 Dictionaries

    Full text link
    The paper reports on quantitative analysis of 12 Russian dictionaries at three levels: 1) headwords: The size and overlap of word lists, coverage of large corpora, and presence of neologisms; 2) synonyms: Overlap of synsets in different dictionaries; 3) definitions: Distribution of definition lengths and numbers of senses, as well as textual similarity of same-headword definitions in different dictionaries. The total amount of data in the study is 805,900 dictionary entries, 892,900 definitions, and 84,500 synsets. The study reveals multiple connections and mutual influences between dictionaries, uncovers differences in modern electronic vs. traditional printed resources, as well as suggests directions for development of new and improvement of existing lexical semantic resources

    61SPA. Funciones de Similitud sobre Cadenas de Texto: Una Comparación Basada en la Naturaleza de los Datos

    Get PDF
    La detección de duplicados hace referencia al conflicto que se presenta en los datos cuando una misma entidad del mundo real aparece representada dos o más veces a través de una o varias bases de datos, en registros o tuplas con igual estructura pero sin un identificador único y presentan diferencias en sus valores. Múltiples funciones de similitud han sido desarrolladas para detectar cuáles cadenas son similares mas no idénticas, es decir, cuáles se refieren a una misma entidad. En el presente artículo se compara, mediante una métrica de evaluación llamada discernibilidad, la eficacia de nueve de estas funciones de similitud sobre cadenas de texto (Levenshtein, Brecha Afín, Smith-Waterman, Jaro, Jaro-Winkler, Bi-grams, Tri-grams, Monge-Elkan y SoftTF-IDF) usando para ello seis situaciones problemáticas (introducción de errores ortográficos, uso de abreviaturas, palabras faltantes, introducción de prefijos/sufijos sin valor semántico, reordenamiento de palabras y eliminación/adición de espacios en blanco). Los resultados muestran que algunas funciones de similitud tienen a fallar en ciertas situaciones problemáticas y que ninguna es superior al resto en todas ellas

    A SUPPORT TOOL TO IMPROVE COURSE CREDIT TRANSFER IN AN EDUCATION INSTITUTION

    Get PDF
    Processes of course transfer equivalencies should verify the compatibility or equivalence between these curricular components. In educational institutions, the teachers evaluate manually such decision processes with no type of technological support. In order to determine if the courses attended by the students in their institutions of origin can be accepted, the teachers make comparisons between the contents of both courses (attended and requested). Allied to this, the semiannual volume of these processes makes the analysis tedious, time-consuming, error-prone, and constantly challenged by stakeholders. Thus, this work purposes the development of a decision tool based on Natural Language Processing (NLP) techniques to aid in identifying the equivalence of disciplines through the analysis of their contents. The purpose of the decision tool is to support teachers during the evaluation of processes to take advantage of these curricular components. In order to evaluate the performance of the system, we constructed a dataset containing teacher evaluations in real processes of course equivalencies. This dataset was the gold standard (benchmark) for the computational tests. The metrics used in the tests for the evaluation of the proposed technique included AUROC curve, Accuracy and F-Measure.Los procesos de equivalencias de transferencia de cursos deben verificar la compatibilidad o equivalencia entre estos componentes curriculares. En las instituciones educativas, los docentes evalúan manualmente dichos procesos de decisión sin ningún tipo de soporte tecnológico. Para determinar si los cursos a los que asisten los estudiantes en sus instituciones de origen pueden ser aceptados, los docentes realizan comparaciones entre los contenidos de ambos cursos (cursados ​​y solicitados). Aliado a esto, el volumen semestral de estos procesos hace que el análisis sea tedioso, lento, propenso a errores y constantemente desafiado por las partes interesadas. Así, este trabajo tiene como objetivo el desarrollo de una herramienta de decisión basada en técnicas de Procesamiento del Lenguaje Natural (PNL) que ayude a identificar la equivalencia de disciplinas a través del análisis de sus contenidos. El propósito de la herramienta de decisión es apoyar a los docentes durante la evaluación de procesos para aprovechar estos componentes curriculares. Para evaluar el desempeño del sistema, construimos un conjunto de datos que contiene las evaluaciones de los maestros en procesos reales de equivalencias de cursos. Este conjunto de datos fue el estándar de oro (punto de referencia) para las pruebas computacionales. Las métricas utilizadas en las pruebas para la evaluación de la técnica propuesta incluyeron curva AUROC, Precisión y Medida F.Os processos de equivalência de transferência de curso devem verificar a compatibilidade ou equivalência entre estes componentes curriculares. Nas instituições de ensino, os professores avaliam manualmente tais processos de decisão sem nenhum tipo de suporte tecnológico. Para determinar se os cursos frequentados pelos alunos nas suas instituições de origem podem ser aceites, os professores fazem comparações entre os conteúdos dos dois cursos (frequentados e solicitados). Aliado a isso, o volume semestral desses processos torna a análise tediosa, demorada, sujeita a erros e constantemente desafiada pelos stakeholders. Assim, este trabalho objetiva o desenvolvimento de uma ferramenta de decisão baseada em técnicas de Processamento de Linguagem Natural (PNL) para auxiliar na identificação da equivalência de disciplinas por meio da análise de seus conteúdos. O objetivo da ferramenta de decisão é apoiar os professores na avaliação dos processos de aproveitamento destes componentes curriculares. Para avaliar o desempenho do sistema, construímos um conjunto de dados contendo avaliações de professores em processos reais de equivalências de cursos. Este conjunto de dados foi o padrão ouro (benchmark) para os testes computacionais. As métricas utilizadas nos testes de avaliação da técnica proposta incluíram curva AUROC, Exatidão e F-Measure

    Metodología para el análisis de la similitud entre marcas mediante técnicas de aprendizaje automático

    Get PDF
    Las marcas son los signos distintivos que usa un empresario para identificar sus productos y servicios. Con frecuencia constituyen uno de los activos más valiosos de las empresas y es por esto por lo que existen normas para su registro y protección. Cuando una marca se registra, le genera a su titular el derecho de impedir que terceros comercialicen productos similares con marcas idénticas o similares. En los procesos de registro y protección marcaria es necesario establecer la similitud entre 2 marcas y determinar posibles confusiones que se puedan generar en los consumidores. Tradicionalmente esta similitud se ha determinado mediante un diagnóstico cualitativo realizado por un humano, pero ante la creciente cantidad de marcas que buscan ser registradas mes a mes, se configura la necesidad de automatizar esta tarea. En el presente proyecto se evalúan diferentes técnicas del Procesamiento del Lenguaje Natural (NLP por sus siglas en inglés), la Visión por Computador y la fonología, aplicadas en el contexto del cotejo de marcas, para así obtener un sistema de modelos que permita establecer la similitud entre marcas a nivel visual, ortográfico y fonético. Los modelos se evalúan sobre un conjunto de datos de oposiciones reales en solicitudes de registro marcario presentadas ante la Superintendencia de Industria y Comercio de Colombia (SIC).Trademarks consist of the symbols and words that businesses use to identify their products and services. They are often one of the most valuable assets of a company and therefore there are regulations for their registration and protection. When a trademark is registered, it gives its holder the right to prevent third parties from marketing similar products with identical or similar symbols. In trademark registration and protection processes it is necessary to determine the similarity between 2 trademarks to detect potential confusion that may mislead consumers. Traditionally, this similarity has been established through a qualitative human assessment, but given the increasing number of trademarks registration, the need to automate this task is configured. This research evaluates different techniques of Natural Language Processing (NLP), Computer Vision and phonology, applied in the context of trademark matching, to obtain a system of models that can measure visual, spelling, and phonetic similarity between trademarks. The proposed method is evaluated on a dataset of trademark registration oppositions in applications filed with the Colombian Trademark Office (Superintendencia de Industria y Comercio)

    Automated Quality Assessment of Natural Language Requirements

    Get PDF
    High demands on quality and increasing complexity are major challenges in the development of industrial software in general. The development of automotive software in particular is subject to additional safety, security, and legal demands. In such software projects, the specification of requirements is the first concrete output of the development process and usually the basis for communication between manufacturers and development partners. The quality of this output is therefore decisive for the success of a software development project. In recent years, many efforts in academia and practice have been targeted towards securing and improving the quality of requirement specifications. Early improvement approaches concentrated on the assistance of developers in formulating their requirements. Other approaches focus on the use of formal methods; but despite several advantages, these are not widely applied in practice today. Most software requirements today are informal and still specified in natural language. Current and previous research mainly focuses on quality characteristics agreed upon by the software engineering community. They are described in the standard ISO/IEC/IEEE 29148:2011, which offers nine essential characteristics for requirements quality. Several approaches focus additionally on measurable indicators that can be derived from text. More recent publications target the automated analysis of requirements by assessing their quality characteristics and by utilizing methods from natural language processing and techniques from machine learning. This thesis focuses in particular on the reliability and accuracy in the assessment of requirements and addresses the relationships between textual indicators and quality characteristics as defined by global standards. In addition, an automated quality assessment of natural language requirements is implemented by using machine learning techniques. For this purpose, labeled data is captured through assessment sessions. In these sessions, experts from the automotive industry manually assess the quality characteristics of natural language requirements.% as defined in ISO 29148. The research is carried out in cooperation with an international engineering and consulting company and enables us to access requirements from automotive software development projects of safety and comfort functions. We demonstrate the applicability of our approach for real requirements and present promising results for an industry-wide application

    Intégration holistique et entreposage automatique des données ouvertes

    Get PDF
    Statistical Open Data present useful information to feed up a decision-making system. Their integration and storage within these systems is achieved through ETL processes. It is necessary to automate these processes in order to facilitate their accessibility to non-experts. These processes have also need to face out the problems of lack of schemes and structural and sematic heterogeneity, which characterize the Open Data. To meet these issues, we propose a new ETL approach based on graphs. For the extraction, we propose automatic activities performing detection and annotations based on a model of a table. For the transformation, we propose a linear program fulfilling holistic integration of several graphs. This model supplies an optimal and a unique solution. For the loading, we propose a progressive process for the definition of the multidimensional schema and the augmentation of the integrated graph. Finally, we present a prototype and the experimental evaluations.Les statistiques présentes dans les Open Data ou données ouvertes constituent des informations utiles pour alimenter un système décisionnel. Leur intégration et leur entreposage au sein du système décisionnel se fait à travers des processus ETL. Il faut automatiser ces processus afin de faciliter leur accessibilité à des non-experts. Ces processus doivent pallier aux problèmes de manque de schémas, d'hétérogénéité structurelle et sémantique qui caractérisent les données ouvertes. Afin de répondre à ces problématiques, nous proposons une nouvelle démarche ETL basée sur les graphes. Pour l'extraction du graphe d'un tableau, nous proposons des activités de détection et d'annotation automatiques. Pour la transformation, nous proposons un programme linéaire pour résoudre le problème d'appariement holistique de données structurelles provenant de plusieurs graphes. Ce modèle fournit une solution optimale et unique. Pour le chargement, nous proposons un processus progressif pour la définition du schéma multidimensionnel et l'augmentation du graphe intégré. Enfin, nous présentons un prototype et les résultats d'expérimentations
    corecore