8 research outputs found

    Distantly Supervised Morpho-Syntactic Model for Relation Extraction

    Full text link
    The task of Information Extraction (IE) involves automatically converting unstructured textual content into structured data. Most research in this field concentrates on extracting all facts or a specific set of relationships from documents. In this paper, we present a method for the extraction and categorisation of an unrestricted set of relationships from text. Our method relies on morpho-syntactic extraction patterns obtained by a distant supervision method, and creates Syntactic and Semantic Indices to extract and classify candidate graphs. We evaluate our approach on six datasets built on Wikidata and Wikipedia. The evaluation shows that our approach can achieve Precision scores of up to 0.85, but with lower Recall and F1 scores. Our approach allows to quickly create rule-based systems for Information Extraction and to build annotated datasets to train machine-learning and deep-learning based classifiers.Comment: Preprin

    Etiquetado social y blog-scraping como alternativa para la actualización de vocabularios controlados: Aplicación práctica a un tesauro de Biblioteconomía y Documentación

    Get PDF
    The aim of this paper is to compare the use of free language tags, taken in our case from specialized blogs on information sciences, against the unstructured controlled language of keywords lists, for verifying which of them is the best source of new terminology for the Librarianship Thesaurus and Documentation. To do this, authors' labels were extracted from 127 blogs on librarianship and information science using web scraping techniques, and were compared with descriptors and identifiers lists of the ISOC library and documentation database (ISOC-BD). The results of the analysis of authors' tags in blogs contribute with 186 new terms, while the database lists only 130 terms. It is concluded that free language tags could be a better and faster way for contributing new terminology to controlled vocabularies than unstructured controlled language lists

    Etiquetado social y blog-scraping como alternativa para la actualización de vocabularios controlados: aplicación práctica a un tesauro de Biblioteconomía y Documentación

    Get PDF
    The aim of this paper is to compare the use of free language tags, taken in our case from specialized blogs on information sciences, against the unstructured controlled language of keywords lists, for verifying which of them is the best source of new terminology for the Librarianship Thesaurus and Documentation. To do this, authors’ labels were extracted from 127 blogs on librarianship and information science using web scraping techniques, and were compared with descriptors and identifiers lists of the ISOC library and documentation database (ISOC-BD). The results of the analysis of authors’ tags in blogs contribute with 186 new terms, while the database lists only 130 terms. It is concluded that free language tags could be a better and faster way for contributing new terminology to controlled vocabularies than unstructured controlled language lists.El objetivo de este artículo es comparar las etiquetas en lenguaje libre, tomadas en nuestro caso de blogs especializados en ciencias de la información (information sciences), frente al lenguaje controlado no estructurado de las listas de palabras clave, con el fin de comprobar cuál de estos dos es una mejor fuente de nueva terminología para el Tesauro de Biblioteconomía y Documentación. Para ello, se extrajeron las etiquetas de autor de 127 blogs sobre biblioteconomía y documentación mediante técnicas de web scraping, y se compararon con los listados de descriptores e identificadores de la base de datos ISOC Biblioteconomía y Documentación (ISOC-BD). El análisis de las etiquetas de autor de blogs ha aportado 186 nuevos términos, mientras que los listados de la base de datos han proporcionado 130términos. Se concluye que las etiquetas en lenguaje libre pueden ser una mejor y más rápida vía de aporte de nueva terminología a los vocabularios controlados que los listados de lenguaje controlado no estructurado

    Etiquetado social y blog-scraping como alternativa para la actualización de vocabularios controlados: Aplicación práctica a un tesauro de Biblioteconomía y Documentación

    Get PDF
    The aim of this paper is to compare the use of free language tags, taken in our case from specialized blogs on information sciences, against the unstructured controlled language of keywords lists, for verifying which of them is the best source of new terminology for the Librarianship Thesaurus and Documentation. To do this, authors' labels were extracted from 127 blogs on librarianship and information science using web scraping techniques, and were compared with descriptors and identifiers lists of the ISOC library and documentation database (ISOC-BD). The results of the analysis of authors' tags in blogs contribute with 186 new terms, while the database lists only 130 terms. It is concluded that free language tags could be a better and faster way for contributing new terminology to controlled vocabularies than unstructured controlled language lists

    Anotação semântica para recomendação de conteúdos educacionais

    Get PDF
    Orientador: Julio Cesar dos ReisDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Sistemas de apoio à aprendizagem exploram diversos recursos multimídia para considerar individualidades do aluno bem com diferentes estilos de aprendizagem. Todavia, a crescente quantidade de conteúdos educacionais disponíveis em diferentes formatos e de maneira fragmentada di?culta o acesso e compreensão dos conceitos em estudo. Embora a literatura tenha proposto abordagens para explorar técnicas de recomendação que permitem representação explícita de semântica por meio de artefatos como ontologias, essa linha não foi totalmente explorada e ainda requer muitos esforços de pesquisa. Esta pesquisa objetiva conceber um método de recomendação de conteúdo educacional explorando o uso de anotações semânticas sobre transcrições textuais de videoaulas. As anotações servem como metadados que expressam o signi?cado de trechos das aulas. A técnica de recomendação, como principal contribuição esperada, fundamenta-se nas anotações disponíveis para de?nir estratégias de ranking de conteúdos disponíveis a partir da proximidade semântica dos conceitos combinadas com técnicas de aprendizagem de máquina. A contribuição envolve o desenvolvimento de protótipos funcionais de software para validação experimental com base em conteúdos de videoaulas reais e deve destacar as principais vantagens e limitações da abordagem. Os resultados obtidos permitirão o acesso à recomendações mais adequadas para melhorar o processo de aprendizagem apresentando a possibilidade de uma experiência mais satisfatória pelos alunosAbstract: Learning support systems explore several audio-visual resources to consider individual needs and learning styles aiming to stimulate learning experiences. However, the large amount of online educational content in di?erent formats and the possibility of making them available in a fragmented way turns di?cult the tasks of accessing these resources and understanding the concepts under study. Although literature has proposed approachestoexploreexplicitsemanticrepresentationthroughartifactssuchasontologies in learning support systems, this research line still requires further investigation e?orts. In this MS.c. dissertation, we propose a method for recommending educational content by exploring the use of semantic annotations over textual transcriptions from video lectures. Our investigation addresses the di?culties in extracting entities from natural language texts in subtitles of videos. Our work studies how to re?ne concepts in a domain ontology to support semantic annotation of video lecture subtitles. We report on the design of a video lecture recommendation system which explores the extracted semantic annotations. Our solution explored semantically annotated videos with an ontology in the Computer Science domain. Obtained results indicate our recommendation mechanism is suited to ?lter relevant video content in di?erent use scenariosMestradoCiência da ComputaçãoMestre em Ciência da Computação2017/02325-5; 2018/00313-2FAPES

    Semantic Enrichment of Ontology Mappings

    Get PDF
    Schema and ontology matching play an important part in the field of data integration and semantic web. Given two heterogeneous data sources, meta data matching usually constitutes the first step in the data integration workflow, which refers to the analysis and comparison of two input resources like schemas or ontologies. The result is a list of correspondences between the two schemas or ontologies, which is often called mapping or alignment. Many tools and research approaches have been proposed to automatically determine those correspondences. However, most match tools do not provide any information about the relation type that holds between matching concepts, for the simple but important reason that most common match strategies are too simple and heuristic to allow any sophisticated relation type determination. Knowing the specific type holding between two concepts, e.g., whether they are in an equality, subsumption (is-a) or part-of relation, is very important for advanced data integration tasks, such as ontology merging or ontology evolution. It is also very important for mappings in the biological or biomedical domain, where is-a and part-of relations may exceed the number of equality correspondences by far. Such more expressive mappings allow much better integration results and have scarcely been in the focus of research so far. In this doctoral thesis, the determination of the correspondence types in a given mapping is the focus of interest, which is referred to as semantic mapping enrichment. We introduce and present the mapping enrichment tool STROMA, which obtains a pre-calculated schema or ontology mapping and for each correspondence determines a semantic relation type. In contrast to previous approaches, we will strongly focus on linguistic laws and linguistic insights. By and large, linguistics is the key for precise matching and for the determination of relation types. We will introduce various strategies that make use of these linguistic laws and are able to calculate the semantic type between two matching concepts. The observations and insights gained from this research go far beyond the field of mapping enrichment and can be also applied to schema and ontology matching in general. Since generic strategies have certain limits and may not be able to determine the relation type between more complex concepts, like a laptop and a personal computer, background knowledge plays an important role in this research as well. For example, a thesaurus can help to recognize that these two concepts are in an is-a relation. We will show how background knowledge can be effectively used in this instance, how it is possible to draw conclusions even if a concept is not contained in it, how the relation types in complex paths can be resolved and how time complexity can be reduced by a so-called bidirectional search. The developed techniques go far beyond the background knowledge exploitation of previous approaches, and are now part of the semantic repository SemRep, a flexible and extendable system that combines different lexicographic resources. Further on, we will show how additional lexicographic resources can be developed automatically by parsing Wikipedia articles. The proposed Wikipedia relation extraction approach yields some millions of additional relations, which constitute significant additional knowledge for mapping enrichment. The extracted relations were also added to SemRep, which thus became a comprehensive background knowledge resource. To augment the quality of the repository, different techniques were used to discover and delete irrelevant semantic relations. We could show in several experiments that STROMA obtains very good results w.r.t. relation type detection. In a comparative evaluation, it was able to achieve considerably better results than related applications. This corroborates the overall usefulness and strengths of the implemented strategies, which were developed with particular emphasis on the principles and laws of linguistics

    Extracting Semantic Concept Relations from Wikipedia

    No full text
    Background knowledge as provided by repositories such as WordNet is of critical importance for linking or mapping ontologies and related tasks. Since current repositories are quite limited in their scope and currentness, we investigate how to automatically build up improved repositories by ex-tracting semantic relations (e.g., is-a and part-of relations) from Wikipedia articles. Our approach uses a comprehen-sive set of semantic patterns, finite state machines and NLP-techniques to process Wikipedia definitions and to identify semantic relations between concepts. Our approach is able to extract multiple relations from a single Wikipedia article. An evaluation for different domains shows the high quality and effectiveness of the proposed approach
    corecore