448 research outputs found
Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR) 2007
This is the proceedings of the Workshop on Semantic Content Acquisition and Representation, held in conjunction with NODALIDA 2007, on May 24 2007 in Tartu, Estonia.</p
A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection
This is the authorâs version of a work that was accepted for publication in Information Processing and Management. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Processing and Management 52 (2016) 550â570. DOI 10.1016/j.ipm.2015.12.004Cross-language plagiarism detection aims to detect plagiarised fragments of text among
documents in different languages. In this paper, we perform a systematic examination of
Cross-language Knowledge Graph Analysis; an approach that represents text fragments using
knowledge graphs as a language independent content model. We analyse the contributions
to cross-language plagiarism detection of the different aspects covered by knowledge
graphs: word sense disambiguation, vocabulary expansion, and representation by similarities
with a collection of concepts. In addition, we study both the relevance of concepts and
their relations when detecting plagiarism. Finally, as a key component of the knowledge
graph construction, we present a new weighting scheme of relations between concepts
based on distributed representations of concepts. Experimental results in SpanishâEnglish
and GermanâEnglish plagiarism detection show state-of-the-art performance and provide
interesting insights on the use of knowledge graphs.
© 2015 Elsevier Ltd. All rights reserved.This research has been carried out in the framework of the European Commission WIQ-EI IRSES (No. 269180) and DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) projects. We would like to thank Tomas Mikolov, Martin Potthast, and Luis A. Leiva for their support and comments during this research.Franco-Salvador, M.; Rosso, P.; Montes Gomez, M. (2016). A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection. Information Processing and Management. 52(4):550-570. https://doi.org/10.1016/j.ipm.2015.12.004S55057052
A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning
Tesis por compendioNatural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario.
In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts.
As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way.
The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification.
The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals.El Procesamiento del Lenguaje Natural (PLN) es un campo de la informĂĄtica, la inteligencia artificial y la lingĂŒĂstica computacional centrado en las interacciones entre las mĂĄquinas y el lenguaje de los humanos. Uno de sus mayores desafĂos implica capacitar a las mĂĄquinas para inferir el significado del lenguaje natural humano. Con este propĂłsito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavĂa tienen un margen de mejora en escenarios transdominios y translingĂŒes.
En esta tesis estudiamos el uso de grafos de conocimiento como una representaciĂłn transdominio y translingĂŒe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semĂĄntica multilingĂŒe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y especĂficos del ser humano.
Como punto de partida de nuestra investigaciĂłn empleamos caracterĂsticas basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificaciĂłn de la polaridad mono- y transdominio. El anĂĄlisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigaciĂłn aprovecha la capacidad de la red semĂĄntica multilingĂŒe y se centra en tareas de RecuperaciĂłn de InformaciĂłn (RI). Primero proponemos un modelo de anĂĄlisis de similitud completamente basado en grafos de conocimiento para detecciĂłn de plagio translingĂŒe. A continuaciĂłn, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingĂŒes de recuperaciĂłn de documentos, clasificaciĂłn, y detecciĂłn de plagio. Por Ășltimo, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificaciĂłn del lenguaje nativo, y identificaciĂłn de la variedad del lenguaje.
Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representaciĂłn transdominio y translingĂŒe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.El Processament del Llenguatge Natural (PLN) Ă©s un camp de la informĂ tica, la intel·ligĂšncia artificial i la lingĂŒĂstica computacional centrat en les interaccions entre les mĂ quines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les mĂ quines per inferir el significat del llenguatge natural humĂ . Amb aquest propĂČsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant aixĂČ, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges.
En aquesta tesi estudiem l'Ășs de grafs de coneixement com una representaciĂł trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement Ă©s un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen grĂ cies a l'Ășs com a base de coneixement d'una xarxa semĂ ntica multilingĂŒe d'Ă mplia cobertura. AixĂČ permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i especĂfics de l'Ă©sser humĂ .
Com a punt de partida de la nostra investigaciĂł emprem caracterĂstiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificaciĂł de la polaritat mono- i trans-domini. L'anĂ lisi i conclusions d'aquest treball mostra evidĂšncies que els grafs de coneixement capturen el significat d'una forma independent del domini. La segĂŒent part de la nostra investigaciĂł aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semĂ ntica multilingĂŒe i se centra en tasques de recuperaciĂł d'informaciĂł (RI). Primer proposem un model d'anĂ lisi de similitud completament basat en grafs de coneixement per a detecciĂł de plagi trans-llenguatge. A continuaciĂł, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperaciĂł de documents, classificaciĂł, i detecciĂł de plagi. Finalment, estudiem l'Ășs de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificaciĂł del llenguatge natiu, i identificaciĂł de la varietat del llenguatge.
Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representaciĂł trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferĂšncies internacionals.Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat PolitĂšcnica de ValĂšncia. https://doi.org/10.4995/Thesis/10251/84285TESISCompendi
Recommended from our members
Reusing Ontologies to Enrich Semantically User Content in Web2.0: A Case Study on Folksonomies
Semantic Web and Web2.0 emerged during the past decade promising to achieve new frontiers for the Web. On the one hand, the Semantic Web is an interlinked web of data, supported by ontological semantics and allowing for intelligent applications such as semantic search and integration of heterogeneous content across systems and applications. On the other hand, Web2.0 represents the new technologies and paradigms that revolutionised the user engagement in content creation and introduced novel means towards social interaction. Bridging the gap between Web2.0 and the Semantic Web has been proposed as a means to better manage and interact with the large amounts of user contributed content, which is a new challenge for Web2.0. This thesis focuses on a popular paradigm of Web2.0, folksonomies. In particular, we investigate the semantic enrichment of folksonomy tagspaces by reusing ontologies available in the Semantic Web. We identify the need for methods that automatically apply semantic descriptions to user generated content without requiring user intervention or alteration of the current tagging paradigm. We use an iterative approach in order to identify the characteristics of folksonomies and the attributes of knowledge sources that influence the semantic enrichment of tagspaces. We build on the results of our experimental studies to implement a folksonomy enrichment algorithm, that given an input tagspace, automatically creates a semantic structure that describes the meaning and relations of tags. We introduce measures for the evaluation of enriched tagspaces and finally, we propose a search algorithm that exploits the semantic structures to improve folksonomy search
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Mining Meaning from Wikipedia
Wikipedia is a goldmine of information; not just for its many readers, but
also for the growing community of researchers who recognize it as a resource of
exceptional scale and utility. It represents a vast investment of manual effort
and judgment: a huge, constantly evolving tapestry of concepts and relations
that is being applied to a host of tasks.
This article provides a comprehensive description of this work. It focuses on
research that extracts and makes use of the concepts, relations, facts and
descriptions found in Wikipedia, and organizes the work into four broad
categories: applying Wikipedia to natural language processing; using it to
facilitate information retrieval and information extraction; and as a resource
for ontology building. The article addresses how Wikipedia is being used as is,
how it is being improved and adapted, and how it is being combined with other
structures to create entirely new resources. We identify the research groups
and individuals involved, and how their work has developed in the last few
years. We provide a comprehensive list of the open-source software they have
produced.Comment: An extensive survey of re-using information in Wikipedia in natural
language processing, information retrieval and extraction and ontology
building. Accepted for publication in International Journal of Human-Computer
Studie
- âŠ