91 research outputs found
A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection
This is the author’s version of a work that was accepted for publication in Information Processing and Management. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Processing and Management 52 (2016) 550–570. DOI 10.1016/j.ipm.2015.12.004Cross-language plagiarism detection aims to detect plagiarised fragments of text among
documents in different languages. In this paper, we perform a systematic examination of
Cross-language Knowledge Graph Analysis; an approach that represents text fragments using
knowledge graphs as a language independent content model. We analyse the contributions
to cross-language plagiarism detection of the different aspects covered by knowledge
graphs: word sense disambiguation, vocabulary expansion, and representation by similarities
with a collection of concepts. In addition, we study both the relevance of concepts and
their relations when detecting plagiarism. Finally, as a key component of the knowledge
graph construction, we present a new weighting scheme of relations between concepts
based on distributed representations of concepts. Experimental results in Spanish–English
and German–English plagiarism detection show state-of-the-art performance and provide
interesting insights on the use of knowledge graphs.
© 2015 Elsevier Ltd. All rights reserved.This research has been carried out in the framework of the European Commission WIQ-EI IRSES (No. 269180) and DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) projects. We would like to thank Tomas Mikolov, Martin Potthast, and Luis A. Leiva for their support and comments during this research.Franco-Salvador, M.; Rosso, P.; Montes Gomez, M. (2016). A Systematic Study of Knowledge Graph Analysis for Cross-language Plagiarism Detection. Information Processing and Management. 52(4):550-570. https://doi.org/10.1016/j.ipm.2015.12.004S55057052
ETRANS: A English-Thai translator
ETRANS is an experimental English-Thai machine translation (MT) system that translates a simple English sentence into a grammatically correct Thai sentence. The entire system is written in C-Prolog, and runs on UNIX systems. The MT strategy taken by ETRANS is an interlingual strategy with a parser for English and a generator for Thai. The parser creates a semantic representation equivalent to the meaning of the English sentence. A generator then interprets the semantic representation into Thai. ETRANS employs frames as a means for representing knowledge, and an augmented transition network (ATN) as the linguistic framework for analyzing and generating sentences
A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning
Tesis por compendioNatural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario.
In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts.
As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way.
The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification.
The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals.El Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüÃstica computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafÃos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavÃa tienen un margen de mejora en escenarios transdominios y translingües.
En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y especÃficos del ser humano.
Como punto de partida de nuestra investigación empleamos caracterÃsticas basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje.
Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.El Processament del Llenguatge Natural (PLN) és un camp de la informà tica, la intel·ligència artificial i la lingüÃstica computacional centrat en les interaccions entre les mà quines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les mà quines per inferir el significat del llenguatge natural humà . Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges.
En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen grà cies a l'ús com a base de coneixement d'una xarxa semà ntica multilingüe d'à mplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i especÃfics de l'ésser humà .
Com a punt de partida de la nostra investigació emprem caracterÃstiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anà lisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semà ntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anà lisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge.
Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals.Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285TESISCompendi
The VERBMOBIL domain model version 1.0
This report describes the domain model used in the German Machine Translation project VERBMOBIL. In order make the design principles underlying the modeling explicit, we begin with a brief sketch of the VERBMOBIL demonstrator architecture from the perspective of the domain model. We then present some rather general considerations on the nature of domain modeling and its relationship to semantics. We claim that the semantic information contained in the model mainly serves two tasks. For one thing, it provides the basis for a conceptual transfer from German to English; on the other hand, it provides information needed for disambiguation. We argue that these tasks pose different requirements, and that domain modeling in general is highly task-dependent. A brief overview of domain models or ontologies used in existing NLP systems confirms this position. We finally describe the different parts of the domain model, explain our design decisions, and present examples of how the information contained in the model can be actually used in the VERBMOBIL demonstrator. In doing so, we also point out the main functionality of FLEX, the Description Logic system used for the modeling
A Call for Executable Linguistics Research
PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200
SCREEN: Learning a Flat Syntactic and Semantic Spoken Language Analysis Using Artificial Neural Networks
In this paper, we describe a so-called screening approach for learning robust
processing of spontaneously spoken language. A screening approach is a flat
analysis which uses shallow sequences of category representations for analyzing
an utterance at various syntactic, semantic and dialog levels. Rather than
using a deeply structured symbolic analysis, we use a flat connectionist
analysis. This screening approach aims at supporting speech and language
processing by using (1) data-driven learning and (2) robustness of
connectionist networks. In order to test this approach, we have developed the
SCREEN system which is based on this new robust, learned and flat analysis.
In this paper, we focus on a detailed description of SCREEN's architecture,
the flat syntactic and semantic analysis, the interaction with a speech
recognizer, and a detailed evaluation analysis of the robustness under the
influence of noisy or incomplete input. The main result of this paper is that
flat representations allow more robust processing of spontaneous spoken
language than deeply structured representations. In particular, we show how the
fault-tolerance and learning capability of connectionist networks can support a
flat analysis for providing more robust spoken-language processing within an
overall hybrid symbolic/connectionist framework.Comment: 51 pages, Postscript. To be published in Journal of Artificial
Intelligence Research 6(1), 199
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
- …