107 research outputs found

    Rapport : a fact-based question answering system for portuguese

    Get PDF
    Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered more common these days, the same still does not happen regarding access to specific textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user questions with small passages or short answers. The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level, although the syntactic aspects of natural language have well known rules, the size and complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest tasks to handle. There is also the need to correctly process the question in order to define its target, and then select and process the answers found in a text. Additionally, the selected text that may yield the answer to a given question must be further processed in order to present just a passage instead of the full text. These issues take also longer to address in languages other than English, as is the case of Portuguese, that have a lot less people working on them. This work focuses on question answering for Portuguese. In other words, our field of interest is in the presentation of short answers, passages, and possibly full sentences, but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPORT, built upon the use of open information extraction techniques for extracting triples, so called facts, characterizing information on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other metadata, constitute the basis of the answers presented by the system. Facts work both by storing short and direct information found in a text, typically entity related information, and by containing in themselves the answers to the questions already in the form of small passages. As for the results, although there is margin for improvement, they are a tangible proof of the adequacy of our approach and its different modules for storing information and retrieving answers in question answering systems. In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LEMPORT, which was built from scratch, and has a high accuracy. Many of these tools result from the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their input, post-processing their output, or both, and by training models for use in those tools or other, such as MaltParser. Other tools include the creation of interfaces for other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules

    Data-driven machine translation for sign languages

    Get PDF
    This thesis explores the application of data-driven machine translation (MT) to sign languages (SLs). The provision of an SL MT system can facilitate communication between Deaf and hearing people by translating information into the native and preferred language of the individual. We begin with an introduction to SLs, focussing on Irish Sign Language - the native language of the Deaf in Ireland. We describe their linguistics and mechanics including similarities and differences with spoken languages. Given the lack of a formalised written form of these languages, an outline of annotation formats is discussed as well as the issue of data collection. We summarise previous approaches to SL MT, highlighting the pros and cons of each approach. Initial experiments in the novel area of example-based MT for SLs are discussed and an overview of the problems that arise when automatically translating these manual-visual languages is given. Following this we detail our data-driven approach, examining the MT system used and modifications made for the treatment of SLs and their annotation. Through sets of automatically evaluated experiments in both language directions, we consider the merits of data-driven MT for SLs and outline the mainstream evaluation metrics used. To complete the translation into SLs, we discuss the addition and manual evaluation of a signing avatar for real SL output

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2). 29 November 2012, Lisbon, Portugal

    Get PDF
    Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), held in Lisbon, Portugal on 29 November 2012

    Proceedings of the Conference on Natural Language Processing 2010

    Get PDF
    This book contains state-of-the-art contributions to the 10th conference on Natural Language Processing, KONVENS 2010 (Konferenz zur Verarbeitung natürlicher Sprache), with a focus on semantic processing. The KONVENS in general aims at offering a broad perspective on current research and developments within the interdisciplinary field of natural language processing. The central theme draws specific attention towards addressing linguistic aspects ofmeaning, covering deep as well as shallow approaches to semantic processing. The contributions address both knowledgebased and data-driven methods for modelling and acquiring semantic information, and discuss the role of semantic information in applications of language technology. The articles demonstrate the importance of semantic processing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting semantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. The collection highlights the importance of semantic processing for different areas and applications in Natural Language Processing, and provides the reader with an overview of current research in this field

    Quantity and Quality: Not a Zero-Sum Game

    Get PDF
    Quantification of existing theories is a great challenge but also a great chance for the study of language in the brain. While quantification is necessary for the development of precise theories, it demands new methods and new perspectives. In light of this, four complementary methods were introduced to provide a quantitative and computational account of the extended Argument Dependency Model from Bornkessel-Schlesewsky and Schlesewsky. First, a computational model of human language comprehension was introduced on the basis of dependency parsing. This model provided an initial comparison of two potential mechanisms for human language processing, the traditional "subject" strategy, based on grammatical relations, and the "actor" strategy based on prominence and adopted from the eADM. Initial results showed an advantage for the traditional subject" model in a restricted context; however, the "actor" model demonstrated behavior in a test run that was more similar to human behavior than that of the "subject" model. Next, a computational-quantitative implementation of the "actor" strategy as weighted feature comparison between memory units was used to compare it to other memory-based models from the literature on the basis of EEG data. The "actor" strategy clearly provided the best model, showing a better global fit as well as better match in all details. Building upon the success modeling EEG data, the feasibility of estimating free parameters from empirical data was demonstrated. Both the procedure for doing so and the necessary software were introduced and applied at the level of individual participants. Using empirically estimated parameters, the models from the previous EEG experiment were calculated again and yielded similar results, thus reinforcing the previous work. In a final experiment, the feasibility of analyzing EEG data from a naturalistic auditory stimulus was demonstrated, which conventional wisdom says is not possible. The analysis suggested a new perspective on the nature of event-related potentials (ERPs), which does not contradict existing theory yet nonetheless goes against previous intuition. Using this new perspective as a basis, a preliminary attempt at a parsimonious neurocomputational theory of cognitive ERP components was developed

    A Syntactical Reverse Engineering Approach to Fourth Generation Programming Languages Using Formal Methods

    Get PDF
    Fourth-generation programming languages (4GLs) feature rapid development with minimum configuration required by developers. However, 4GLs can suffer from limitations such as high maintenance cost and legacy software practices. Reverse engineering an existing large legacy 4GL system into a currently maintainable programming language can be a cheaper and more effective solution than rewriting from scratch. Tools do not exist so far, for reverse engineering proprietary XML-like and model-driven 4GLs where the full language specification is not in the public domain. This research has developed a novel method of reverse engineering some of the syntax of such 4GLs (with Uniface as an exemplar) derived from a particular system, with a view to providing a reliable method to translate/transpile that system's code and data structures into a modern object-oriented language (such as C\#). The method was also applied, although only to a limited extent, to some other 4GLs, Informix and Apex, to show that it was in principle more broadly applicable. A novel testing method that the syntax had been successfully translated was provided using 'abstract syntax trees'. The novel method took manually crafted grammar rules, together with Encapsulated Document Object Model based data from the source language and then used parsers to produce syntactically valid and equivalent code in the target/output language. This proof of concept research has provided a methodology plus sample code to automate part of the process. The methodology comprised a set of manual or semi-automated steps. Further automation is left for future research. In principle, the author's method could be extended to allow the reverse engineering recovery of the syntax of systems developed in other proprietary 4GLs. This would reduce time and cost for the ongoing maintenance of such systems by enabling their software engineers to work using modern object-oriented languages, methodologies, tools and techniques

    Parallel evaluation strategies for lazy data structures in Haskell

    Get PDF
    Conventional parallel programming is complex and error prone. To improve programmer productivity, we need to raise the level of abstraction with a higher-level programming model that hides many parallel coordination aspects. Evaluation strategies use non-strictness to separate the coordination and computation aspects of a Glasgow parallel Haskell (GpH) program. This allows the specification of high level parallel programs, eliminating the low-level complexity of synchronisation and communication associated with parallel programming. This thesis employs a data-structure-driven approach for parallelism derived through generic parallel traversal and evaluation of sub-components of data structures. We focus on evaluation strategies over list, tree and graph data structures, allowing re-use across applications with minimal changes to the sequential algorithm. In particular, we develop novel evaluation strategies for tree data structures, using core functional programming techniques for coordination control, achieving more flexible parallelism. We use non-strictness to control parallelism more flexibly. We apply the notion of fuel as a resource that dictates parallelism generation, in particular, the bi-directional flow of fuel, implemented using a circular program definition, in a tree structure as a novel way of controlling parallel evaluation. This is the first use of circular programming in evaluation strategies and is complemented by a lazy function for bounding the size of sub-trees. We extend these control mechanisms to graph structures and demonstrate performance improvements on several parallel graph traversals. We combine circularity for control for improved performance of strategies with circularity for computation using circular data structures. In particular, we develop a hybrid traversal strategy for graphs, exploiting breadth-first order for exposing parallelism initially, and then proceeding with a depth-first order to minimise overhead associated with a full parallel breadth-first traversal. The efficiency of the tree strategies is evaluated on a benchmark program, and two non-trivial case studies: a Barnes-Hut algorithm for the n-body problem and sparse matrix multiplication, both using quad-trees. We also evaluate a graph search algorithm implemented using the various traversal strategies. We demonstrate improved performance on a server-class multicore machine with up to 48 cores, with the advanced fuel splitting mechanisms proving to be more flexible in throttling parallelism. To guide the behaviour of the strategies, we develop heuristics-based parameter selection to select their specific control parameters
    corecore