106 research outputs found

    Islands in the grammar? Standards of evidence

    Get PDF
    When considering how a complex system operates, the observable behavior depends upon both architectural properties of the system and the principles governing its operation. As a simple example, the behavior of computer chess programs depends upon both the processing speed and resources of the computer and the programmed rules that determine how the computer selects its next move. Despite having very similar search techniques, a computer from the 1990s might make a move that its 1970s forerunner would overlook simply because it had more raw computational power. From the naïve observer’s perspective, however, it is not superficially evident if a particular move is dispreferred or overlooked because of computational limitations or the search strategy and decision algorithm. In the case of computers, evidence for the source of any particular behavior can ultimately be found by inspecting the code and tracking the decision process of the computer. But with the human mind, such options are not yet available. The preference for certain behaviors and the dispreference for others may theoretically follow from cognitive limitations or from task-related principles that preclude certain kinds of cognitive operations, or from some combination of the two. This uncertainty gives rise to the fundamental problem of finding evidence for one explanation over the other. Such a problem arises in the analysis of syntactic island effects – the focu

    Quality in human post-editing of machine-translated texts : error annotation and linguistic specifications for tackling register errors

    Get PDF
    During the last decade, machine translation has played an important role in the translation market and has become an essential tool for speeding up the translation process and for reducing the time and costs needed. Nevertheless, the quality of the results obtained is not completely satisfactory, as it is considerably variable, depending on numerous factors. Given this, it is necessary to combine MT with human intervention, by post-editing the machine-translated texts, in order to reach high-quality translations. This work aims at describing the MT process provided by Unbabel, a Portuguese start-up that combines MT with post-editing provided by online editors. The main objective of the study is to contribute to improving the quality of the translated text, by analyzing annotated translated texts, from English into Italian, to define linguistic specifications to improve the tools used at the start-up to aid human editors and annotators. The analysis of guidelines provided to the annotator to guide his/her editing process has also been developed, a task that contributed to improve the inter-annotator agreement, thus making the annotated data reliable. Accomplishing these goals allowed for the identification and the categorization of the most frequent errors in translated texts, namely errors whose resolution is bound to significantly improve the efficacy and quality of the translation. The data collected allowed us to identify register as the most frequent error category and also the one with the most impact on the quality of translations, and for these reasons this category is analyzed in more detail along the work. From the analysis of errors in this category, it was possible to define and implement a set of rules in the Smartcheck, a tool used at Unbabel to automatically detect errors in the target text produced by the MT system to guarantee a higher quality of the translated texts after post-edition.Nas últimas décadas, a tradução automática tem sido uma importante área de investigação, no âmbito da qual os investigadores têm vindo a conseguir melhorias nos resultados, obtendo mesmo resultados positivos. Hoje em dia, a tradução automática desempenha um papel muito importante no mercado da tradução, devido ao número cada vez maior de textos para traduzir e aos curtos prazos estabelecidos, bem como à pressão constante para se reduzir os custos. Embora a tradução automática seja usada cada vez com mais frequência, os resultados obtidos são variáveis e a qualidade das traduções nem sempre é satisfatória, dependendo dos paradigmas dos sistemas de tradução automática escolhidos, do domínio do texto a traduzir e da sintaxe e do léxico do texto de partida. Mais especificamente, os sistemas de tradução automática que foram desenvolvidos podem ser divididos entre sistemas baseados em conhecimento linguístico, sistemas orientados para os dados e sistemas híbridos, que combinam diferentes paradigmas. Recentemente, o paradigma neuronal tem tido uma aplicação muito expressiva, implicando mesmo a problematização da existência dos restantes paradigmas. Sendo que a qualidade dos resultados de tradução automática depende de diferentes fatores, para a melhorar, é necessário que haja intervenção humana, através de processos de pré-edição ou de pós-edição. Este trabalho parte das atividades desenvolvidas ao longo do estágio curricular na start-up Unbabel, concentrando-se especificamente na análise do processo de tradução automática, implementado na Unbabel, com vista a apresentar um contributo para melhorar a qualidade das traduções obtidas, em particular as traduções de inglês para italiano. A Unbabel é uma start-up portuguesa que oferece serviços de tradução quase em tempo real, combinando tradução automática com uma comunidade de revisores que assegura a pós-edição dos mesmos. O corpus utilizado na realização deste trabalho é composto por traduções automáticas de inglês para italiano, pós-editadas por revisores humanos de e-mails de apoio ao cliente. O processo de anotação visa identificar e categorizar erros em textos traduzidos automaticamente, o que, no contexto da Unbabel, é um processo feito por anotadores humanos. Analisou-se o processo de anotação e as ferramentas que permitem analisar e anotar os textos, o sistema que avalia a métrica de qualidade e as orientações que o anotador tem de seguir no processo de revisão. Este trabalho tornou possível identificar e categorizar os erros mais frequentes nos textos do nosso corpus. Um outro objetivo do presente trabalho consiste em analisar as instâncias dos tipos de erro mais frequentes, para entender quais as causas desta frequência e estabelecer generalizações que permitam elaborar regras suscetíveis de ser implementadas na ferramenta usada na Unbabel, para apoiar o trabalho dos editores e anotadores humanos com notificações automáticas. Em particular, o nosso trabalho foca-se em erros da categoria do registo, o mais frequente nos textos anotados considerados. Mais especificamente, o nosso estudo consiste em definir um conjunto de regras para melhorar a cobertura do Smartcheck, uma ferramenta usada na Unbabel para detetar automaticamente erros em textos traduzidos no âmbito dos fenómenos relacionados com a expressão de registo, para garantir melhores resultados depois do processo de pós-edição. O trabalho apresentado está dividido em oito capítulos. No primeiro capítulo, apresenta-se o objeto de estudo do trabalho, a metodologia usada na sua realização e a organização deste relatório. No segundo capítulo, apresenta-se uma panorâmica teórica sobre a área da tradução automática, sublinhando as características e as finalidades destes sistemas. Apresenta-se uma breve história da tradução automática, desde o surgimento desta área até hoje, bem como os diferentes paradigmas dos sistemas de tradução automática. No terceiro capítulo, apresenta-se a entidade de acolhimento do estágio que serviu de ponto de partida para este trabalho, a start-up portuguesa Unbabel. Explica-se o processo de tradução utilizado na empresa e as fases que o compõem, descrevendo-se detalhadamente os processos de pós-edição e de anotação humanas. São apresentadas também algumas informações sobre as ferramentas usadas na empresa para apoiar o processo de tradução, o Smartcheck e o Turbo Tagger. No quarto capítulo, apresenta-se o processo de anotação desenvolvido na Unbabel, como funciona e as orientações que o anotador deve seguir, descrevendo-se também alguns aspetos que podem ser melhorados. No quinto capítulo problematiza-se a questão do acordo entre anotadores, descrevendo-se a sua importância para medir a homogeneidade entre anotadores e, consequentemente, a fiabilidade de usar os dados de anotação para medir a eficácia e a qualidade dos sistemas de tradução automática. No sexto capítulo, identificam-se os erros mais frequentes por categoria de erro e destaca-se a categoria de registo, a mais frequente e com repercussões evidentes na fluência e na qualidade da tradução, por representar a voz e a imagem do cliente. Apresenta-se uma descrição de um conjunto de regras que pode ser implementado na ferramenta Smartcheck, com vista a diminuir a frequência do erro e aumentar a qualidade dos textos de chegada. Procede-se ainda à verificação do correto funcionamento das regras implementadas, apresentando-se exemplos ilustrativos do desempenho do Smartcheck, na sua versão de teste, com dados relevantes. No último capítulo deste trabalho, apresentam-se as conclusões e o trabalho futuro perspetivado com base neste projeto. Em conclusão, o objetivo do presente trabalho visa contribuir para a melhoria da qualidade dos textos traduzidos na entidade de acolhimento do estágio. Concretamente este trabalho constitui um contributo tangível para o aumento da precisão do processo de anotação humana e para a extensão da cobertura das ferramentas de apoio ao editor e ao anotador humanos usados na start-up Unbabel

    EMIL: Extracting Meaning from Inconsistent Language

    Get PDF
    Developments in formal and computational theories of argumentation reason with inconsistency. Developments in Computational Linguistics extract arguments from large textual corpora. Both developments head in the direction of automated processing and reasoning with inconsistent, linguistic knowledge so as to explain and justify arguments in a humanly accessible form. Yet, there is a gap between the coarse-grained, semi-structured knowledge-bases of computational theories of argumentation and fine-grained, highly-structured inferences from knowledge-bases derived from natural language. We identify several subproblems which must be addressed in order to bridge the gap. We provide a direct semantics for argumentation. It has attractive properties in terms of expressivity and complexity, enables reasoning by cases, and can be more highly structured. For language processing, we work with an existing controlled natural language (CNL), which interfaces with our computational theory of argumentation; the tool processes natural language input, translates them into a form for automated inference engines, outputs argument extensions, then generates natural language statements. The key novel adaptation incorporates the defeasible expression ‘it is usual that’. This is an important, albeit incremental, step to incorporate linguistic expressions of defeasibility. Overall, the novel contribution of the paper is an integrated, end-to-end argumentation system which bridges between automated defeasible reasoning and a natural language interface. Specific novel contributions are the theory of ‘direct semantics’, motivations for our theory, results with respect to the direct semantics, an implementation, experimental results, the tie between the formalisation and the CNL, the introduction into a CNL of a natural language expression of defeasibility, and an ‘engineering’ approach to fine-grained argument analysis

    An exploration of minimal and maximal metrical feet

    Get PDF
    This thesis presents a principled theory of bounded recursive footing. Building on previous research on metrical stress, and couched within the framework of Prosodic Hierarchy Theory, I argue that the rehabilitation of recursive feet in phonological representations leads to an improvement of our theory of prosody. I investigate the major driving forces that may cause recursion at the foot level and demonstrate that reference to recursive and non-recursive feet in various related and unrelated languages (e.g. Wargamay, Yidiɲ, Chugach, English, Dutch, German, Gilbertese, Seneca, Ryukyuan, Tripura Bangla, Cayuvava) allows us to provide a unified account of a wide range of prosodically-conditioned phenomena which would otherwise remain unexplained. In particular, I demonstrate that the assignment of binary and ternary stress, certain tonal distributions, some puzzling cases of vowel lengthening, consonant fortition, vowel reduction and consonant weakening all clearly benefit from recursion-based analyses. In arguing for the need for recursive feet in phonological representations, I identify new strength relations in prosodic systems. Besides the well-established strength dichotomy between the head of a foot (i.e. the strong branch of a foot) and the dependent of a foot (i.e. its weak branch), I show that languages may distinguish between further metrical prominence positions. These additional required positions do not need to be stipulated as they come for free in a framework that allows recursion at the level of the foot

    MULTI-MODAL TASK INSTRUCTIONS TO ROBOTS BY NAIVE USERS

    Get PDF
    This thesis presents a theoretical framework for the design of user-programmable robots. The objective of the work is to investigate multi-modal unconstrained natural instructions given to robots in order to design a learning robot. A corpus-centred approach is used to design an agent that can reason, learn and interact with a human in a natural unconstrained way. The corpus-centred design approach is formalised and developed in detail. It requires the developer to record a human during interaction and analyse the recordings to find instruction primitives. These are then implemented into a robot. The focus of this work has been on how to combine speech and gesture using rules extracted from the analysis of a corpus. A multi-modal integration algorithm is presented, that can use timing and semantics to group, match and unify gesture and language. The algorithm always achieves correct pairings on a corpus and initiates questions to the user in ambiguous cases or missing information. The domain of card games has been investigated, because of its variety of games which are rich in rules and contain sequences. A further focus of the work is on the translation of rule-based instructions. Most multi-modal interfaces to date have only considered sequential instructions. The combination of frame-based reasoning, a knowledge base organised as an ontology and a problem solver engine is used to store these rules. The understanding of rule instructions, which contain conditional and imaginary situations require an agent with complex reasoning capabilities. A test system of the agent implementation is also described. Tests to confirm the implementation by playing back the corpus are presented. Furthermore, deployment test results with the implemented agent and human subjects are presented and discussed. The tests showed that the rate of errors that are due to the sentences not being defined in the grammar does not decrease by an acceptable rate when new grammar is introduced. This was particularly the case for complex verbal rule instructions which have a large variety of being expressed

    Morphotactics in Affix Ordering: Typology and Theory

    Get PDF
    This dissertation discusses the empirical distribution and systematicity of morphotactic rules on the relative order of verbal affixes. In the literature, the exact role of morphology and its interaction with other factors affecting affix order is still under debate. More specifically, syntactic (Baker 1985, 1988) and semantic approaches (Muysken 1986, Rice 2000, Stiebels 2003) to affix order assume that some underlying grammatical structure, the syntactic derivation or the semantic composition, is mapped transparently onto the surface, such that the relative order of affixes on the surface matches the underlying order of the elements. However, phenomena like nontransitive affix order or templatic morphology suggest that morphological rules may overwrite the surface order provided by syntax or semantics. In this dissertation, I examine exactly these phenomena to investigate the empirical scope of these morphological rules. I demonstrate that there are crosslinguistically stable, systematic rules of morphology, which are in direct competition with rules of syntactic or semantic transparency. Concretely, I conclude that there is a morphological rule that requires the realization of causatives in proximity of the verb root. The role and systematicity of morphotactics in affix order is highly relevant for linguistic theory: if seemingly arbitrary rules influence affix order without any restriction, it is impossible to build restrictive theories. Thus, uncovering the crosslinguistic patterns of morphological rules help to build empirically adequate, restrictive theories about affix order. Furthermore, I demonstrate that the interaction of affix order with phonology suggests a cyclic model of the morpho-phonology interface. More specifically, I assume that phonology has temporarily limited access to morphological structure, thus deriving well-attested cases of phonologically conditioned affix order. To model the competition between rules of morphology on the one hand and rules of syntax and semantics on the other hand, I suggest a concrete mechanism that translates the underlying semantic composition into a restricted set of constraints. Consequently, the simultaneous interaction between these constraints implementing transparency requirements and morphotactic constraints derives the variety of transparency patterns found in combinations of valency markers

    Order and structure

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, 1996.Includes bibliographical references (p. [291]-306).by Colin Phillips.Ph.D

    Rule based learning of word pronunciations from training corpora

    Get PDF
    Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (leaves 83-85).This paper describes a text-to-pronunciation system using transformation-based error-driven learning for speech-recognition purposes. Efforts have been made to make the system language independent, automatic, robust and able to generate multiple pronunciations. The learner proposes initial pronunciations for the words and finds transformations that bring the pronunciations closer to the correct pronunciations. The pronunciation generator works by applying the transformations to a similar initial pronunciation. A dynamic aligner is used for the necessary alignment of phonemes and graphemes. The pronunciations are scored using a weighed string edit distance. Optimizations were made to make the learner and the rule applier fast. The system achieves 73.9% exact word accuracy with multiple pronunciations, 82.3% word accuracy with one correct pronunciation, and 95.3% phoneme accuracy for English words. For proper names, it achieves 50.5% exact word accuracy, 69.2% word accuracy, and 92.0% phoneme accuracy, which outperforms the compared neural network approach.Lajos Molnár.M.Eng.and S.B

    Stress in Harmonic Serialism

    Get PDF
    This dissertation proposes a model of word stress in a derivational version of Optimality Theory (OT) called Harmonic Serialism (HS; Prince and Smolensky 1993/2004, McCarthy 2000, 2006, 2010a). In this model, the metrical structure of a word is derived through a series of optimizations in which the \u27best\u27 metrical foot is chosen according to a ranking of violable constraints. Like OT, HS models cross-linguistic typology under the assumption that every constraint ranking should correspond to an attested language. Chapter 2 provides an argument for modeling stress typology in HS by showing that the serial model correctly rules out stress patterns that display non-local interactions, while a parallel OT model with the same constraints and representations fails to make such a distinction. Chapter 3 discusses two types of primary stress---autonomous and parasitic---and argues that limited parallelism in the assignment of primary stress is warranted by a consideration of attested typology. Stress systems in which the primary stress appears to behave autonomously from secondary stresses require that primary stress assignment be simultaneous with a foot\u27s construction. As a result, a provision to allow primary stress to be reassigned during a derivation is necessary to account for a class of stress systems in which primary stress is parasitic on secondary stresses. Chapter 4 takes up two issues in the definition of constraints on primary stress, including a discussion of how primary stress alignment should be formulated and the identification of vacuous satisfaction as a cause of problematic typological predictions. It is proposed that all primary stress constraints be redefined according to non-vacuous schemata, which eliminate the problematic predictions when implemented within HS. Finally, chapter 5 considers the role of representational assumptions in typological predictions with comparisons between HS and parallel OT. The primary conclusion of this chapter is that constituent representations (i.e., feet) are necessary in HS to account for rhythmic stress patterns in a typologically restrictive way

    Doctor of Philosophy

    Get PDF
    dissertationCognitive linguists argue that certain sets of knowledge of language are innate. However, critics have argued that the theoretical concept of "innateness" should be eliminated since it is ambiguous and insubstantial. In response, I aim to strengthen theories of language acquisition and identify ways to make them more substantial. I review the Poverty of Stimulus argument and separate it into four nonequivalent arguments: Deficiency of Stimulus, Corruption of Stimulus, Variety of Stimulus, and Poverty of Negative Evidence. Each argument uses a disparate set of empirical observations to support different conclusions about the traits that are claimed to be innate. Separating the Poverty of Stimulus arguments will aid in making each one more effective. I offer three sets of considerations that scholars can use to strengthen linguistic theories. The Empirical Consideration urges scholars to address specific sets of empirical observations, thus ensuring that innateness theories are not used to explain dissimilar traits. The Developmental Consideration urges scholars to consider complex developmental processes of acquisition. The Interaction Consideration urges scholars to examine interactions between organisms and their environment during language acquisition. I support recent contributions to the approach of "biologicizing the mind" which encourages interdisciplinary collaboration between psychology and biology. I develop an account of language acquisition in terms of canalization, and use this account to explain empirical observations used in Variety of Stimulus arguments. Finally, I argue that the conception of "innateness" can be understood in terms of canalization when it applies to traits that are canalized. Although the canalization conception of "innateness" is not generalizable, it can explain a certain set of empirical observations about language acquisition
    corecore