14 research outputs found

    Лингвистические и металингвистические представления в интеллектуальных многоязычных системах

    Get PDF
    В данной работе предложен функционально-семантический подход, который обеспечивает синергетическое сочетание статистических методов и логико-лингвистических правил. В процессе грамматического разбора используются значения вероятности каждого узла разбора предложения. Взаимодействие функциональных блоков и подсистем интеллектуальной многоязычной системы между собой, а также взаимодействие ее с пользователем организуется с помощью метаданных управления и связи. Были изучены структуры когнитивного переноса в рамках поля функционального переноса первичной и вторичной предикации для русско-французской языковой пары по аналогии с русско-английской языковой парой. Материалом анализа послужили параллельные тексты, статьи из научной периодики.У даній роботі запропоновано функціонально-семантичний підхід, який забезпечує синергетичне сполучення статистичних методів і логіко-лінгвістичних правил. У процесі граматичного розбору використовуються значення ймовірності кожного вузла розбору речення. Взаємодія функціональних блоків і підсистем інтелектуальної багатомовної системи між собою, а також взаємодія її з користувачем організується за допомогою метаданих керування і зв’язку. Були вивчені структури когнітивного переносу у межах поля функціонального переносу первинної і вторинної предикації для російсько-французької мовної пари за аналогією з російсько-англійською мовною парою. Матеріалом аналізу послужили паралельні тексти, статті з наукової періодики.The approach, which integrates linguistic means of different language levels (syntactic, lexical, derivational and inflectional) on the basis of their functional semantic characteristics and ensures the synergetic combination of statistical methods and logical linguistic rules, is proposed in this work. In the process of grammatical analysis the values of probability are used. Interaction of functional blocks and subsystems of the intellectual multilingual system between themselves, and also its interaction with the user is organized with the aid of metadata of control and connection. The authors studied the structures of cognitive transfer within the framework of the functional transfer field of primary and secondary predication for the Russian-French language pair by analogy with the Russian-English language pair. The material for analysis was taken from the parallel texts of scientific periodicals

    A GUI For Defining Inductive Logic Programming Tasks For Novice Users

    Get PDF
    University of Minnesota M.S. thesis. March 2017. Major: Computer Science. Advisor: Richard Maclin. 1 computer file (PDF); vii, 64 pages.Inductive logic programming, which involves learning a solution to a problem where data is more naturally viewed as multiple tables with relationships between the tables, is an extremely powerful learning method. But these methods have suffered from the fact that very few are written in languages other than Prolog and because describing such problems is difficult. To describe an inductive logic programming problem the user needs to designate many tables and relationships and often provide some knowledge about the relationships in order for the techniques to work well. The goal of this thesis is to develop a Java-based Graphical User Interface (GUI) for novice users that will allow them to define ILP problems by connecting to an existing database and allowing users to define such a problem in an understandable way, perhaps with the assistance of data exploration techniques from the GUI

    Integrating context and transliteration to mine new word translations from comparable corpora

    Get PDF
    Master'sMASTER OF SCIENC

    Unsupervised neural machine translation between the Portuguese language and the Chinese and Korean languages

    Get PDF
    Tese de Mestrado, Informática, 2023, Universidade de Lisboa, Faculdade de CiênciasO propósito desta dissertação é apresentar um estudo comparativo e de reprodução sobre técnicas de Tradução Automática Neuronal Não-Supervisionada (Unsupervised Neural Machine Translation) para o par de línguas Português (PT) →Chinês (ZH) e Português (PT) → Coreano (KR) tirando partido de ferramentas e recursos online. A escolha destes pares de línguas prende-se com duas grandes razões. A primeira refere-se à importância no panorama global das línguas asiáticas, nomeadamente do chinês, e também pela infuência que a língua portuguesa desempenha no mundo especialmente no hemisfério sul. A segunda razão é puramente académica. Como há escassez de estudos na área de Processamento Natural de Linguagem (NLP) com línguas não-germânicas (devido à hegemonia da língua inglesa), procurou-se desenvolver um trabalho que estude a infuência das técnicas de tradução não supervisionada em par de línguas poucos estudadas, a fm de testar a sua robustez. Falada por um quarto da população mundial, a língua chinesa é o“Ás”no baralho de cartas da China. De acordo com o International Chinese Language Education Week, em 2020 estimava-se que 200 milhões pessoas não-nativas já tinham aprendido chinês e que no ano corrente se encontravam mais de 25 milhões a estudá-la. Com a infuência que a língua chinesa desempenha, torna-se imperativo desenvolver ferramentas que preencham as falhas de comunicação. Assim, nesta conjuntura global surge a tradução automática como ponte de comunicação entre várias culturas e a China. A Coreia do Sul, também conhecida como um dos quatro tigres asiáticos, concretizou um feito extraordinário ao levantar-se da pobreza extrema para ser um dos países mais desenvolvidos do mundo em duas gerações. Apesar de não possuir a hegemonia económica da China, a Coreia do Sul exerce bastante infuência devido ao seu soft power na área de entretenimento, designado por hallyu. Esta“onda”de cultura pop coreana atraí multidões para a aprendizagem da cultura. De forma a desvanecer a barreira comunicativa entre os amantes da cultura coreana e os nativos, a tradução automática é um forte aliado porque permite a interação entre pessoas instantaneamente sem a necessidade de aprender uma língua nova. Apesar de Portugal não ter ligações culturais com a Coreia, há uma forte ligação com a região administrativa especial de Macau (RAEM) onde o português é uma das línguas ofciais, sendo que a Tradução Automática entre ambas as línguas ofciais é uma das áreas estratégicas do governo local tendo sido estabelecido um laboratório de Tradução Automática no Instituto Politécnico de Macau que visa construir um sistema que possa ser usado na função pública de auxílio aos tradutores. Neste trabalho foram realizadas duas abordagens: (i) Tradução Automática Neuronal Não Supervisionada (Unsupervised Neural Machine Translation) e; (ii) abordagem pivô (pivot approach). Como o foco da dissertação é em técnicas nãosupervisionadas, nenhuma das arquiteturas fez uso de dados paralelos entre os pares de línguas em questão. Nomeadamente, na primeira abordagem usou-se dados monolingues. Na segunda introduziu-se uma terceira língua pivô que é utilizada para estabelecer a ponte entre a língua de partida e a de chegada. Esta abordagem à tradução automática surgiu com a necessidade de criar sistemas de tradução para pares de línguas onde existem poucos ou nenhuns dados paralelos. Como demonstrado por Koehn and Knowles [2017a], a tradução automática neuronal precisa de grandes quantidades de dados a fm de ter um desempenho melhor que a Tradução Automática Estatística (SMT). No entanto, em pares de línguas com poucos recursos linguísticos isso não é exequível. Para tal, a arquitetura de tradução automática não supervisionada somente requer dados monolingues. A implementação escolhida foi a de Artetxe et al. [2018d] que é constituída por uma arquitetura encoder-decoder. Como contém um double-encoder, para esta abordagem foram consideradas ambas direções: Português ↔ Chinês e Português ↔ Coreano. Para além da reprodução para línguas dissimilares com poucos recursos, também foi elaborado um estudo de replicação do artigo original usando os dados de um dos pares de línguas estudados pelos autores: Inglês ↔ Francês. Outra alternativa para a falta de corpora paralelos é a abordagem pivô. Nesta abordagem, o sistema faz uso de uma terceira língua, designada por pivô, que liga a língua de partida à de chegada. Esta opção é tida em conta quando há existência de dados paralelos em abundância entre as duas línguas. A motivação deste método é fazer jus ao desempenho que as redes neuronais têm quando são alimentadas com grandes volumes de dados. Com a existência de grandes quantidades de corpora paralelos entre todas as línguas em questão e a pivô, o desempenho das redes compensa a propagação de erro introduzida pela língua intermediária. No nosso caso, a língua pivô escolhida foi o inglês pela forte presença de dados paralelos entre o pivô e as restantes três línguas. O sistema começa por traduzir de português para inglês e depois traduz a pivô para coreano ou chinês. Ao contrário da primeira abordagem, só foi considerada uma direção de Português → Chinês e Português → Coreano. Para implementar esta abordagem foi considerada a framework OpenNMT desenvolvida por [Klein et al., 2017]. Os resultados foram avaliados usando a métrica BLEU [Papineni et al., 2002b]. Com esta métrica foi possível comparar o desempenho entre as duas arquiteturas e aferir qual é o método mais efcaz para pares de línguas dissimilares com poucos recursos. Na direção Português → Chinês e Português → Coreano a abordagem pivô foi superior tendo obtido um BLEU de 13,37 pontos para a direção Português → Chinês e um BLEU de 17,28 pontos na direção Português → Coreano. Já com a abordagem de tradução automática neural não supervisionada o valor mais alto obtido na direção Português → Coreano foi de um BLEU de 0,69, enquanto na direção de Português → Chinês foi de 0,32 BLEU (num total de 100). Os valores da tradução não supervisionada vão estão alinhados com os obtidos por [Guzmán et al., 2019], [Kim et al., 2020]. A explicação dada para estes valores baixos prende-se com a qualidade dos cross-lingual embeddings. O desempenho dos cross-lingual embeddings tende a degradar-se quando mapeia pares de línguas distantes e, sendo que modelo de tradução automática não supervisionado é inicializado com os cross-lingual embeddings, caso estes sejam de baixa qualidade, o modelo não converge para um ótimo local, resultando nos valores obtidos na dissertação. Dos dois métodos testados, verifica-se que a abordagem pivô é a que tem melhor performance. Tal como foi possível averiguar pela literatura corrente e também pelos resultados obtidos nesta dissertação, o método neuronal não-supervisionado proposto por Artetxe et al. [2018d] não é sufcientemente robusto para inicializar um sistema de tradução suportado por textos monolingues em línguas distantes. Porém é uma abordagem promissora porque permitiria colmatar uma das grandes lacunas na área de Tradução Automática que se cinge à falta de dados paralelos de boa qualidade. No entanto seria necessário dar mais atenção ao problema dos cross-lingual embeddings em mapear línguas distantes. Este trabalho fornece uma visão sobre o estudo de técnicas não supervisionadas para pares de línguas distantes e providencia uma solução para a construção de sistemas de tradução automática para os pares de língua português-chinês e português-coreano usando dados monolingues.This dissertation presents a comparative and reproduction study on Unsupervised Neural Machine Translation techniques in the pair of languages Portuguese (PT) → Chinese (ZH) and Portuguese (PT) → Korean(KR). We chose these language-pairs for two main reasons. The frst one refers to the importance that Asian languages play in the global panorama and the infuence that Portuguese has in the southern hemisphere. The second reason is purely academic. Since there is a lack of studies in the area of Natural Language Processing (NLP) regarding non-Germanic languages, we focused on studying the infuence of nonsupervised techniques in under-studied languages. In this dissertation, we worked on two approaches: (i) Unsupervised Neural Machine Translation; (ii) the Pivot approach. The frst approach uses only monolingual corpora. As for the second, it uses parallel corpora between the pivot and the non-pivot languages. The unsupervised approach was devised to mitigate the problem of low-resource languages where training traditional Neural Machine Translations was unfeasible due to requiring large amounts of data to achieve promising results. As such, the unsupervised machine translation only requires monolingual corpora. In this dissertation we chose the mplementation of Artetxe et al. [2018d] to develop our work. Another alternative to the lack of parallel corpora is the pivot approach. In this approach, the system uses a third language (called pivot) that connects the source language to the target language. The reasoning behind this is to take advantage of the performance of the neural networks when being fed with large amounts of data, making it enough to counterbalance the error propagation which is introduced when adding a third language. The results were evaluated using the BLEU metric and showed that for both language pairs Portuguese → Chinese and Portuguese → Korean, the pivot approach had a better performance making it a more suitable choice for these dissimilar low resource language pairs

    Organización de la información mediante el uso de lenguajes de modelado : viejos recursos para nuevas necesidades

    Get PDF
    First, the context in which the need to develop an ontology appears is defined. Then, we discuss the concepts of “model” and ”ontology”, and justify our bet for an intensional concept. In the same way, we discuss the differences between ontologies design and business processes re-engineering. Finally, the IDEF5 method and languages to create ontologies are explained in a more detailed way, refering to some examples of associated softwarePrimero, se establece el contexto en el que nace la necesidad de elaborar una ontología. Después, se discuten los conceptos de modelo y de ontología, y se justifica la selección de un concepto intensional. Se discute de igual modo la diferencia entre el diseño de ontologías y la reingeniería de procesos. Finalmentelican con más detalle el método y los lenguajes IDEF5 para la creación de ontologías. Se refieren algunos ejemplos de software relevante. (a

    Incorporating translation quality-oriented features into log-linear models of machine translation

    Get PDF
    The current state-of-the-art approach to Machine Translation (MT) has limitations which could be alleviated by the use of syntax-based models. Although the benefits of syntax use in MT are becoming clear with the ongoing improvements in string-to-tree and tree-to-string systems, tree-to-tree systems such as Data Oriented Translation (DOT) have, until recently, suffered from lack of training resources, and as a consequence are currently immature, lacking key features compared to Phrase-Based Statistical MT (PB-SMT) systems. In this thesis we propose avenues to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems. Noting that both types of systems score translations using probabilities not necessarily related to the quality of the translations they produce, we introduce a training mechanism which takes translation quality into account by averaging the edit distance between a translation unit and translation units used in oracle translations. This training mechanism could in principle be adapted to a very broad class of MT systems. In particular, we show how when translating Spanish sentences into English, it leads to improvements in the translation quality of both PB-SMT and DOT. In addition, we show how our method leads to a PB-SMT system which uses significantly less resources and translates significantly faster than the original, while maintaining the improvements in translation quality. We then address the issue of the limited feature set in DOT by defining a new DOT model which is able to exploit features of the complete source sentence. We introduce a feature into this new model which conditions each target word to the source-context it is associated with, and we also make the first attempt at incorporating a language model (LM) to a DOT system. We investigate different estimation methods for our lexical feature (namely Maximum Entropy and improved Kneser-Ney), reporting on their empirical performance. After describing methods which enable us to improve the efficiency of our system, and which allows us to scale to larger training data sizes, we evaluate the performance of our new model on English-to-Spanish translation, obtaining significant translation quality improvements compared to the original DOT system

    An Investigation of the Attitudes Held by General Education Teachers Toward Students with Disabilities in a Pilot Inclusive Education Program in Cameroon

    Get PDF
    Problem Statement The literature from Cameroon depicts that the implementation of inclusive education is not only in its embryonic stage but faces resistance from educators who are still not accepting of the presence of students with disabilities in general education classrooms. This resistance has been attributed to several factors ranging from attachment to customs and traditions that encourage the isolation of persons with disabilities, to the lack of resources and professionals needed for the successful implementation of inclusive education programs. These unfavorable attitudes have been a cause for concern among parents, educators, and especially government leaders who do not want to be left behind the international community in embracing inclusive education. Researchers have found that unsuccessful inclusive programs stem from teachers’ perceptions of the concept of inclusion, their teaching ability, classroom management, and benefits/outcomes of inclusion. As a result, this study sought to examine if there is a relationship between teachers’ characteristics (such as gender, age, the level of education, years of teaching experience, experience teaching in inclusive classrooms, training, and teachers’ language of instruction), and their attitudes toward inclusive education. Method A quantitative non-experimental descriptive survey research design was used in this study. Participants included 346 full-time state licensed general education teachers from seven bilingual secondary schools participating in SEEPD pilot inclusive education program in the North West Region of Cameroon. A survey instrument “Opinions Relative to the Integration of Students with Disabilities” (ORI) was used to collect data in determining the attitudes of general education teachers toward inclusion. The Statistical Package for Social Sciences Software (SPSS) was used to analyze the data, organize the results, and provide descriptive statistics, multivariate and univariate analysis of variances (MANOVA, and ANOVA). Results Teachers’ attitudes toward inclusive education in Cameroon were negative on how they perceived the concept of inclusion and perceptions of their ability to teach in inclusive classrooms. They had positive attitudes toward managing students with disabilities in inclusive classrooms, and about the outcomes/benefits of inclusion. Overall, most teachers in the pilot inclusive education program in the North West Region of Cameroon were not accepting of the presence of students with disabilities in general education classrooms. These negative attitudes were manifested in teachers’ self-perceptions of their inability or lack of training in both special and inclusive education. There was no significant difference in attitudes on the basis of the language of instruction. However, differences were found regarding the other demographic variables such as age, gender, experience, and education. Male teachers were more favorable to inclusion than their female colleagues. Additionally, older, more experienced, more qualified, and more educated teachers, were more likely to be supportive of inclusive education than younger, less experienced, less qualified, and less educated ones. Conclusion This study was conducted in general education secondary schools actively engaged in a pilot effort to introduce inclusive classroom practices in seven selected bilingual secondary schools in the North West Region of Cameroon. It is not certain what the level of acceptance the practice of integrating students with disabilities into the general education classroom would be if the study were carried out in schools not actively involved in the inclusive education initiative. Nonetheless, what stands out about the findings of this study is that most teachers showed negative attitudes about the success or outcome of inclusive education and indicated that the training they received in special education and inclusive education was not enough to ensure a successful integration of students with disabilities into general education classrooms. These findings support not only the rationale but also the urgent need for investment by all Cameroonian education stakeholders, especially the leading sponsor of education, the government, in the training of special education professionals and paraprofessionals in the country. These revelations also constitute a call for needed action from instructional leaders and higher education leaders who can make a difference by promoting professional development through seminars and workshops as well as creating targeted special education programs in the various institutions of higher learning in the country

    Using linguistic knowledge in SMT

    Get PDF
    Thesis (Ph. D. in Information Technology)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 153-162).In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). One of the advantages of the statistical approach to machine translation is that it is largely language-agnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using linguistic knowledge to address specific areas of the translation process, where translations would be hard to learn fully automatically. We present methods that use linguistic knowledge at various levels to improve statistical machine translation, focusing on Arabic-English translation as a case study. In the first part, morphological information is used to preprocess the Arabic text for Arabic-to-English and English-to-Arabic translation, which reduces the gap in the complexity of the morphology between Arabic and English. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, we show how additional local context information on the source side is incorporated, which helps reduce lexical ambiguity. Two methods are proposed for using binary decision trees to control the amount of context information introduced. These methods are successfully applied to the use of diacritized Arabic source in Arabic-to-English translation. The final method combines the outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system.by Rabih M. Zbib.Ph.D.in Information Technolog

    Recycling texts: human evaluation of example-based machine translation subtitles for DVD

    Get PDF
    This project focuses on translation reusability in audiovisual contexts. Specifically, the project seeks to establish (1) whether target language subtitles produced by an EBMT system are considered intelligible and acceptable by viewers of movies on DVD, and (2)whether a relationship exists between the ‘profiles’ of corpora used to train an EBMT system, on the one hand, and viewers’ judgements of the intelligibility and acceptability of the subtitles produced by the system, on the other. The impact of other factors, namely: whether movie-viewing subjects have knowledge of the soundtrack language; subjects’ linguistic background; and subjects’ prior knowledge of the (Harry Potter) movie clips viewed; is also investigated. Corpus profiling is based on measurements (partly using corpus-analysis tools) of three characteristics of the corpora used to train the EBMT system: the number of source language repetitions they contain; the size of the corpus; and the homogeneity of the corpus (independent variables). As a quality control measure in this prospective profiling phase, we also elicit human judgements (through a combined questionnaire and interview) on the quality of the corpus data and on the reusability in new contexts of the TL subtitles. The intelligibility and acceptability of EBMT-produced subtitles (dependent variables) are, in turn, established through end-user evaluation sessions. In these sessions 44 native German-speaking subjects view short movie clips containing EBMT-generated German subtitles, and following each clip answer questions (again, through a combined questionnaire and interview) relating to the quality characteristics mentioned above. The findings of the study suggest that an increase in corpus size along with a concomitant increase in the number of source language repetitions and a decrease in corpus homogeneity, improves the readability of the EBMT-generated subtitles. It does not, however, have a significant effect on the comprehensibility, style or wellformedness of the EBMT-generated subtitles. Increasing corpus size and SL repetitions also results in a higher number of alternative TL translations in the corpus that are deemed acceptable by evaluators in the corpus profiling phase. The research also finds that subjects are more critical of subtitles when they do not understand the soundtrack language, while subjects’ linguistic background does not have a significant effect on their judgements of the quality of EBMT-generated subtitles. Prior knowledge of the Harry Potter genre, on the other hand, appears to have an effect on how viewing subjects rate the severity of observed errors in the subtitles, and on how they rate the style of subtitles, although this effect is training corpus-dependent. The introduction of repeated subtitles did not reduce the intelligibility or acceptability of the subtitles. Overall, the findings indicate that the subtitles deemed the most acceptable when evaluated in a non-AVT environment (albeit one in which rich contextual information was available) were the same as the subtitles deemed the most acceptable in an AVT environment, although richer data were gathered from the AVT environment
    corecore