3,531 research outputs found
Traduction automatique des noms propres de l’anglais et du français vers le vietnamien : analyse des erreurs et quelques solutions
Machine translation (MT) has increasingly become an indispensable tool for decoding themeaning of a text from a source language into a target language in our current information and knowledgeera. In particular, MT of proper names (PN) plays a crucial role in providing the specific and preciseidentification of persons, places, organizations, and artefacts through the languages. Despite a largenumber of studies and significant achievements of named entity recognition in the NLP communityaround the world, there has been almost no research on PNMT for Vietnamese language. Due to the different features of PN writing, transliteration or transcription and translation from a variety of languages including English, French, Russian, Chinese, etc. into Vietnamese, the PNMT from those languages into Vietnamese is still challenging and problematic issue. This study focuses on theproblems of English-Vietnamese and French-Vietnamese PNMT arising from current MT engines. First,it proposes a corpus-based PN classification, then a detailed PNMT error analysis to conclude with somepre-processing solutions in order to improve the MT quality. Through the analysis and classification of PNMT errors from the two English-Vietnamese and French-Vietnamese parallel corpora of texts with PNs, we propose solutions concerning two major issues:(1)corpus annotation for preparing the pre-processing databases, and (2)design of the pre-processingprogram to be used on annotated corpora to reduce the PNMT errors and enhance the quality of MTsystems, including Google, Vietgle, Bing and EVTran. The efficacy of different annotation methods of English and French corpora of PNs and the results of PNMT errors before and after using the pre-processing program on the two annotated corporaare compared and discussed in this study. They prove that the pre-processing solution reducessignificantly PNMT errors and contributes to the improvement of the MT systems’ for Vietnameselanguage.Dans l'ère de l'information et de la connaissance, la traduction automatique (TA) devientprogressivement un outil indispensable pour transposer la signification d'un texte d'une langue source versune langue cible. La TA des noms propres (NP), en particulier, joue un rôle crucial dans ce processus,puisqu'elle permet une identification précise des personnes, des lieux, des organisations et des artefacts à travers les langues. Malgré un grand nombre d'études et des résultats significatifs concernant lareconnaissance d'entités nommées (dont le nom propre fait partie) dans la communauté de TAL dans lemonde, il n'existe presque aucune recherche sur la traduction automatique des noms propres (TANP) pourle vietnamien. En raison des caractéristiques différentes d'écriture de NP, la translittération ou la transcription etla traduction de plusieurs de langues incluant l'anglais, le français, le russe, le chinois, etc. vers levietnamien, le TANP de ces langues vers le vietnamien est stimulant et problématique. Cette étude seconcentre sur les problèmes de TANP d’anglais vers le vietnamien et de français vers le vietnamienrésultant du moteurs courants de la TA et présente les solutions de prétraitement de ces problèmes pouraméliorer la qualité de la TA. A travers l'analyse et la classification d'erreurs de la TANP faites sur deux corpus parallèles detextes avec PN (anglais-vietnamien et français-vietnamien), nous proposons les solutions concernant deuxproblématiques importantes: (1) l'annotation de corpus, afin de préparer des bases de données pour leprétraitement et (2) la création d'un programme pour prétraiter automatiquement les corpus annotés, afinde réduire les erreurs de la TANP et d'améliorer la qualité de traduction des systèmes de TA, tels queGoogle, Vietgle, Bing et EVTran. L'efficacité de différentes méthodes d'annotation des corpus avec des NP ainsi que les tauxd'erreurs de la TANP avant et après l'application du programme de prétraitement sur les deux corpusannotés est comparés et discutés dans cette thèse. Ils prouvent que le prétraitement réduitsignificativement le taux d'erreurs de la TANP et, par la même, contribue à l'amélioration de traductionautomatique vers la langue vietnamienne
Projecting named entity tags from a resource rich language to a resource poor language
Named Entities (NE) are the prominent entities appearing in textual documents.Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc.This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism.A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism.The English corpus is the translated version of the Malay corpus.Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping.The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.An evaluation of the selected open source NER tool for English is also presented
Recommended from our members
Cross-generational linguistic variation in the Canberra Vietnamese heritage language community: A corpus-centred investigation
This dissertation investigates cross-generational linguistic differences in the Canberra Vietnamese bilingual community, with a particular focus on Vietnamese as the heritage language. Specifically, it documents the vernacular and considers key aspects of this data from different theoretical perspectives. Its main contribution is an insight into a rarely studied heritage language variety in a contact community that has never been examined.
The dissertation consists of five core chapters, organised into two parts. In the first part (Chapters 2–3), I describe how I documented the vernacular and created the Canberra Vietnamese English Corpus (CanVEC), an original corpus compiled specifically for this study that is also the first to be freely available for research purposes. The corpus consists of over ten hours of spontaneous speech produced by 45 Vietnamese-English bilingual speakers across two generations living in Canberra. In the second part of the study (Chapters 4–6), I put the corpus to use and investigate aspects of the cross-generational differences in Vietnamese as the heritage language in this community.
In particular, I first probe the Vietnamese heritage language via its participation in the code-switching discourse (Chapter 4). In doing so, I focus on the applicability of the Matrix Language Framework (MLF) (Myers-Scotton, 1993, 2002) and its associated Matrix Language (ML) Turnover Hypothesis (Myers-Scotton, 1998) to the code-switching data in CanVEC. Since support for this prominent model has mainly come from language pairs that have different clausal word order or vastly different inventories of inflectional morphology, Vietnamese-English as a pair in which both languages are SVO and essentially isolating offers a tantalising testing ground for its application. Results show that the universal claims of this model do not hold so straight-forwardly. CanVEC data challenges several assumptions of the MLF, with the model ultimately only being able to account for around half of the CanVEC code-switching data. I further demonstrate that even when the ML is putatively identifiable and a cross-generational ML ‘turnover’ is quantitatively observed, the predictions do not reflect the direction of structural influence that we see in CanVEC. The MLF approach therefore sheds only limited light on cross-generational language shift and variation in this community.
Given that null elements emerge as a distinct area of difficulty in Chapter 4, I take this aspect as the focal point for the next part of the investigation (Chapter 5), where I use the variationist approach (Labov, 1972 et seq.) to explore three cases where null and overt realisation alternates in Vietnamese: subjects, objects, and copulas. In doing so, I move away from the bilingual portion of CanVEC to examine the monolingual heritage Vietnamese subset directly. Results show that Vietnamese null subjects vary significantly across generations, while null objects and copulas remain stable in terms of use. As speakers also overwhelmingly prefer overt forms over null forms (∼70:30) across all the three of the variables of interest, I appeal to the generative interface-oriented approach (Sorace & Filiaci, 2006 et seq.) to next examine the distribution of overt subjects, objects, copulas (Chapter 6). These results converge with what was found for null forms: cross-generational effects were observed for pronominal subjects, but not pronominal objects and copulas. This finding also supports the importance of a distinction drawn in previous works between internal (syntax-semantics) and external (syntax-discourse/pragmatics) interface phenomena, with the latter being seemingly more susceptible to change.
Ultimately, this dissertation highlights the empirical and theoretical value of studying rarely considered contact varieties, while deploying an integrated approach that acknowledges the multi-faceted complexity of the contact communities where these varieties are spoken.Cambridge Trust International Scholarshi
Artificial Intelligence Chatbots: A Survey of Classical versus Deep Machine Learning Techniques
Artificial Intelligence (AI) enables machines to be intelligent, most importantly using Machine Learning (ML) in which machines are trained to be able to make better decisions and predictions. In particular, ML-based chatbot systems have been developed to simulate chats with people using Natural Language Processing (NLP) techniques. The adoption of chatbots has increased rapidly in many sectors, including, Education, Health Care, Cultural Heritage, Supporting Systems and Marketing, and Entertainment. Chatbots have the potential to improve human interaction with machines, and NLP helps them understand human language more clearly and thus create proper and intelligent responses. In addition to classical ML techniques, Deep Learning (DL) has attracted many researchers to develop chatbots using more sophisticated and accurate techniques. However, research has paid chatbots have widely been developed for English, there is relatively less research on Arabic, which is mainly due to its complexity and lack of proper corpora compared to English. Though there have been several survey studies that reviewed the state-of-the-art of chatbot systems, these studies (a) did not give a comprehensive overview of how different the techniques used for Arabic chatbots in comparison with English chatbots; and (b) paid little attention to the application of ANN for developing chatbots. Therefore, in this paper, we conduct a literature survey of chatbot studies to highlight differences between (1) classical and deep ML techniques for chatbots; and (2) techniques employed for Arabic chatbots versus those for other languages. To this end, we propose various comparison criteria of the techniques, extract data from collected studies accordingly, and provide insights on the progress of chatbot development for Arabic and what still needs to be done in the future
Recommended from our members
A Comparative Analysis of Web-based Machine Translation Quality: English to French and French to English
This study offers a partial reduplication of a 2006 study by Williams, which focused primarily on the analysis of the quality of translation produced by online software, namely Yahoo!® Babelfish, Freetranslation.com, and Google Translate. Since the data for the study by Williams were collected in 2004 and the data for present study in 2012, this gives a lapse of eight years for a diachronic analysis of the differences in quality of the translations provided by these online services. At the time of the 2006 study by Williams, all three services used a rule-based translation system, but, in October 2007, however, Google Translate switched to a system that is entirely statistical in nature. Thus, the present study is also able to examine the differences in quality between contemporary statistical and rule-based approaches to machine translation
- …