9 research outputs found

    Character-level and syntax-level models for low-resource and multilingual natural language processing

    Get PDF
    There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

    Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

    Get PDF
    Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.Comment: la 28e Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN

    APLIKASI LAYANAN TICKET RESERVATION PADA TAMAN OBJEK AGROWISATA GO GREEN BERBASIS ANDROID

    Get PDF
    Perkembangan teknologi informasi yang sedang berkembang pesat pada dikalangan masyarakat saat ini adalah penggunaan perangkat mobile atau smartphone.Dengan adanya teknologi informasi dapat membantu banyak pihak dalam menjalankan proses bisnis terutama pada objek Taman Agrowisata Go Green di Desa Sungai Pinang. Go Green saat ini banyak digemari oleh masyarakat sebagai tempat objek wisata yang berada di Desa Sungai Pinang.Pada saat ini sistem pemesanan tiket di Go Green masih menggunakan cara manual yang mengharuskan pengunjung untuk datang langsung ke loket tiket, sehingga pengunjung harus mengantri untuk mendapatkan tiket masuk. Hal tersebut menyebabkan sistem yang digunakan kurang efektif dan membuat pengunjung harus lama menunggu antrian.Aplikasi Layanan Ticket Reservation adalah aplikasi mobile yang dirancang dan dikembangkan dalam upaya mengatasi masalah tersebut.Aplikasi ini akan menyediakan layanan untuk pengunjung Go Green seperti informasi terkait Go Green, informasi rute terdekat menuju lokasi, pemesanan tiket dan pembayaran secara online pada aplikasi.Aplikasi ini menerapkan LBS (Location Based Service) untuk membantu pengguna menemukan rute terdekat menuju lokasi dan menerapkan payment gateway dalam proses transaksi pembayaran.Aplikasi Layanan Ticket Reservation diharapkan dapat membantu pengguna untuk mencari informasi dan menikmati layanan yang terdapat pada Go Green.Hasil pengujian dari aplikasi ini menggunakan User Acceptance Test (UAT) menghasilkan persentase sebesar 89,2%

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

    Information and Communication Technologies in Tourism 2021

    Get PDF
    This open access book is the proceedings of the International Federation for IT and Travel & Tourism (IFITT)’s 28th Annual International eTourism Conference, which assembles the latest research presented at the ENTER21@yourplace virtual conference January 19–22, 2021. This book advances the current knowledge base of information and communication technologies and tourism in the areas of social media and sharing economy, technology including AI-driven technologies, research related to destination management and innovations, COVID-19 repercussions, and others. Readers will find a wealth of state-of-the-art insights, ideas, and case studies on how information and communication technologies can be applied in travel and tourism as we encounter new opportunities and challenges in an unpredictable world
    corecore