18 research outputs found

    Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

    Get PDF

    Pengembangan tata bahasa baku bahasa Indonesia (TBBI) daring terpadu

    Get PDF
    Badan Pengembangan dan Pembinaan Bahasa (Badan Bahasa) di bawah naungan Kementerian Pendidikan dan Kebudayaan Republik Indonesia, sebagai instansi pemerintah yang ditugaskan untuk menangani masalah kebahasaan dan kesastraan di Indonesia, menerbitkan berbagai produk kebahasaan. Dua produk yang sering dimanfaatkan para pemelajar bahasa Indonesia adalah Kamus Besar Bahasa Indonesia (KBBI) dan Tata Bahasa Baku Bahasa Indonesia (TBBI). KBBI terbaru edisi kelima (Amalia 2016) diluncurkan pada tahun 2016 dalam tiga versi: cetak, daring, dan luring (Moeljadi et al. 2017). Sejak diluncurkan pada 28 Oktober 2016, KBBI Daring mendapat sambutan hangat masyarakat, baik dari dalam maupun luar negeri. KBBI Daring memudahkan pemelajar bahasa Indonesia dan masyarakat umum menggunakan kamus pada era digital ini. Hal yang serupa dapat dilakukan untuk TBBI. Makalah ini membahas tahap awal pengembangan pangkalan data dan laman TBBI Daring Terpadu dengan menggunakan tata bahasa komputasional bahasa Indonesia INDRA (Indonesian Resource Grammar) (Moeljadi et al. 2015) yang dikembangkan dengan metode rekayasa tata bahasa dengan mengacu pada buku-buku referensi tata bahasa baku bahasa Indonesia, terutama TBBI (Alwi et al. 2014) dan Indonesian Reference Grammar (Sneddon et al. 2010). TBBI Daring Terpadu akan memuat aturan-aturan tata bahasa bahasa Indonesia baku, dipadukan dengan leksikon dan contoh-contoh dari korpus bahasa Indonesia baku yang telah dianotasi secara sintaksis dan semantis. Penulis berharap TBBI Daring Terpadu dapat menjadi acuan utama tata bahasa baku bahasa Indonesia yang dapat diakses dengan mudah oleh para penggunanya, misalnya pemelajar Bahasa Indonesia bagi Penutur Asing (BIPA), dan dapat memperkaya KBBI Daring dalam penggolongan kelas kata yang lebih spesifik, serta mendorong kemajuan bidang linguistik komputasional dan pemrosesan bahasa alami bahasa Indonesia, misalnya dalam penerjemahan mesin dan pengembangan sistem pemeriksaan gramatika dan leksikon bahasa Indonesia baku

    Comparing Classifier use in Chinese and Japanese

    Get PDF

    SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks

    Full text link
    SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens, which are all sense-annotated. The corpus is annotated using two different sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how tokens and senses are associated. Instead of linking a token to only one intended sense, SALMA links a token to multiple senses and provides a score to each sense. A smart web-based annotation tool was developed to support scoring multiple senses against a given word. In addition to sense annotations, we also annotated the corpus using six types of named entities. The quality of our annotations was assessed using various metrics (Kappa, Linear Weighted Kappa, Quadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error), which show very high inter-annotator agreement. To establish a Word Sense Disambiguation baseline using our SALMA corpus, we developed an end-to-end Word Sense Disambiguation system using Target Sense Verification. We used this system to evaluate three Target Sense Verification models available in the literature. Our best model achieved an accuracy with 84.2% using Modern and 78.7% using Ghani. The full corpus and the annotation tool are open-source and publicly available at https://sina.birzeit.edu/salma/

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

    No full text

    Contextual Understanding of Sequential Data Across Multiple Modalities

    Get PDF
    In recent years, progress in computing and networking has made it possible to collect large volumes of data for various different applications in data mining and data analytics using machine learning methods. Data may come from different sources and in different shapes and forms depending on their inherent nature and the acquisition process. In this dissertation, we focus specifically on sequential data, which have been exponentially growing in recent years on platforms such as YouTube, social media, news agency sites, and other platforms. An important characteristic of sequential data is the inherent causal structure with latent patterns that can be discovered and learned from samples of the dataset. With this in mind, we target problems in two different domains of Computer Vision and Natural Language Processing that deal with sequential data and share the common characteristics of such data. The first one is action recognition based on video data, which is a fundamental problem in computer vision. This problem aims to find generalized patterns from videos to recognize or predict human actions. A video contains two important sets of information, i.e. appearance and motion. These information are complementary, and therefore an accurate recognition or prediction of activities or actions in video data depend significantly on our ability to extract them both. However, effective extraction of these information is a non-trivial task due to several challenges, such as viewpoint changes, camera motions, and scale variations, to name a few. It is thus crucial to design effective and generalized representations of video data that learn these variations and/or are invariant to such variations. We propose different models that learn and extract spatio-temporal correlations from video frames by using deep networks that overcome these challenges. The second problem that we study in this dissertation in the context of sequential data analysis is text summarization in multi-document processing. Sentences consist of sequence of words that imply context. The summarization task requires learning and understanding the contextual information from each sentence in order to determine which subset of sentences forms the best representative of a given article. With the progress made by deep learning, better representations of words have been achieved, leading in turn to better contextual representations of sentences. We propose summarization methods that combine mathematical optimization, Determinantal Point Processes (DPPs), and deep learning models that outperform the state of the art in multi-document text summarization
    corecore