37 research outputs found

    Text Summarization Technique for Punjabi Language Using Neural Networks

    Get PDF
    In the contemporary world, utilization of digital content has risen exponentially. For example, newspaper and web articles, status updates, advertisements etc. have become an integral part of our daily routine. Thus, there is a need to build an automated system to summarize such large documents of text in order to save time and effort. Although, there are summarizers for languages such as English since the work has started in the 1950s and at present has led it up to a matured stage but there are several languages that still need special attention such as Punjabi language. The Punjabi language is highly rich in morphological structure as compared to English and other foreign languages. In this work, we provide three phase extractive summarization methodology using neural networks. It induces compendious summary of Punjabi single text document. The methodology incorporates pre-processing phase that cleans the text; processing phase that extracts statistical and linguistic features; and classification phase. The classification based neural network applies an activation function- sigmoid and weighted error reduction-gradient descent optimization to generate the resultant output summary. The proposed summarization system is applied over monolingual Punjabi text corpus from Indian languages corpora initiative phase-II. The precision, recall and F-measure are achieved as 90.0%, 89.28% an 89.65% respectively which is reasonably good in comparison to the performance of other existing Indian languages" summarizers.This research is partially funded by the Ministry of Economy, Industry and Competitiveness, Spain (CSO2017-86747-R)

    Peringkasan Teks Berita Berbahasa Indonesia Menggunakan Metode Latent Semantic Analysis (LSA) dan Teknik Steinberger&Jezek

    Get PDF
    Dokumen berita merupakan dokumen yang memuat berbagai macam informasi. Semakin banyak informasi yang terdapat pada suatu dokumen membuat dokumen tersebut semakin panjang. Membaca keseluruhan dokumen tersebut memakan banyak waktu. Ringkasan dokumen diperlukan untuk memudahkan memahami informasi yang berukuran besar dengan cepat. Peringkasan dokumen secara otomatis merupakan solusi untuk membantu mendapatkan intisari dari dokumen. Pada penelitian ini dilakukan penerapan metode Latent Semantic Analysis dan teknik Steinberger&Jezek yang digunakan untuk peringkasan teks otomatis. Jumlah data uji yang digunakan sebanyak 10 teks berita yang diambil dari data uji penelitian sebelumnya. Hasil penelitian yang telah dilakukan menghasilkan rata-rata recall 0.7027, precision 0.6973, dan f-measure 0.6974

    Topic identification using filtering and rule generation algorithm for textual document

    Get PDF
    Information stored digitally in text documents are seldom arranged according to specific topics. The necessity to read whole documents is time-consuming and decreases the interest for searching information. Most existing topic identification methods depend on occurrence of terms in the text. However, not all frequent occurrence terms are relevant. The term extraction phase in topic identification method has resulted in extracted terms that might have similar meaning which is known as synonymy problem. Filtering and rule generation algorithms are introduced in this study to identify topic in textual documents. The proposed filtering algorithm (PFA) will extract the most relevant terms from text and solve synonym roblem amongst the extracted terms. The rule generation algorithm (TopId) is proposed to identify topic for each verse based on the extracted terms. The PFA will process and filter each sentence based on nouns and predefined keywords to produce suitable terms for the topic. Rules are then generated from the extracted terms using the rule-based classifier. An experimental design was performed on 224 English translated Quran verses which are related to female issues. Topics identified by both TopId and Rough Set technique were compared and later verified by experts. PFA has successfully extracted more relevant terms compared to other filtering techniques. TopId has identified topics that are closer to the topics from experts with an accuracy of 70%. The proposed algorithms were able to extract relevant terms without losing important terms and identify topic in the verse

    CLOUD BASED MULTI-LANGUAGE INDEXING USING CROSS LINGUAL INFORMATION RETRIEVAL APPROACHES

    Get PDF
    The exponential growth of data sizes created by digital media (video/audio/images), physicalsimulations, scientific instruments and web authoring joins the new growth of interest in cloud computing. The options for distribution and parallelization of information in clouds make the retrieval and storage processes very complicated, especially when faced with real-time data management. The quantity of Web Users getting access to data over Internet is expanding step by step. An enormous measure of data on Internet is accessible in various languages which could be accessed by anyone whenever. The Information Retrieval (IR) manages finding valuable data from a huge assortment of unorganized, organized and semi-organized information. In the present situation, the variety of data and language boundaries are the difficult challenges for communication and social trade over the world. To tackle such obstructions, CLIR, the cross-language information retrieval frameworks, are these days in solid interest. The Query Expansion (QE) is the way toward adding related and important terms to original inquiry to upgrade its indexing ability to improve the significance of recovered files in CLIR. In this exploration work, QE has been investigated for a Hindi-English and Kannada-English CLIR in that Hindi and Kannada queries are utilized to look through English docs. After the interpretation of query, recovered outcomes are positioned making use of OkapiBM25 to organize the most important doc at the top for expanding the significance of recovered docs using QE. We proposed architecture for Hindi-English and Kannada-English CLIR making use of QE. to

    Indonesian Sentence Boundary Detection using Deep Learning Approaches

    Get PDF
    Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes.

    Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

    Full text link
    Large language models (LLMs) are at the forefront of transforming numerous domains globally. However, their inclusivity and effectiveness remain limited for non-Latin scripts and low-resource languages. This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs, specifically focusing on Generative models. Through systematic investigation and evaluation of diverse languages using popular question-answering (QA) datasets, we present novel techniques that unlock the true potential of LLMs in a polyglot landscape. Our approach encompasses three key strategies that yield remarkable improvements in multilingual proficiency. First, by meticulously optimizing prompts tailored for polyglot LLMs, we unlock their latent capabilities, resulting in substantial performance boosts across languages. Second, we introduce a new hybrid approach that synergizes GPT generation with multilingual embeddings and achieves significant multilingual performance improvement on critical tasks like QA and retrieval. Finally, to further propel the performance of polyglot LLMs, we introduce a novel learning algorithm that dynamically selects the optimal prompt strategy, LLM model, and embeddings per query. This dynamic adaptation maximizes the efficacy of LLMs across languages, outperforming best static and random strategies. Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages

    A massively parallel corpus: the Bible in 100 languages

    Get PDF
    We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora
    corecore