Search CORE

37 research outputs found

Text Summarization Technique for Punjabi Language Using Neural Networks

Author: Arora Anuja
Jain Arti
Kaur Amanpreet
Morato Lara Jorge Luis
Yadav Divakar
Publication venue: IAJIT
Publication date: 01/11/2021
Field of study

In the contemporary world, utilization of digital content has risen exponentially. For example, newspaper and web articles, status updates, advertisements etc. have become an integral part of our daily routine. Thus, there is a need to build an automated system to summarize such large documents of text in order to save time and effort. Although, there are summarizers for languages such as English since the work has started in the 1950s and at present has led it up to a matured stage but there are several languages that still need special attention such as Punjabi language. The Punjabi language is highly rich in morphological structure as compared to English and other foreign languages. In this work, we provide three phase extractive summarization methodology using neural networks. It induces compendious summary of Punjabi single text document. The methodology incorporates pre-processing phase that cleans the text; processing phase that extracts statistical and linguistic features; and classification phase. The classification based neural network applies an activation function- sigmoid and weighted error reduction-gradient descent optimization to generate the resultant output summary. The proposed summarization system is applied over monolingual Punjabi text corpus from Indian languages corpora initiative phase-II. The precision, recall and F-measure are achieved as 90.0%, 89.28% an 89.65% respectively which is reasonably good in comparison to the performance of other existing Indian languages" summarizers.This research is partially funded by the Ministry of Economy, Industry and Competitiveness, Spain (CSO2017-86747-R)

Universidad Carlos III de Madrid e-Archivo

Peringkasan Teks Berita Berbahasa Indonesia Menggunakan Metode Latent Semantic Analysis (LSA) dan Teknik Steinberger&Jezek

Author: Fachrurrozi Muhammad
Saputra Jerry Satiamy
Yunita Yunita
Publication venue: Annual Research Seminar (ARS)
Publication date: 13/11/2017
Field of study

Dokumen berita merupakan dokumen yang memuat berbagai macam informasi. Semakin banyak informasi yang terdapat pada suatu dokumen membuat dokumen tersebut semakin panjang. Membaca keseluruhan dokumen tersebut memakan banyak waktu. Ringkasan dokumen diperlukan untuk memudahkan memahami informasi yang berukuran besar dengan cepat. Peringkasan dokumen secara otomatis merupakan solusi untuk membantu mendapatkan intisari dari dokumen. Pada penelitian ini dilakukan penerapan metode Latent Semantic Analysis dan teknik Steinberger&Jezek yang digunakan untuk peringkasan teks otomatis. Jumlah data uji yang digunakan sebanyak 10 teks berita yang diambil dari data uji penelitian sebelumnya. Hasil penelitian yang telah dilakukan menghasilkan rata-rata recall 0.7027, precision 0.6973, dan f-measure 0.6974

Repository Proceeding Seminar (Fakultas Ilmu Komputer, Universitas Sriwijaya)

Topic identification using filtering and rule generation algorithm for textual document

Author: Nurul Syafidah Jamil
Publication venue
Publication date: 01/01/2015
Field of study

Information stored digitally in text documents are seldom arranged according to specific topics. The necessity to read whole documents is time-consuming and decreases the interest for searching information. Most existing topic identification methods depend on occurrence of terms in the text. However, not all frequent occurrence terms are relevant. The term extraction phase in topic identification method has resulted in extracted terms that might have similar meaning which is known as synonymy problem. Filtering and rule generation algorithms are introduced in this study to identify topic in textual documents. The proposed filtering algorithm (PFA) will extract the most relevant terms from text and solve synonym roblem amongst the extracted terms. The rule generation algorithm (TopId) is proposed to identify topic for each verse based on the extracted terms. The PFA will process and filter each sentence based on nouns and predefined keywords to produce suitable terms for the topic. Rules are then generated from the extracted terms using the rule-based classifier. An experimental design was performed on 224 English translated Quran verses which are related to female issues. Topics identified by both TopId and Rough Set technique were compared and later verified by experts. PFA has successfully extracted more relevant terms compared to other filtering techniques. TopId has identified topics that are closer to the topics from experts with an accuracy of 70%. The proposed algorithms were able to extract relevant terms without losing important terms and identify topic in the verse

Universiti Utara Malaysia: UUM eTheses

CLOUD BASED MULTI-LANGUAGE INDEXING USING CROSS LINGUAL INFORMATION RETRIEVAL APPROACHES

Author: Chayapathi A.R.
Manjunath Swamy B.E.
Sunil Kumar G.
Thriveni J.
Venugopal K.R.
Publication venue
Publication date: 18/03/2021
Field of study

The exponential growth of data sizes created by digital media (video/audio/images), physicalsimulations, scientific instruments and web authoring joins the new growth of interest in cloud computing. The options for distribution and parallelization of information in clouds make the retrieval and storage processes very complicated, especially when faced with real-time data management. The quantity of Web Users getting access to data over Internet is expanding step by step. An enormous measure of data on Internet is accessible in various languages which could be accessed by anyone whenever. The Information Retrieval (IR) manages finding valuable data from a huge assortment of unorganized, organized and semi-organized information. In the present situation, the variety of data and language boundaries are the difficult challenges for communication and social trade over the world. To tackle such obstructions, CLIR, the cross-language information retrieval frameworks, are these days in solid interest. The Query Expansion (QE) is the way toward adding related and important terms to original inquiry to upgrade its indexing ability to improve the significance of recovered files in CLIR. In this exploration work, QE has been investigated for a Hindi-English and Kannada-English CLIR in that Hindi and Kannada queries are utilized to look through English docs. After the interpretation of query, recovered outcomes are positioned making use of OkapiBM25 to organize the most important doc at the top for expanding the significance of recovered docs using QE. We proposed architecture for Hindi-English and Kannada-English CLIR making use of QE. to

ePrints@Bangalore University

Indonesian Sentence Boundary Detection using Deep Learning Approaches

Author: Kurniawan Fachrul
Purwanto Christian Nathaniel
Santoso Joan
Setiawan Esther Irawati
Publication venue: 'State University of Malang (UM)'
Publication date: 01/06/2021
Field of study

Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes.

Portal Jurnal Elektronik Universitas Negeri Malang

Directory of Open Access Journals

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Author: Giulianelli M.
Jumelet J.
Schubert M.
Shutova E.
Siro C.
Srivastava A.
ter Hoeve M.
Tong X.
Publication venue
Publication date: 10/06/2022
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

Author: Ahuja Kabir
Bali Kalika
Balloli Vaibhav
Ganu Tanuja
Nambi Akshay
Ranjit Mercy
Sitaram Sunayana
Publication venue
Publication date: 28/05/2023
Field of study

Large language models (LLMs) are at the forefront of transforming numerous domains globally. However, their inclusivity and effectiveness remain limited for non-Latin scripts and low-resource languages. This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs, specifically focusing on Generative models. Through systematic investigation and evaluation of diverse languages using popular question-answering (QA) datasets, we present novel techniques that unlock the true potential of LLMs in a polyglot landscape. Our approach encompasses three key strategies that yield remarkable improvements in multilingual proficiency. First, by meticulously optimizing prompts tailored for polyglot LLMs, we unlock their latent capabilities, resulting in substantial performance boosts across languages. Second, we introduce a new hybrid approach that synergizes GPT generation with multilingual embeddings and achieves significant multilingual performance improvement on critical tasks like QA and retrieval. Finally, to further propel the performance of polyglot LLMs, we introduce a novel learning algorithm that dynamically selects the optimal prompt strategy, LLM model, and embeddings per query. This dynamic adaptation maximizes the efficacy of LLMs across languages, outperforming best static and random strategies. Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages

arXiv.org e-Print Archive

A massively parallel corpus: the Bible in 100 languages

Author: Christos Christodouloupoulos
CP Wei
FJ Och
M Marcus
M Potthast
Mark Steedman
P Koehn
P Resnik
T Kanungo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora

Crossref

Springer - Publisher Connector

PubMed Central

Edinburgh Research Explorer