16 research outputs found
Text Summarization Technique for Punjabi Language Using Neural Networks
In the contemporary world, utilization of digital content has risen exponentially. For example, newspaper and web
articles, status updates, advertisements etc. have become an integral part of our daily routine. Thus, there is a need to build
an automated system to summarize such large documents of text in order to save time and effort. Although, there are
summarizers for languages such as English since the work has started in the 1950s and at present has led it up to a matured
stage but there are several languages that still need special attention such as Punjabi language. The Punjabi language is
highly rich in morphological structure as compared to English and other foreign languages. In this work, we provide three
phase extractive summarization methodology using neural networks. It induces compendious summary of Punjabi single text
document. The methodology incorporates pre-processing phase that cleans the text; processing phase that extracts statistical
and linguistic features; and classification phase. The classification based neural network applies an activation function-
sigmoid and weighted error reduction-gradient descent optimization to generate the resultant output summary. The proposed
summarization system is applied over monolingual Punjabi text corpus from Indian languages corpora initiative phase-II.
The precision, recall and F-measure are achieved as 90.0%, 89.28% an 89.65% respectively which is reasonably good in
comparison to the performance of other existing Indian languages" summarizers.This research is partially funded by the Ministry of Economy, Industry and Competitiveness, Spain (CSO2017-86747-R)
Automatic Text Summarization for Hindi Using Real Coded Genetic Algorithm
In the present scenario, Automatic Text Summarization (ATS) is in great demand to address the ever-growing volume of text data available online to discover relevant information faster. In this research, the ATS methodology is proposed for the Hindi language using Real Coded Genetic Algorithm (RCGA) over the health corpus, available in the Kaggle dataset. The methodology comprises five phases: preprocessing, feature extraction, processing, sentence ranking, and summary generation. Rigorous experimentation on varied feature sets is performed where distinguishing features, namely- sentence similarity and named entity features are combined with others for computing the evaluation metrics. The top 14 feature combinations are evaluated through Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measure. RCGA computes appropriate feature weights through strings of features, chromosomes selection, and reproduction operators: Simulating Binary Crossover and Polynomial Mutation. To extract the highest scored sentences as the corpus summary, different compression rates are tested. In comparison with existing summarization tools, the ATS extractive method gives a summary reduction of 65%
INDIAN LANGUAGE TEXT MINING
India is the home of different languages, due to its cultural and geographical diversity. In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. In India, the growth in consumption of Indian language content started because of growth of electronic devices and technology. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. But not much work has been done in Indian languages text processing. So there is a huge gap from the stored data to the knowledge that could be constructed from the data. This transition won't occur automatically, that's where Text mining comes into picture. This research is concerned with the study and analyzes the text mining for Indian regional languages Text mining refers to such a knowledge discovery process when the source data under consideration is text. Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from information retrieval, information extraction as well as natural language processing (NLP) and connects them with the algorithms and methods of KDD, data mining, machine learning and statistics. Some applications of text mining are: document classification, information retrieval, clustering documents, information extraction, and performance evaluation. In this paper we made an attempt to show the need of text mining for Indian language
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages
Lack of encyclopedic text contributors, especially on Wikipedia, makes
automated text generation for low resource (LR) languages a critical problem.
Existing work on Wikipedia text generation has focused on English only where
English reference articles are summarized to generate English Wikipedia pages.
But, for low-resource languages, the scarcity of reference articles makes
monolingual summarization ineffective in solving this problem. Hence, in this
work, we propose XWikiGen, which is the task of cross-lingual multi-document
summarization of text from multiple reference articles, written in various
languages, to generate Wikipedia-style text. Accordingly, we contribute a
benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five
domains and eight languages. We harness this dataset to train a two-stage
system where the input is a set of citations and a section title and the output
is a section-specific LR summary. The proposed system is based on a novel idea
of neural unsupervised extractive summarization to coarsely identify salient
information followed by a neural abstractive model to generate the
section-specific text. Extensive experiments show that multi-domain training is
better than the multi-lingual setup on average
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Contemporary works on abstractive text summarization have focused primarily
on high-resource languages like English, mostly due to the limited availability
of datasets for low/mid-resource ones. In this work, we present XL-Sum, a
comprehensive and diverse dataset comprising 1 million professionally annotated
article-summary pairs from BBC, extracted using a set of carefully designed
heuristics. The dataset covers 44 languages ranging from low to high-resource,
for many of which no public dataset is currently available. XL-Sum is highly
abstractive, concise, and of high quality, as indicated by human and intrinsic
evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model,
with XL-Sum and experiment on multilingual and low-resource summarization
tasks. XL-Sum induces competitive results compared to the ones obtained using
similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10
languages we benchmark on, with some of them exceeding 15, as obtained by
multilingual training. Additionally, training on low-resource languages
individually also provides competitive performance. To the best of our
knowledge, XL-Sum is the largest abstractive summarization dataset in terms of
the number of samples collected from a single source and the number of
languages covered. We are releasing our dataset and models to encourage future
research on multilingual abstractive summarization. The resources can be found
at \url{https://github.com/csebuetnlp/xl-sum}.Comment: Findings of the Association for Computational Linguistics, ACL 2021
(camera-ready
Scientific Documents clustering based on Text Summarization
In this paper a novel method is proposed for scientific document clustering. The proposed method is a summarization-based hybrid algorithm which comprises a preprocessing phase. In the preprocessing phase unimportant words which are frequently used in the text are removed. This process reduces the amount of data for the clustering purpose. Furthermore frequent items cause overlapping between the clusters which leads to inefficiency of the cluster separation. After the preprocessing phase, Term Frequency/Inverse Document Frequency (TFIDF) is calculated for all words and stems over the document to score them in the document. Text summarization is performed then in the sentence level. Document clustering is finally done according to the scores of calculated TFIDF. The hybrid progress of the proposed scheme, from preprocessing phase to document clustering, gains a rapid and efficient clustering method which is evaluated by 400 English texts extracted from scientific databases of 11 different topics. The proposed method is compared with CSSA, SMTC and Max-Capture methods. The results demonstrate the proficiency of the proposed scheme in terms of computation time and efficiency using F-measure criterion
AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model
In this investigation, we propose a solution for the author’s gender identification task called AGI-P. This task has several real-world applications across different fields, such as marketing and advertising, forensic linguistics, sociology, recommendation systems, language processing, historical analysis, education, and language learning. We created a new dataset to evaluate our proposed method. The dataset is balanced in terms of gender using a random sampling method and consists of 1944 samples in total. We use accuracy as an evaluation measure and compare the performance of the proposed solution (AGI-P) against state-of-the-art machine learning classifiers and fine-tuned pre-trained multilingual language models such as DistilBERT, mBERT, XLM-RoBERTa, and Multilingual DEBERTa. In this regard, we also propose a customized fine-tuning strategy that improves the accuracy of the pre-trained language models for the author gender identification task. Our extensive experimental studies reveal that our solution (AGI-P) outperforms the well-known machine learning classifiers and fine-tuned pre-trained multilingual language models with an accuracy level of 92.03%. Moreover, the pre-trained multilingual language models, fine-tuned with the proposed customized strategy, outperform the fine-tuned pre-trained language models using an out-of-the-box fine-tuning strategy. The codebase and corpus can be accessed on our GitHub page at: https://github.com/mumairhassan/AGI-
IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
Natural Language Generation (NLG) for non-English languages is hampered by
the scarcity of datasets in these languages. In this paper, we present the
IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic
languages. We focus on five diverse tasks, namely, biography generation using
Wikipedia infoboxes, news headline generation, sentence summarization,
paraphrase generation and, question generation. We describe the created
datasets and use them to benchmark the performance of several monolingual and
multilingual baselines that leverage pre-trained sequence-to-sequence models.
Our results exhibit the strong performance of multilingual language-specific
pre-trained models, and the utility of models trained on our dataset for other
related NLG tasks. Our dataset creation methods can be easily applied to
modest-resource languages as they involve simple steps such as scraping news
articles and Wikipedia infoboxes, light cleaning, and pivoting through machine
translation data. To the best of our knowledge, the IndicNLG Benchmark is the
first NLG benchmark for Indic languages and the most diverse multilingual NLG
dataset, with approximately 8M examples across 5 tasks and 11 languages. The
datasets and models are publicly available at
https://ai4bharat.iitm.ac.in/indicnlg-suite.Comment: Accepted at EMNLP 202