5,637 research outputs found
Automatic Text Summarization for Hindi Using Real Coded Genetic Algorithm
In the present scenario, Automatic Text Summarization (ATS) is in great demand to address the ever-growing volume of text data available online to discover relevant information faster. In this research, the ATS methodology is proposed for the Hindi language using Real Coded Genetic Algorithm (RCGA) over the health corpus, available in the Kaggle dataset. The methodology comprises five phases: preprocessing, feature extraction, processing, sentence ranking, and summary generation. Rigorous experimentation on varied feature sets is performed where distinguishing features, namely- sentence similarity and named entity features are combined with others for computing the evaluation metrics. The top 14 feature combinations are evaluated through Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measure. RCGA computes appropriate feature weights through strings of features, chromosomes selection, and reproduction operators: Simulating Binary Crossover and Polynomial Mutation. To extract the highest scored sentences as the corpus summary, different compression rates are tested. In comparison with existing summarization tools, the ATS extractive method gives a summary reduction of 65%
Generating indicative-informative summaries with SumUM
We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies
Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization
In Automatic Text Summarization, preprocessing is an important phase to
reduce the space of textual representation. Classically, stemming and
lemmatization have been widely used for normalizing words. However, even using
normalization on large texts, the curse of dimensionality can disturb the
performance of summarizers. This paper describes a new method for normalization
of words to further reduce the space of representation. We propose to reduce
each word to its initial letters, as a form of Ultra-stemming. The results show
that Ultra-stemming not only preserve the content of summaries produced by this
representation, but often the performances of the systems can be dramatically
improved. Summaries on trilingual corpora were evaluated automatically with
Fresa. Results confirm an increase in the performance, regardless of summarizer
system used.Comment: 22 pages, 12 figures, 9 table
Automatic Text Summarization Approaches to Speed up Topic Model Learning Process
The number of documents available into Internet moves each day up. For this
reason, processing this amount of information effectively and expressibly
becomes a major concern for companies and scientists. Methods that represent a
textual document by a topic representation are widely used in Information
Retrieval (IR) to process big data such as Wikipedia articles. One of the main
difficulty in using topic model on huge data collection is related to the
material resources (CPU time and memory) required for model estimate. To deal
with this issue, we propose to build topic spaces from summarized documents. In
this paper, we present a study of topic space representation in the context of
big data. The topic space representation behavior is analyzed on different
languages. Experiments show that topic spaces estimated from text summaries are
as relevant as those estimated from the complete documents. The real advantage
of such an approach is the processing time gain: we showed that the processing
time can be drastically reduced using summarized documents (more than 60\% in
general). This study finally points out the differences between thematic
representations of documents depending on the targeted languages such as
English or latin languages.Comment: 16 pages, 4 tables, 8 figure
- …