79,644 research outputs found
Text Preprocessing for Speech Synthesis
In this paper we describe our text preprocessing modules for English text-to-speech synthesis. These modules comprise rule-based text normalization subsuming sentence segmentation and normalization of non-standard words, statistical part-of-speech tagging, and statistical syllabification, grapheme-to-phoneme conversion, and word stress assignment relying in parts on rule-based morphological analysis
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
Text preprocessing is often the first step in the pipeline of a Natural
Language Processing (NLP) system, with potential impact in its final
performance. Despite its importance, text preprocessing has not received much
attention in the deep learning literature. In this paper we investigate the
impact of simple text preprocessing decisions (particularly tokenizing,
lemmatizing, lowercasing and multiword grouping) on the performance of a
standard neural text classifier. We perform an extensive evaluation on standard
benchmarks from text categorization and sentiment analysis. While our
experiments show that a simple tokenization of input text is generally
adequate, they also highlight significant degrees of variability across
preprocessing techniques. This reveals the importance of paying attention to
this usually-overlooked step in the pipeline, particularly when comparing
different models. Finally, our evaluation provides insights into the best
preprocessing practices for training word embeddings.Comment: Blackbox EMNLP 2018. 7 page
Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers
With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing
Enhanced Integrated Scoring for Cleaning Dirty Texts
An increasing number of approaches for ontology engineering from text are
gearing towards the use of online sources such as company intranet and the
World Wide Web. Despite such rise, not much work can be found in aspects of
preprocessing and cleaning dirty texts from online sources. This paper presents
an enhancement of an Integrated Scoring for Spelling error correction,
Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as
part of a text preprocessing phase in an ontology engineering system. New
evaluations performed on the enhanced ISSAC using 700 chat records reveal an
improved accuracy of 98% as compared to 96.5% and 71% based on the use of only
basic ISSAC and of Aspell, respectively.Comment: More information is available at
http://explorer.csse.uwa.edu.au/reference
Pengaruh Stemming Terhadap Performa Klasifikasi Sentimen Masyarakat Tentang Kebijakan New Normal
Banyaknya pengguna twitter dapat dimanfaatkan untuk mengetahui sentimen masyarakat tentang kebijakan dan penanganan yang dilakukan oleh pemerintah terhadap Covid-19, salah satunya kebijakan mengenai adaptasi kebiasaan baru atau new normal. Untuk melakukan hal itu, bisa digunakan salah satu fungsi dari text mining, yaitu klasifikasi text. Sebelum model klasifikasi text dibuat, text akan melalui tahapan preprocessing. Setiap tahapan memiliki pengaruh terhadap hasil evaluasi klasifikasi text yang akan dilakukan. Tujuan dari penelitian ini adalah mengetahui perbandingan performa klasifikasi menggunakan proses stemming dan tanpa stemming pada dataset melalui tahapan preprocessing dan algoritma klasifikasi yang performanya paling baik jika pada preprocessing dilakukan stemming. Metode klasifikasi menggunakan metode Naive Bayes dan Logistic Regression. Hasil percobaan menunjukan pengaruh model klasifikasi Naive Bayes dan Logistic Regression terhadap penggunaan tahapan stemming pada preprocessing dengan akurasi sebesar 74,11% dan 73,57%, sedangkan tanpa melakukan stemming mendapatkan akurasi masing -masing sebesar 78,47% dan 76,29% . Dari hasil pengujian model, dapat dilihat bahwa tanpa tahapan stemming pada preprocessing memiliki tingkat akurasi yang lebih unggul pada masing-masing model sebesar 4,36% dan 2,72% dibandingkan dengan penerapan tahapan stemming pada preprocessing. Hasil tersebut menunjukan bahwa penggunaan tahapan stemming dapat menurunkan akurasi klasifikasi. Algoritma klasifikasi yang tidak banyak pengaruhnya jika pada preprocessing dilakukan stemming adalah Logistic Regression karena tingkat penurunan akurasi lebih tipis dari algoritma Naïve Bayes
Learning to Use Normalization Techniques for Preprocessing and Classification of Text Documents
Text classification is the most substantial area in natural language processing. In this task, the text document is divided into various types according to the researcher’s purpose. In the text classification process, the basic phase is text preprocessing. In text preprocessing, cleaning, and preparing text data are significant tasks. To accomplish these tasks under the text preprocessing, normalization techniques play a major role. Different kinds of normalization techniques are available. In this research, we mainly focus on different normalization techniques and the way of applying them to text preprocessing. Normalization techniques reduce the words of the text files and change the word form to another form. It helps to analyze the unstructured texts and predefine the text into standard form. This causes to improve the efficiency and performance of the text classification process. For text classification, it is important to extract the most reliable and relevant words of the text files, because feature extraction causes successful classification. This study includes the lowercasing, tokenization, stop word removal, and lemmatization as normalization techniques. 200 text documents from two different domains, namely, formal news articles and informal letters obtained from the Internet in the English language were evaluated using these normalization techniques. The experimental results show the effectiveness of the use of normalization techniques for the preprocessing and classification of text documents and for comparison between before and after using normalization techniques to the text files. Based on the comparison, we identified that these normalization techniques help to clean and prepare text data for effective and accurate text document classification.
KEYWORDS: Preprocessing, Normalization, Techniques, Cleaning documents, Text classificati
THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY
Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the influence of these methods and tools on further text mining. We first focused on the analysis of the influence on the reduction of the vector space model for the multidimensional represen-tation of text documents. We then analyzed the influence on calculating text similarity, which is the focus of this research. The conclusion we reached is that the implemen-tation of various text preprocessing methods in the Serbian language, which are used for the reduction of the vector space model for the multidimensional representation of text document, achieves the required results. But, the implementation of various text preprocessing methods specific to the Serbian language for the purpose of calculating text similarity can lead to great differences in the results
Minimal Suffix and Rotation of a Substring in Optimal Time
For a text given in advance, the substring minimal suffix queries ask to
determine the lexicographically minimal non-empty suffix of a substring
specified by the location of its occurrence in the text. We develop a data
structure answering such queries optimally: in constant time after linear-time
preprocessing. This improves upon the results of Babenko et al. (CPM 2014),
whose trade-off solution is characterized by product of these
time complexities. Next, we extend our queries to support concatenations of
substrings, for which the construction and query time is preserved. We
apply these generalized queries to compute lexicographically minimal and
maximal rotations of a given substring in constant time after linear-time
preprocessing.
Our data structures mainly rely on properties of Lyndon words and Lyndon
factorizations. We combine them with further algorithmic and combinatorial
tools, such as fusion trees and the notion of order isomorphism of strings
- …