79,644 research outputs found

    Text Preprocessing for Speech Synthesis

    Get PDF
    In this paper we describe our text preprocessing modules for English text-to-speech synthesis. These modules comprise rule-based text normalization subsuming sentence segmentation and normalization of non-standard words, statistical part-of-speech tagging, and statistical syllabification, grapheme-to-phoneme conversion, and word stress assignment relying in parts on rule-based morphological analysis

    On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

    Full text link
    Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.Comment: Blackbox EMNLP 2018. 7 page

    Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers

    Get PDF
    With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing

    Enhanced Integrated Scoring for Cleaning Dirty Texts

    Full text link
    An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as part of a text preprocessing phase in an ontology engineering system. New evaluations performed on the enhanced ISSAC using 700 chat records reveal an improved accuracy of 98% as compared to 96.5% and 71% based on the use of only basic ISSAC and of Aspell, respectively.Comment: More information is available at http://explorer.csse.uwa.edu.au/reference

    Pengaruh Stemming Terhadap Performa Klasifikasi Sentimen Masyarakat Tentang Kebijakan New Normal

    Get PDF
    Banyaknya pengguna twitter dapat dimanfaatkan untuk mengetahui sentimen masyarakat tentang kebijakan dan penanganan yang dilakukan oleh pemerintah terhadap Covid-19, salah satunya kebijakan mengenai adaptasi kebiasaan baru atau new normal. Untuk melakukan hal itu, bisa digunakan salah satu fungsi dari text mining, yaitu klasifikasi text. Sebelum model klasifikasi text dibuat, text akan melalui tahapan preprocessing. Setiap tahapan memiliki pengaruh terhadap hasil evaluasi klasifikasi text yang akan dilakukan. Tujuan dari penelitian ini adalah mengetahui perbandingan performa klasifikasi menggunakan proses stemming dan tanpa stemming pada dataset melalui tahapan preprocessing dan algoritma klasifikasi yang performanya paling baik jika pada preprocessing dilakukan stemming. Metode klasifikasi menggunakan metode Naive Bayes dan Logistic Regression. Hasil percobaan menunjukan pengaruh model klasifikasi Naive Bayes dan Logistic Regression terhadap penggunaan tahapan stemming pada preprocessing dengan akurasi sebesar 74,11% dan 73,57%, sedangkan tanpa melakukan stemming mendapatkan akurasi masing -masing sebesar 78,47% dan 76,29% . Dari hasil pengujian model, dapat dilihat bahwa tanpa tahapan stemming pada preprocessing memiliki tingkat akurasi yang lebih unggul pada masing-masing model sebesar 4,36% dan 2,72% dibandingkan dengan penerapan tahapan stemming pada preprocessing. Hasil tersebut menunjukan bahwa penggunaan tahapan stemming dapat menurunkan akurasi klasifikasi. Algoritma klasifikasi yang tidak banyak pengaruhnya jika pada preprocessing dilakukan stemming adalah Logistic Regression karena tingkat penurunan akurasi lebih tipis dari algoritma Naïve Bayes

    Learning to Use Normalization Techniques for Preprocessing and Classification of Text Documents

    Get PDF
    Text classification is the most substantial area in natural language processing. In this task, the text document is divided into various types according to the researcher’s purpose. In the text classification process, the basic phase is text preprocessing. In text preprocessing, cleaning, and preparing text data are significant tasks. To accomplish these tasks under the text preprocessing, normalization techniques play a major role. Different kinds of normalization techniques are available. In this research, we mainly focus on different normalization techniques and the way of applying them to text preprocessing. Normalization techniques reduce the words of the text files and change the word form to another form. It helps to analyze the unstructured texts and predefine the text into standard form. This causes to improve the efficiency and performance of the text classification process. For text classification, it is important to extract the most reliable and relevant words of the text files, because feature extraction causes successful classification. This study includes the lowercasing, tokenization, stop word removal, and lemmatization as normalization techniques. 200 text documents from two different domains, namely, formal news articles and informal letters obtained from the Internet in the English language were evaluated using these normalization techniques. The experimental results show the effectiveness of the use of normalization techniques for the preprocessing and classification of text documents and for comparison between before and after using normalization techniques to the text files. Based on the comparison, we identified that these normalization techniques help to clean and prepare text data for effective and accurate text document classification. KEYWORDS: Preprocessing, Normalization, Techniques, Cleaning documents, Text classificati

    THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY

    Get PDF
    Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the influence of these methods and tools on further text mining. We first focused on the analysis of the influence on the reduction of the vector space model for the multidimensional represen-tation of text documents. We then analyzed the influence on calculating text similarity, which is the focus of this research. The conclusion we reached is that the implemen-tation of various text preprocessing methods in the Serbian language, which are used for the reduction of the vector space model for the multidimensional representation of text document, achieves the required results. But, the implementation of various text preprocessing methods specific to the Serbian language for the purpose of calculating text similarity can lead to great differences in the results

    Minimal Suffix and Rotation of a Substring in Optimal Time

    Get PDF
    For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queries optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by Θ(nlogn)\Theta(n\log n) product of these time complexities. Next, we extend our queries to support concatenations of O(1)O(1) substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing. Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings
    corecore