25 research outputs found

    Pengklasifikasian Dokumen Berbahasa Indonesia Dengan Pengindeksan Berbasis LSI

    Get PDF
    AbstrakKlasifikasi dokumen teks bertujuan untuk menentukan kategori suatu dokumen berdasarkan kesamaannya dengan kumpulan dokumen yang telah berlabel sebelumnya. Namun demikian kebanyakan metode klasifikasi yang ada saat ini dilakukan berdasarkan kata-kata kunci atau kata-kata yang dianggap penting dengan mengasumsikan masing-masing merepresentasikan konsep yang unik. Padahal pada kenyataanya beberapa kata yang mempunyai makna atau semantik sama seharusnya diwakili satu kata unik. Pada penelitian ini pendekatan berbasis LSI (Latent Semantic Indexing) digunakan pada KNN untuk mengklasifikasi dokumen berbahasa Indonesia. Pembobotan term dari dokumen-dokumen latih maupun uji menggunakan tf-idf,  yang direpresentasikan masing-masing dalam matrik term-dokumen A dan B. Selanjutnya matrik A didekomposisi menggunakan SVD untuk mendapatkan matrik U dan V yang tereduksi dengan k-rank. Kedua matrik U dan V digunakan untuk mereduksi B sebagai representasi dokumen uji.  Evaluasi kinerja sistem terbaik berdasarkan hasil  diperoleh pada klasifikasi KNN berbasis LSI tanpa stemming dengan threshould 2. Akan tetapi evaluasi kinerja terbaik berdasarkan waktu dicapai ketika KNN LSI dengan stemming pada threshould 5. Kinerja KNN berbasis LSI secara signifikan jauh lebih baik dibandingkan dengan KNN biasa baik dari sisi hasil maupun waktu.Kata kunci: KNN, LSI, K-Rank, SVD, Klasifikasi dokumen AbstractClassification of text documents aimed to determine the category of a document based on its similarity to set of documents which have been previously labeled. However, most existing methods of classification were conducted based on key words or words that are considered important by assuming each representing a unique concept. Whereas in fact some of the words that have the same meaning or semantics should be represented as a unique word. In this research LSI -based approach  used on KNN to classify documents in Indonesian language. Weighting the terms of the training documents or testing using tf-idf, which represented respectively in term-document matrix A and B. Furthermore, the matrix A is decomposed using SVD to obtain matrices U and V are reduced by k-rank. Both matrices U and V are used to reduce B as a representation of test documents. The best system performance evaluation based on the results obtained LSI-based in the KNN classification without stemming with threshould 2. However, the best performance evaluation based on the time achieved when KNN LSI with stemming the KNN with threshould 5. Performance-based LSI is significantly much better than the tradisional KNN in term both the outcome and timing.Keywords: KNN, LSI, K-Rank, SVD, Documents classificatio

    An information theoretic approach to sentiment polarity classification

    Full text link
    Sentiment classification is a task of classifying documents according to their overall sentiment inclination. It is very important and popular in many web applications, such as credibility analysis of news sites on the Web, recommen-dation system and mining online discussion. Vector space model is widely applied on modeling documents in super-vised sentiment classification, in which the feature presenta-tion (including features type and weight function) is crucial for classification accuracy. The traditional feature presen-tation methods of text categorization do not perform well in sentiment classification, because the expressing manners of sentiment are more subtle. We analyze the relationships of terms with sentiment labels based on information theory, and propose a method by applying information theoretic approach on sentiment classification of documents. In this paper, we adopt mutual information on quantifying the sen-timent polarities of terms in a document firstly. Then the terms are weighted in vector space based on both sentiment scores and contribution to the document. We perform exten-sive experiments with SVM on the sets of multiple product reviews, and the experimental results show our approach is more effective than the traditional ones

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Expression and Reception: An Analytic Method for Assessing Message Production and Consumption in CMC

    Get PDF
    This article presents an innovative methodology to study computer-mediated communication (CMC), which allows analysis of the multi-layered effects of online expression and reception. The methodology is demonstrated by combining the following three data sets collected from a widely tested eHealth system, the Comprehensive Health Enhancement Support System (CHESS): (1) a flexible and precise computer-aided content analysis; (2) a record of individual message posting and reading; and (3) longitudinal survey data. Further, this article discusses how the resulting data can be applied to online social network analysis and demonstrates how to construct two distinct types of online social networks—open and targeted communication networks—for different types of content embedded in social networks

    Dimensionality Reduction via Matrix Factorization for Predictive Modeling from Large, Sparse Behavioral Data

    Get PDF
    Matrix factorization is a popular technique for engineering features for use in predictive models; it is viewed as a key part of the predictive analytics process and is used in many different domain areas. The purpose of this paper is to investigate matrix-factorization-based dimensionality reduction as a design artifact in predictive analytics. With the rise in availability of large amounts of sparse behavioral data, this investigation comes at a time when traditional techniques must be reevaluated. Our contribution is based on two lines of inquiry: we survey the literature on dimensionality reduction in predictive analytics, and we undertake an experimental evaluation comparing using dimensionality reduction versus not using dimensionality reduction for predictive modeling from large, sparse behavioral data. Our survey of the dimensionality reduction literature reveals that, despite mixed empirical evidence as to the benefit of computing dimensionality reduction, it is frequently applied in predictive modeling research and application without either comparing to a model built using the full feature set or utilizing state-of-the-art predictive modeling techniques for complexity control. This presents a concern, as the survey reveals complexity control as one of the main reasons for employing dimensionality reduction. This lack of comparison is troubling in light of our empirical results. We experimentally evaluate the e cacy of dimensionality reduction in the context of a collection of predictive modeling problems from a large-scale published study. We find that utilizing dimensionality reduction improves predictive performance only under certain, rather narrow, conditions. Specifically, under default regularization (complexity control)settings dimensionality reduction helps for the more di cult predictive problems (where the predictive performance of a model built using the original feature set is relatively lower), but it actually decreases the performance on the easier problems. More surprisingly, employing state-of-the-art methods for selecting regularization parameters actually eliminates any advantage that dimensionality reduction has! Since the value of building accurate predictive models for business analytics applications has been well-established, the resulting guidelines for the application of dimensionality reduction should lead to better research and managerial decisions.NYU Stern School of Busines

    Multilingual Detection of Hate Speech Against Women and Immigrants in Twitter

    Get PDF
    Due to the massive rise of users in social media, the presence of verbal abuse, hate speech and bully-attitudes has increased over the years. Especially on Twitter, users find a way to anonymously harass and offend other individuals or collectives, and not enough work is done to stop them. This project describes the implementation of our hate speech detection system against women and immigrants with the aim of serving to reduce hatred in social networks in the future and participating in the SemEval-2019 Task 5 challenge. SemEval-2019 Task 5 consists of detecting hate speech against women and immigrants in Twitter, both in English and Spanish. This work proposes a strong baseline for hate speech detection by means of a traditional Machine Learning techniques. Our system is mainly based on the use of n-grams, sentiment analysis and word embeddings together. In the challenge, given the text of a tweet, one of the tasks consists of identifying hate speech against women and immigrants. Our system obtained the second highest accuracy in Task A in Spanish, surpassing more complex systems based on neural networks of a total of 40 participants

    Automated analysis of Learner\u27s Research Article writing and feedback generation through Machine Learning and Natural Language Processing

    Get PDF
    Teaching academic writing in English to native and non-native speakers is a challenging task. Quite a variety of computer-aided instruction tools have arisen in the form of Automated Writing Evaluation (AWE) systems to help students in this regard. This thesis describes my contribution towards the implementation of the Research Writing Tutor (RWT), an AWE tool that aids students with academic research writing by analyzing a learner\u27s text at the discourse level. It offers tailored feedback after analysis based on discipline-aware corpora. At the core of RWT lie two different computational models built using machine learning algorithms to identify the rhetorical structure of a text. RWT extends previous research on a similar AWE tool, the Intelligent Academic Discourse Evaluator (IADE) (Cotos, 2010), designed to analyze articles at the move level of discourse. As a result of the present research, RWT analyzes further at the level of discourse steps, which are the granular communicative functions that constitute a particular move. Based on features extracted from a corpus of expert-annotated research article introductions, the learning algorithm classifies each sentence of a document with a particular rhetorical move and a step. Currently, RWT analyzes the introduction section of a research article, but this work generalizes to handle the other sections of an article, including Methods, Results and Discussion/Conclusion. This research describes RWT\u27s unique software architecture for analyzing academic writing. This architecture consists of a database schema, a specific choice of classification features, our computational model training procedure, our approach to testing for performance evaluation, and finally the method of applying the models to a learner\u27s writing sample. Experiments were done on the annotated corpus data to study the relation among the features and the rhetorical structure within the documents. Finally, I report the performance measures of our 23 computational models and their capability to identify rhetorical structure on user submitted writing. The final move classifier was trained using a total of 5828 unigrams and 11630 trigrams and performed at a maximum accuracy of 72.65%. Similarly, the step classifier was trained using a total of 27689 unigrams and 27160 trigrams and performed at a maximum accuracy of 72.01%. The revised architecture presented also led to increased speed of both training (a 9x speedup) and real-time performance (a 2x speedup). These performance rates are sufficient for satisfactory usage of RWT in the classroom. The overall goal of RWT is to empower students to write better by helping them consider writing as a series of rhetorical strategies to convey a functional meaning. This research will enable RWT to be deployed broadly into a wider spectrum of classrooms

    Filtrado de spam mediante ajuste lineal por cuadrados mínimos

    Get PDF
    Fil: Vega, Daniel Mario. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; ArgentinaFil: Alvarez Alonso, Pablo Alejandro. Universidad de Buenos Aires. Sistema de Bibliotecas y de Informacion; ArgentinaUn problema creciente en las comunicaciones mediante correo electrónico es la práctica de utilizar este medio para el envío de mensajes publicitarios masivos no solicitados, mejor conocidos como "Spam". Distintas soluciones han sido propuestas para atacar este problema, como ser la utilización de técnicas de aprendizaje automático. En este trabajo de tesis, analizaremos un método de clasificación y filtrado basado en ajuste lineal por cuadrados mínimos (LLSF) (YAN/94) en la tarea de filtrado de Spam. Analizaremos distintas variantes y mejoras sobre el algoritmo básico. Entre ellas presentaremos una nueva fórmula de selección de atributos, nuevas alternativas en la representación de los mensajes, un método matemático de determinación del umbral. Finalmente comparemos los resultados con los obtenidos en trabajos anteriores, los cuales utilizaron el algoritmo de Naïve-Bayes (AND/00b)
    corecore