850 research outputs found

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Probabilistic models for mining imbalanced relational data

    Get PDF
    Most data mining and pattern recognition techniques are designed for learning from at data files with the assumption of equal populations per class. However, most real-world data are stored as rich relational databases that generally have imbalanced class distribution. For such domains, a rich relational technique is required to accurately model the different objects and relationships in the domain, which can not be easily represented as a set of simple attributes, and at the same time handle the imbalanced class problem.Motivated by the significance of mining imbalanced relational databases that represent the majority of real-world data, learning techniques for mining imbalanced relational domains are investigated. In this thesis, the employment of probabilistic models in mining relational databases is explored. In particular, the Probabilistic Relational Models (PRMs) that were proposed as an extension of the attribute-based Bayesian Networks. The effectiveness of PRMs in mining real-world databases was explored by learning PRMs from a real-world university relational database. A visual data mining tool is also proposed to aid the interpretation of the outcomes of the PRM learned models.Despite the effectiveness of PRMs in relational learning, the performance of PRMs as predictive models is significantly hindered by the imbalanced class problem. This is due to the fact that PRMs share the assumption common to other learning techniques of relatively balanced class distributions in the training data. Therefore, this thesis proposes a number of models utilizing the effectiveness of PRMs in relational learning and extending it for mining imbalanced relational domains.The first model introduced in this thesis examines the problem of mining imbalanced relational domains for a single two-class attribute. The model is proposed by enriching the PRM learning with the ensemble learning technique. The premise behind this model is that an ensemble of models would attain better performance than a single model, as misclassification committed by one of the models can be often correctly classified by others.Based on this approach, another model is introduced to address the problem of mining multiple imbalanced attributes, in which it is important to predict several attributes rather than a single one. In this model, the ensemble bagging sampling approach is exploited to attain a single model for mining several attributes. Finally, the thesis outlines the problem of imbalanced multi-class classification and introduces a generalized framework to handle this problem for both relational and non-relational domains

    Evaluation of the landslide susceptibility and its spatial difference in the whole Qinghai-Tibetan Plateau region by five learning algorithms

    Get PDF
    AbstractLandslides are considered as major natural hazards that cause enormous property damages and fatalities in Qinghai-Tibetan Plateau (QTP). In this article, we evaluated the landslide susceptibility, and its spatial differencing in the whole Qinghai-Tibetan Plateau region using five state-of-the-art learning algorithms; deep neural network (DNN), logistic regression (LR), Naïve Bayes (NB), random forest (RF), and support vector machine (SVM), differing from previous studies only in local areas of QTP. The 671 landslide events were considered, and thirteen landslide conditioning factors (LCFs) were derived for database generation, including annual rainfall, distance to drainage (Dsd){(\mathrm{Ds}}_{\mathrm{d}}) ( Ds d ) , distance to faults (Dsf){(\mathrm{Ds}}_{\mathrm{f}}) ( Ds f ) , drainage density (Dd){D}_{d}) D d ) , elevation (Elev), fault density (Fd)({F}_{d}) ( F d ) , lithology, normalized difference vegetation index (NDVI), plan curvature (Plc){(\mathrm{Pl}}_{\mathrm{c}}) ( Pl c ) , profile curvature (Prc){(\mathrm{Pr}}_{\mathrm{c}}) ( Pr c ) , slope (S){(S}^{^\circ }) ( S ∘ ) , stream power index (SPI), and topographic wetness index (TWI). The multi-collinearity analysis and mean decrease Gini (MDG) were used to assess the suitability and predictability of these factors. Consequently, five landslide susceptibility prediction (LSP) maps were generated and validated using accuracy, area under the receiver operatic characteristic curve, sensitivity, and specificity. The MDG results demonstrated that the rainfall, elevation, and lithology were the most significant landslide conditioning factors ruling the occurrence of landslides in Qinghai-Tibetan Plateau. The LSP maps depicted that the north-northwestern and south-southeastern regions ( 45% of total area). Moreover, among the five models with a high goodness-of-fit, RF model was highlighted as the superior one, by which higher accuracy of landslide susceptibility assessment and better prone areas management in QTP can be achieved compared to previous results. Graphical Abstrac

    Higher-order semantic smoothing for text classification

    Get PDF
    Poyraz, Mitat (Dogus Author)Text classification is the task of automatically sorting a set of documents into classes (or categories) from a predefined set. This task is of great practical importance given the massive volume of online text available through the World Wide Web, Internet news feeds, electronic mail and corporate databases. Existing statistical text classification algorithms can be trained to accurately classify documents, given a sufficient set of labeled training examples. However, in real world applications, only a small amount of labeled data is available because expert labeling of large amounts of data is expensive. In this case, making an adequate estimation of the model parameters of a classifier is challenging. Underlying this issue is the traditional assumption in machine learning algorithms that instances are independent and identically distributed (IID). Semi-supervised learning (SSL) is the machine learning concept concerned with leveraging explicit as well as implicit link information within data to provide a richer data representation for model parameter estimation. It has been shown that Latent Semantic Indexing (LSI) takes advantage of implicit higher order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture "latent semantics". lnspired by this, a novel Bayesian frarnework for classifıcation named Higher Order Naive Bayes (HONB), which can explicitly make use of these higher-order relations, has been introduced previously. In this thesis, a novel semantic smoothing rnethod named Higher Order Smoothing (HOS) for the Naive Bayes algorithm is presented. HOS is built on a similar graph based data representation of HONB which allows semantics in higher-order paths to be exploited. Additionally, we take the concept one step further in HOS and exploited the relationships between instances of different classes in order to improve the parameter estimation when dealing with insufficient labeled data. As a result, we have not only been able to move beyond instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. The results of experiments demonstrate the value of HOS on several benchmark datasets.Metin sınıflandırma, bir dokümanlar kümesini daha önceden tanımlanan sınıflara ya da kategorilere otomatik olarak dahil etme işlemidir. Bu işlem, Web sayfalarında, Internet haber kaynaklarında, e-posta iletilerinde ve kurumsal veri tabanlarında mevcut olan çok büyük miktardaki elektronik metin nedeniyle, giderek büyük önem kazanmaktadır. Hali hazırdaki metin sınıflandırma algoritmaları, yeterli sayıda etiketli eğitim kümesi verildiği taktirde dokümanları doğru sınıflandırmak üzere eğitilebilir. Oysa ki gerçek hayatta, büyük miktarda verilerin uzman kişilerce etiketlenmesi pahalı olduğundan çok az sayıda etiketli veri mevcuttur. Bu durumda, sınıflandırıcının model parametreleri ile ilgili uygun bir kestirim yapmak zordur. Bunun temelinde, makine öğrenimi algoritmalarının, veri içerisindeki örneklerin dağılımının bağımsız ve özdeş olduğunu varsayması yatar. Yarı öğreticiyle öğrenme kavramı, model parametre kestirimi için, veri içerisindeki hem açık hem de saklı ilişkilerden yararlanıp, onu daha zengin bir şekilde temsil etmeyle ilgilenir. Saklı Anlam Indeksleme'nin (LSI) dokümanların içerdiği terimler arasındaki yüksek dereceli ilişkileri kullanan bir teknik olduğu ortaya konulmuştur. LSI tekniğinde kullanılan yüksek dereceli ilişkilerden kasıt, terimler arasındaki gizli anlamsal yakınlıktır. Bu teknikten esinlenerek, Higher Order Naive Bayes (HONB) adı verilen, metnin içerisindeki yüksek dereceli anlamsal ilişkileri kullanan, yeni bir metod literatürde yer almaktadır. Bu tezde Higher Order Smoothing (HOS) adı verilen, Naive Bayes algoritması için yeni bir anlamsal yumuşatma metodu ortaya konmuştur. HOS metodu, HONB uygulama çatısında yer alan, metin içerisindeki yüksek dereceli anlamsal ilişkileri kullanmaya imkan veren grafik tabanlı veri gösterimine dayanmaktadır. Ayrıca HOS metodunda, aynı sınıfların örnekleri arasındaki ilişkilerden faydalanma noktasından bir adım öteye geçilerek, farklı sınıfların örnekleri arasındaki ilişkilerden de faydalanılmıştır. Bu sayede, etiketli veri kümesinin yetersiz olduğu durumlardaki parametre kestirimi geliştirilmiştir. Sonuç olarak, yüksek dereceli anlamsal bilgilerden faydalanmak için, sadece örnek sınırlarının ötesine geçmekle kalmayıp aynı zamanda sınıf sınırlarının da ötesine geçebiliyoruz. Farklı veri kümeleriye yapılan deneylerin sonuçları, HOS metodunun değerini kanıtlamaktadır.PREFACE, iii -- ABSTRACT, iv -- ÖZET, v -- ACKNOWLEDMENT, vi -- LIST OF FIGURES, vii -- LIST OF TABLES, viii -- LIST OF SYMBOLS, ix -- ABBREVIATIONS, x - 1. INTRODUCTION, 1 -- 1.1. Scope and objectives of the Thesis, 1 -- 1.2. Methodology of the Thesis, 2 -- 2. LITERATURE REVIEW, 3 -- 3. METHODOLOGY, 16 -- 3.1. Theoretical Background, 16 -- 3.2. Naive Bayes Event Models, 16 -- 3.2.1. Jelinek-Mercer Smoothing, 17 -- 3.2.2. Higher Order Data Representation, 18 -- 3.2.3. Higher Order Naive Bayes, 19 -- 3.3. Higher Order Smoothing, 20 -- 4. CONCLUSION, 25 -- 4.1. Experiment Results, 25 -- 4.2. Discussion, 34 -- 4.3. Future Work, 35 -- REFERENCES, 37 -- CV, 4

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

    Context classification for service robots

    Get PDF
    This dissertation presents a solution for environment sensing using sensor fusion techniques and a context/environment classification of the surroundings in a service robot, so it could change his behavior according to the different rea-soning outputs. As an example, if a robot knows he is outdoors, in a field environment, there can be a sandy ground, in which it should slow down. Contrariwise in indoor environments, that situation is statistically unlikely to happen (sandy ground). This simple assumption denotes the importance of context-aware in automated guided vehicles
    corecore