77,444 research outputs found

    Optimized Naïve Bayesian Algorithm for Efficient Performance

    Get PDF
    Naïve Bayesian algorithm is a data mining algorithm that depicts relationship between data objects using probabilistic method. Classification using Bayesian algorithm is usually done by finding the class that has the highest probability value. Data mining is a popular research area that consists of algorithm development and pattern extraction from database using different algorithms. Classification is one of the major tasks of data mining which aimed at building a model (classifier) that can be used to predict unknown class labels. There are so many algorithms for classification such as decision tree classifier, neural network, rule induction and naïve Bayesian. This paper is focused on naïve Bayesian algorithm which is a classical algorithm for classifying categorical data. It easily converged at local optima. Particle Swarm Optimization (PSO) algorithm has gained recognition in many fields of human endeavours and has been applied to enhance efficiency and accuracy in different problem domain. This paper proposed an optimized naïve Bayesian classifier using particle swarm optimization to overcome the problem of premature convergence and to improve the efficiency of the naïve Bayesian algorithm. The classification result from the optimized naïve Bayesian when compared with the traditional algorithm showed a better performance Keywords: Data Mining, Classification, Particle Swarm Optimization, Naïve Bayesian

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

    LC an effective classification based association rule mining algorithm

    Get PDF
    Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory. Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider

    A probablistic framework for classification and fusion of remotely sensed hyperspectral data

    Get PDF
    Reliable and accurate material identification is a crucial component underlying higher-level autonomous tasks within the context of autonomous mining. Such tasks can include exploration, reconnaissance and guidance of machines (e.g. autonomous diggers and haul trucks) to mine sites. This thesis focuses on the problem of classification of materials (rocks and minerals) using high spatial and high spectral resolution (hyperspectral) imagery, collected remotely from mine faces in operational open pit mines. A new method is developed for the classification of hyperspectral data including field spectra and imagery using a probabilistic framework and Gaussian Process regression. The developed method uses, for the first time, the Observation Angle Dependent (OAD) covariance function to classify high-dimensional sets of data. The performance of the proposed method of classification is assessed and compared to standard methods used for the classification of hyperspectral data. This is done using a staged experimental framework. First, the proposed method is tested using high-resolution field spectrometer data acquired in the laboratory and in the field. Second, the method is extended to work on hyperspectral imagery acquired in the laboratory and its performance evaluated. Finally, the method is evaluated for imagery acquired from a mine face under natural illumination and the use of independent spectral libraries to classify imagery is explored. A probabilistic framework was selected because it best enables the integration of internal and external information from a variety of sensors. To demonstrate advantages of the proposed GP-OAD method over existing, deterministic methods, a new framework is proposed to fuse hyperspectral images using the classified probabilistic outputs from several different images acquired of the same mine face. This method maximises the amount of information but reduces the amount of data by condensing all available information into a single map. Thus, the proposed fusion framework removes the need to manually select a single classification among many individual classifications of a mine face as the `best' one and increases the classification performance by combining more information. The methods proposed in this thesis are steps forward towards an automated mine face inspection system that can be used within the existing autonomous mining framework to improve productivity and efficiency. Last but not least the proposed methods will also contribute to increased mine safety

    Cost-Sensitive Learning for Recurrence Prediction of Breast Cancer

    Get PDF
    Breast cancer is one of the top cancer-death causes and specifically accounts for 10.4% of all cancer incidences among women. The prediction of breast cancer recurrence has been a challenging research problem for many researchers. Data mining techniques have recently received considerable attention, especially when used for the construction of prognosis models from survival data. However, existing data mining techniques may not be effective to handle censored data. Censored instances are often discarded when applying classification techniques to prognosis. In this paper, we propose a cost-sensitive learning approach to involve the censored data in prognostic assessment with better recurrence prediction capability. The proposed approach employs an outcome inference mechanism to infer the possible probabilistic outcome of each censored instance and adopt the cost-proportionate rejection sampling and a committee machine strategy to take into account these instances with probabilistic outcomes during the classification model learning process. We empirically evaluate the effectiveness of our proposed approach for breast cancer recurrence prediction and include a censored-data-discarding method (i.e., building the recurrence prediction model by only using uncensored data) and the Kaplan-Meier method (a common prognosis method) as performance benchmarks. Overall, our evaluation results suggest that the proposed approach outperforms its benchmark techniques, measured by precision, recall and F1 score

    Machine Learning (ML) Methods in Assessing the Intensity of Damage Caused by High-Energy Mining Tremors in Traditional Development of LGOM Mining Area

    Get PDF
    The paper presents a comparative analysis of Machine Learning (ML) research methods allowing to assess the risk of mining damage occurring in traditional masonry buildings located in the mining area of Legnica-Głogów Copper District (LGOM) as a result of intense mining tremors. The database of reports on damage that occurred after the tremors of 20 February 2002, 16 May 2004 and 21 May 2006 formed the basis for the analysis. Based on these data, classification models were created using the Probabilistic Neural Network (PNN) and the Support Vector Machine (SVM) method. The results of previous research studies allowed to include structural and geometric features of buildings,as well as protective measures against mining tremors in the model. The probabilistic notation of the model makes it possible to effectively assess the probability of damage in the analysis of large groups of building structures located in the area of paraseismic impacts. The results of the conducted analyses confirm the thesis that the proposed methodology may allow to estimate, with the appropriate probability, the financial outlays that the mining plant should secure for the repair of the expected damage to the traditional development of the LGOM mining area

    Interpretable multiclass classification by MDL-based rule lists

    Get PDF
    Interpretable classifiers have recently witnessed an increase in attention from the data mining community because they are inherently easier to understand and explain than their more complex counterparts. Examples of interpretable classification models include decision trees, rule sets, and rule lists. Learning such models often involves optimizing hyperparameters, which typically requires substantial amounts of data and may result in relatively large models. In this paper, we consider the problem of learning compact yet accurate probabilistic rule lists for multiclass classification. Specifically, we propose a novel formalization based on probabilistic rule lists and the minimum description length (MDL) principle. This results in virtually parameter-free model selection that naturally allows to trade-off model complexity with goodness of fit, by which overfitting and the need for hyperparameter tuning are effectively avoided. Finally, we introduce the Classy algorithm, which greedily finds rule lists according to the proposed criterion. We empirically demonstrate that Classy selects small probabilistic rule lists that outperform state-of-the-art classifiers when it comes to the combination of predictive performance and interpretability. We show that Classy is insensitive to its only parameter, i.e., the candidate set, and that compression on the training set correlates with classification performance, validating our MDL-based selection criterion

    Node Classification in Uncertain Graphs

    Full text link
    In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, the classification accuracy in the underlying network may be affected adversely. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection, and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets, and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach
    corecore