108 research outputs found

    The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

    Get PDF
    Current classification approaches usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Such approaches ignore the possibility of different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performances may not be optimal or may even be coincidental. This dissertation analyzes the above issues in depth. It also proposes two new approaches called the Homogeneity-Based Algorithm (HBA) and the Convexity-Based Algorithm (CBA) to address these issues. These new approaches aim at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. The approaches first define the total misclassification cost (TC) as a weighted function of the three penalty costs and their corresponding error rates. The approaches then partition the training data into regions. In the HBA, the partitioning is done according to some homogeneous properties derivable from the training data. Meanwhile, the CBA employs some convex properties to derive regions. A traditional classification method is then used in conjunction with the HBA and CBA. Finally, the approaches apply a genetic approach to determine the optimal levels of fitting and generalization. The TC serves as the fitness function in this genetic approach. Real-life datasets from a wide spectrum of domains were used to better understand the effectiveness of the HBA and CBA. The computational results have indicated that both the HBA and CBA might potentially fill a critical gap in the implementation of current or future classification approaches. Furthermore, the results have also shown that when the penalty cost of an error type was changed, the corresponding error rate followed stepwise patterns. The finding of stepwise patterns of classification errors can assist researchers in determining applicable penalties for classification errors. Thus, the dissertation also proposes a binary search approach (BSA) to produce those patterns. Real-life datasets were utilized to demonstrate for the BSA

    SCALABLE APPROXIMATION OF KERNEL FUZZY C-MEANS

    Get PDF
    Virtually every sector of business and industry that uses computing, including financial analysis, search engines, and electronic commerce, incorporate Big Data analysis into their business model. Sophisticated clustering algorithms are popular for deducing the nature of data by assigning labels to unlabeled data. We address two main challenges in Big Data. First, by definition, the volume of Big Data is too large to be loaded into a computer’s memory (this volume changes based on the computer used or available, but there is always a data set that is too large for any computer). Second, in real-time applications, the velocity of new incoming data prevents historical data from being stored and future data from being accessed. Therefore, we propose our Streaming Kernel Fuzzy c-Means (stKFCM) algorithm, which reduces both computational complexity and space complexity significantly. The proposed stKFCM only requires O(n2) memory where n is the (predetermined) size of a data subset (or data chunk) at each time step, which makes this algorithm truly scalable (as n can be chosen based on the available memory). Furthermore, only 2n2 elements of the full N × N (where N \u3e\u3e n) kernel matrix need to be calculated at each time-step, thus reducing both the computation time in producing the kernel elements and also the complexity of the FCM algorithm. Empirical results show that stKFCM, even with relatively very small n, can provide clustering performance as accurately as kernel fuzzy c-means run on the entire data set while achieving a significant speedup

    The machine abnormal degree detection method based on SVDD and negative selection mechanism

    Get PDF
    As is well-known, fault samples are essential for the fault diagnosis and anomaly detection, but in most cases, it is difficult to obtain them. The negative selection mechanism of immune system, which can distinguish almost all nonself cells or molecules with only the self cells, gives us an inspiration to solve the problem of anomaly detection with only the normal samples. In this paper, we introduced the Support Vector Data Description (SVDD) and negative selection mechanism to separate the state space of machines into self, non-self and fault space. To estimate the abnormal level of machines, a function that could calculate the abnormal degree was constructed and its sensitivity change according to the change of abnormal degree was also discussed. At last, Iris-Fisher and ball bearing fault data set were used to verify the effectiveness of this method

    Kernel Extended Real-Valued Negative Selection Algorithm (KERNSA)

    Get PDF
    Artificial Immune Systems (AISs) are a type of statistical Machine Learning (ML) algorithm based on the Biological Immune System (BIS) applied to classification problems. Inspired by increased performance in other ML algorithms when combined with kernel methods, this research explores using kernel methods as the distance measure for a specific AIS algorithm, the Real-valued Negative Selection Algorithm (RNSA). This research also demonstrates that the hard binary decision from the traditional RNSA can be relaxed to a continuous output, while maintaining the ability to map back to the original RNSA decision boundary if necessary. Continuous output is used in this research to generate Receiver Operating Characteristic (ROC) curves and calculate Area Under Curves (AUCs), but can also be used as a basis of classification confidence or probability. The resulting Kernel Extended Real-valued Negative Selection Algorithm (KERNSA) offers performance improvements over a comparable RNSA implementation. Using the Sigmoid kernel in KERNSA seems particularly well suited (in terms of performance) to four out of the eighteen domains tested

    Inter-query Learning in Content-based Image Retrieval

    Get PDF
    Computer Scienc

    Retrieval of Leaf Area Index (LAI) and Soil Water Content (WC) Using Hyperspectral Remote Sensing under Controlled Glass House Conditions for Spring Barley and Sugar Beet

    Get PDF
    Leaf area index (LAI) and water content (WC) in the root zone are two major hydro-meteorological parameters that exhibit a dominant control on water, energy and carbon fluxes, and are therefore important for any regional eco-hydrological or climatological study. To investigate the potential for retrieving these parameter from hyperspectral remote sensing, we have investigated plant spectral reflectance (400-2,500 nm, ASD FieldSpec3) for two major agricultural crops (sugar beet and spring barley) in the mid-latitudes, treated under different water and nitrogen (N) conditions in a greenhouse experiment over the growing period of 2008. Along with the spectral response, we have measured soil water content and LAI for 15 intensive measurement campaigns spread over the growing season and could demonstrate a significant response of plant reflectance characteristics to variations in water content and nutrient conditions. Linear and non-linear dimensionality analysis suggests that the full band reflectance information is well represented by the set of 28 vegetation spectral indices (SI) and most of the variance is explained by three to a maximum of eight variables. Investigation of linear dependencies between LAI and soil WC and pre-selected SI's indicate that: (1) linear regression using single SI is not sufficient to describe plant/soil variables over the range of experimental conditions, however, some improvement can be seen knowing crop species beforehand; (2) the improvement is superior when applying multiple linear regression using three explanatory SI's approach. In addition to linear investigations, we applied the non-linear CART (Classification and Regression Trees) technique, which finally did not show the potential for any improvement in the retrieval process

    External Support Vector Machine Clustering

    Get PDF
    The external-Support Vector Machine (SVM) clustering algorithm clusters data vectors with no a priori knowledge of each vector\u27s class. The algorithm works by first running a binary SVM against a data set, with each vector in the set randomly labeled, until the SVM converges. It then relabels data points that are mislabeled and a large distance from the SVM hyperplane. The SVM is then iteratively rerun followed by more label swapping until no more progress can be made. After this process, a high percentage of the previously unknown class labels of the data set will be known. With sub-cluster identification upon iterating the overall algorithm on the positive and negative clusters identified (until the clusters are no longer separable into sub-clusters), this method provides a way to cluster data sets without prior knowledge of the data\u27s clustering characteristics, or the number of clusters
    • …
    corecore