88,944 research outputs found

    BUGOPTIMIZE: Bugs dataset Optimization with Majority Vote Cluster-Based Fine-Tuned Feature Selection for Scalable Handling

    Get PDF
    Software bugs are prevalent in the software development lifecycle, posing challenges to developers in ensuring product quality and reliability. Accurate prediction of bug counts can significantly aid in resource allocation and prioritization of bug-fixing efforts. However, the vast number of attributes in bug datasets often requires effective feature selection techniques to enhance prediction accuracy and scalability. Existing feature selection methods, though diverse, suffer from limitations such as suboptimal feature subsets and lack of scalability. This paper proposes BUGOPTIMIZE, a novel algorithm tailored to address these challenges. BUGOPTIMIZE innovatively integrates majority voting cluster-based fine-tuned feature selection to optimize bug datasets for scalable handling and accurate prediction. The algorithm initiates by clustering the dataset using K-means, EM, and Hierarchical clustering algorithms and performs majority voting to assign data points to final clusters. It then employs filter-based, wrapper-based, and embedded feature selection techniques within each cluster to identify common features. Additionally, feature selection is applied to the entire dataset to extract another set of common features. These selected features are combined to form the final best feature set. Experimental results demonstrate the efficacy of BUGOPTIMIZE compared to existing feature selection methods, reducing MAE and RMSE in Linear Regression (MAE: 0.2668 to 0.2609, RMSE: 0.3251 to 0.308) and Random Forest (MAE: 0.1626 to 0.1341, RMSE: 0.2363 to 0.224), highlighting its significant contribution to bug dataset optimization and prediction accuracy in software development while addressing feature selection limitations. By mitigating the disadvantages of current approaches and introducing a comprehensive and scalable solution, BUGOPTIMIZE presents a significant advancement in bug dataset optimization and prediction accuracy in software development environments

    Dynamic feature selection for clustering high dimensional data streams

    Get PDF
    open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked

    Unsupervised Feature Selection with Adaptive Structure Learning

    Full text link
    The problem of feature selection has raised considerable interests in the past decade. Traditional unsupervised methods select the features which can faithfully preserve the intrinsic structures of data, where the intrinsic structures are estimated using all the input features of data. However, the estimated intrinsic structures are unreliable/inaccurate when the redundant and noisy features are not removed. Therefore, we face a dilemma here: one need the true structures of data to identify the informative features, and one need the informative features to accurately estimate the true structures of data. To address this, we propose a unified learning framework which performs structure learning and feature selection simultaneously. The structures are adaptively learned from the results of feature selection, and the informative features are reselected to preserve the refined structures of data. By leveraging the interactions between these two essential tasks, we are able to capture accurate structures and select more informative features. Experimental results on many benchmark data sets demonstrate that the proposed method outperforms many state of the art unsupervised feature selection methods

    A survey on utilization of data mining approaches for dermatological (skin) diseases prediction

    Get PDF
    Due to recent technology advances, large volumes of medical data is obtained. These data contain valuable information. Therefore data mining techniques can be used to extract useful patterns. This paper is intended to introduce data mining and its various techniques and a survey of the available literature on medical data mining. We emphasize mainly on the application of data mining on skin diseases. A categorization has been provided based on the different data mining techniques. The utility of the various data mining methodologies is highlighted. Generally association mining is suitable for extracting rules. It has been used especially in cancer diagnosis. Classification is a robust method in medical mining. In this paper, we have summarized the different uses of classification in dermatology. It is one of the most important methods for diagnosis of erythemato-squamous diseases. There are different methods like Neural Networks, Genetic Algorithms and fuzzy classifiaction in this topic. Clustering is a useful method in medical images mining. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. Clustering has some applications in dermatology. Besides introducing different mining methods, we have investigated some challenges which exist in mining skin data
    corecore