1,875 research outputs found

    Reducing Spatial Data Complexity for Classification Models

    Get PDF
    Intelligent data analytics gradually becomes a day-to-day reality of today's businesses. However, despite rapidly increasing storage and computational power current state-of-the-art predictive models still can not handle massive and noisy corporate data warehouses. What is more adaptive and real-time operational environment requires multiple models to be frequently retrained which fiirther hinders their use. Various data reduction techniques ranging from data sampling up to density retention models attempt to address this challenge by capturing a summarised data structure, yet they either do not account for labelled data or degrade the classification performance of the model trained on the condensed dataset. Our response is a proposition of a new general framework for reducing the complexity of labelled data by means of controlled spatial redistribution of class densities in the input space. On the example of Parzen Labelled Data Compressor (PLDC) we demonstrate a simulatory data condensation process directly inspired by the electrostatic field interaction where the data are moved and merged following the attracting and repelling interactions with the other labelled data. The process is controlled by the class density function built on the original data that acts as a class-sensitive potential field ensuring preservation of the original class density distributions, yet allowing data to rearrange and merge joining together their soft class partitions. As a result we achieved a model that reduces the labelled datasets much further than any competitive approaches yet with the maximum retention of the original class densities and hence the classification performance. PLDC leaves the reduced dataset with the soft accumulative class weights allowing for efficient online updates and as shown in a series of experiments if coupled with Parzen Density Classifier (PDC) significantly outperforms competitive data condensation methods in terms of classification performance at the comparable compression levels

    On accuracy of PDF divergence estimators and their applicability to representative data sampling

    Get PDF
    Generalisation error estimation is an important issue in machine learning. Cross-validation traditionally used for this purpose requires building multiple models and repeating the whole procedure many times in order to produce reliable error estimates. It is however possible to accurately estimate the error using only a single model, if the training and test data are chosen appropriately. This paper investigates the possibility of using various probability density function divergence measures for the purpose of representative data sampling. As it turned out, the first difficulty one needs to deal with is estimation of the divergence itself. In contrast to other publications on this subject, the experimental results provided in this study show that in many cases it is not possible unless samples consisting of thousands of instances are used. Exhaustive experiments on the divergence guided representative data sampling have been performed using 26 publicly available benchmark datasets and 70 PDF divergence estimators, and their results have been analysed and discussed

    Building Gene Expression Profile Classifiers with a Simple and Efficient Rejection Option in R

    Get PDF
    Background: The collection of gene expression profiles from DNA microarrays and their analysis with pattern recognition algorithms is a powerful technology applied to several biological problems. Common pattern recognition systems classify samples assigning them to a set of known classes. However, in a clinical diagnostics setup, novel and unknown classes (new pathologies) may appear and one must be able to reject those samples that do not fit the trained model. The problem of implementing a rejection option in a multi-class classifier has not been widely addressed in the statistical literature. Gene expression profiles represent a critical case study since they suffer from the curse of dimensionality problem that negatively reflects on the reliability of both traditional rejection models and also more recent approaches such as one-class classifiers. Results: This paper presents a set of empirical decision rules that can be used to implement a rejection option in a set of multi-class classifiers widely used for the analysis of gene expression profiles. In particular, we focus on the classifiers implemented in the R Language and Environment for Statistical Computing (R for short in the remaining of this paper). The main contribution of the proposed rules is their simplicity, which enables an easy integration with available data analysis environments. Since in the definition of a rejection model tuning of the involved parameters is often a complex and delicate task, in this paper we exploit an evolutionary strategy to automate this process. This allows the final user to maximize the rejection accuracy with minimum manual intervention. Conclusions: This paper shows how the use of simple decision rules can be used to help the use of complex machine learning algorithms in real experimental setups. The proposed approach is almost completely automated and therefore a good candidate for being integrated in data analysis flows in labs where the machine learning expertise required to tune traditional classifiers might not be availabl

    Pixel classification for automated diabetic foot diagnosis

    Get PDF
    Worldwide, more than 180 million people suffer from diabetes mellitus. Approximately 50% of these patients will develop complications to their feet. Neuropathy, combined with poor blood supply and biomechanical changes results in a high risk for foot ulcers, which is a key problem in the diabetic foot; when these wounds become infected, this can ultimately result in lower extremity amputation, which has a serious effect on the quality of life of the patient, and causes a large economic burden on society.\ud \ud This was the motivation for a collaborate project (Vincent50) in which a photographic foot imaging device was developed. The system allows scanning of the foot soles on a daily basis which may lead to early recognition of foot problems. The goal of the present study is to determine whether pixel classification is a useful intermediate step towards automatically assessing the images of the foot soles for signs of diabetic foot disease. If successful, this approach will further relief health care professionals in assessing the foot and enable the placement of more devices in the future. \ud \ud The best agreement between automated recognition and expert diagnosis was achieved with a combination of RGB and derived features, proves that the RGB data is informative with respect to detection of ulcers. However, the automatic detection of pre-signs of ulcers and other anomalies needs more sophistication than pixel classification alone. Firstly, other physical features, such as hyperspectral data, infrared and/or textural features are expected to be more informative. Secondly, we expect to be able to boost the performance by using the context between neighboring pixels. Thirdly, an individualized and normalized classification process might help with the large variability in foot soles between individuals. \u

    A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory

    Get PDF
    Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithm

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
    • ā€¦
    corecore