1,098 research outputs found

    Reducing Spatial Data Complexity for Classification Models

    Get PDF
    Intelligent data analytics gradually becomes a day-to-day reality of today's businesses. However, despite rapidly increasing storage and computational power current state-of-the-art predictive models still can not handle massive and noisy corporate data warehouses. What is more adaptive and real-time operational environment requires multiple models to be frequently retrained which fiirther hinders their use. Various data reduction techniques ranging from data sampling up to density retention models attempt to address this challenge by capturing a summarised data structure, yet they either do not account for labelled data or degrade the classification performance of the model trained on the condensed dataset. Our response is a proposition of a new general framework for reducing the complexity of labelled data by means of controlled spatial redistribution of class densities in the input space. On the example of Parzen Labelled Data Compressor (PLDC) we demonstrate a simulatory data condensation process directly inspired by the electrostatic field interaction where the data are moved and merged following the attracting and repelling interactions with the other labelled data. The process is controlled by the class density function built on the original data that acts as a class-sensitive potential field ensuring preservation of the original class density distributions, yet allowing data to rearrange and merge joining together their soft class partitions. As a result we achieved a model that reduces the labelled datasets much further than any competitive approaches yet with the maximum retention of the original class densities and hence the classification performance. PLDC leaves the reduced dataset with the soft accumulative class weights allowing for efficient online updates and as shown in a series of experiments if coupled with Parzen Density Classifier (PDC) significantly outperforms competitive data condensation methods in terms of classification performance at the comparable compression levels

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    Automatic Network Fingerprinting through Single-Node Motifs

    Get PDF
    Complex networks have been characterised by their specific connectivity patterns (network motifs), but their building blocks can also be identified and described by node-motifs---a combination of local network features. One technique to identify single node-motifs has been presented by Costa et al. (L. D. F. Costa, F. A. Rodrigues, C. C. Hilgetag, and M. Kaiser, Europhys. Lett., 87, 1, 2009). Here, we first suggest improvements to the method including how its parameters can be determined automatically. Such automatic routines make high-throughput studies of many networks feasible. Second, the new routines are validated in different network-series. Third, we provide an example of how the method can be used to analyse network time-series. In conclusion, we provide a robust method for systematically discovering and classifying characteristic nodes of a network. In contrast to classical motif analysis, our approach can identify individual components (here: nodes) that are specific to a network. Such special nodes, as hubs before, might be found to play critical roles in real-world networks.Comment: 16 pages (4 figures) plus supporting information 8 pages (5 figures

    Machine learning applied to crime prediction

    Get PDF
    Machine Learning is a cornerstone when it comes to artificial intelligence and big data analysis. It provides powerful algorithms that are capable of recognizing patterns, classifying data, and, basically, learn by themselves to perform a specific task. This field has incredibly grown in popularity these days, however, it still remains unknown for the majority of people, and even for most professionals. This project intends to provide an understandable explanation of what is it, what types are there and what it can be used for, as well as solve a real data classification problem (namely San Francisco crimes classification) using different algorithms, such as K-Nearest Neighbours, Parzen windows and Neural Networks, as an introduction to this field.El "Machine Learning" o aprendizaje máquina es la piedra angular de la inteligencia artificial i el análisis de grandes volúmenes de datos. Provee algoritmos potentes que son capaces de reconocer patrones, clasificar datos, y, básicamente, aprender por ellos mismos a hacer una tarea específica. Este campo ha crecido en popularidad últimamente, pero, aun así, todavía es un gran desconocido para la mayoría de gente, incluidos muchos profesionales del sector. La intención de este proyecto es dar una explicación más inteligible de qué es, qué tipos hay y para qué se puede usar, así como resolver un problema real de clasificación de datos (clasificando los crímenes de la ciudad de San Francisco) usando diversos algoritmos como K-Nearest Neighbours (K vecinos más cercanos), ventanas de Parzen y Redes Neuronales, como introducción a este campo.El "Machine Learning" o aprenentatge màquina és la pedra angular de la intel·ligència artificial i l'anàlisi de grans volums de dades. Proveeix algoritmes potents que són capaços de reconèixer patrons, classificar dades, i, bàsicament, aprendre per ells mateixos a fer una tasca específica. Aquest camp ha crescut en popularitat darrerament, però, tot i això, encara és un gran desconegut per la majoria de gent, inclosos molts professionals del sector. La intenció d'aquest projecte és donar una explicació més intel·ligible de què és, quins tipus hi ha i per a què es pot fer servir, així com solucionar un problema real de classificació de dades (classificant els crims de la ciutat de San Francisco) fent servir diversos algoritmes com K-Nearest Neighbours (K veïns més propers), finestres de Parzen i Xarxes Neuronals, com a introducció a aquest camp

    Efficient selection of discriminative genes from microarray gene expression data for cancer diagnosis

    Full text link
    A new mutual information (MI)-based feature-selection method to solve the so-called large p and small n problem experienced in a microarray gene expression-based data is presented. First, a grid-based feature clustering algorithm is introduced to eliminate redundant features. A huge gene set is then greatly reduced in a very efficient way. As a result, the computational efficiency of the whole feature-selection process is substantially enhanced. Second, MI is directly estimated using quadratic MI together with Parzen window density estimators. This approach is able to deliver reliable results even when only a small pattern set is available. Also, a new MI-based criterion is proposed to avoid the highly redundant selection results in a systematic way. At last, attributed to the direct estimation of MI, the appropriate selected feature subsets can be reasonably determined. © 2005 IEEE
    • …
    corecore