455 research outputs found

    Machine learning applied to crime prediction

    Get PDF
    Machine Learning is a cornerstone when it comes to artificial intelligence and big data analysis. It provides powerful algorithms that are capable of recognizing patterns, classifying data, and, basically, learn by themselves to perform a specific task. This field has incredibly grown in popularity these days, however, it still remains unknown for the majority of people, and even for most professionals. This project intends to provide an understandable explanation of what is it, what types are there and what it can be used for, as well as solve a real data classification problem (namely San Francisco crimes classification) using different algorithms, such as K-Nearest Neighbours, Parzen windows and Neural Networks, as an introduction to this field.El "Machine Learning" o aprendizaje máquina es la piedra angular de la inteligencia artificial i el análisis de grandes volúmenes de datos. Provee algoritmos potentes que son capaces de reconocer patrones, clasificar datos, y, básicamente, aprender por ellos mismos a hacer una tarea específica. Este campo ha crecido en popularidad últimamente, pero, aun así, todavía es un gran desconocido para la mayoría de gente, incluidos muchos profesionales del sector. La intención de este proyecto es dar una explicación más inteligible de qué es, qué tipos hay y para qué se puede usar, así como resolver un problema real de clasificación de datos (clasificando los crímenes de la ciudad de San Francisco) usando diversos algoritmos como K-Nearest Neighbours (K vecinos más cercanos), ventanas de Parzen y Redes Neuronales, como introducción a este campo.El "Machine Learning" o aprenentatge màquina és la pedra angular de la intel·ligència artificial i l'anàlisi de grans volums de dades. Proveeix algoritmes potents que són capaços de reconèixer patrons, classificar dades, i, bàsicament, aprendre per ells mateixos a fer una tasca específica. Aquest camp ha crescut en popularitat darrerament, però, tot i això, encara és un gran desconegut per la majoria de gent, inclosos molts professionals del sector. La intenció d'aquest projecte és donar una explicació més intel·ligible de què és, quins tipus hi ha i per a què es pot fer servir, així com solucionar un problema real de classificació de dades (classificant els crims de la ciutat de San Francisco) fent servir diversos algoritmes com K-Nearest Neighbours (K veïns més propers), finestres de Parzen i Xarxes Neuronals, com a introducció a aquest camp

    A MACHINE LEARNING APPROACH TO QUERY TIME-SERIES MICROARRAY DATA SETS FOR FUNCTIONALLY RELATED GENES USING HIDDEN MARKOV MODELS

    Get PDF
    Microarray technology captures the rate of expression of genes under varying experimental conditions. Genes encode the information necessary to build proteins; proteins used by cellular functions exhibit higher rates of expression for the associated genes. If multiple proteins are required for a particular function then their genes show a pattern of coexpression during time periods when the function is active within a cell. Cellular functions are generally complex and require groups of genes to cooperate; these groups of genes are called functional modules. Modular organization of genetic functions has been evident since 1999. Detecting functionally related genes in a genome and detecting all genes belonging to particular functional modules are current research topics in this field. The number of microarray gene expression datasets available in public repositories increases rapidly, and advances in technology have now made it feasible to routinely perform whole-genome studies where the behavior of every gene in a genome is captured. This promises a wealth of biological and medical information, but making the amount of data accessible to researchers requires intelligent and efficient computational algorithms. Researchers working on specific cellular functions would benefit from this data if it was possible to quickly extract information useful to their area of research. This dissertation develops a machine learning algorithm that allows one or multiple microarray data sets to be queried with a set of known and functionally related input genes in order to detect additional genes participating in the same or closely related functions. The focus is on time-series microarray datasets where gene expression values are obtained from the same experiment over a period of time from a series of sequential measurements. A feature selection algorithm selects relevant time steps where the provided input genes exhibit correlated expression behavior. Time steps are the columns in microarray data sets, rows list individual genes. A specific linear Hidden Markov Model (HMM) is then constructed to contain one hidden state for each of the selected experiments and is trained using the expression values of the input genes from the microarray. Given the trained HMM the probability that a sequence of gene expression values was generated by that particular HMM can be calculated. This allows for the assignment of a probability score for each gene in the microarray. High-scoring genes are included in the result set (of genes with functional similarities to the input genes.) P-values can be calculated by repeating this algorithm to train multiple individual HMMs using randomly selected genes as input genes and calculating a Parzen Density Function (PDF) from the probability scores of all HMMs for each gene. A feedback loop uses the result generated from one algorithm run as input set for another iteration of the algorithm. This iterated HMM algorithm allows for the characterization of functional modules from very small input sets and for weak similarity signals. This algorithm also allows for the integration of multiple microarray data sets; two approaches are studied: Meta-Analysis (combination of the results from individual data set runs) and the extension of the linear HMM across multiple individual data sets. Results indicate that Meta-Analysis works best for integration of closely related microarrays and a spanning HMM works best for the integration of multiple heterogeneous datasets. The performance of this approach is demonstrated relative to the published literature on a number of widely used synthetic data sets. Biological application is verified by analyzing biological data sets of the Fruit Fly D. Melanogaster and Baker‟s Yeast S. Cerevisiae. The algorithm developed in this dissertation is better able to detect functionally related genes in common data sets than currently available algorithms in the published literature

    Novel techniques of computational intelligence for analysis of astronomical structures

    Get PDF
    Gravitational forces cause the formation and evolution of a variety of cosmological structures. The detailed investigation and study of these structures is a crucial step towards our understanding of the universe. This thesis provides several solutions for the detection and classification of such structures. In the first part of the thesis, we focus on astronomical simulations, and we propose two algorithms to extract stellar structures. Although they follow different strategies (while the first one is a downsampling method, the second one keeps all samples), both techniques help to build more effective probabilistic models. In the second part, we consider observational data, and the goal is to overcome some of the common challenges in observational data such as noisy features and imbalanced classes. For instance, when not enough examples are present in the training set, two different strategies are used: a) nearest neighbor technique and b) outlier detection technique. In summary, both parts of the thesis show the effectiveness of automated algorithms in extracting valuable information from astronomical databases

    Recognition of pen-based music notation with finite-state machines

    Get PDF
    This work presents a statistical model to recognize pen-based music compositions using stroke recognition algorithms and finite-state machines. The series of strokes received as input is mapped onto a stochastic representation, which is combined with a formal language that describes musical symbols in terms of stroke primitives. Then, a Probabilistic Finite-State Automaton is obtained, which defines probabilities over the set of musical sequences. This model is eventually crossed with a semantic language to avoid sequences that does not make musical sense. Finally, a decoding strategy is applied in order to output a hypothesis about the musical sequence actually written. Comprehensive experimentation with several decoding algorithms, stroke similarity measures and probability density estimators are tested and evaluated following different metrics of interest. Results found have shown the goodness of the proposed model, obtaining competitive performances in all metrics and scenarios considered.This work was supported by the Spanish Ministerio de Educación, Cultura y Deporte through a FPU Fellowship (Ref. AP2012–0939) and the Spanish Ministerio de Economía y Competitividad through the TIMuL Project (No. TIN2013-48152-C2-1-R, supported by UE FEDER funds)

    A Survey of Adaptive Resonance Theory Neural Network Models for Engineering Applications

    Full text link
    This survey samples from the ever-growing family of adaptive resonance theory (ART) neural network models used to perform the three primary machine learning modalities, namely, unsupervised, supervised and reinforcement learning. It comprises a representative list from classic to modern ART models, thereby painting a general picture of the architectures developed by researchers over the past 30 years. The learning dynamics of these ART models are briefly described, and their distinctive characteristics such as code representation, long-term memory and corresponding geometric interpretation are discussed. Useful engineering properties of ART (speed, configurability, explainability, parallelization and hardware implementation) are examined along with current challenges. Finally, a compilation of online software libraries is provided. It is expected that this overview will be helpful to new and seasoned ART researchers

    Prognostic and health management for engineering systems: a review of the data-driven approach and algorithms

    Get PDF
    Prognostics and health management (PHM) has become an important component of many engineering systems and products, where algorithms are used to detect anomalies, diagnose faults and predict remaining useful lifetime (RUL). PHM can provide many advantages to users and maintainers. Although primary goals are to ensure the safety, provide state of the health and estimate RUL of the components and systems, there are also financial benefits such as operational and maintenance cost reductions and extended lifetime. This study aims at reviewing the current status of algorithms and methods used to underpin different existing PHM approaches. The focus is on providing a structured and comprehensive classification of the existing state-of-the-art PHM approaches, data-driven approaches and algorithms

    EDMON - Electronic Disease Surveillance and Monitoring Network: A Personalized Health Model-based Digital Infectious Disease Detection Mechanism using Self-Recorded Data from People with Type 1 Diabetes

    Get PDF
    Through time, we as a society have been tested with infectious disease outbreaks of different magnitude, which often pose major public health challenges. To mitigate the challenges, research endeavors have been focused on early detection mechanisms through identifying potential data sources, mode of data collection and transmission, case and outbreak detection methods. Driven by the ubiquitous nature of smartphones and wearables, the current endeavor is targeted towards individualizing the surveillance effort through a personalized health model, where the case detection is realized by exploiting self-collected physiological data from wearables and smartphones. This dissertation aims to demonstrate the concept of a personalized health model as a case detector for outbreak detection by utilizing self-recorded data from people with type 1 diabetes. The results have shown that infection onset triggers substantial deviations, i.e. prolonged hyperglycemia regardless of higher insulin injections and fewer carbohydrate consumptions. Per the findings, key parameters such as blood glucose level, insulin, carbohydrate, and insulin-to-carbohydrate ratio are found to carry high discriminative power. A personalized health model devised based on a one-class classifier and unsupervised method using selected parameters achieved promising detection performance. Experimental results show the superior performance of the one-class classifier and, models such as one-class support vector machine, k-nearest neighbor and, k-means achieved better performance. Further, the result also revealed the effect of input parameters, data granularity, and sample sizes on model performances. The presented results have practical significance for understanding the effect of infection episodes amongst people with type 1 diabetes, and the potential of a personalized health model in outbreak detection settings. The added benefit of the personalized health model concept introduced in this dissertation lies in its usefulness beyond the surveillance purpose, i.e. to devise decision support tools and learning platforms for the patient to manage infection-induced crises

    Discriminative learning with application to interactive facial image retrieval

    Get PDF
    The amount of digital images is growing drastically and advanced tools for searching in large image collections are therefore becoming urgently needed. Content-based image retrieval is advantageous for such a task in terms of automatic feature extraction and indexing without human labor and subjectivity in image annotations. The semantic gap between high-level semantics and low-level visual features can be reduced by the relevance feedback technique. However, most existing interactive content-based image retrieval (ICBIR) systems require a substantial amount of human evaluation labor, which leads to the evaluation fatigue problem that heavily restricts the application of ICBIR. In this thesis a solution based on discriminative learning is presented. It extends an existing ICBIR system, PicSOM, towards practical applications. The enhanced ICBIR system allows users to input partial relevance which includes not only relevance extent but also relevance reason. A multi-phase retrieval with partial relevance can adapt to the user's searching intention in a from-coarse-to-fine manner. The retrieval performance can be improved by employing supervised learning as a preprocessing step before unsupervised content-based indexing. In this work, Parzen Discriminant Analysis (PDA) is proposed to extract discriminative components from images. PDA regularizes the Informative Discriminant Analysis (IDA) objective with a greatly accelerated optimization algorithm. Moreover, discriminative Self-Organizing Maps trained with resulting features can easily handle fuzzy categorizations. The proposed techniques have been applied to interactive facial image retrieval. Both a query example and a benchmark simulation study are presented, which indicate that the first image depicting the target subject can be retrieved in a small number of rounds

    Information Theory Filters for Wavelet Packet Coefficient Selection with Application to Corrosion Type Identification from Acoustic Emission Signals

    Get PDF
    The damage caused by corrosion in chemical process installations can lead to unexpected plant shutdowns and the leakage of potentially toxic chemicals into the environment. When subjected to corrosion, structural changes in the material occur, leading to energy releases as acoustic waves. This acoustic activity can in turn be used for corrosion monitoring, and even for predicting the type of corrosion. Here we apply wavelet packet decomposition to extract features from acoustic emission signals. We then use the extracted wavelet packet coefficients for distinguishing between the most important types of corrosion processes in the chemical process industry: uniform corrosion, pitting and stress corrosion cracking. The local discriminant basis selection algorithm can be considered as a standard for the selection of the most discriminative wavelet coefficients. However, it does not take the statistical dependencies between wavelet coefficients into account. We show that, when these dependencies are ignored, a lower accuracy is obtained in predicting the corrosion type. We compare several mutual information filters to take these dependencies into account in order to arrive at a more accurate prediction
    corecore