97 research outputs found

    Temporal decision making using unsupervised learning

    Get PDF
    With the explosion of ubiquitous continuous sensing, on-line streaming clustering continues to attract attention. The requirements are that the streaming clustering algorithm recognize and adapt clusters as the data evolves, that anomalies are detected, and that new clusters are automatically formed as incoming data dictate. In this dissertation, we develop a streaming clustering algorithm, MU Streaming Clustering (MUSC), that is based on coupling a Gaussian mixture model (GMM) with possibilistic clustering to build an adaptive system for analyzing streaming multi-dimensional activity feature vectors. For this reason, the possibilistic C-Means (PCM) and Automatic Merging Possibilistic Clustering Method (AMPCM) are combined together to cluster the initial data points, detect anomalies and initialize the GMM. MUSC achieves our goals when tested on synthetic and real-life datasets. We also compare MUSC's performance with Sequential k-means (sk-means), Basic Sequential Clustering Algorithm (BSAS), and Modified BSAS (MBSAS) here MUSC shows superiority in the performance and accuracy. The performance of a streaming clustering algorithm needs to be monitored over time to understand the behavior of the streaming data in terms of new emerging clusters and number of outlier data points. Incremental internal Validity Indices (iCVIs) are used to monitor the performance of an on-line clustering algorithm. We study the internal incremental Davies-Bouldin (DB), Xie-Beni (XB), and Dunn internal cluster validity indices in the context of streaming data analysis. We extend the original incremental DB (iDB) to a more general version parameterized by the exponent of membership weights. Then we illustrate how the iDB can be used to analyze and understand the performance of MUSC algorithm. We give examples that illustrate the appearance of a new cluster, the effect of different cluster sizes, handling of outlier data samples, and the effect of the input order on the resultant cluster history. In addition, we investigate the internal incremental Davies-Bouldin (iDB) cluster validity index in the context of big streaming data analysis. We analyze the effect of large numbers of samples on the values of the iCVI (iDB). We also develop online versions of two modified generalized Dunn's indices that can be used for dynamic evaluation of evolving (cluster) structure in streaming data. We argue that this method is a good way to monitor the ongoing performance of online clustering algorithms and we illustrate several types of inferences that can be drawn from such indices. We compare the two new indices to the incremental Xie-Beni and Davies-Bouldin indices, which to our knowledge offer the only comparable approach, with numerical examples on a variety of synthetic and real data sets. We also study the performance of MUSC and iCVIs with big streaming data applications. We show the advantage of iCVIs in monitoring large streaming datasets and in providing useful information about the data stream in terms of emergence of a new structure, amount of outlier data, size of the clusters, and order of data samples in each cluster. We also propose a way to project streaming data into a lower space for cases where the distance measure does not perform as expected in the high dimensional space. Another example of streaming is the data acivity data coming from TigerPlace and other elderly residents' apartments in and around Columbia. MO. TigerPlace is an eldercare facility that promotes aging-in-place in Columbia Missouri. Eldercare monitoring using non-wearable sensors is a candidate solution for improving care and reducing costs. Abnormal sensor patterns produced by certain resident behaviors could be linked to early signs of illness. We propose an unsupervised method for detecting abnormal behavior patterns based on a new context preserving representation of daily activities. A preliminary analysis of the method was conducted on data collected in TigerPlace. Sensor firings of each day are converted into sequences of daily activities. Then, building a histogram from the daily sequences of a resident, we generate a single data vector representing that day. Using the proposed method, a day with hundreds of sequences is converted into a single data point representing that day and preserving the context of the daily routine at the same time. We obtained an average Area Under the Curve (AUC) of 0.9 in detecting days where elder adults need to be assessed. Our approach outperforms other approaches on the same datset. Using the context preserving representation, we develoed a multi-dimensional alert system to improve the existing single-dimensional alert system in TigerPlace. Also, this represenation is used to develop a framework that utilizes sensor sequence similarity and medical concepts extracted from the EHR to automatically inform the nursing staff when health problems are detected. Our context preserving representation of daily activities is used to measure the similarity between the sensor sequences of different days. The medical concepts are extracted from the nursing notes using MetamapLite, an NLP tool included in the Unified Medical Language System (UMLS). The proposed idea is validated on two pilot datasets from twelve Tiger Place residents, with a total of 5810 sensor days out of which 1966 had nursing notes

    Speaker specific feature based clustering and its applications in language independent forensic speaker recognition

    Get PDF
    Forensic speaker recognition (FSR) is the process of determining whether the source of a questioned voice recording (trace) is a specific individual (suspected speaker). The role of the forensic expert is to testify by using, if possible, a quantitative measure of this value to the value of the voice evidence. Using this information as an aid in their judgments and decisions are up to the judge and/or the jury. Most existing methods measure inter-utterance similarities directly based on spectrum-based characteristics, the resulting clusters may not be well related to speaker’s, but rather to different acoustic classes. This research addresses this deficiency by projecting language-independent utterances into a reference space equipped to cover the standard voice features underlying the entire utterance set. The resulting projection vectors naturally represent the language-independent voice-like relationships among all the utterances and are therefore more robust against non-speaker interference. Then a clustering approach is proposed based on the peak approximation in order to maximize the similarities between language-independent utterances within all clusters. This method uses a K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva algorithm to evaluate the cluster to which each utterance should be allocated, overcoming the disadvantage of traditional hierarchical clustering that the ultimate outcome can only hit the optimum recognition efficiency. The recognition efficiency of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms are 95.2%, 97.3%, 98.5% and 99.7% and EER are 3.62%, 2.91 %, 2.82%, and 2.61% respectively. The EER improvement of the Gath-Geva technique based FSRsystem compared with Gustafson and Kessel and Fuzzy C-means is 8.04% and 11.49% respectivel

    Factors influencing short-term parasitoid establishment and efficacy for the biological control of Halyomorpha halys with the samurai wasp Trissolcus japonicus

    Get PDF
    Background: Classical biological control has been identified as the most promising approach to limit the impact of the invasive pest species Halyomorpha halys (Heteroptera: Pentatomidae). This study investigated the parasitism rate at sites where the biocontrol agent Trissolcus japonicus (Hymenoptera: Scelionidae) was released and where its unintentional introduction took place, in the Trentino-South Tyrol region. The effect of land-use composition was studied to understand which factors favor the establishment of hosts and parasitoids, including native and exotic species. Results: The released T. japonicus were detected a year after the start of the program, with a significant parasitoid impact and discovery, compared to control sites. Trissolcus japonicus was the most abundant H. halys parasitoid, and Trissolcus mitsukurii and Anastatus bifasciatus were recorded also. The efficacy of T. mitsukurii was lower in sites where T. japonicus was successfully established, suggesting a possible competitive interaction. Parasitism level by T. japonicus at the release sites was 12.5% in 2020 and 16.4% in 2021. The combined effect of predation and parasitization increased H. halys mortality up to 50% at the release sites. Landscape composition analysis showed that both H. halys and T. japonicus were more likely to be found at sites with lower altitude and with permanent crops, whereas other hosts and parasitoids preferred different conditions. Conclusion: Trissolcus japonicus showed a promising impact on H. halys, at release and adventive sites, with minor nontarget effects, mediated by landscape heterogeneity. The prevalence of T. japonicus in landscapes with permanent crops could support IPM in the future. © 2023 The Authors. Pest Management Science published by John Wiley & Sons Ltd on behalf of Society of Chemical Industry

    Clustering And Control Of Streaming Synchrophasor Datasets

    Get PDF
    With the large scale deployment of phasor measurement units (PMU) in the United States, a resonating topic has been the question of how to extract useful “information” or “knowledge” from voluminous data generated by supervisory control and data acquisition (SCADA), PMUs and advanced metering infrastructure (AMI). With a sampling rate of 30 to as high as 120 samples per second, the PMU provide a fine-grained monitoring of the power grid with time synchronized measurements of voltage, current, frequency and phase angle. Running the sensors continuously can produce nearly 2,592,000 samples of data every day. This large data need to be treated efficiently to extract information for better decision making in a smart grid network (SG) environment. My research presents a flexible software framework to process the streaming data sets for smart-grid applications. The proposed Integrated Software Suite (ISS) is capable of mining the data using various clustering algorithms for better decision-making purposes. The decisions based on the proposed methods can help electric grid’s system operators to reduce blackouts, instabilities and oscillations in the smart-grid. The research work primarily focus on integrating a density-based clustering (DBSCAN) and variations of k-means clustering methods to capture specific types of anomalies or faults. A novel method namely, multi-tier k-means was developed to cluster the PMU data. Such a grouping scheme will enable system operators for better decision making. Different fault conditions, such as voltage, current, phase angle or frequency deviations, generation, and load trips, are investigated and a comparative analysis of application of three methods are studied. A collection of forecasting techniques has also been applied to PMU datasets. The datasets considered are from the PJM Corporation that describes the energy demand for 13 states and District of Columbia (DC). The applications and suitability of forecasting techniques to PMU data using random forest (RF), locally weighted scatterplot smoothing (LOWESS) and seasonal auto regressive integrated moving average (SARIMA) has been investigated. The approaches are tested against standardized error indices like mean absolute percentage error (MAPE), mean squared error (MSE), root mean squared error (RMSE) and normal percentage error (PCE), to compare the performance. It is observed that the proposed hybrid combination of RF and SARIMA can be usd with good results in day ahead forecasting of load dispatch

    Static and dynamic selection of ensemble of classifiers

    Get PDF
    Nous présentons dans cette thèse plusieurs solutions novatrices pour tenter de solutionner trois problèmes fondamentaux reliés à la conception des ensembles de classifieurs: la génération des classificateurs, la sélection et la fusion. Une nouvelle fonction de fusion (Compound Diversity Function - CDF) basée sur la prise en compte de la performance individuelle des classificateurs et de la diversité entre pairs de classificateurs. Une nouvelle fonction de fusion basée sur les matrices de confusions "pairwise" (PFM), mieux adaptée pour la fusion des classificateurs en présence d'un grand nombre de classes. Une nouvelle méthode pour générer des ensembles de Mo- dèles de Markov Cachés (Hidden Markov Models - EoHMM) pour la reconnaissance des caractères manuscrits. Une solution novatrice repose sur le concept des Oracles associés aux données de la base de validation (KNORA). Une nouvelle approche pour la sélection des sous-espaces de représentation à partir d'une mesure de diversité évaluée entre les paires de partitions

    Global and local clustering soft assignment for intrusion detection system: a comparative study

    Get PDF
    Intrusion Detection System (IDS) plays an important role in computer network defence mechanism against malicious objects. The ability of IDS to detect new sophisticated attacks compared to traditional method such as firewall is important to secure the network. Machine Learning algorithm such as unsupervised learning and supervised learning is capable to solve the problem of classification in IDS. To achieve that, KDD Cup 99 dataset is used in experiments. This dataset contains 5 million instances with 5 different categories which are Normal, DOS, U2R, R2L and Probe. With such a large dataset, the learning process consumes a lot of processing times and resources. Clustering is unsupervised learning method that can be used for organizing data by grouping similar features into same group. In literature, many researchers used global clustering approach whereby all input will be combined and clustered to construct a codebook. However, there is an alternative technique namely local clustering approach whereby the input will be split into 5 different categories and clustered independently to construct 5 different codebooks. The main objective of this research is to compare the classification performance between the global and local clustering approaches. For this purpose, the soft assignment approach is used for indexing on KDD input and SVM for classification. In the soft assignment approach, the smallest distance values are used for attack description and RBF kernel for SVM to classify attack. The results show that the global clustering approach outperforms the local clustering approach for binary classification. It gives 83.0% of the KDD Cup 99 dataset. However, the local clustering approach outperforms the global clustering approach on multi-class classification problem. It gives 60.6% of the KDD Cup 99 dataset

    Robust techniques and applications in fuzzy clustering

    Get PDF
    This dissertation addresses issues central to frizzy classification. The issue of sensitivity to noise and outliers of least squares minimization based clustering techniques, such as Fuzzy c-Means (FCM) and its variants is addressed. In this work, two novel and robust clustering schemes are presented and analyzed in detail. They approach the problem of robustness from different perspectives. The first scheme scales down the FCM memberships of data points based on the distance of the points from the cluster centers. Scaling done on outliers reduces their membership in true clusters. This scheme, known as the Mega-clustering, defines a conceptual mega-cluster which is a collective cluster of all data points but views outliers and good points differently (as opposed to the concept of Dave\u27s Noise cluster). The scheme is presented and validated with experiments and similarities with Noise Clustering (NC) are also presented. The other scheme is based on the feasible solution algorithm that implements the Least Trimmed Squares (LTS) estimator. The LTS estimator is known to be resistant to noise and has a high breakdown point. The feasible solution approach also guarantees convergence of the solution set to a global optima. Experiments show the practicability of the proposed schemes in terms of computational requirements and in the attractiveness of their simplistic frameworks. The issue of validation of clustering results has often received less attention than clustering itself. Fuzzy and non-fuzzy cluster validation schemes are reviewed and a novel methodology for cluster validity using a test for random position hypothesis is developed. The random position hypothesis is tested against an alternative clustered hypothesis on every cluster produced by the partitioning algorithm. The Hopkins statistic is used as a basis to accept or reject the random position hypothesis, which is also the null hypothesis in this case. The Hopkins statistic is known to be a fair estimator of randomness in a data set. The concept is borrowed from the clustering tendency domain and its applicability to validating clusters is shown here. A unique feature selection procedure for use with large molecular conformational datasets with high dimensionality is also developed. The intelligent feature extraction scheme not only helps in reducing dimensionality of the feature space but also helps in eliminating contentious issues such as the ones associated with labeling of symmetric atoms in the molecule. The feature vector is converted to a proximity matrix, and is used as an input to the relational fuzzy clustering (FRC) algorithm with very promising results. Results are also validated using several cluster validity measures from literature. Another application of fuzzy clustering considered here is image segmentation. Image analysis on extremely noisy images is carried out as a precursor to the development of an automated real time condition state monitoring system for underground pipelines. A two-stage FCM with intelligent feature selection is implemented as the segmentation procedure and results on a test image are presented. A conceptual framework for automated condition state assessment is also developed

    Recurrent neural network based approach for estimating the dynamic evolution of grinding process variables

    Get PDF
    170 p.El proceso de rectificado es ampliamente utilizado para la fabricación de componentes de precisión por arranque de viruta por sus buenos acabados y excelentes tolerancias. Así, el modelado y el control del proceso de rectificado es altamente importante para alcanzar los requisitos económicos y de precisión de los clientes. Sin embargo, los modelos analíticos desarrollados hasta ahora están lejos de poder ser implementados en la industria. Es por ello que varias investigaciones han propuesto la utilización de técnicas inteligentes para el modelado del proceso de rectificado. Sin embargo, estas propuestas a) no generalizan para nuevas muelas y b) no tienen en cuenta el desgaste de la muela, efecto esencial para un buen modelo del proceso de rectificado. Es por ello que se propone la utilización de las redes neuronales recurrentes para estimar variables del proceso de rectificado que a) sean capaces de generalizar para muelas nuevas y b) que tenga en cuenta el desgaste de la muela, es decir, que sea capaz de estimar variables del proceso de rectificado mientras la muela se va desgastando. Así, tomando como base la metodología general, se han desarrollado sensores virtuales para la medida del desgaste de la muela y la rugosidad de la pieza, dos variables esenciales del proceso de rectificado. Por otro lado, también se plantea la utilización la metodología general para estimar fuera de máquina la energía específica de rectificado que puede ayudar a seleccionar la muela y los parámetros de rectificado por adelantado. Sin embargo, una única red no es suficiente para abarcar todas las muelas y condiciones de rectificado existentes. Así, también se propone una metodología para generar redes ad-hoc seleccionando unos datos específicos de toda la base de datos. Para ello, se ha hecho uso de los algoritmos Fuzzy c-Means. Finalmente, hay que decir que los resultados obtenidos mejoran los existentes hasta ahora. Sin embargo, estos resultados no son suficientemente buenos para poder controlar el proceso. Así, se propone la utilización de las redes neuronales de impulsos. Al trabajar con impulsos, estas redes tienen inherentemente la capacidad de trabajar con datos temporales, lo que las hace adecuados para estimar valores que evolucionan con el tiempo. Sin embargo, estas redes solamente se usan para clasificación y no predicción de evoluciones temporales por la falta de métodos de codificación/decodificación de datos temporales. Así, en este trabajo se plantea una metodología para poder codificar en trenes de impulsos señales secuenciales y poder reconstruir señales secuenciales a partir de trenes de impulsos. Esto puede llevar a en un futuro poder utilizar las redes neuronales de impulsos para la predicción de secuenciales y/o temporales
    corecore