526 research outputs found

    Overhead Management Strategies for Internet of Things Devices

    Get PDF
    Overhead (time and energy) management is paramount for IoT edge devices considering their typically resource-constrained nature. In this thesis we present two contributions for lowering resource consumption of IoT devices. The first contribution is minimizing the overhead of the Transport Layer Security (TLS) authentication protocol in the context of IoT networks by selecting a lightweight cipher suite configuration. TLS is the de facto authentication protocol for secure communication in Internet of Things (IoT) applications. However, the processing and energy demands of this protocol are the two essential parameters that must be taken into account with respect to the resource-constraint nature of IoT devices. For the first contribution, we study these parameters using a testbed in which an IoT board (Cypress CYW43907) communicates with a server over an 802.11 wireless link. Although TLS supports a wide-array of cipher suites, in this paper we focus on DHE RSA, ECDHE RSA, and ECDHE ECDSA, which are among the most popular ciphers used due to their robustness. Our studies show that ciphers using Elliptic Curve Diffie Hellman (ECDHE) key exchange are considerably more efficient than ciphers using Diffie Hellman (DHE). Furthermore, ECDSA signature verification consumes more time and energy than RSA signature verification for ECDHE key exchange. This study helps IoT designers choose an appropriate TLS cipher suite based on application demands, computational capabilities, and energy resources available. The second contribution of this thesis is deploying supervised machine learning anomaly detection algorithms on an IoT edge device to reduce data transmission overhead and cloud storage requirements. With continuous monitoring and sensing, millions of Internet of Things sensors all over the world generate tremendous amounts of data every minute. As a result, recent studies start to raise the question as whether to send all the sensing data directly to the cloud (i.e., direct transmission), or to preprocess such data at the network edge and only send necessary data to the cloud (i.e., preprocessing at the edge). Anomaly detection is particularly useful as an edge mining technique to reduce the transmission overhead in such a context when the frequently monitored activities contain only a sparse set of anomalies. This paper analyzes the potential overhead-savings of machine learning based anomaly detection models on the edge in three different IoT scenarios. Our experimental results prove that by choosing the appropriate anomaly detection models, we are able to effectively reduce the total amount of transmission energy as well as minimize required cloud storage. We prove that Random Forest, Multilayer Perceptron, and Discriminant Analysis models can viably save time and energy on the edge device during data transmission. K-Nearest Neighbors, although reliable in terms of prediction accuracy, demands exorbitant overhead and results in net time and energy loss on the edge device. In addition to presenting our model results for the different IoT scenarios, we provide guidelines for potential model selections through analysis of involved tradeoffs such as training overhead, prediction overhead, and classification accuracy

    Inferring Anomalies from Data using Bayesian Networks

    Get PDF
    Existing studies on data mining has largely focused on the design of measures and algorithms to identify outliers in large and high dimensional categorical and numeric databases. However, not much stress has been given on the interestingness of the reported outlier. One way to ascertain interestingness and usefulness of the reported outlier is by making use of domain knowledge. In this thesis, we present measures to discover outliers based on background knowledge, represented by a Bayesian network. Using causal relationships between attributes encoded in the Bayesian framework, we demonstrate that meaningful outliers, i.e., outliers which encode important or new information are those which violate causal relationships encoded in the model. Depending upon nature of data, several approaches are proposed to identify and explain anomalies using Bayesian knowledge. Outliers are often identified as data points which are ``rare'', ''isolated'', or ''far away from their nearest neighbors''. We show that these characteristics may not be an accurate way of describing interesting outliers. Through a critical analysis on several existing outlier detection techniques, we show why there is a mismatch between outliers as entities described by these characteristics and ``real'' outliers as identified using Bayesian approach. We show that the Bayesian approaches presented in this thesis has better accuracy in mining genuine outliers while, keeping a low false positive rate as compared to traditional outlier detection techniques

    Oil and Gas flow Anomaly Detection on offshore naturally flowing wells using Deep Neural Networks

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThe Oil and Gas industry, as never before, faces multiple challenges. It is being impugned for being dirty, a pollutant, and hence the more demand for green alternatives. Nevertheless, the world still has to rely heavily on hydrocarbons, since it is the most traditional and stable source of energy, as opposed to extensively promoted hydro, solar or wind power. Major operators are challenged to produce the oil more efficiently, to counteract the newly arising energy sources, with less of a climate footprint, more scrutinized expenditure, thus facing high skepticism regarding its future. It has to become greener, and hence to act in a manner not required previously. While most of the tools used by the Hydrocarbon E&P industry is expensive and has been used for many years, it is paramount for the industry’s survival and prosperity to apply predictive maintenance technologies, that would foresee potential failures, making production safer, lowering downtime, increasing productivity and diminishing maintenance costs. Many efforts were applied in order to define the most accurate and effective predictive methods, however data scarcity affects the speed and capacity for further experimentations. Whilst it would be highly beneficial for the industry to invest in Artificial Intelligence, this research aims at exploring, in depth, the subject of Anomaly Detection, using the open public data from Petrobras, that was developed by experts. For this research the Deep Learning Neural Networks, such as Recurrent Neural Networks with LSTM and GRU backbones, were implemented for multi-class classification of undesirable events on naturally flowing wells. Further, several hyperparameter optimization tools were explored, mainly focusing on Genetic Algorithms as being the most advanced methods for such kind of tasks. The research concluded with the best performing algorithm with 2 stacked GRU and the following vector of hyperparameters weights: [1, 47, 40, 14], which stand for timestep 1, number of hidden units 47, number of epochs 40 and batch size 14, producing F1 equal to 0.97%. As the world faces many issues, one of which is the detrimental effect of heavy industries to the environment and as result adverse global climate change, this project is an attempt to contribute to the field of applying Artificial Intelligence in the Oil and Gas industry, with the intention to make it more efficient, transparent and sustainable

    ServeNet: A Deep Neural Network for Web Services Classification

    Full text link
    Automated service classification plays a crucial role in service discovery, selection, and composition. Machine learning has been widely used for service classification in recent years. However, the performance of conventional machine learning methods highly depends on the quality of manual feature engineering. In this paper, we present a novel deep neural network to automatically abstract low-level representation of both service name and service description to high-level merged features without feature engineering and the length limitation, and then predict service classification on 50 service categories. To demonstrate the effectiveness of our approach, we conduct a comprehensive experimental study by comparing 10 machine learning methods on 10,000 real-world web services. The result shows that the proposed deep neural network can achieve higher accuracy in classification and more robust than other machine learning methods.Comment: Accepted by ICWS'2

    Meta-survey on outlier and anomaly detection

    Full text link
    The impact of outliers and anomalies on model estimation and data processing is of paramount importance, as evidenced by the extensive body of research spanning various fields over several decades: thousands of research papers have been published on the subject. As a consequence, numerous reviews, surveys, and textbooks have sought to summarize the existing literature, encompassing a wide range of methods from both the statistical and data mining communities. While these endeavors to organize and summarize the research are invaluable, they face inherent challenges due to the pervasive nature of outliers and anomalies in all data-intensive applications, irrespective of the specific application field or scientific discipline. As a result, the resulting collection of papers remains voluminous and somewhat heterogeneous. To address the need for knowledge organization in this domain, this paper implements the first systematic meta-survey of general surveys and reviews on outlier and anomaly detection. Employing a classical systematic survey approach, the study collects nearly 500 papers using two specialized scientific search engines. From this comprehensive collection, a subset of 56 papers that claim to be general surveys on outlier detection is selected using a snowball search technique to enhance field coverage. A meticulous quality assessment phase further refines the selection to a subset of 25 high-quality general surveys. Using this curated collection, the paper investigates the evolution of the outlier detection field over a 20-year period, revealing emerging themes and methods. Furthermore, an analysis of the surveys sheds light on the survey writing practices adopted by scholars from different communities who have contributed to this field. Finally, the paper delves into several topics where consensus has emerged from the literature. These include taxonomies of outlier types, challenges posed by high-dimensional data, the importance of anomaly scores, the impact of learning conditions, difficulties in benchmarking, and the significance of neural networks. Non-consensual aspects are also discussed, particularly the distinction between local and global outliers and the challenges in organizing detection methods into meaningful taxonomies

    D-AREdevil: a novel approach for discovering disease-associated rare cell populations in mass cytometry data

    Get PDF
    Background: The advances in single-cell technologies such as mass cytometry provides increasing resolution of the complexity of cellular samples, allowing researchers to deeper investigate and understand the cellular heterogeneity and possibly detect and discover previously undetectable rare cell populations. The identification of rare cell populations is of paramount importance for understanding the onset, progression and pathogenesis of many diseases. However, their identification remains challenging due to the always increasing dimensionality and throughput of the data generated. Aim: This study aimed at implementing a straightforward approach that efficiently supports a data analyst to identify disease-associated rare cell populations in large and complex biological samples and within reasonable limits of time and computational infrastructure. Methods: We proposed a novel computational framework called D-AREdevil (disease- associated rare cells detection) for cytometry datasets. The main characteristic of our computational framework is the combination of an anomaly detection algorithm (i.e. LOF, or FiRE) that provides a continuous score for individual cells with one of the best performing and fastest unsupervised clustering methods (i.e. FlowSOM). In our approach, the LOF score serves to select a set of candidate cells belonging to one or more subgroups of similar rare cell populations. Then, we tested these subgroups of rare cells for association with a patient group, disease type, clinical outcome or other characteristic of interest. Results: We reported in this study the properties and implementation of D-AREdevil and presented an evaluation of its performances and applications on three different testing datasets based on mass cytometry data. We generated data mixed with one or more known rare cell populations at varying frequencies (below 1%) and tested the ability of our approach to identify those cells in order to bring them to the attention of the data analyst. This is a key step in the process of finding cell subgroups that are associated with a disease or outcome of interest, when their existence and identification is not previously known and has yet to be discovered. Conclusions: We proposed a novel computational framework with demostrated good sensitivity and precision in detecting target rare cell poopulations present at very low frequencies in the total datasets (<1%). -- Contexte: Les avancĂ©es en technologies sur cellules individuelles telles que la cytomĂ©trie de masse offrent une meilleure rĂ©solution de la complexitĂ© des Ă©chantillons cellulaires, permettant aux chercheurs d’étudier et de comprendre plus en profondeur l’hĂ©tĂ©rogĂ©nĂ©itĂ© cellulaire et Ă©ventuellement de dĂ©tecter et dĂ©couvrir des populations de cellules rares auparavant indĂ©tectables. L’identification de populations de cellules rares est importante pour comprendre l’apparition, la progression et la pathogenĂšse de nombreuses maladies. Cependant, leur identification reste difficile en raison de la haute dimensionnalitĂ© et du dĂ©bit toujours croissants de donnĂ©es gĂ©nĂ©rĂ©es. But: Cette Ă©tude met en Ɠuvre une approche simple et efficace pour identifier des populations de cellules rares associĂ©es Ă  une maladie dans des Ă©chantillons biologiques vastes et complexes dans des limites de temps et d’infrastructure de calcul raisonnables. MĂ©thodes: Nous proposons un nouveau cadre de calcul appelĂ© D-AREdevil (dĂ©tection de cellules rares associĂ©es Ă  une maladie) pour l’analyse de donnĂ©es de cytomĂ©trie de masse. La principale caractĂ©ristique de notre cadre computationnel est la combinaison d’un algorithme de dĂ©tection d’anomalies (LOF ou FiRE) qui fournit un score continu pour chaque cellule avec l’une des mĂ©thodes de regroupement non-supervisĂ© les plus performantes et les plus rapides (FlowSOM). Dans notre approche, le score LOF sert Ă  sĂ©lectionner un ensemble de cellules candidates appartenant Ă  un ou plusieurs sous-groupes de populations de cellules rares similaires. Ensuite, nous testons ces sous-groupes de cellules rares pour dĂ©terminer s’ils sont associĂ©es avec un groupe de patients, un type de maladie, un rĂ©sultat clinique ou une autre caractĂ©ristique d’intĂ©rĂȘt. RĂ©sultats: Dans cette Ă©tude, nous avons rapportĂ© les propriĂ©tĂ©s et l’implĂ©mentation de D-AREdevil, et prĂ©sentĂ© une Ă©valuation de ses performances et applications sur trois jeux de donnĂ©es diffĂ©rents de cytomĂ©trie de masse. Nous avons gĂ©nĂ©rĂ© des donnĂ©es mĂ©langĂ©es contenant une ou plusieurs populations de cellules rares connues Ă  des frĂ©quences variables (infĂ©rieures Ă  1%) et nous avons testĂ© la capacitĂ© de notre approche Ă  identifier ces cellules afin de les porter Ă  l’attention de l’analyste. Il s’agit lĂ  d’une Ă©tape clĂ© dans le processus de recherche de sous-groupes de cellules qui sont associĂ©s Ă  une maladie ou Ă  un rĂ©sultat d’intĂ©rĂȘt qui est encore inconnu. Conclusions: Nous proposons un nouveau cadre de calcul avec une bonne sensibilitĂ© et une bonne prĂ©cision dans la dĂ©tection de cellules rares qui sont prĂ©sentes Ă  de trĂšs basses frĂ©quences dans l’ensemble des donnĂ©es (<1%)

    Neuromorphic Learning Systems for Supervised and Unsupervised Applications

    Get PDF
    The advancements in high performance computing (HPC) have enabled the large-scale implementation of neuromorphic learning models and pushed the research on computational intelligence into a new era. Those bio-inspired models are constructed on top of unified building blocks, i.e. neurons, and have revealed potentials for learning of complex information. Two major challenges remain in neuromorphic computing. Firstly, sophisticated structuring methods are needed to determine the connectivity of the neurons in order to model various problems accurately. Secondly, the models need to adapt to non-traditional architectures for improved computation speed and energy efficiency. In this thesis, we address these two problems and apply our techniques to different cognitive applications. This thesis first presents the self-structured confabulation network for anomaly detection. Among the machine learning applications, unsupervised detection of the anomalous streams is especially challenging because it requires both detection accuracy and real-time performance. Designing a computing framework that harnesses the growing computing power of the multicore systems while maintaining high sensitivity and specificity to the anomalies is an urgent research need. We present AnRAD (Anomaly Recognition And Detection), a bio-inspired detection framework that performs probabilistic inferences. We leverage the mutual information between the features and develop a self-structuring procedure that learns a succinct confabulation network from the unlabeled data. This network is capable of fast incremental learning, which continuously refines the knowledge base from the data streams. Compared to several existing anomaly detection methods, the proposed approach provides competitive detection accuracy as well as the insight to reason the decision making. Furthermore, we exploit the massive parallel structure of the AnRAD framework. Our implementation of the recall algorithms on the graphic processing unit (GPU) and the Xeon Phi co-processor both obtain substantial speedups over the sequential implementation on general-purpose microprocessor (GPP). The implementation enables real-time service to concurrent data streams with diversified contexts, and can be applied to large problems with multiple local patterns. Experimental results demonstrate high computing performance and memory efficiency. For vehicle abnormal behavior detection, the framework is able to monitor up to 16000 vehicles and their interactions in real-time with a single commodity co-processor, and uses less than 0.2ms for each testing subject. While adapting our streaming anomaly detection model to mobile devices or unmanned systems, the key challenge is to deliver required performance under the stringent power constraint. To address the paradox between performance and power consumption, brain-inspired hardware, such as the IBM Neurosynaptic System, has been developed to enable low power implementation of neural models. As a follow-up to the AnRAD framework, we proposed to port the detection network to the TrueNorth architecture. Implementing inference based anomaly detection on a neurosynaptic processor is not straightforward due to hardware limitations. A design flow and the supporting component library are developed to flexibly map the learned detection networks to the neurosynaptic cores. Instead of the popular rate code, burst code is adopted in the design, which represents numerical value using the phase of a burst of spike trains. This does not only reduce the hardware complexity, but also increases the result\u27s accuracy. A Corelet library, NeoInfer-TN, is implemented for basic operations in burst code and two-phase pipelines are constructed based on the library components. The design can be configured for different tradeoffs between detection accuracy, hardware resource consumptions, throughput and energy. We evaluate the system using network intrusion detection data streams. The results show higher detection rate than some conventional approaches and real-time performance, with only 50mW power consumption. Overall, it achieves 10^8 operations per Joule. In addition to the modeling and implementation of unsupervised anomaly detection, we also investigate a supervised learning model based on neural networks and deep fragment embedding and apply it to text-image retrieval. The study aims at bridging the gap between image and natural language. It continues to improve the bidirectional retrieval performance across the modalities. Unlike existing works that target at single sentence densely describing the image objects, we elevate the topic to associating deep image representations with noisy texts that are only loosely correlated. Based on text-image fragment embedding, our model employs a sequential configuration, connects two embedding stages together. The first stage learns the relevancy of the text fragments, and the second stage uses the filtered output from the first one to improve the matching results. The model also integrates multiple convolutional neural networks (CNN) to construct the image fragments, in which rich context information such as human faces can be extracted to increase the alignment accuracy. The proposed method is evaluated with both synthetic dataset and real-world dataset collected from picture news website. The results show up to 50% ranking performance improvement over the comparison models
    • 

    corecore