365 research outputs found

    Scalable And Efficient Outlier Detection In Large Distributed Data Sets With Mixed-type Attributes

    Get PDF
    An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates

    Detection of outliers and outliers clustering on large datasets with distributed computing

    Get PDF
    Tese de mestrado em Informática, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, 2012Outlier detection is a data analysis related problem, of great importance in diverse science fields and with many applications. Without a definitive formal definition and holding several other designations – deviations, anomalies, exceptions, noise, atypical data, – outliers are, succinctly, the samples in a dataset that, for some reason, are different from the rest of the set. It can be of interest to either remove them, as a filtering process to smoothing data, or collect them as new dataset holding additional information potentially relevant. Its importance can be seen from the broad range of applications, like fraud or intrusion detection, specialized pattern recognition, data filtering, scientific data mining, medical diagnosis, etc. Although an old problem, with roots in Statistics, the outlier detection problem has become more pertinent then ever and yet further difficult to deal with. Better and more ubiquitous ways of data acquisition and storage capacities increasing constantly, made the size of datasets grow considerably in recent years, along with its number and its availability. Larger volumes of data becomes harder to explore and filter, while simultaneously data treatment and analysis emerges as more demanded and fundamental in today’s life. Distributed computing is a computer science paradigm to distribute hard, complex problems across several independent machines, connected on a network. A problem is break down in more simple sub-problems, that are solved simultaneous by the autonomous machines, and all resultant sub-solutions collected and put together into a final solution. Distributed computing provides a solution for the limitations in the hardware scaling, both economical and physical, by building up computational capacity, as needed, with the addition of new machines, not necessarily new or advanced models, but any commodity hardware. This work presents several distributed computing algorithms to outlier detection, starting from a distributed version of an existent algorithm, CURIO[9], and introducing a series of optimizations and variants that leads to a new method, Curio3XD, that allows to resolve both the common issues typical of this problem, the constraints imposed by the size and the dimensionality of the datasets. The final version, and its variant, is applicable for any volume of data, by scaling the hardware in the distributed computing, and to high dimensionality datasets, by moving the original exponential dependency on the dimension to a dependency, quadratic, on the local density of data, easily tunable with an algorithm parameter, the precision. Intermediate versions are presented for the sake of clarification of the process that took to the final method, and as an alternative approach, possibly useful with very sparse datasets. For a distributed computing environment with full support for the distributed system and the underlying hardware infrastructure, it was chosen Apache Hadoop[23] as a platform for developing, implementation and testing, due to its power and flexibility, and yet relatively easy usability. This constitutes an open-source solution, well studied and documented, employed by several major companies, with an excellent applicability to both clouds and local clusters. The different algorithms and variants were developed within the MapReduce programing model, and implemented in the Hadoop framework, which supports that model. MapReduce was conceived to permit the deployment of distributed computing applications in a simple, developer-oriented way, with main focus on the programmatic solutions of the problems, and leaving the underneath distributed network control and maintenance absolutely transparent. The developed implementations are included in appendix. Results of tests, with an adapted real world dataset, showed very good performances of the referred algorithms’ final versions, with excellent scalability on both size and dimensionality of data, as previewed theoretically. Performance tests with the precision parameter and comparative tests between all variants developed are also presented and discussed.Detecção de outliers é um problema relativo à análise de dados, de grande importância em diversos campos científicos. Sem um definição formal definitiva e possuindo diversas outras designações – desvios, anomalias, exceções, ruído, dados atípicos, – outliers são, sucintamente, as amostras num conjunto de dados que, por alguma razão, são diferentes do resto do dados. Pode ser de interesse quer a sua remoção, como um processo de filtragem para uma suavização dos dados, quer para a recolecção num novo conjunto de dados constituindo informação adicional potencialmente relevante. A sua importância pode ser notada no diversificado espectro de aplicações, como sejam a detecção de fraudes ou intrusos, reconhecimento especializado de padrões, filtragem de dados, prospecção de dados científicos, diagnósticos médicos, etc. Apesar de se tratar de um problema antigo, com origem na Estatística, a detecção de outliers tem-se tornado mais pertinente que nunca e contudo mais difícil de lidar. Melhor e mais ubíquas formas de aquisição de dados e capacidades de armazenamento em constante crescimento, fizeram as bases de dados crescer consideravelmente nos últimos anos, em conjunto com o aumento do seu número e disponibilidade. Um maior volume de dados torna-se mais difícil de explorar e filtrar, e simultaneamente o tratamento e análise de dados emerge como um processo mais necessário e fundamental nos dias de hoje. A computação distribuída é um paradigma das ciências da computação para distribuir problemas complexos e difíceis por diferentes máquinas independentes, ligadas em rede. Os problemas são divididos em problemas menores, mais simples, que são resolvidos simultaneamente pelas várias máquinas autónomas, e todas as sub-soluções resultantes coligidas e combinadas para obter uma solução final. A computação distribuída fornece uma solução para as limitações, físicas e económicas, no escalamento de equipamento, pela incremento de capacidade computacional, conforme a necessidade, com a adição de novas máquinas , não necessariamente modelos novos ou avançados, mas quaisquer equipamento à disposição. Este trabalho apresenta diversos algoritmos em computação distribuída para detecção de outliers, tendo como ponto de partida uma versão distribuída de um algoritmo existente, CURIO[9], e introduzindo uma série de optimizações e variantes que levam a um novo método, Curio3XD, que permite resolver ambos os problemas típicos comuns a este tipo de problemas, relacionados com o tamanho e com a dimensionalidade dos conjuntos de dados. Essa versão final, ou a sua variante, é aplicável a qualquer volume de dados, por escalamento de equipamento na computação distribuída, e a conjuntos de qualquer dimensão, pela remoção da dependência exponencial original na dimensão, substituindo-a por uma dependência, quadrática, na densidade local dos dados, facilmente controlável por um parâmetro do algoritmo, a precisão. As versões intermédias são apresentadas pela clarificação do processo que levou ao método final, e como uma abordagem alternativa, potencialmente útil com conjuntos de dados muito esparsos. Para um ambiente de computação distribuída com suporte completo a um sistema distribuído e uma infraestrutura de hardware adjacente, foi escolhido o Apache Hadoop[23] como plataforma para desenvolvimento, implementação e teste, devido às suas potencialidades e flexibilidade, e sendo contudo de relativo uso fácil. Este constitui um solução open-source, bem estudada e documentada, empregue por diversas grandes empresas, com uma excelente aplicabilidade quer em cloud como em clusters locais. Os diferentes algoritmos e variantes foram desenvolvidos no modelo programático MapReduce, e implementados no quadro do Apache Hadoop, que suporta esse modelo e oferece a capacidade de um fácil desenvolvimento em cloud e grandes clusters. Resultados dos testes, com um conjunto de dados real adaptado, mostrou um muito bom desempenho das versões finais dos referidos algoritmos, com uma excelente escalabilidade em ambas as variáveis tamanho e dimensionalidade dos dados, conforme previsto teoricamente. Testes de desempenho com a precisão e testes comparativos entre todas as variantes desenvolvidas são também apresentados e discutidos

    Fraud detection for online banking for scalable and distributed data

    Get PDF
    Online fraud causes billions of dollars in losses for banks. Therefore, online banking fraud detection is an important field of study. However, there are many challenges in conducting research in fraud detection. One of the constraints is due to unavailability of bank datasets for research or the required characteristics of the attributes of the data are not available. Numeric data usually provides better performance for machine learning algorithms. Most transaction data however have categorical, or nominal features as well. Moreover, some platforms such as Apache Spark only recognizes numeric data. So, there is a need to use techniques e.g. One-hot encoding (OHE) to transform categorical features to numerical features, however OHE has challenges including the sparseness of transformed data and that the distinct values of an attribute are not always known in advance. Efficient feature engineering can improve the algorithm’s performance but usually requires detailed domain knowledge to identify correct features. Techniques like Ripple Down Rules (RDR) are suitable for fraud detection because of their low maintenance and incremental learning features. However, high classification accuracy on mixed datasets, especially for scalable data is challenging. Evaluation of RDR on distributed platforms is also challenging as it is not available on these platforms. The thesis proposes the following solutions to these challenges: • We developed a technique Highly Correlated Rule Based Uniformly Distribution (HCRUD) to generate highly correlated rule-based uniformly-distributed synthetic data. • We developed a technique One-hot Encoded Extended Compact (OHE-EC) to transform categorical features to numeric features by compacting sparse-data even if all distinct values are unknown. • We developed a technique Feature Engineering and Compact Unified Expressions (FECUE) to improve model efficiency through feature engineering where the domain of the data is not known in advance. • A Unified Expression RDR fraud deduction technique (UE-RDR) for Big data has been proposed and evaluated on the Spark platform. Empirical tests were executed on multi-node Hadoop cluster using well-known classifiers on bank data, synthetic bank datasets and publicly available datasets from UCI repository. These evaluations demonstrated substantial improvements in terms of classification accuracy, ruleset compactness and execution speed.Doctor of Philosoph

    Data Mining Applications in Big Data

    Get PDF
    Data mining is a process of extracting hidden, unknown, but potentially useful information from massive data. Big Data has great impacts on scientific discoveries and value creation. This paper introduces methods in data mining and technologies in Big Data. Challenges of data mining and data mining with big data are discussed. Some technology progress of data mining and data mining with big data are also presented

    New scalable machine learning methods: beyond classification and regression

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] The recent surge in data available has spawned a new and promising age of machine learning. Success cases of machine learning are arriving at an increasing rate as some algorithms are able to leverage immense amounts of data to produce great complicated predictions. Still, many algorithms in the toolbox of the machine learning practitioner have been render useless in this new scenario due to the complications associated with large-scale learning. Handling large datasets entails logistical problems, limits the computational and spatial complexity of the used algorithms, favours methods with few or no hyperparameters to be con gured and exhibits speci c characteristics that complicate learning. This thesis is centered on the scalability of machine learning algorithms, that is, their capacity to maintain their e ectivity as the scale of the data grows, and how it can be improved. We focus on problems for which the existing solutions struggle when the scale grows. Therefore, we skip classi cation and regression problems and focus on feature selection, anomaly detection, graph construction and explainable machine learning. We analyze four di erent strategies to obtain scalable algorithms. First, we explore distributed computation, which is used in all of the presented algorithms. Besides this technique, we also examine the use of approximate models to speed up computations, the design of new models that take advantage of a characteristic of the input data to simplify training and the enhancement of simple models to enable them to manage large-scale learning. We have implemented four new algorithms and six versions of existing ones that tackle the mentioned problems and for each one we report experimental results that show both their validity in comparison with competing methods and their capacity to scale to large datasets. All the presented algorithms have been made available for download and are being published in journals to enable practitioners and researchers to use them.[Resumen] El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar inmensas cantidades de datos para producir predicciones difíciles y muy certeras. Sin embargo, muchos de los algoritmos hasta ahora disponibles para los científicos de datos han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva problemas logísticos, limita la complejidad computacional y espacial de los algoritmos utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y muestra complicaciones específicas que dificultan el aprendizaje. Esta tesis se centra en la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección de características, detección de anomalías, construcción de grafos y en el aprendizaje máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en todos los algoritmos presentados. Además de esta técnica, también examinamos el uso de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan una particularidad de los datos de entrada para simplificar el entrenamiento y la potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales que muestran tanto su validez en comparación con los métodos previamente disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y se han difundido mediante publicaciones en revistas científicas para facilitar que tanto investigadores como científicos de datos puedan conocerlos y utilizarlos.[Resumo] O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas cantidades de datos para producir prediccións difíciles e moi acertadas. Non obstante, moitos dos algoritmos ata agora dispo~nibles para os científicos de datos perderon a súa efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas loxísticos, limita a complexidade computacional e espacial dos algoritmos empregados, favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións específicas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade a medida que a escala do conxunto de datos aumenta. Tratamos problemas para os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando no canto a clasificación e a regresión, centrámonos na selección de características, detección de anomalías, construcción de grafos e no aprendizaxe máquina explicable. Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro lugar, exploramos a computación distribuída, que empregamos en tódolos algoritmos presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e seis versións de algoritmos existentes que tratan os problemas mencionados e para cada un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición do lector para a súa descarga e difundíronse mediante publicacións en revistas científicas para facilitar que tanto investigadores como científicos de datos poidan coñecelos e empregalos

    Dimension Debasing towards Minimal Search Space Utilization for Mining Patterns in Big Data

    Get PDF
    Data mining algorithms generally produce patterns which are interesting. Such patterns can be used by domain experts in order to produce business intelligence. However, most of the existing algoritms that can not properly work for uncertain data. Keeping uncertain data’s characteristics in mind, it can be said that they do have more search space with existing algorithms. In this paper we proposed a method that can be used to reduce search space besides helping in producing patterns from uncertain data. The proposed method is based on MapReduce programming framework that works in distributed environment. The method essentially works on big data which is characterized by velocity, volume and variety. The proposed method also helps users to have constraints so as to produce high quality patterns. Such patterns can help in making well informed decisions. We built a prototype application that demonstrates the proof of concept. The empirical results are encouraging in mining uncertain big data in the presence of constraints. DOI: 10.17762/ijritcc2321-8169.15080

    Contextual Anomaly Detection Framework for Big Sensor Data

    Get PDF
    Performing predictive modelling, such as anomaly detection, in Big Data is a difficult task. This problem is compounded as more and more sources of Big Data are generated from environmental sensors, logging applications, and the Internet of Things. Further, most current techniques for anomaly detection only consider the content of the data source, i.e. the data itself, without concern for the context of the data. As data becomes more complex it is increasingly important to bias anomaly detection techniques for the context, whether it is spatial, temporal, or semantic. The work proposed in this thesis outlines a contextual anomaly detection framework for use in Big sensor Data systems. The framework uses a well-defined content anomaly detection algorithm for real-time point anomaly detection. Additionally, we present a post-processing context-aware anomaly detection algorithm based on sensor profiles, which are groups of contextually similar sensors generated by a multivariate clustering algorithm. The contextual anomaly detection framework is evaluated with respect to two different Big sensor Data data sets; one for electrical sensors, and another for temperature sensors within a building

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    Big Data Clustering Algorithm and Strategies

    Get PDF
    In current digital era extensive volume ofdata is being generated at an enormous rate. The data are large, complex and information rich. In order to obtain valuable insights from the massive volume and variety of data, efficient and effective tools are needed. Clustering algorithms have emerged as a machine learning tool to accurately analyze such massive volume of data. Clustering is an unsupervised learning technique which groups data objects in such a way that objects in the same group are more similar as much as possible and data objects in different groups are dissimilar. But, traditional algorithm cannot cope up with huge amount of data. Therefore efficient clustering algorithms are needed to analyze such a big data within a reasonable time. In this paper we have discussed some theoretical overview and comparison of various clustering techniques used for analyzing big data
    • …
    corecore