39,717 research outputs found
Outlier detection in performance data of parallel applications
Abstract — When an adaptive software component is employed to select the best-performing implementation for a communica-tion operation at runtime, the correctness of the decision taken strongly depends on detecting and removing outliers in the data used for the comparison. This automatic decision is greatly complicated by the fact that the types and quantities of outliers depend on the network interconnect and the nodes assigned to the job by the batch scheduler. This paper evaluates four different statistical methods used for handling outliers, namely a standard interquartile range method, a heuristic derived from the trimmed mean value, cluster analysis and a method using robust statistics. Using performance data from the Abstract Data and Communication Library (ADCL) we evaluate the correctness of the decisions made with each statistical approach over three fundamentally different network interconnects, namely a highly reliable InfiniBand network, a Gigabit Ethernet network having a larger variance in the performance, and a hierarchical Gigabit Ethernet network
Scalable And Efficient Outlier Detection In Large Distributed Data Sets With Mixed-type Attributes
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates
Algorithms for Large Scale Problems in Eigenvalue and Svd Computations and in Big Data Applications
As ”big data” has increasing influence on our daily life and research activities, it poses significant challenges on various research areas. Some applications often demand a fast solution of large, sparse eigenvalue and singular value problems; In other applications, extracting knowledge from large-scale data requires many techniques such as statistical calculations, data mining, and high performance computing. In this dissertation, we develop efficient and robust iterative methods and software for the computation of eigenvalue and singular values. We also develop practical numerical and data mining techniques to estimate the trace of a function of a large, sparse matrix and to detect in real-time blob-filaments in fusion plasma on extremely large parallel computers. In the first work, we propose a hybrid two stage SVD method for efficiently and accurately computing a few extreme singular triplets, especially the ones corresponding to the smallest singular values. The first stage achieves fast convergence while the second achieves the final accuracy. Furthermore, we develop a high-performance preconditioned SVD software based on the proposed method on top of the state-of-the-art eigensolver PRIMME. The method can be used with or without preconditioning, on parallel computers, and is superior to other state-of-the-art SVD methods in both efficiency and robustness. In the second study, we provide insights and develop practical algorithms to accomplish efficient and accurate computation of interior eigenpairs using refined projection techniques in non-Krylov iterative methods. By analyzing different implementations of the refined projection, we propose a new hybrid method to efficiently find interior eigenpairs without compromising accuracy. Our numerical experiments illustrate the efficiency and robustness of the proposed method. In the third work, we present a novel method to estimate the trace of matrix inverse that exploits the pattern correlation between the diagonal of the inverse of the matrix and that of some approximate inverse. We leverage various sampling and fitting techniques to fit the diagonal of the approximation to that of the inverse. Our method may serve as a standalone kernel for providing a fast trace estimate or as a variance reduction method for Monte Carlo in some cases. An extensive set of experiments demonstrate the potential of our method. In the fourth study, we provide first results on applying outlier detection techniques to effectively tackle the fusion blob detection problem on extremely large parallel machines. We present a real-time region outlier detection algorithm to efficiently find and track blobs in fusion experiments and simulations. Our experiments demonstrated we can achieve linear time speedup up to 1024 MPI processes and complete blob detection in two or three milliseconds
Towards Real-Time Detection and Tracking of Spatio-Temporal Features: Blob-Filaments in Fusion Plasma
A novel algorithm and implementation of real-time identification and tracking
of blob-filaments in fusion reactor data is presented. Similar spatio-temporal
features are important in many other applications, for example, ignition
kernels in combustion and tumor cells in a medical image. This work presents an
approach for extracting these features by dividing the overall task into three
steps: local identification of feature cells, grouping feature cells into
extended feature, and tracking movement of feature through overlapping in
space. Through our extensive work in parallelization, we demonstrate that this
approach can effectively make use of a large number of compute nodes to detect
and track blob-filaments in real time in fusion plasma. On a set of 30GB fusion
simulation data, we observed linear speedup on 1024 processes and completed
blob detection in less than three milliseconds using Edison, a Cray XC30 system
at NERSC.Comment: 14 pages, 40 figure
In-Network Outlier Detection in Wireless Sensor Networks
To address the problem of unsupervised outlier detection in wireless sensor
networks, we develop an approach that (1) is flexible with respect to the
outlier definition, (2) computes the result in-network to reduce both bandwidth
and energy usage,(3) only uses single hop communication thus permitting very
simple node failure detection and message reliability assurance mechanisms
(e.g., carrier-sense), and (4) seamlessly accommodates dynamic updates to data.
We examine performance using simulation with real sensor data streams. Our
results demonstrate that our approach is accurate and imposes a reasonable
communication load and level of power consumption.Comment: Extended version of a paper appearing in the Int'l Conference on
Distributed Computing Systems 200
Adapted K-Nearest Neighbors for Detecting Anomalies on Spatio–Temporal Traffic Flow
Outlier detection is an extensive research area, which has been intensively studied in several domains such as biological sciences, medical diagnosis, surveillance, and traffic anomaly detection. This paper explores advances in the outlier detection area by finding anomalies in spatio-temporal urban traffic flow. It proposes a new approach by considering the distribution of the flows in a given time interval. The flow distribution probability (FDP) databases are first constructed from the traffic flows by considering both spatial and temporal information. The outlier detection mechanism is then applied to the coming flow distribution probabilities, the inliers are stored to enrich the FDP databases, while the outliers are excluded from the FDP databases. Moreover, a k-nearest neighbor for distance-based outlier detection is investigated and adopted for FDP outlier detection. To validate the proposed framework, real data from Odense traffic flow case are evaluated at ten locations. The results reveal that the proposed framework is able to detect the real distribution of flow outliers. Another experiment has been carried out on Beijing data, the results show that our approach outperforms the baseline algorithms for high-urban traffic flow
- …