1,845 research outputs found

    Outlier Mining Methods Based on Graph Structure Analysis

    Get PDF
    Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Peer ReviewedPostprint (published version

    STWalk: Learning Trajectory Representations in Temporal Graphs

    Full text link
    Analyzing the temporal behavior of nodes in time-varying graphs is useful for many applications such as targeted advertising, community evolution and outlier detection. In this paper, we present a novel approach, STWalk, for learning trajectory representations of nodes in temporal graphs. The proposed framework makes use of structural properties of graphs at current and previous time-steps to learn effective node trajectory representations. STWalk performs random walks on a graph at a given time step (called space-walk) as well as on graphs from past time-steps (called time-walk) to capture the spatio-temporal behavior of nodes. We propose two variants of STWalk to learn trajectory representations. In one algorithm, we perform space-walk and time-walk as part of a single step. In the other variant, we perform space-walk and time-walk separately and combine the learned representations to get the final trajectory embedding. Extensive experiments on three real-world temporal graph datasets validate the effectiveness of the learned representations when compared to three baseline methods. We also show the goodness of the learned trajectory embeddings for change point detection, as well as demonstrate that arithmetic operations on these trajectory representations yield interesting and interpretable results.Comment: 10 pages, 5 figures, 2 table

    Spatial Data Quality in the IoT Era:Management and Exploitation

    Get PDF
    Within the rapidly expanding Internet of Things (IoT), growing amounts of spatially referenced data are being generated. Due to the dynamic, decentralized, and heterogeneous nature of the IoT, spatial IoT data (SID) quality has attracted considerable attention in academia and industry. How to invent and use technologies for managing spatial data quality and exploiting low-quality spatial data are key challenges in the IoT. In this tutorial, we highlight the SID consumption requirements in applications and offer an overview of spatial data quality in the IoT setting. In addition, we review pertinent technologies for quality management and low-quality data exploitation, and we identify trends and future directions for quality-aware SID management and utilization. The tutorial aims to not only help researchers and practitioners to better comprehend SID quality challenges and solutions, but also offer insights that may enable innovative research and applications

    Towards Real-Time Detection and Tracking of Spatio-Temporal Features: Blob-Filaments in Fusion Plasma

    Full text link
    A novel algorithm and implementation of real-time identification and tracking of blob-filaments in fusion reactor data is presented. Similar spatio-temporal features are important in many other applications, for example, ignition kernels in combustion and tumor cells in a medical image. This work presents an approach for extracting these features by dividing the overall task into three steps: local identification of feature cells, grouping feature cells into extended feature, and tracking movement of feature through overlapping in space. Through our extensive work in parallelization, we demonstrate that this approach can effectively make use of a large number of compute nodes to detect and track blob-filaments in real time in fusion plasma. On a set of 30GB fusion simulation data, we observed linear speedup on 1024 processes and completed blob detection in less than three milliseconds using Edison, a Cray XC30 system at NERSC.Comment: 14 pages, 40 figure

    USING SPATIAL METHODS TO BETTER UNDERSTAND FOOD INSECURITY AND SNAP UNDER-PARTICIPATION IN TEXAS

    Get PDF
    The overall objective of this research is to use spatial methods to better understand food insecurity and SNAP under-participation in Texas. Paper 1 assesses whether a sample of community dwelling Medicare and Medicaid beneficiaries, who screen positive for food insecurity at healthcare locations in Harris County, exhibit a spatial pattern in terms of where they live. In other words, it tests whether or not there are statistically significant neighborhood hot spots or cold spots of food insecurity against a null hypothesis of complete spatial randomness. This approach is novel because it uses address-level data on patients who report being food insecure to test for statistically significant neighborhood hot spots or cold spots, instead of relying on extant factors like neighborhood poverty rates, or the presence of grocery stores. Using address-level food insecurity screening data is often difficult because few organizations screen for food insecurity, and even fewer are willing to share their data due to privacy concerns. Paper 2 utilizes geographical information systems (GIS) to map census tract-level clusters and outliers of households that are eligible but not enrolled (EBNE) in the SNAP program. The implications of this analysis are vast. Knowing the locations of neighborhood-level clusters and outliers of SNAP EBNE households can inform interventions to address the “SNAP GAP” more effectively. Additionally, this method of identifying neighborhood-level clusters and outliers of SNAP EBNE households can be applied to other safety net programs including Medicaid, the Children’s Health Insurance Program (CHIP), Healthy Texas Women, and the Women, Infant, and Children (WIC) Program
    corecore