2,129 research outputs found

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Graph Sample and Hold: A Framework for Big-Graph Analytics

    Full text link
    Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

    Mapping crime: Understanding Hotspots

    Get PDF

    Hybrid group anomaly detection for sequence data: application to trajectory data analytics

    Get PDF
    Many research areas depend on group anomaly detection. The use of group anomaly detection can maintain and provide security and privacy to the data involved. This research attempts to solve the deficiency of the existing literature in outlier detection thus a novel hybrid framework to identify group anomaly detection from sequence data is proposed in this paper. It proposes two approaches for efficiently solving this problem: i) Hybrid Data Mining-based algorithm, consists of three main phases: first, the clustering algorithm is applied to derive the micro-clusters. Second, the kNN algorithm is applied to each micro-cluster to calculate the candidates of the group's outliers. Third, a pattern mining framework gets applied to the candidates of the group's outliers as a pruning strategy, to generate the groups of outliers, and ii) a GPU-based approach is presented, which benefits from the massively GPU computing to boost the runtime of the hybrid data mining-based algorithm. Extensive experiments were conducted to show the advantages of different sequence databases of our proposed model. Results clearly show the efficiency of a GPU direction when directly compared to a sequential approach by reaching a speedup of 451. In addition, both approaches outperform the baseline methods for group detection.acceptedVersio

    A Spatiotemporal analysis to identify Naturally Occurring Retirement Communities in Nebraska

    Get PDF
    This study aims to identify the geographic locations of ā€œnaturally occurring retirement communities (NORCs)ā€ and whether there were spatiotemporal patterns of naturally occurring retirement communities in Nebraska for the time periods of 2000 to 2010, and to 2015. As the American population continues to age, older people generally prefer to live in their own homes for later years of life, instead of moving into assisted living. These demands have resulted in the increase of elderly populations who are ā€œaging in placeā€. Nevertheless, there have been few spatiotemporal analyses about the distribution patterns of elderly households in terms of NORCs for the state of Nebraska. In this study, the entire area within the stateā€™s boundaries was subdivided into block groups and the spatial statistics of demographic patterns were analyzed over time. For this study, U.S. Census data from 2000, 2010, and 2015 were aggregated by block groups which include the total number of households and proportion of households (owners/renters) in Nebraska. Three analyses were conducted on the data. First, the geovisualization method with ArcGIS 10.4 was used to visually investigate the distribution and changes of NORCs from 2000 to 2010, and to 2015. Second, Global Moranā€™s I was used to quantify the spatial relationship of NORCs in Nebraska. Third, various methods of spatial statistics were used to identify clusters between NORCs and other block groups: Local Moranā€™s and G-statistics. Over the past 15 years, the proportion of elderly households in Nebraska has steadily increased, and the rate of increase has risen sharply over the recent five years, as of 2015. As a result, the number of NORCs has also increased, and 47 of the total NORCs (57.3%) were classified as the aging in place type of NORCs. In addition, block groups with similar proportion of households have clustered spatially together or formed hot-spots. This study contributes to understanding the concept of NORCs relative to the residents ā€œaging in placeā€ and policy makers. Local government should take appropriate steps to prepare for the super aging society by rearranging and integrating given resources as much as possible. By taking full advantage of results of this study, the government should develop community-based policies to support the older residents aging in place. Because of the population density and proximity of older residents in NORCs, economies of scale are able to rethink how to organize and deliver services, giving the opportunity to make our communities better for those retired seniors. Advisor: Yunwoo Na

    Reflecting Human Knowledge of Place and Route-Choice Behavior Using Big Data

    Get PDF
    Exploring human knowledge of geographical space and related behavior not only helps in understanding human-environment interactions and dynamic geographic processes, but also advances Geographic Information Systems (GIS) toward a human-centric paradigm to make daily life more efficient. Todayā€™s relatively easy acquisition of various big data provides an unprecedented opportunity for geographers to answer research questions that previously could not be adequately addressed. However, new challenges also arise regarding data quality and bias as well as change in methodology for dealing with big data that are different from traditional data types. Representing peopleā€™s perception of place and studying driverā€™s route-choice behavior are two of the many applications of big data in answering research questions about human knowledge and behavior in the fields of GIS and transportation. Incorporating three papers, this dissertation focuses on these two different applications to achieve the following objectives: 1) examine the degree to which a geographic placeā€™s spatial extent can be estimated from human-generated geotagged photos; 2) address the challenge of geotagged photosā€™ uneven spatial distribution in place estimation and explore an approach that can better derive a placeā€™s spatial extent; 3) develop a method that can properly estimate the spatial extent of a place that has multiple disjoint regions while considering geotagged photosā€™ uneven distribution; 4) explore useful spatiotemporal patterns of taxi driversā€™ route-choice behavior in a dynamic urban environment. This dissertation makes three major contributions to big data applicationsā€™ systematic theory: 1) proposes an effective approach to handling the uneven spatial distribution problem of geotagged photos as a type of volunteered geographic data by modeling their representativeness; 2) develops methods that can properly derive the vague spatial extent of a place with or without disjoint regions; and 3) explores taxi driversā€™ route-choice patterns in different situations that can inform future transportation decisions and policy-making processes
    • ā€¦
    corecore