2,129 research outputs found
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Graph Sample and Hold: A Framework for Big-Graph Analytics
Sampling is a standard approach in big-graph analytics; the goal is to
efficiently estimate the graph properties by consulting a sample of the whole
population. A perfect sample is assumed to mirror every property of the whole
population. Unfortunately, such a perfect sample is hard to collect in complex
populations such as graphs (e.g. web graphs, social networks etc), where an
underlying network connects the units of the population. Therefore, a good
sample will be representative in the sense that graph properties of interest
can be estimated with a known degree of accuracy. While previous work focused
particularly on sampling schemes used to estimate certain graph properties
(e.g. triangle count), much less is known for the case when we need to estimate
various graph properties with the same sampling scheme. In this paper, we
propose a generic stream sampling framework for big-graph analytics, called
Graph Sample and Hold (gSH). To begin, the proposed framework samples from
massive graphs sequentially in a single pass, one edge at a time, while
maintaining a small state. We then show how to produce unbiased estimators for
various graph properties from the sample. Given that the graph analysis
algorithms will run on a sample instead of the whole population, the runtime
complexity of these algorithm is kept under control. Moreover, given that the
estimators of graph properties are unbiased, the approximation error is kept
under control. Finally, we show the performance of the proposed framework (gSH)
on various types of graphs, such as social graphs, among others
Hybrid group anomaly detection for sequence data: application to trajectory data analytics
Many research areas depend on group anomaly detection. The use of group anomaly detection can maintain and provide security and privacy to the data involved. This research attempts to solve the deficiency of the existing literature in outlier detection thus a novel hybrid framework to identify group anomaly detection from sequence data is proposed in this paper. It proposes two approaches for efficiently solving this problem: i) Hybrid Data Mining-based algorithm, consists of three main phases: first, the clustering algorithm is applied to derive the micro-clusters. Second, the kNN algorithm is applied to each micro-cluster to calculate the candidates of the group's outliers. Third, a pattern mining framework gets applied to the candidates of the group's outliers as a pruning strategy, to generate the groups of outliers, and ii) a GPU-based approach is presented, which benefits from the massively GPU computing to boost the runtime of the hybrid data mining-based algorithm. Extensive experiments were conducted to show the advantages of different sequence databases of our proposed model. Results clearly show the efficiency of a GPU direction when directly compared to a sequential approach by reaching a speedup of 451. In addition, both approaches outperform the baseline methods for group detection.acceptedVersio
A Spatiotemporal analysis to identify Naturally Occurring Retirement Communities in Nebraska
This study aims to identify the geographic locations of ānaturally occurring retirement communities (NORCs)ā and whether there were spatiotemporal patterns of naturally occurring retirement communities in Nebraska for the time periods of 2000 to 2010, and to 2015. As the American population continues to age, older people generally prefer to live in their own homes for later years of life, instead of moving into assisted living. These demands have resulted in the increase of elderly populations who are āaging in placeā. Nevertheless, there have been few spatiotemporal analyses about the distribution patterns of elderly households in terms of NORCs for the state of Nebraska. In this study, the entire area within the stateās boundaries was subdivided into block groups and the spatial statistics of demographic patterns were analyzed over time.
For this study, U.S. Census data from 2000, 2010, and 2015 were aggregated by block groups which include the total number of households and proportion of households (owners/renters) in Nebraska. Three analyses were conducted on the data. First, the geovisualization method with ArcGIS 10.4 was used to visually investigate the distribution and changes of NORCs from 2000 to 2010, and to 2015. Second, Global Moranās I was used to quantify the spatial relationship of NORCs in Nebraska. Third, various methods of spatial statistics were used to identify clusters between NORCs and other block groups: Local Moranās and G-statistics. Over the past 15 years, the proportion of elderly households in Nebraska has steadily increased, and the rate of increase has risen sharply over the recent five years, as of 2015. As a result, the number of NORCs has also increased, and 47 of the total NORCs (57.3%) were classified as the aging in place type of NORCs. In addition, block groups with similar proportion of households have clustered spatially together or formed hot-spots.
This study contributes to understanding the concept of NORCs relative to the residents āaging in placeā and policy makers. Local government should take appropriate steps to prepare for the super aging society by rearranging and integrating given resources as much as possible. By taking full advantage of results of this study, the government should develop community-based policies to support the older residents aging in place. Because of the population density and proximity of older residents in NORCs, economies of scale are able to rethink how to organize and deliver services, giving the opportunity to make our communities better for those retired seniors.
Advisor: Yunwoo Na
Reflecting Human Knowledge of Place and Route-Choice Behavior Using Big Data
Exploring human knowledge of geographical space and related behavior not only helps in understanding human-environment interactions and dynamic geographic processes, but also advances Geographic Information Systems (GIS) toward a human-centric paradigm to make daily life more efficient. Todayās relatively easy acquisition of various big data provides an unprecedented opportunity for geographers to answer research questions that previously could not be adequately addressed. However, new challenges also arise regarding data quality and bias as well as change in methodology for dealing with big data that are different from traditional data types.
Representing peopleās perception of place and studying driverās route-choice behavior are two of the many applications of big data in answering research questions about human knowledge and behavior in the fields of GIS and transportation. Incorporating three papers, this dissertation focuses on these two different applications to achieve the following objectives: 1) examine the degree to which a geographic placeās spatial extent can be estimated from human-generated geotagged photos; 2) address the challenge of geotagged photosā uneven spatial distribution in place estimation and explore an approach that can better derive a placeās spatial extent; 3) develop a method that can properly estimate the spatial extent of a place that has multiple disjoint regions while considering geotagged photosā uneven distribution; 4) explore useful spatiotemporal patterns of taxi driversā route-choice behavior in a dynamic urban environment.
This dissertation makes three major contributions to big data applicationsā systematic theory: 1) proposes an effective approach to handling the uneven spatial distribution problem of geotagged photos as a type of volunteered geographic data by modeling their representativeness; 2) develops methods that can properly derive the vague spatial extent of a place with or without disjoint regions; and 3) explores taxi driversā route-choice patterns in different situations that can inform future transportation decisions and policy-making processes
- ā¦