12,079 research outputs found
DPCA: Dimensionality Reduction for Discriminative Analytics of Multiple Large-Scale Datasets
Principal component analysis (PCA) has well-documented merits for data
extraction and dimensionality reduction. PCA deals with a single dataset at a
time, and it is challenged when it comes to analyzing multiple datasets. Yet in
certain setups, one wishes to extract the most significant information of one
dataset relative to other datasets. Specifically, the interest may be on
identifying, namely extracting features that are specific to a single target
dataset but not the others. This paper develops a novel approach for such
so-termed discriminative data analysis, and establishes its optimality in the
least-squares (LS) sense under suitable data modeling assumptions. The
criterion reveals linear combinations of variables by maximizing the ratio of
the variance of the target data to that of the remainders. The novel approach
solves a generalized eigenvalue problem by performing SVD just once. Numerical
tests using synthetic and real datasets showcase the merits of the proposed
approach relative to its competing alternatives.Comment: 5 pages, 2 figure
Application Oriented Analysis of Large Scale Datasets
Diverse application areas, such as social network, epidemiology, and software engineering consist of systems of objects and their relationships. Such systems are generally modeled as graphs. Graphs consist of vertices that represent the objects, and edges that represent the relationships between them. These systems are data intensive and it is important to correctly analyze the data to obtain meaningful information. Combinatorial metrics can provide useful insights for analyzing these systems. In this thesis, we use the graph based metrics such as betweenness centrality, clustering coefficient, articulation points, etc. for analyzing instances of large change in evolving networks (Software Engineering), and identifying points of similarity (Gene Expression Data). Computations of combinatorial properties are expensive and most real world networks are not static. As the network evolves these properties have to be recomputed. In the last part of thesis, we develop a fast algorithm that avoids redundant recomputations of communities in dynamic networks
R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets
The rapid growth of big spatial data urged the research community to develop
several big spatial data systems. Regardless of their architecture, one of the
fundamental requirements of all these systems is to spatially partition the
data efficiently across machines. The core challenges of big spatial
partitioning are building high spatial quality partitions while simultaneously
taking advantages of distributed processing models by providing load balanced
partitions. Previous works on big spatial partitioning are to reuse existing
index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree,
by building a temporary tree for a sample of the input and use its leaf nodes
as partition boundaries. However, we show in this paper that none of those
techniques has addressed the mentioned challenges completely. This paper
proposes a novel partitioning method, termed R*-Grove, which can partition very
large spatial datasets into high quality partitions with excellent load balance
and block utilization. This appealing property allows R*-Grove to outperform
existing techniques in spatial query processing. R*-Grove can be easily
integrated into any big data platforms such as Apache Spark or Apache Hadoop.
Our experiments show that R*-Grove outperforms the existing partitioning
techniques for big spatial data systems. With all the proposed work publicly
available as open source, we envision that R*-Grove will be adopted by the
community to better serve big spatial data research.Comment: 29 pages, to be published in Frontiers in Big Dat
Parallel Framework for Dimensionality Reduction of Large-Scale Datasets
Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution
Towards matching user mobility traces in large-scale datasets
The problem of unicity and reidentifiability of records in large-scale databases has been studied in different contexts and approaches, with focus on preserving privacy or matching records from different data sources. With an increasing number of service providers nowadays routinely collecting location traces of their users on unprecedented scales, there is a pronounced interest in the possibility of matching records and datasets based on spatial trajectories. Extending previous work on reidentifiability of spatial data and trajectory matching, we present the first large-scale analysis of user matchability in real mobility datasets on realistic scales, i.e. among two datasets that consist of several million people's mobility traces, coming from a mobile network operator and transportation smart card usage. We extract the relevant statistical properties which influence the matching process and analyze their impact on the matchability of users. We show that for individuals with typical activity in the transportation system (those making 3-4 trips per day on average), a matching algorithm based on the co-occurrence of their activities is expected to achieve a 16.8% success only after a one-week long observation of their mobility traces, and over 55% after four weeks. We show that the main determinant of matchability is the expected number of co-occurring records in the two datasets. Finally, we discuss different scenarios in terms of data collection frequency and give estimates of matchability over time. We show that with higher frequency data collection becoming more common, we can expect much higher success rates in even shorter intervals
- …