14,442 research outputs found
Large-scale online feature selection for ultra-high dimensional sparse data
National Research Foundation (NRF) Singapore under International Research Centre @ Singapore Funding Initiativ
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Calibrating nonconvex penalized regression in ultra-high dimension
We investigate high-dimensional nonconvex penalized regression, where the
number of covariates may grow at an exponential rate. Although recent
asymptotic theory established that there exists a local minimum possessing the
oracle property under general conditions, it is still largely an open problem
how to identify the oracle estimator among potentially multiple local minima.
There are two main obstacles: (1) due to the presence of multiple minima, the
solution path is nonunique and is not guaranteed to contain the oracle
estimator; (2) even if a solution path is known to contain the oracle
estimator, the optimal tuning parameter depends on many unknown factors and is
hard to estimate. To address these two challenging issues, we first prove that
an easy-to-calculate calibrated CCCP algorithm produces a consistent solution
path which contains the oracle estimator with probability approaching one.
Furthermore, we propose a high-dimensional BIC criterion and show that it can
be applied to the solution path to select the optimal tuning parameter which
asymptotically identifies the oracle estimator. The theory for a general class
of nonconvex penalties in the ultra-high dimensional setup is established when
the random errors follow the sub-Gaussian distribution. Monte Carlo studies
confirm that the calibrated CCCP algorithm combined with the proposed
high-dimensional BIC has desirable performance in identifying the underlying
sparsity pattern for high-dimensional data analysis.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1159 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
Bayesian Conditional Tensor Factorizations for High-Dimensional Classification
In many application areas, data are collected on a categorical response and
high-dimensional categorical predictors, with the goals being to build a
parsimonious model for classification while doing inferences on the important
predictors. In settings such as genomics, there can be complex interactions
among the predictors. By using a carefully-structured Tucker factorization, we
define a model that can characterize any conditional probability, while
facilitating variable selection and modeling of higher-order interactions.
Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm
for posterior computation accommodating uncertainty in the predictors to be
included. Under near sparsity assumptions, the posterior distribution for the
conditional probability is shown to achieve close to the parametric rate of
contraction even in ultra high-dimensional settings. The methods are
illustrated using simulation examples and biomedical applications
- …