194,697 research outputs found
Feature Selection for Big Visual Data: Overview and Challenges
International Conference Image Analysis and Recognition (ICIAR 2018, Póvoa de Varzim, Portugal
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
Adaptation and learning over networks for nonlinear system modeling
In this chapter, we analyze nonlinear filtering problems in distributed
environments, e.g., sensor networks or peer-to-peer protocols. In these
scenarios, the agents in the environment receive measurements in a streaming
fashion, and they are required to estimate a common (nonlinear) model by
alternating local computations and communications with their neighbors. We
focus on the important distinction between single-task problems, where the
underlying model is common to all agents, and multitask problems, where each
agent might converge to a different model due to, e.g., spatial dependencies or
other factors. Currently, most of the literature on distributed learning in the
nonlinear case has focused on the single-task case, which may be a strong
limitation in real-world scenarios. After introducing the problem and reviewing
the existing approaches, we describe a simple kernel-based algorithm tailored
for the multitask case. We evaluate the proposal on a simulated benchmark task,
and we conclude by detailing currently open problems and lines of research.Comment: To be published as a chapter in `Adaptive Learning Methods for
Nonlinear System Modeling', Elsevier Publishing, Eds. D. Comminiello and J.C.
Principe (2018
Negatively Correlated Search
Evolutionary Algorithms (EAs) have been shown to be powerful tools for
complex optimization problems, which are ubiquitous in both communication and
big data analytics. This paper presents a new EA, namely Negatively Correlated
Search (NCS), which maintains multiple individual search processes in parallel
and models the search behaviors of individual search processes as probability
distributions. NCS explicitly promotes negatively correlated search behaviors
by encouraging differences among the probability distributions (search
behaviors). By this means, individual search processes share information and
cooperate with each other to search diverse regions of a search space, which
makes NCS a promising method for non-convex optimization. The cooperation
scheme of NCS could also be regarded as a novel diversity preservation scheme
that, different from other existing schemes, directly promotes diversity at the
level of search behaviors rather than merely trying to maintain diversity among
candidate solutions. Empirical studies showed that NCS is competitive to
well-established search methods in the sense that NCS achieved the best overall
performance on 20 multimodal (non-convex) continuous optimization problems. The
advantages of NCS over state-of-the-art approaches are also demonstrated with a
case study on the synthesis of unequally spaced linear antenna arrays
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
- …