18,032 research outputs found
Information Splitting for Big Data Analytics
Many statistical models require an estimation of unknown (co)-variance
parameter(s) in a model. The estimation usually obtained by maximizing a
log-likelihood which involves log determinant terms. In principle, one requires
the \emph{observed information}--the negative Hessian matrix or the second
derivative of the log-likelihood---to obtain an accurate maximum likelihood
estimator according to the Newton method. When one uses the \emph{Fisher
information}, the expect value of the observed information, a simpler algorithm
than the Newton method is obtained as the Fisher scoring algorithm. With the
advance in high-throughput technologies in the biological sciences,
recommendation systems and social networks, the sizes of data sets---and the
corresponding statistical models---have suddenly increased by several orders of
magnitude. Neither the observed information nor the Fisher information is easy
to obtained for these big data sets. This paper introduces an information
splitting technique to simplify the computation. After splitting the mean of
the observed information and the Fisher information, an simpler approximate
Hessian matrix for the log-likelihood can be obtained. This approximated
Hessian matrix can significantly reduce computations, and makes the linear
mixed model applicable for big data sets. Such a spitting and simpler formulas
heavily depends on matrix algebra transforms, and applicable to large scale
breeding model, genetics wide association analysis.Comment: arXiv admin note: text overlap with arXiv:1605.0764
An efficient industrial big-data engine
Current trends in industrial systems opt for the use of different big-data engines as a mean to process huge amounts of data that cannot be processed with an ordinary infrastructure. The number of issues an industrial infrastructure has to face is large and includes challenges such as the definition of different efficient architecture setups for different applications, and the definition of specific models for industrial analytics. In this context, the article explores the development of a medium size big-data engine (i.e. implementation) able to improve performance in map-reduce computing by splitting the analytic into different segments that may be processed by the engine in parallel using a hierarchical model. This type of facility reduces end-to-end computation time for all segments with their results then merged with other information from other segments after their processing in parallel. This type of setup increases performance of current clusters improving I/O operations remarkably as empirical results revealed.Work partially supported by “Distributed Java Infrastructure for Real-Time Big-data” (CAS14/00118), eMadrid (S2013/ICE-2715), HERMES-SMARTDRIVER (TIN2013-46801-C4-2-R), and AUDACity (TIN2016-77158-C4-1-R)
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling
Spambot detection in online social networks is a long-lasting challenge
involving the study and design of detection techniques capable of efficiently
identifying ever-evolving spammers. Recently, a new wave of social spambots has
emerged, with advanced human-like characteristics that allow them to go
undetected even by current state-of-the-art algorithms. In this paper, we show
that efficient spambots detection can be achieved via an in-depth analysis of
their collective behaviors exploiting the digital DNA technique for modeling
the behaviors of social network users. Inspired by its biological counterpart,
in the digital DNA representation the behavioral lifetime of a digital account
is encoded in a sequence of characters. Then, we define a similarity measure
for such digital DNA sequences. We build upon digital DNA and the similarity
between groups of users to characterize both genuine accounts and spambots.
Leveraging such characterization, we design the Social Fingerprinting
technique, which is able to discriminate among spambots and genuine accounts in
both a supervised and an unsupervised fashion. We finally evaluate the
effectiveness of Social Fingerprinting and we compare it with three
state-of-the-art detection algorithms. Among the peculiarities of our approach
is the possibility to apply off-the-shelf DNA analysis techniques to study
online users behaviors and to efficiently rely on a limited number of
lightweight account characteristics
An efficient time optimized scheme for progressive analytics in big data
Big data analytics is the key research subject for future data driven decision making applications. Due to the large amount of data, progressive analytics could provide an efficient way for querying big data clusters. Each cluster contains only a piece of the examined data. Continuous queries over these data sources require intelligent mechanisms to result the final outcome (query response) in the minimum time with the maximum performance. A Query Controller (QC) is responsible to manage continuous/sequential queries and return the final outcome to users or applications. In this paper, we propose a mechanism that can be adopted by the QC. The proposed mechanism is capable of managing partial results retrieved by a number of processors each one responsible for each cluster. Each processor executes a query over a specific cluster of data. Our mechanism adopts two sequential decision making models for handling the incoming partial results. The first model is based on a finite horizon time-optimized model and the second one is based on an infinite horizon optimally scheduled model. We provide mathematical formulations for solving the discussed problem and present simulation results. Through a large number of experiments, we reveal the advantages of the proposed models and give numerical results comparing them with a deterministic model. These results indicate that the proposed models can efficiently reduce the required time for returning the final outcome to the user/application while keeping the quality of the aggregated result at high levels
Wiz: a web-based tool for interactive visualization of big data
In an age of information, visualizing and discerning meaning from data is as important as its collection. Interactive data visualization addresses both fronts by allowing researchers to explore data beyond what static images can offer. Here, we present Wiz, a web-based application for handling and visualizing large amounts of data. Wiz does not require programming or downloadable software for its use and allows scientists and non-scientists to unravel the complexity of data by splitting their relationships through 5D visual analytics, performing multivariate data analysis, such as principal component and linear discriminant analyses, all in vivid, publication-ready figures. With the explosion of high-throughput practices for materials discovery, information streaming capabilities, and the emphasis on industrial digitalization and artificial intelligence, we expect Wiz to serve as an invaluable tool to have a broad impact in our world of big data
- …