520,594 research outputs found
Big Data Dimensional Analysis
The ability to collect and analyze large amounts of data is a growing problem
within the scientific community. The growing gap between data and users calls
for innovative tools that address the challenges faced by big data volume,
velocity and variety. One of the main challenges associated with big data
variety is automatically understanding the underlying structures and patterns
of the data. Such an understanding is required as a pre-requisite to the
application of advanced analytics to the data. Further, big data sets often
contain anomalies and errors that are difficult to know a priori. Current
approaches to understanding data structure are drawn from the traditional
database ontology design. These approaches are effective, but often require too
much human involvement to be effective for the volume, velocity and variety of
data encountered by big data systems. Dimensional Data Analysis (DDA) is a
proposed technique that allows big data analysts to quickly understand the
overall structure of a big dataset, determine anomalies. DDA exploits
structures that exist in a wide class of data to quickly determine the nature
of the data and its statical anomalies. DDA leverages existing schemas that are
employed in big data databases today. This paper presents DDA, applies it to a
number of data sets, and measures its performance. The overhead of DDA is low
and can be applied to existing big data systems without greatly impacting their
computing requirements.Comment: From IEEE HPEC 201
Mapping Big Data into Knowledge Space with Cognitive Cyber-Infrastructure
Big data research has attracted great attention in science, technology,
industry and society. It is developing with the evolving scientific paradigm,
the fourth industrial revolution, and the transformational innovation of
technologies. However, its nature and fundamental challenge have not been
recognized, and its own methodology has not been formed. This paper explores
and answers the following questions: What is big data? What are the basic
methods for representing, managing and analyzing big data? What is the
relationship between big data and knowledge? Can we find a mapping from big
data into knowledge space? What kind of infrastructure is required to support
not only big data management and analysis but also knowledge discovery, sharing
and management? What is the relationship between big data and science paradigm?
What is the nature and fundamental challenge of big data computing? A
multi-dimensional perspective is presented toward a methodology of big data
computing.Comment: 59 page
Small sample sizes : A big data problem in high-dimensional data analysis
Acknowledgements The authors are grateful to the Editor, Associate Editor and three anonymous referees for their helpful suggestions, which greatly improved the manuscript. Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research is supported by the German Science Foundation awards number DFG KO 4680/3-2 and PA 2409/3-2.Peer reviewedPublisher PD
Analyzing big time series data in solar engineering using features and PCA
In solar engineering, we encounter big time series data such as the satellite-derived irradiance data and string-level measurements from a utility-scale photovoltaic (PV) system. While storing and hosting big data are certainly possible using today’s data storage technology, it is challenging to effectively and efficiently visualize and analyze the data. We consider a data analytics algorithm to mitigate some of these challenges in this work. The algorithm computes a set of generic and/or application-specific features to characterize the time series, and subsequently uses principal component analysis to project these features onto a two-dimensional space. As each time series can be represented by features, it can be treated as a single data point in the feature space, allowing many operations to become more amenable. Three applications are discussed within the overall framework, namely (1) the PV system type identification, (2) monitoring network design, and (3) anomalous string detection. The proposed framework can be easily translated to many other solar engineer applications
Modeling and replicating statistical topology, and evidence for CMB non-homogeneity
Under the banner of `Big Data', the detection and classification of structure
in extremely large, high dimensional, data sets, is, one of the central
statistical challenges of our times. Among the most intriguing approaches to
this challenge is `TDA', or `Topological Data Analysis', one of the primary
aims of which is providing non-metric, but topologically informative,
pre-analyses of data sets which make later, more quantitative analyses
feasible. While TDA rests on strong mathematical foundations from Topology, in
applications it has faced challenges due to an inability to handle issues of
statistical reliability and robustness and, most importantly, in an inability
to make scientific claims with verifiable levels of statistical confidence. We
propose a methodology for the parametric representation, estimation, and
replication of persistence diagrams, the main diagnostic tool of TDA. The power
of the methodology lies in the fact that even if only one persistence diagram
is available for analysis -- the typical case for big data applications --
replications can be generated to allow for conventional statistical hypothesis
testing. The methodology is conceptually simple and computationally practical,
and provides a broadly effective statistical procedure for persistence diagram
TDA analysis. We demonstrate the basic ideas on a toy example, and the power of
the approach in a novel and revealing analysis of CMB non-homogeneity
- …