20,217 research outputs found
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
Semantic HMC for Big Data Analysis
Analyzing Big Data can help corporations to im-prove their efficiency. In
this work we present a new vision to derive Value from Big Data using a
Semantic Hierarchical Multi-label Classification called Semantic HMC based in a
non-supervised Ontology learning process. We also proposea Semantic HMC
process, using scalable Machine-Learning techniques and Rule-based reasoning
Big Data Analysis for PV Applications
With increasing photovoltaic (PV) installations, large amounts of time series data from utility-scale PV systems such as meteorological data and string level measurements are collected [1, 2]. Due to fluctuations in irradiance and temperature, PV data is highly stochastic. Spatio-temporal differences with potential time-lagged correlation are also exhibited, due to the wind directions affecting cloud movements [3]. Coupling these variations with different types of PV systems in terms of power output and wiring configuration, as well as localised PV effects like partial shading and module mismatches, lengthy time series data from solar systems are highly multi-dimensional and challenging to process. In addition, these raw datasets can rarely be used directly due to the possibly high noise and irrelevant information embedded in them. Moreover, it is challenging to operate directly on the raw datasets, especially when it comes to visualizing and analyzing these data. On this point, the Pareto principle, or better-known as the 80/20 rule, commonly applies: researchers and solar engineers often spend most of their time collecting, cleaning, filtering, reducing and formatting the data.
In this work, a data analytics algorithm is applied to mitigate some of the complexities and make sense of the large time series data in PV systems. Each time series is treated as an individual entity which can be characterized by a set of generic or application-specific features. This reduces the dimension of the data, i.e., from hundreds of samples in a time series to a few descriptive features. It is is also easier to visualize big time series data in the feature space, as compared to the traditional time series visualization methods, such as the spaghetti plot and horizon plot, which are informative but not very scalable. The time series data is processed to extract features through clustering and identify correspondence between specific measurements and geographical location of the PV systems. This characterisation of the time series data can be used for several PV applications, namely, (1) PV fault identification, (2) PV network design and (3) PV type pre-design for PV installation in locations with different geographical attributes
Big Data Analysis
The value of big data is predicated on the ability to detect trends and patterns and more generally to make sense of the large volumes of data that is often comprised of a heterogeneous mix of format, structure, and semantics. Big data analysis is the component of the big data value chain that focuses on transforming raw acquired data into a coherent usable resource suitable for analysis. Using a range of interviews with key stakeholders in small and large companies and academia, this chapter outlines key insights, state of the art, emerging trends, future requirements, and sectorial case studies for data analysis
Integrating R and Hadoop for Big Data Analysis
Analyzing and working with big data could be very diffi cult using classical
means like relational database management systems or desktop software packages
for statistics and visualization. Instead, big data requires large clusters
with hundreds or even thousands of computing nodes. Offi cial statistics is
increasingly considering big data for deriving new statistics because big data
sources could produce more relevant and timely statistics than traditional
sources. One of the software tools successfully and wide spread used for
storage and processing of big data sets on clusters of commodity hardware is
Hadoop. Hadoop framework contains libraries, a distributed fi le-system (HDFS),
a resource-management platform and implements a version of the MapReduce
programming model for large scale data processing. In this paper we investigate
the possibilities of integrating Hadoop with R which is a popular software used
for statistical computing and data visualization. We present three ways of
integrating them: R with Streaming, Rhipe and RHadoop and we emphasize the
advantages and disadvantages of each solution.Comment: Romanian Statistical Review no. 2 / 201
- …