97 research outputs found
Big Data Dimensional Analysis
The ability to collect and analyze large amounts of data is a growing problem
within the scientific community. The growing gap between data and users calls
for innovative tools that address the challenges faced by big data volume,
velocity and variety. One of the main challenges associated with big data
variety is automatically understanding the underlying structures and patterns
of the data. Such an understanding is required as a pre-requisite to the
application of advanced analytics to the data. Further, big data sets often
contain anomalies and errors that are difficult to know a priori. Current
approaches to understanding data structure are drawn from the traditional
database ontology design. These approaches are effective, but often require too
much human involvement to be effective for the volume, velocity and variety of
data encountered by big data systems. Dimensional Data Analysis (DDA) is a
proposed technique that allows big data analysts to quickly understand the
overall structure of a big dataset, determine anomalies. DDA exploits
structures that exist in a wide class of data to quickly determine the nature
of the data and its statical anomalies. DDA leverages existing schemas that are
employed in big data databases today. This paper presents DDA, applies it to a
number of data sets, and measures its performance. The overhead of DDA is low
and can be applied to existing big data systems without greatly impacting their
computing requirements.Comment: From IEEE HPEC 201
Improving Big Data Visual Analytics with Interactive Virtual Reality
For decades, the growth and volume of digital data collection has made it
challenging to digest large volumes of information and extract underlying
structure. Coined 'Big Data', massive amounts of information has quite often
been gathered inconsistently (e.g from many sources, of various forms, at
different rates, etc.). These factors impede the practices of not only
processing data, but also analyzing and displaying it in an efficient manner to
the user. Many efforts have been completed in the data mining and visual
analytics community to create effective ways to further improve analysis and
achieve the knowledge desired for better understanding. Our approach for
improved big data visual analytics is two-fold, focusing on both visualization
and interaction. Given geo-tagged information, we are exploring the benefits of
visualizing datasets in the original geospatial domain by utilizing a virtual
reality platform. After running proven analytics on the data, we intend to
represent the information in a more realistic 3D setting, where analysts can
achieve an enhanced situational awareness and rely on familiar perceptions to
draw in-depth conclusions on the dataset. In addition, developing a
human-computer interface that responds to natural user actions and inputs
creates a more intuitive environment. Tasks can be performed to manipulate the
dataset and allow users to dive deeper upon request, adhering to desired
demands and intentions. Due to the volume and popularity of social media, we
developed a 3D tool visualizing Twitter on MIT's campus for analysis. Utilizing
emerging technologies of today to create a fully immersive tool that promotes
visualization and interaction can help ease the process of understanding and
representing big data.Comment: 6 pages, 8 figures, 2015 IEEE High Performance Extreme Computing
Conference (HPEC '15); corrected typo
Graphulo Implementation of Server-Side Sparse Matrix Multiply in the Accumulo Database
The Apache Accumulo database excels at distributed storage and indexing and
is ideally suited for storing graph data. Many big data analytics compute on
graph data and persist their results back to the database. These graph
calculations are often best performed inside the database server. The GraphBLAS
standard provides a compact and efficient basis for a wide range of graph
applications through a small number of sparse matrix operations. In this
article, we implement GraphBLAS sparse matrix multiplication server-side by
leveraging Accumulo's native, high-performance iterators. We compare the
mathematics and performance of inner and outer product implementations, and
show how an outer product implementation achieves optimal performance near
Accumulo's peak write rate. We offer our work as a core component to the
Graphulo library that will deliver matrix math primitives for graph analytics
within Accumulo.Comment: To be presented at IEEE HPEC 2015: http://www.ieee-hpec.org
- …