616 research outputs found
DualTable: A Hybrid Storage Model for Update Optimization in Hive
Hive is the most mature and prevalent data warehouse tool providing SQL-like
interface in the Hadoop ecosystem. It is successfully used in many Internet
companies and shows its value for big data processing in traditional
industries. However, enterprise big data processing systems as in Smart Grid
applications usually require complicated business logics and involve many data
manipulation operations like updates and deletes. Hive cannot offer sufficient
support for these while preserving high query performance. Hive using the
Hadoop Distributed File System (HDFS) for storage cannot implement data
manipulation efficiently and Hive on HBase suffers from poor query performance
even though it can support faster data manipulation.There is a project based on
Hive issue Hive-5317 to support update operations, but it has not been finished
in Hive's latest version. Since this ACID compliant extension adopts same data
storage format on HDFS, the update performance problem is not solved.
In this paper, we propose a hybrid storage model called DualTable, which
combines the efficient streaming reads of HDFS and the random write capability
of HBase. Hive on DualTable provides better data manipulation support and
preserves query performance at the same time. Experiments on a TPC-H data set
and on a real smart grid data set show that Hive on DualTable is up to 10 times
faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201
A High-Performance Data Accessing and Processing System for Campus Real-time Power Usage
With the flourishing of Internet of Things (IoT) technology, ubiquitous power data can be linked to the Internet and be analyzed for real-time monitoring requirements. Numerous power data would be accumulated to even Tera-byte level as the time goes. To approach a real-time power monitoring platform on them, an efficient and novel implementation techniques has been developed and formed to be the kernel material of this thesis. Based on the integration of multiple software subsystems in a layered manner, the proposed power-monitoring platform has been established and is composed of Ubuntu (as operating system), Hadoop (as storage subsystem), Hive (as data warehouse), and the Spark MLlib (as data analytics) from bottom to top. The generic power-data source is provided by the so-called smart meters equipped inside factories located in an enterprise practically. The data collection and storage are handled by the Hadoop subsystem and the data ingestion to Hive data warehouse is conducted by the Spark unit. On the aspect of system verification, under single-record query, these software modules: HiveQL and Impala SQL had been tested in terms of query-response efficiency. And for the performance exploration on the full-table query function. The relevant experiments have been conducted on the same software modules as well. The kernel contributions of this research work can be highlighted by two parts: the details of building an efficient real-time power-monitoring platform, and the relevant query-response efficiency for reference
A Scalable Machine Learning Online Service for Big Data Real-Time Analysis
Proceedings of: IEEE Symposium Series on Computational Intelligence (SSCI 2014). Orlando, FL, USA, December 09-12, 2014.This work describes a proposal for developing and testing a scalable machine learning architecture able to provide real-time predictions or analytics as a service over domain-independent big data, working on top of the Hadoop ecosystem and providing real-time analytics as a service through a RESTful API. Systems implementing this architecture could provide companies with on-demand tools facilitating the tasks of storing, analyzing, understanding and reacting to their data, either in batch or stream fashion; and could turn into a valuable asset for improving the business performance and be a key market differentiator in this fast pace environment. In order to validate the proposed architecture, two systems are developed, each one providing classical machine-learning services in different domains: the first one involves a recommender system for web advertising, while the second consists in a prediction system which learns from gamers' behavior and tries to predict future events such as purchases or churning. An evaluation is carried out on these systems, and results show how both services are able to provide fast responses even when a number of concurrent requests are made, and in the particular case of the second system, results clearly prove that computed predictions significantly outperform those obtained if random guess was used.This research work is part of Memento Data Analysis project, co-funded by the Spanish Ministry of Industry, Energy and Tourism with identifier TSI-020601-2012-99.Publicad
Proposal of Vital Data Analysis Platform using Wearable Sensor
In this paper, we propose a vital data analysis platform which resolves
existing problems to utilize vital data for real-time actions. Recently, IoT
technologies have been progressed but in the healthcare area, real-time actions
based on analyzed vital data are not considered sufficiently yet. The causes
are proper use of analyzing methods of stream / micro batch processing and
network cost. To resolve existing problems, we propose our vital data analysis
platform. Our platform collects vital data of Electrocardiograph and
acceleration using an example of wearable vital sensor and analyzes them to
extract posture, fatigue and relaxation in smart phones or cloud. Our platform
can show analyzed dangerous posture or fatigue level change. We implemented the
platform. And we are now preparing a field test.Comment: 6 pages, 2 figures, 5th IIAE International Conference on Industrial
Application Engineering 2017 (ICIAE2017), pp.138-143, Mar. 201
Power extraction circuits for piezoelectric energy harvesters and time series data in water supply systems
This thesis investigates two fundamental technological challenges that prevent water
utilities from deploying infrastructure monitoring apparatus with high spatial and temporal resolution: providing sufficient power for sensor nodes by increasing the power
output from a vibration-driven energy harvester based on piezoelectric transduction,
and the processing and storage of large volumes of data resulting from the increased
level of pressure and flow rate monitoring.
Piezoelectric energy harvesting from flow-induced vibrations within a water main
represents a potential source of power to supply a sensor node capable of taking high-
frequency measurements. A main factor limiting the amount of power from a piezoelectric device is the damping force that can be achieved. Electronic interface circuits
can modify this damping in order to increase the power output to a reasonable level. A unified analytical framework was developed to compare circuits able to do this in terms
of their power output. A new circuit is presented that out-performs existing circuits by
a factor of 2, which is verified experimentally.
The second problem concerns the management of large data sets arising from resolving challenges with the provision of power to sensor devices. The ability to process large
data volumes is limited by the throughput of storage devices. For scientists to execute
queries in a timely manner, query execution must be performant. The large volume of
data that must be gathered to extract information from historic trends mandates a scalable approach. A scalable, durable storage and query execution framework is presented
that is able to significantly improve the execution time of user-defined queries.
A prototype database was implemented and validated on a cluster of commodity servers using live data gathered from a London pumping station and transmission
mains. Benchmark results and reliability tests are included that demonstrate a significant improvement in performance over a traditional database architecture for a range of
frequently-used operations, with many queries returning results near-instantaneously
- …