616 research outputs found

    DualTable: A Hybrid Storage Model for Update Optimization in Hive

    Full text link
    Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201

    A High-Performance Data Accessing and Processing System for Campus Real-time Power Usage

    Get PDF
    With the flourishing of Internet of Things (IoT) technology, ubiquitous power data can be linked to the Internet and be analyzed for real-time monitoring requirements. Numerous power data would be accumulated to even Tera-byte level as the time goes. To approach a real-time power monitoring platform on them, an efficient and novel implementation techniques has been developed and formed to be the kernel material of this thesis. Based on the integration of multiple software subsystems in a layered manner, the proposed power-monitoring platform has been established and is composed of Ubuntu (as operating system), Hadoop (as storage subsystem), Hive (as data warehouse), and the Spark MLlib (as data analytics) from bottom to top. The generic power-data source is provided by the so-called smart meters equipped inside factories located in an enterprise practically. The data collection and storage are handled by the Hadoop subsystem and the data ingestion to Hive data warehouse is conducted by the Spark unit. On the aspect of system verification, under single-record query, these software modules: HiveQL and Impala SQL had been tested in terms of query-response efficiency. And for the performance exploration on the full-table query function. The relevant experiments have been conducted on the same software modules as well. The kernel contributions of this research work can be highlighted by two parts: the details of building an efficient real-time power-monitoring platform, and the relevant query-response efficiency for reference

    A Scalable Machine Learning Online Service for Big Data Real-Time Analysis

    Get PDF
    Proceedings of: IEEE Symposium Series on Computational Intelligence (SSCI 2014). Orlando, FL, USA, December 09-12, 2014.This work describes a proposal for developing and testing a scalable machine learning architecture able to provide real-time predictions or analytics as a service over domain-independent big data, working on top of the Hadoop ecosystem and providing real-time analytics as a service through a RESTful API. Systems implementing this architecture could provide companies with on-demand tools facilitating the tasks of storing, analyzing, understanding and reacting to their data, either in batch or stream fashion; and could turn into a valuable asset for improving the business performance and be a key market differentiator in this fast pace environment. In order to validate the proposed architecture, two systems are developed, each one providing classical machine-learning services in different domains: the first one involves a recommender system for web advertising, while the second consists in a prediction system which learns from gamers' behavior and tries to predict future events such as purchases or churning. An evaluation is carried out on these systems, and results show how both services are able to provide fast responses even when a number of concurrent requests are made, and in the particular case of the second system, results clearly prove that computed predictions significantly outperform those obtained if random guess was used.This research work is part of Memento Data Analysis project, co-funded by the Spanish Ministry of Industry, Energy and Tourism with identifier TSI-020601-2012-99.Publicad

    Proposal of Vital Data Analysis Platform using Wearable Sensor

    Full text link
    In this paper, we propose a vital data analysis platform which resolves existing problems to utilize vital data for real-time actions. Recently, IoT technologies have been progressed but in the healthcare area, real-time actions based on analyzed vital data are not considered sufficiently yet. The causes are proper use of analyzing methods of stream / micro batch processing and network cost. To resolve existing problems, we propose our vital data analysis platform. Our platform collects vital data of Electrocardiograph and acceleration using an example of wearable vital sensor and analyzes them to extract posture, fatigue and relaxation in smart phones or cloud. Our platform can show analyzed dangerous posture or fatigue level change. We implemented the platform. And we are now preparing a field test.Comment: 6 pages, 2 figures, 5th IIAE International Conference on Industrial Application Engineering 2017 (ICIAE2017), pp.138-143, Mar. 201

    Power extraction circuits for piezoelectric energy harvesters and time series data in water supply systems

    Get PDF
    This thesis investigates two fundamental technological challenges that prevent water utilities from deploying infrastructure monitoring apparatus with high spatial and temporal resolution: providing sufficient power for sensor nodes by increasing the power output from a vibration-driven energy harvester based on piezoelectric transduction, and the processing and storage of large volumes of data resulting from the increased level of pressure and flow rate monitoring. Piezoelectric energy harvesting from flow-induced vibrations within a water main represents a potential source of power to supply a sensor node capable of taking high- frequency measurements. A main factor limiting the amount of power from a piezoelectric device is the damping force that can be achieved. Electronic interface circuits can modify this damping in order to increase the power output to a reasonable level. A unified analytical framework was developed to compare circuits able to do this in terms of their power output. A new circuit is presented that out-performs existing circuits by a factor of 2, which is verified experimentally. The second problem concerns the management of large data sets arising from resolving challenges with the provision of power to sensor devices. The ability to process large data volumes is limited by the throughput of storage devices. For scientists to execute queries in a timely manner, query execution must be performant. The large volume of data that must be gathered to extract information from historic trends mandates a scalable approach. A scalable, durable storage and query execution framework is presented that is able to significantly improve the execution time of user-defined queries. A prototype database was implemented and validated on a cluster of commodity servers using live data gathered from a London pumping station and transmission mains. Benchmark results and reliability tests are included that demonstrate a significant improvement in performance over a traditional database architecture for a range of frequently-used operations, with many queries returning results near-instantaneously
    • …
    corecore