8,216 research outputs found
On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research
Scientific research requires access, analysis, and sharing of data that is
distributed across various heterogeneous data sources at the scale of the
Internet. An eager ETL process constructs an integrated data repository as its
first step, integrating and loading data in its entirety from the data sources.
The bootstrapping of this process is not efficient for scientific research that
requires access to data from very large and typically numerous distributed data
sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy
ETL is faster in bootstrapping. However, queries on the integrated data
repository of eager ETL perform faster, due to the availability of the entire
data beforehand.
In this paper, we propose a novel ETL approach for scientific data
integration, as a hybrid of eager and lazy ETL approaches, and applied both to
data as well as metadata. This way, Hybrid ETL supports incremental integration
and loading of metadata and data from the data sources. We incorporate a
human-in-the-loop approach, to enhance the hybrid ETL, with selective data
integration driven by the user queries and sharing of integrated data between
users. We implement our hybrid ETL approach in a prototype platform, Obidos,
and evaluate it in the context of data sharing for medical research. Obidos
outperforms both the eager ETL and lazy ETL approaches, for scientific research
data integration and sharing, through its selective loading of data and
metadata, while storing the integrated data in a scalable integrated data
repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD
Journa
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Apache Calcite is a foundational software framework that provides query
processing, optimization, and query language support to many popular
open-source data processing systems such as Apache Hive, Apache Storm, Apache
Flink, Druid, and MapD. Calcite's architecture consists of a modular and
extensible query optimizer with hundreds of built-in optimization rules, a
query processor capable of processing a variety of query languages, an adapter
architecture designed for extensibility, and support for heterogeneous data
models and stores (relational, semi-structured, streaming, and geospatial).
This flexible, embeddable, and extensible architecture is what makes Calcite an
attractive choice for adoption in big-data frameworks. It is an active project
that continues to introduce support for the new types of data sources, query
languages, and approaches to query processing and optimization.Comment: SIGMOD'1
End-to-end informed VM selection in compute clouds
The selection of resources, particularly VMs, in current public IaaS clouds is usually done in a blind fashion, as cloud users do not have much information about resource consumption by co-tenant third-party tasks. In particular, communication patterns can play a significant part in cloud application performance and responsiveness, specially in the case of novel latencysensitive applications, increasingly common in today’s clouds. Thus, herein we propose an end-to-end approach to the VM allocation problem using policies based uniquely on round-trip time measurements between VMs. Those become part of a userlevel ‘Recommender Service’ that receives VM allocation requests with certain network-related demands and matches them to a suitable subset of VMs available to the user within the cloud. We propose and implement end-to-end algorithms for VM selection that cover desirable profiles of communications between VMs in distributed applications in a cloud setting, such as profiles with prevailing pair-wise, hub-and-spokes, or clustered communication patterns between constituent VMs. We quantify the expected benefits from deploying our Recommender Service by comparing our informed VM allocation approaches to conventional, random allocation methods, based on real measurements of latencies between Amazon EC2 instances. We also show that our approach is completely independent from cloud architecture details, is adaptable to different types of applications and workloads, and is lightweight and transparent to cloud providers.This work is supported in part by the National Science
Foundation under grant CNS-0963974
Building the HIVe: disrupting biomedical HIV and AIDS research with gay men, other men who have sex with men (MSM) and transgenders
Networked and digital technologies now mediate the sexual behaviors of many gay men, other men that have sex with men and transgenders, challenging the effectiveness of biomedical HIV/AIDS research and prevention practices. Driven by the normative positivist philosophy of science, these approaches—while paramount to fighting the epidemic—have neglected to rethink their ontological and epistemological assumptions when confronting the social drivers of HIV. Building the HIVe responds by forefronting community-based and led sociological HIV/AIDS research and prevention that addresses digitally mediated and driven sexual behaviors. The HIVe disrupts biomedical approaches by building an accessible and dynamic social science research community engaged in reflexive performativity to improve the health and human rights of marginalized communities disproportionately at risk of HIV/AIDS
D-SPACE4Cloud: A Design Tool for Big Data Applications
The last years have seen a steep rise in data generation worldwide, with the
development and widespread adoption of several software projects targeting the
Big Data paradigm. Many companies currently engage in Big Data analytics as
part of their core business activities, nonetheless there are no tools and
techniques to support the design of the underlying hardware configuration
backing such systems. In particular, the focus in this report is set on Cloud
deployed clusters, which represent a cost-effective alternative to on premises
installations. We propose a novel tool implementing a battery of optimization
and prediction techniques integrated so as to efficiently assess several
alternative resource configurations, in order to determine the minimum cost
cluster deployment satisfying QoS constraints. Further, the experimental
campaign conducted on real systems shows the validity and relevance of the
proposed method
Living IoT: A Flying Wireless Platform on Live Insects
Sensor networks with devices capable of moving could enable applications
ranging from precision irrigation to environmental sensing. Using mechanical
drones to move sensors, however, severely limits operation time since flight
time is limited by the energy density of current battery technology. We explore
an alternative, biology-based solution: integrate sensing, computing and
communication functionalities onto live flying insects to create a mobile IoT
platform.
Such an approach takes advantage of these tiny, highly efficient biological
insects which are ubiquitous in many outdoor ecosystems, to essentially provide
mobility for free. Doing so however requires addressing key technical
challenges of power, size, weight and self-localization in order for the
insects to perform location-dependent sensing operations as they carry our IoT
payload through the environment. We develop and deploy our platform on
bumblebees which includes backscatter communication, low-power
self-localization hardware, sensors, and a power source. We show that our
platform is capable of sensing, backscattering data at 1 kbps when the insects
are back at the hive, and localizing itself up to distances of 80 m from the
access points, all within a total weight budget of 102 mg.Comment: Co-primary authors: Vikram Iyer, Rajalakshmi Nandakumar, Anran Wang,
In Proceedings of Mobicom. ACM, New York, NY, USA, 15 pages, 201
- …