Search CORE

268 research outputs found

14-08 Big Data Analytics to Aid Developing Livable Communities

Author: Cho Hyunkeun
Oh Jun-Seok
Yang Li
Publication venue: ScholarWorks at WMU
Publication date: 31/12/2015
Field of study

In transportation, ubiquitous deployment of low-cost sensors combined with powerful computer hardware and high-speed network makes big data available. USDOT defines big data research in transportation as a number of advanced techniques applied to the capture, management and analysis of very large and diverse volumes of data. Data in transportation are usually well organized into tables and are characterized by relatively low dimensionality and yet huge numbers of records. Therefore, big data research in transportation has unique challenges on how to effectively process huge amounts of data records and data streams. The purpose of this study is to conduct research on the problems caused by large data volume and data streams and to develop applications for data analysis in transportation. To process large number of records efficiently, we have proposed to aggregate the data at multiple resolutions and to explore the data at various resolutions to balance between accuracy and speed. Techniques and algorithms in statistical analysis and data visualization have been developed for efficient data analytics using multiresolution data aggregation. Results will be helpful in setting up a primitive stage towards a rigorous framework for general analytical processing of big data in transportation

ScholarWorks at WMU

Towards efficient localization of dynamic replicas for Geo-Distributed data stores

Author: Antoniu Gabriel
Costan Alexandru
Matri Pierre
Montes Sánchez Jesús
Pérez Hernández María de los Santos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Large-scale scientific experiments increasingly rely on geo- distributed clouds to serve relevant data to scientists world- wide with minimal latency. State-of-the-art caching systems often require the client to access the data through a caching proxy, or to contact a metadata server to locate the closest available copy of the desired data. Also, such caching sys- tems are inconsistent with the design of distributed hash- table databases such as Dynamo, which focus on allowing clients to locate data independently. We argue there is a gap between existing state-of-the-art solutions and the needs of geographically distributed applications, which require fast access to popular objects while not degrading access latency for the rest of the data. In this paper, we introduce a proba- bilistic algorithm allowing the user to locate the closest copy of the data e?ciently and independently with minimal over- head, allowing low-latency access to non-cached data. Also, we propose a network-e?cient technique to identify the most popular data objects in the cluster and trigger their replica- tion close to the clients. Experiments with a real-world data set show that these principles allow clients to locate the clos- est available copy of data with small memory footprint and low error-rate, thus improving read-latency for non-cached data and allowing hot data to be read locally

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Archivo Digital UPM

HAL-Rennes 1

Scalable and Reliable Middlebox Deployment

Author: Ghaznavi Milad
Publication venue: 'University of Waterloo'
Publication date: 25/05/2020
Field of study

Middleboxes are pervasive in modern computer networks providing functionalities beyond mere packet forwarding. Load balancers, intrusion detection systems, and network address translators are typical examples of middleboxes. Despite their benefits, middleboxes come with several challenges with respect to their scalability and reliability. The goal of this thesis is to devise middlebox deployment solutions that are cost effective, scalable, and fault tolerant. The thesis includes three main contributions: First, distributed service function chaining with multiple instances of a middlebox deployed on different physical servers to optimize resource usage; Second, Constellation, a geo-distributed middlebox framework enabling a middlebox application to operate with high performance across wide area networks; Third, a fault tolerant service function chaining system

University of Waterloo's Institutional Repository

Recommended from our members

Integrating Public and Private Data Sources for Freight Transportation Planning

Author: Agarwal Prateek
Cebelak Meredith
Cruz-Ross Alejandra
Kim Haegon
La Fountain Peter
O'Brien William J.
Overmyer Sarah
Prozzi Jolanda
Sankaran Bharathwaj
Seedah Dan
Walton C. Michael
Publication venue: TxDOT
Publication date: 01/11/2013
Field of study

The Moving Ahead for Progress in the 21st Century Act (MAP-21) stipulates that state transportation agencies expand their interest in freight initiatives and modeling to support planning efforts, particularly the evaluation of current and future freight transportation capacity necessary to ensure freight mobility. However, the understanding of freight demand and the evaluation of current and future freight transportation capacity are not only determined by robust models, but are critically contingent on the availability of accurate data. Effective partnerships are clearly needed between the public and private sectors to ensure adequate freight planning and funding of transportation infrastructure at the state and local levels. However, establishing partnerships with firms who are both busy and suspicious of data-sharing, remains a challenge. This study was commissioned by the Texas Department of Transportation (TxDOT) to explore the feasibility of TxDOT entering into a data-sharing partnership with representatives of the private sector to obtain sample data for use in formulating a strategy for integrating public and private sector data sources. This report summarizes the findings, lessons learned, and recommendations formed from the outreach effort, and provides a prototype freight data architecture that will facilitate the storage, exchange, and integration of freight data through a data-sharing partnership.Texas Department of Transportation Research and Technology Implementation Office P.O. Box 5080 Austin, TX 78763-5080Civil, Architectural, and Environmental Engineerin

Texas ScholarWorks

Distributed Caching for Processing Raw Arrays

Author: Abadi M.
Altinel M.
Cao P.
Cudre-Mauroux P.
Dar S.
Furtado P.
Han D.
Idreos S.
Idreos S.
Karpathiotakis M.
Kornacker M.
Lübbe C.
Nelson J.
Sarawagi S.
Sparks E.
Witkowski A.
Zhang Y.
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/07/2018
Field of study

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority - by as much as two orders of magnitude - of the proposed framework over existing techniques in terms of cache overhead and workload execution time

Crossref

Caltech Authors

Distributed Caching for Complex Querying of Raw Arrays

Author: Dong Bin
Ho Anna Y. Q.
Nugent Peter
Rusu Florin
Wu Kesheng
Zhao Weijie
Publication venue
Publication date: 15/03/2018
Field of study

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority -- by as much as two orders of magnitude -- of the proposed framework over existing techniques in terms of cache overhead and workload execution time

arXiv.org e-Print Archive

eScholarship - University of California

Continuously Providing Approximate Results under Limited Resources: Load Shedding and Spilling in XML Streams

Author: Wei Mingzhu
Publication venue: Digital WPI
Publication date: 18/12/2011
Field of study

Because of the high volume and unpredictable arrival rates, stream processing systems may not always be able to keep up with the input data streams, resulting in buffer overflow and uncontrolled loss of data. To continuously supply online results, two alternate solutions to tackle this problem of unpredictable failures of such overloaded systems can be identified. One technique, called load shedding, drops some fractions of data from the input stream to reduce the memory and CPU requirements of the workload. However, dropping some portions of the input data means that the accuracy of the output is reduced since some data is lost. To produce eventually complete results, the second technique, called data spilling, pushes some fractions of data to persistent storage temporarily when the processing speed cannot keep up with the arrival rate. The processing of the disk resident data is then postponed until a later time when system resources become available. This dissertation explores these load reduction technologies in the context of XML stream systems. Load shedding in the specific context of XML streams poses several unique opportunities and challenges. Since XML data is hierarchical, subelements, extracted from different positions of the XML tree structure, may vary in their importance. Further, dropping different subelements may vary in their savings of storage and computation. Hence, unlike prior work in the literature that drops data completely or not at all, in this dissertation we introduce the notion of structure-oriented load shedding, meaning selectively some XML subelements are shed from the possibly complex XML objects in the XML stream. First we develop a preference model that enables users to specify the relative importance of preserving different subelements within the XML result structure. This transforms shedding into the problem of rewriting the user query into shed queries that return approximate answers with their utility as measured by the user preference model. Our optimizer finds the appropriate shed queries to maximize the output utility driven by our structure-based preference model under the limitation of available computation resources. The experimental results demonstrate that our proposed XML-specific shedding solution consistently achieves higher utility results compared to the existing relational shedding techniques. Second, we introduces structure-based spilling, a spilling technique customized for XML streams by considering the spilling of partial substructures of possibly complex XML elements. Several new challenges caused by structure-based spilling are addressed. When a path is spilled, multiple other paths may be affected. We categorize varying types of spilling side effects on the query caused by spilling. How to execute the reduced query to produce the correct runtime output is also studied. Three optimization strategies are developed to select the reduced query that maximizes the output quality. We also examine the clean-up stage to guarantee that an entire result set is eventually generated by producing supplementary results to complement the partial results output earlier. The experimental study demonstrates that our proposed solutions consistently achieve higher quality results compared to the state-of-the-art techniques. Third, we design an integrated framework that combines both shedding and spilling policies into one comprehensive methodology. Decisions on the choice of whether to shed or spill data may be affected by the application needs and data arrival patterns. For some input data, it may be worth to flush it to disk if a delayed output of its result will be important, while other data would best directly dropped from the system given that a delayed delivery of these results would no longer be meaningful to the application. Therefore we need sophisticated technologies capable of deploying both shedding and spilling techniques within one integrated strategy with the ability to deliver the most appropriate decision customers need for each specific circumstance. We propose a novel flexible framework for structure-based shed and spill approaches, applicable in any XML stream system. We propose a solution space that represents all the shed and spill candidates. An age-based quality model is proposed for evaluating the output quality for different reduced query and supplementary query pairs. We also propose a family of four optimization strategies, OptF, OptSmart, HiX and Fex. OptF and OptSmart are both guaranteed to identify an optimal solution of reduced and supplementary query pair, with OptSmart exhibiting significantly less overhead than OptF. HiX and Fex use heuristic-based approaches that are much more efficient than OptF and OptSmart

DigitalCommons@WPI

Augmenting the performance of image similarity search through crowdsourcing

Author: Rahmanian Bahareh
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2014
Field of study

Crowdsourcing is defined as “outsourcing a task that is traditionally performed by an employee to a large group of people in the form of an open call” (Howe 2006). Many platforms designed to perform several types of crowdsourcing and studies have shown that results produced by crowds in crowdsourcing platforms are generally accurate and reliable. Crowdsourcing can provide a fast and efficient way to use the power of human computation to solve problems that are difficult for machines to perform. From several different microtasking crowdsourcing platforms available, we decided to perform our study using Amazon Mechanical Turk. In the context of our research we studied the effect of user interface design and its corresponding cognitive load on the performance of crowd-produced results. Our results highlighted the importance of a well-designed user interface on crowdsourcing performance. Using crowdsourcing platforms such as Amazon Mechanical Turk, we can utilize humans to solve problems that are difficult for computers, such as image similarity search. However, in tasks like image similarity search, it is more efficient to design a hybrid human–machine system. In the context of our research, we studied the effect of involving the crowd on the performance of an image similarity search system and proposed a hybrid human–machine image similarity search system. Our proposed system uses machine power to perform heavy computations and to search for similar images within the image dataset and uses crowdsourcing to refine results. We designed our content-based image retrieval (CBIR) system using SIFT, SURF, SURF128 and ORB feature detector/descriptors and compared the performance of the system using each feature detector/descriptor. Our experiment confirmed that crowdsourcing can dramatically improve the CBIR system performance

Sydney eScholarship

Query Optimization in Dynamic Environments

Author: El-Helw Amr
Publication venue: 'University of Waterloo'
Publication date: 01/01/2012
Field of study

Most modern applications deal with very large amounts of data. Having to deal with such huge amounts of data is in itself a challenge. This challenge is complicated even more by the fact that, in many cases, this data is constantly changing and evolving. For instance, relational databases that handle the data of day-to-day transactional applications often have tables with very high data change rates. It is not uncommon to even have temporary or volatile tables that get created from scratch and completely dropped over the course of one query workload. This dissertation focuses on optimizing structured queries over dynamic and constantly changing data sets. Our work address this issue, and some of the challenges related to it. We address the issue of database statistics becoming stale and inaccurate due to constantly changing data. We introduce ways to automatically analyze the existing statistics and recommend and collect the necessary statistics to optimize a single query or a query workload. We introduce a mechanism to automate the recommendation and collection of statistical views for a given query workload. We also compare two methods of using these statistical views in selectivity estimation. We evaluate our methods and techniques with experimental studies using prototypes that we built into commercial database systems

University of Waterloo's Institutional Repository

The LOFAR Transients Pipeline

Current and future astronomical survey facilities provide a remarkably rich opportunity for transient astronomy, combining unprecedented fields of view with high sensitivity and the ability to access previously unexplored wavelength regimes. This is particularly true of LOFAR, a recently-commissioned, low-frequency radio interferometer, based in the Netherlands and with stations across Europe. The identification of and response to transients is one of LOFAR's key science goals. However, the large data volumes which LOFAR produces, combined with the scientific requirement for rapid response, make automation essential. To support this, we have developed the LOFAR Transients Pipeline, or TraP. The TraP ingests multi-frequency image data from LOFAR or other instruments and searches it for transients and variables, providing automatic alerts of significant detections and populating a lightcurve database for further analysis by astronomers. Here, we discuss the scientific goals of the TraP and how it has been designed to meet them. We describe its implementation, including both the algorithms adopted to maximize performance as well as the development methodology used to ensure it is robust and reliable, particularly in the presence of artefacts typical of radio astronomy imaging. Finally, we report on a series of tests of the pipeline carried out using simulated LOFAR observations with a known population of transients.Comment: 30 pages, 11 figures; Accepted for publication in Astronomy & Computing; Code at https://github.com/transientskp/tk

arXiv.org e-Print Archive

CWI's Institutional Repository

HAL-INSU

HAL-OBSPM

The University of Manchester - Institutional Repository

Radboud Repository

HAL-CEA

Hal-Diderot

International Migration, Integration and Social Cohesion online publications