11 research outputs found
Progressive data stream mining and transaction classification for workload-aware incremental database repartitioning
Minimising the impact of distributed transactions (DTs) in a shared-nothing distributed database is extremely challenging for transactional workloads. With dynamic workload nature and rapid growth in data volume the underlying database requires incremental repartitioning to maintain acceptable level of DTs and data load balance with minimum physical data migrations. In a workload-aware repartitioning scheme transactional workload is modelled as graph or hyper graph, and subsequently perform k-way min-cut clustering guaranteeing minimum edge cuts can reduce the impact of DTs significantly by mapping the workload clusters into logical database partitions. However, without exploring the inherent workload characteristics, the overall processing and computing times for large-scale workload networks increase in polynomial orders. In this paper, a workload-aware incremental database repartitioning technique is proposed, which effectively exploits proactive transaction classification and workload stream mining techniques. Workload batches are modelled in graph, hyper graph, and compressed hyper graph then repartitioned to produce a fresh tuple-to-partition data migration plan for every incremental cycle. Experimental studies in a simulated TPC-C environment demonstrate that the proposed model can be effectively adopted in managing rapid data growth and dynamic workloads, thus progressively reduce the overall processing time required to operate over the workload networks
BUILDING EFFICIENT AND COST-EFFECTIVE CLOUD-BASED BIG DATA MANAGEMENT SYSTEMS
In today’s big data world, data is being produced in massive volumes, at great velocity
and from a variety of different sources such as mobile devices, sensors, a plethora
of small devices hooked to the internet (Internet of Things), social networks, communication
networks and many others. Interactive querying and large-scale analytics are being
increasingly used to derive value out of this big data. A large portion of this data is being
stored and processed in the Cloud due the several advantages provided by the Cloud such
as scalability, elasticity, availability, low cost of ownership and the overall economies
of scale. There is thus, a growing need for large-scale cloud-based data management
systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics
can grow linearly with the time and resources required. Reducing the cost of data analytics
in the Cloud thus remains a primary challenge. In my dissertation research, I have
focused on building efficient and cost-effective cloud-based data management systems for
different application domains that are predominant in cloud computing environments.
In the first part of my dissertation, I address the problem of reducing the cost of
transactional workloads on relational databases to support database-as-a-service in the
Cloud. The primary challenges in supporting such workloads include choosing how to
partition the data across a large number of machines, minimizing the number of distributed
transactions, providing high data availability, and tolerating failures gracefully.
I have designed, built and evaluated SWORD, an end-to-end scalable online transaction
processing system, that utilizes workload-aware data placement and replication to minimize
the number of distributed transactions that incorporates a suite of novel techniques
to significantly reduce the overheads incurred both during the initial placement of data,
and during query execution at runtime.
In the second part of my dissertation, I focus on sampling-based progressive analytics
as a means to reduce the cost of data analytics in the relational domain. Sampling has
been traditionally used by data scientists to get progressive answers to complex analytical
tasks over large volumes of data. Typically, this involves manually extracting samples
of increasing data size (progressive samples) for exploratory querying. This provides the
data scientists with user control, repeatable semantics, and result provenance. However,
such solutions result in tedious workflows that preclude the reuse of work across samples.
On the other hand, existing approximate query processing systems report early results,
but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive
data-parallel computation framework, NOW!, that provides support for progressive
analytics over big data. In particular, NOW! enables progressive relational (SQL) query
support in the Cloud using unique progress semantics that allow efficient and deterministic
query processing over samples providing meaningful early results and provenance
to data scientists. NOW! enables the provision of early results using significantly fewer
resources thereby enabling a substantial reduction in the cost incurred during such analytics.
Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics
on large-scale graph-structured data in the Cloud. The system is based on the
key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in
the graph; examples include ego network analysis, motif counting in biological networks,
finding social circles in social networks, personalized recommendations, link prediction,
etc. These tasks are not well served by existing vertex-centric graph processing frameworks
whose computation and execution models limit the user program to directly access
the state of a single vertex, resulting in high execution overheads. Further, the lack of
support for extracting the relevant portions of the graph that are of interest to an analysis
task and loading it onto distributed memory leads to poor scalability. NSCALE allows
users to write programs at the level of neighborhoods or subgraphs rather than at the level
of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient
distributed execution of these neighborhood-centric complex analysis tasks over largescale
graphs, while minimizing resource consumption and communication cost, thereby
substantially reducing the overall cost of graph data analytics in the Cloud.
The results of our extensive experimental evaluation of these prototypes with several
real-world data sets and applications validate the effectiveness of our techniques
which provide orders-of-magnitude reductions in the overheads of distributed data querying
and analysis in the Cloud
Outlier Detection In Big Data
The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches
Runtime Prediction for Scale-Out Data Analytics
Many analytics applications generate mixed workloads, i.e., workloads comprised of analytical tasks with different processing characteristics including data pre-processing, SQL, and iterative machine learning algorithms. Examples of such mixed workloads can be found in web data analysis, social media analysis, and graph analytics, where they are executed repetitively on large input datasets (e.g., "Find the average user time spent on the top 10 most popular web pages on the UK domain web graph."). Scale-out processing engines satisfy the needs of these applications by distributing the data and the processing task efficiently among multiple workers that are first reserved and then used to execute the task in parallel on a cluster of machines. Finding the resource allocation that can complete the workload execution within a given time constraint, and optimizing cluster resource allocations among multiple analytical workloads motivates the need for estimating the runtime of the workload before its actual execution. Predicting runtime of analytical workloads is a challenging problem as runtime depends on a large number of factors that are hard to model a priori execution. These factors can be summarized as workload characteristics (i.e., data statistics and processing costs), the execution configuration (i.e., deployment, resource allocation, and software settings), and the cost model that captures the interplay among all of the above parameters. While conventional cost models proposed in the context of query optimization can assess the relative order among alternative SQL query plans, they are not aimed to estimate absolute runtime. Additionally, conventional models are ill-equipped to estimate the runtime of iterative analytics that are executed repetitively until convergence and that of user defined data pre-processing operators which are not "owned" by the underlying data management system. This thesis demonstrates that runtime for data analytics can be predicted accurately by breaking the analytical tasks into multiple processing phases, collecting key input features during a reference execution on a sample of the dataset, and then using the features to build per-phase cost models. We develop prediction models for three categories of data analytics produced by social media applications: iterative machine learning, data pre-processing, and reporting SQL. The prediction framework for iterative analytics, PREDIcT, addresses the challenging problem of estimating the number of iterations, and per-iteration runtime for a class of iterative machine learning algorithms that are run repetitively until convergence. The hybrid prediction models we develop for data pre-processing tasks and for reporting SQL combine the benefits of analytical modeling with that of machine learning-based models. Through a training methodology and a pruning algorithm we reduce the cost of running training queries to a minimum while maintaining a good level of accuracy for the models
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
Managing Smartphone Testbeds with SmartLab
The explosive number of smartphones with ever growing sensing and computing capabilities have brought a paradigm shift to many traditional domains of the computing field. Re-programming smartphones and instrumenting them for application testing and data gathering at scale is currently a tedious and time-consuming process that poses significant logistical challenges. In this paper, we make three major contributions: First, we propose a comprehensive architecture, coined SmartLab1, for managing a cluster of both real and virtual smartphones that are either wired to a private cloud or connected over a wireless link. Second, we propose and describe a number of Android management optimizations (e.g., command pipelining, screen-capturing, file management), which can be useful to the community for building similar functionality into their systems. Third, we conduct extensive experiments and microbenchmarks to support our design choices providing qualitative evidence on the expected performance of each module comprising our architecture. This paper also overviews experiences of using SmartLab in a research-oriented setting and also ongoing and future development efforts
Recommended from our members
Digitalisation and Business Model Innovation: Exploring the Microfoundations of Dynamic Consistency
The Industry 4.0 paradigm (I4.0) as the digitalisation of manufacturing firms denotes the exploitation of real-time data originating from a ubiquitous interconnection of objects, machines and humans (via the internet) across the entire value network. I4.0 not only serves as a catalyst to improve value-adding activities or to design new product and service solutions but also, more fundamentally, enables manufacturing firms to innovate their established business models (BMs). Against this rapid socio-technological shift, manufacturers face the challenge of holistically innovating their BMs. This requires the individualisation of the value proposition alongside the flexibilisation of their value creating and capturing activities, as well as a continuous adaptation and alignment of these activities with the firm’s organisational systems and the resource and competence base. Adopting the view of a BMI (business model innovation) as a system of interdependent activities, the continuous alignment of activities across the BMI is called dynamic consistency. However, it is not clear what mechanisms denote the notion of dynamic consistency. This thesis operationalises the microfoundations of dynamic consistency in an I4.0-driven BMI by empirically investigating six European manufacturing firms. Following the design themes of BMI, it argues that the notion of dynamic consistency comprises three main aspects: (1) a value focus on data and software; (2) a flexi-directional interlinkage to facilitate the exchange of information and materials; (3) agile working ensembles governing changes to the activity system. Moreover, it proposes open-mindedness and integrity of behaviour as a cognitive foundation that facilitates changes to the activity system. Taken together, these microfoundations provide reasoning for manufacturing firms to transform their traditional make-and-sell BM into a sense-and-act BM, yielding higher profits and profitability. The results demonstrate that the notion of BMI as an activity system must be complemented by the cognitive perspective of BMI to sufficiently operationalise the concept of dynamic consistency. This thesis is anticipated to be a starting point for further studies to achieve consistency during I4.0-driven BMI to generate superior and sustained value appropriation for manufacturing firms.Ford Britain Trust, Queens' Colleg