273 research outputs found
Designing a Modern Software Engineering Training Program with Cloud Computing
The software engineering industry is trending towards cloud computing. For our project, we assessed the various tools and practices used in modern software development. The main goals of this project were to create a reference model for developing cloud-based applications, to program a functional cloud-based prototype, and to develop an accompanying training manual. These materials will be incorporated into the software engineering courses at WPI, namely CS 3733 and CS 509
Performance modelling and optimization for video-analytic algorithms in a cloud-like environment using machine learning
CCTV cameras produce a large amount of video surveillance data per day, and
analysing them require the use of significant computing resources that often need to be scalable. The emergence of the Hadoop distributed processing framework has had a significant impact on various data intensive applications as the distributed computed based processing enables an increase of the processing capability of applications it serves. Hadoop is an open source implementation of the MapReduce
programming model. It automates the operation of creating tasks for each
function, distribute data, parallelize executions and handles machine failures that reliefs users from the complexity of having to manage the underlying processing and only focus on building their application. It is noted that in a practical deployment the challenge of Hadoop based architecture is that it requires several scalable machines for effective processing, which in turn adds hardware investment cost to the infrastructure. Although using a cloud infrastructure offers scalable and elastic utilization of resources where users can scale up or scale down the number of Virtual Machines (VM) upon requirements, a user such as a CCTV system operator intending to use a public cloud would aspire to know what cloud resources (i.e. number of VMs) need to be deployed
so that the processing can be done in the fastest (or within a known time
constraint) and the most cost effective manner. Often such resources will also
have to satisfy practical, procedural and legal requirements. The capability to
model a distributed processing architecture where the resource requirements can
be effectively and optimally predicted will thus be a useful tool, if available. In
literature there is no clear and comprehensive modelling framework that provides
proactive resource allocation mechanisms to satisfy a user's target requirements,
especially for a processing intensive application such as video analytic.
In this thesis, with the hope of closing the above research gap, novel research
is first initiated by understanding the current legal practices and requirements of
implementing video surveillance system within a distributed processing and data
storage environment, since the legal validity of data gathered or processed within
such a system is vital for a distributed system's applicability in such domains.
Subsequently the thesis presents a comprehensive framework for the performance
ii
modelling and optimization of resource allocation in deploying a scalable distributed
video analytic application in a Hadoop based framework, running on virtualized
cluster of machines.
The proposed modelling framework investigates the use of several machine
learning algorithms such as, decision trees (M5P, RepTree), Linear Regression,
Multi Layer Perceptron(MLP) and the Ensemble Classifier Bagging model, to
model and predict the execution time of video analytic jobs, based on infrastructure
level as well as job level parameters. Further in order to propose a novel
framework for the allocate resources under constraints to obtain optimal performance
in terms of job execution time, we propose a Genetic Algorithms (GAs) based
optimization technique.
Experimental results are provided to demonstrate the proposed framework's
capability to successfully predict the job execution time of a given video analytic task based on infrastructure and input data related parameters and its ability determine the minimum job execution time, given constraints of these parameters.
Given the above, the thesis contributes to the state-of-art in distributed video
analytics, design, implementation, performance analysis and optimisation
Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures
One of the significant shifts of the next-generation computing technologies will certainly be in
the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD
landmark, evolved as a widely deployed BD operating system. Its new features include
federation structure and many associated frameworks, which provide Hadoop 3.x with the
maturity to serve different markets. This dissertation addresses two leading issues involved in
exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely,
(i)Scalability that directly affects the system performance and overall throughput using
portable Docker containers. (ii) Security that spread the adoption of data protection practices
among practitioners using access controls. An Enhanced Mapreduce Environment (EME),
OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker
(BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for
data streaming to the cloud computing are the main contribution of this thesis study
Designing, Building, and Modeling Maneuverable Applications within Shared Computing Resources
Extending the military principle of maneuver into war-fighting domain of cyberspace, academic and military researchers have produced many theoretical and strategic works, though few have focused on researching actual applications and systems that apply this principle. We present our research in designing, building and modeling maneuverable applications in order to gain the system advantages of resource provisioning, application optimization, and cybersecurity improvement. We have coined the phrase “Maneuverable Applications” to be defined as distributed and parallel application that take advantage of the modification, relocation, addition or removal of computing resources, giving the perception of movement. Our work with maneuverable applications has been within shared computing resources, such as the Clemson University Palmetto cluster, where multiple users share access and time to a collection of inter-networked computers and servers. In this dissertation, we describe our implementation and analytic modeling of environments and systems to maneuver computational nodes, network capabilities, and security enhancements for overcoming challenges to a cyberspace platform. Specifically we describe our work to create a system to provision a big data computational resource within academic environments. We also present a computing testbed built to allow researchers to study network optimizations of data centers. We discuss our Petri Net model of an adaptable system, which increases its cybersecurity posture in the face of varying levels of threat from malicious actors. Lastly, we present work and investigation into integrating these technologies into a prototype resource manager for maneuverable applications and validating our model using this implementation
Evaluation of Docker Containers for Scientific Workloads in the Cloud
The HPC community is actively researching and evaluating tools to support
execution of scientific applications in cloud-based environments. Among the
various technologies, containers have recently gained importance as they have
significantly better performance compared to full-scale virtualization, support
for microservices and DevOps, and work seamlessly with workflow and
orchestration tools. Docker is currently the leader in containerization
technology because it offers low overhead, flexibility, portability of
applications, and reproducibility. Singularity is another container solution
that is of interest as it is designed specifically for scientific applications.
It is important to conduct performance and feature analysis of the container
technologies to understand their applicability for each application and target
execution environment. This paper presents a (1) performance evaluation of
Docker and Singularity on bare metal nodes in the Chameleon cloud (2) mechanism
by which Docker containers can be mapped with InfiniBand hardware with RDMA
communication and (3) analysis of mapping elements of parallel workloads to the
containers for optimal resource management with container-ready orchestration
tools. Our experiments are targeted toward application developers so that they
can make informed decisions on choosing the container technologies and
approaches that are suitable for their HPC workloads on cloud infrastructure.
Our performance analysis shows that scientific workloads for both Docker and
Singularity based containers can achieve near-native performance. Singularity
is designed specifically for HPC workloads. However, Docker still has
advantages over Singularity for use in clouds as it provides overlay networking
and an intuitive way to run MPI applications with one container per rank for
fine-grained resources allocation
Algorithms for advance bandwidth reservation in media production networks
Media production generally requires many geographically distributed actors (e.g., production houses, broadcasters, advertisers) to exchange huge amounts of raw video and audio data. Traditional distribution techniques, such as dedicated point-to-point optical links, are highly inefficient in terms of installation time and cost. To improve efficiency, shared media production networks that connect all involved actors over a large geographical area, are currently being deployed. The traffic in such networks is often predictable, as the timing and bandwidth requirements of data transfers are generally known hours or even days in advance. As such, the use of advance bandwidth reservation (AR) can greatly increase resource utilization and cost efficiency. In this paper, we propose an Integer Linear Programming formulation of the bandwidth scheduling problem, which takes into account the specific characteristics of media production networks, is presented. Two novel optimization algorithms based on this model are thoroughly evaluated and compared by means of in-depth simulation results
Large Scale Hierarchical K-Means Based Image Retrieval With MapReduce
Image retrieval remains one of the most heavily researched areas in Computer Vision. Image retrieval methods have been used in autonomous vehicle localization research, object recognition applications, and commercially in projects such as Google Glass. Current methods for image retrieval become problematic when implemented on image datasets that can easily reach billions of images. In order to process these growing datasets, we distribute the necessary computation for image retrieval among a cluster of machines using Apache Hadoop. While there are many techniques for image retrieval, we focus on systems that use Hierarchical K-Means Trees. Successful image retrieval systems based on Hierarchical K-Means Trees have been built using the tree as a Visual Vocabulary to build an Inverted File Index and implementing a Bag of Words retrieval approach, or by building the tree as a Full Representation of every image in the database and implementing a K-Nearest Neighbor voting scheme for retrieval. Both approaches involve different levels of approximation, and each has strengths and weaknesses that must be weighed in accordance with the needs of the application. Both approaches are implemented with MapReduce, for the first time, and compared in terms of image retrieval precision, index creation run-time, and image retrieval throughput. Experiments that include up to 2 million images running on 20 virtual machines are shown
Storage and Analysis of Big Data Tools for Sessionized Data
The Oracle database currently used to mine data at PEGGY is approaching end-of-life and a new infrastructure overhaul is required. It has also been identified that a critical business requirement is the need to load and store very large historical data sets. These data sets contain raw electronic consumer events and interactions from a website such as page views, clicks, downloads, return visits, length of time spent on pages, and how they got to the site / originated.
This project will be focused on finding a tool to analyze and measure sessionized data, which is a unit of measurement in web analytics that captures either a user\u27s actions within a particular time period, or the process of segmenting user activity of each user into sessions, each representing a single visit to the site. This sessionized data can be used as the input for a variety of data mining tasks such as clustering, association rule mining, sequence mining etc (Ansari. 2011) This sessionized data must be delivered in a reorganized and readable format timely enough to make informed go-to-market decisions as it relates to the current and existing industry trends. It is also pertinent to understand any development work required and the burden on the resources.
Legacy on-premise data warehouse solutions are becoming more expensive, less efficient, less dynamic, and unscalable when compared to current Cloud Infrastructure as a Service (IaaS) that offer real time, on-demand, pay-as-you-go solutions . Therefore, this study will examine the total cost of ownership (TCO) by considering, researching, and analyzing the following factors against a system wide upgrade of the current on-premise Oracle Real Application Cluster (RAC) System: High performance: real-time (or as close to as possible) query speed against sessionized data SQL compliance Cloud based or, at least a hybrid (read: on-premise paired with cloud) Security: encryption preferred Cost structure: cost-effective pay-as-you-go pricing model and resources required for the migration and operations.
These technologies analyzed against the current Oracle database are: Amazon Redshift Google Bigquery Hadoop Hadoop + Hive
The cost of building an on-premise data warehouse is substantial. The project will determine the performance capabilities and affordability of Amazon Redshift, when compared to other emerging highly ranked solutions, for running e-commerce standard analytics queries on terabytes of sessionized data. Rather than redesigning, upgrading, or over purchasing infrastructure at a high cost for an on-premise data warehouse, this project considers data warehousing solutions through cloud based infrastructure as a service (IaaS) solutions. The proposed objective of this project is to determine the most cost-effective high performer between Amazon Redshift, Apache Hadoop, and Google BigQuery when running e-commerce standard analytics queries on terabytes of sessionized data
- …