13,663 research outputs found
Cloud Storage Performance and Security Analysis with Hadoop and GridFTP
Even though cloud server has been around for a few years, most of the web hosts today have not converted to cloud yet. If the purpose of the cloud server is distributing and storing files on the internet, FTP servers were much earlier than the cloud. FTP server is sufficient to distribute content on the internet. Therefore, is it worth to shift from FTP server to cloud server? The cloud storage provider declares high durability and availability for their users, and the ability to scale up for more storage space easily could save users tons of money. However, does it provide higher performance and better security features? Hadoop is a very popular platform for cloud computing. It is free software under Apache License. It is written in Java and supports large data processing in a distributed environment. Characteristics of Hadoop include partitioning of data, computing across thousands of hosts, and executing application computations in parallel. Hadoop Distributed File System allows rapid data transfer up to thousands of terabytes, and is capable of operating even in the case of node failure. GridFTP supports high-speed data transfer for wide-area networks. It is based on the FTP and features multiple data channels for parallel transfers. This report describes the technology behind HDFS and enhancement to the Hadoop security features with Kerberos. Based on data transfer performance and security features of HDFS and GridFTP server, we can decide if we should replace GridFTP server with HDFS. According to our experiment result, we conclude that GridFTP server provides better throughput than HDFS, and Kerberos has minimal impact to HDFS performance. We proposed a solution which users authenticate with HDFS first, and get the file from HDFS server to the client using GridFTP
Only Aggressive Elephants are Fast Elephants
Yellow elephants are slow. A major reason is that they consume their inputs
entirely before responding to an elephant rider's orders. Some clever riders
have trained their yellow elephants to only consume parts of the inputs before
responding. However, the teaching time to make an elephant do that is high. So
high that the teaching lessons often do not pay off. We take a different
approach. We make elephants aggressive; only this will make them very fast. We
propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and
Hadoop MapReduce that dramatically improves runtimes of several classes of
MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create
different clustered indexes on each data block replica. An interesting feature
of HAIL is that we typically create a win-win situation: we improve both data
upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of
data upload, HAIL improves over HDFS by up to 60% with the default replication
factor of three. In terms of query execution, we demonstrate that HAIL runs up
to 68x faster than Hadoop. In our experiments, we use six clusters including
physical and EC2 clusters of up to 100 nodes. A series of scalability
experiments also demonstrates the superiority of HAIL.Comment: VLDB201
Data locality in Hadoop
Current market tendencies show the need of storing and processing rapidly
growing amounts of data. Therefore, it implies the demand for distributed
storage and data processing systems. The Apache Hadoop is an open-source
framework for managing such computing clusters in an effective, fault-tolerant
way.
Dealing with large volumes of data, Hadoop, and its storage system HDFS
(Hadoop Distributed File System), face challenges to keep the high efficiency
with computing in a reasonable time. The typical Hadoop implementation
transfers computation to the data, rather than shipping data across the cluster.
Otherwise, moving the big quantities of data through the network could significantly
delay data processing tasks. However, while a task is already running,
Hadoop favours local data access and chooses blocks from the nearest nodes.
Next, the necessary blocks are moved just when they are needed in the given
ask.
For supporting the Hadoop’s data locality preferences, in this thesis, we propose
adding an innovative functionality to its distributed file system (HDFS), that
enables moving data blocks on request. In-advance shipping of data makes it
possible to forcedly redistribute data between nodes in order to easily adapt it to
the given processing tasks. New functionality enables the instructed movement
of data blocks within the cluster. Data can be shifted either by user running
the proper HDFS shell command or programmatically by other module like an
appropriate scheduler.
In order to develop such functionality, the detailed analysis of Apache Hadoop
source code and its components (specifically HDFS) was conducted. Research
resulted in a deep understanding of internal architecture, what made it possible
to compare the possible approaches to achieve the desired solution, and develop
the chosen one
Topology of the Galaxy Distribution in the Hubble Deep Fields
We have studied topology of the distribution of the high redshift galaxies
identified in the Hubble Deep Field (HDF) North and South. The two-dimensional
genus is measured from the projected distributions of the HDF galaxies at
angular scales from to . We have also divided the samples into
three redshift slices with roughly equal number of galaxies using photometric
redshifts to see possible evolutionary effects on the topology.
The genus curve of the HDF North clearly indicates clustering of galaxies
over the Poisson distribution while the clustering is somewhat weaker in the
HDF South. This clustering is mainly due to the nearer galaxies in the samples.
We have also found that the genus curve of galaxies in the HDF is consistent
with the Gaussian random phase distribution with no significant redshift
dependence.Comment: 14 pages, 4 figures, submitted to Ap
Direct Measurements of the Stellar Continua and Balmer/4000 Angstrom Breaks of Red z>2 Galaxies: Redshifts and Improved Constraints on Stellar Populations
We use near-infrared (NIR) spectroscopy obtained with GNIRS on Gemini,
NIRSPEC on KECK, and ISAAC on the VLT to study the rest-frame optical continua
of three `Distant Red Galaxies' (having Js - Ks > 2.3) at z>2. All three galaxy
spectra show the Balmer/4000 Angstrom break in the rest-frame optical. The
spectra allow us to determine spectroscopic redshifts from the continuum with
an estimated accuracy dz/(1+z) ~ 0.001-0.04. These redshifts agree well with
the emission line redshifts for the 2 galaxies with Halpha emission. This
technique is particularly important for galaxies that are faint in the
rest-frame UV, as they are underrepresented in high redshift samples selected
in optical surveys and are too faint for optical spectroscopy. Furthermore, we
use the break, continuum shape, and equivalent width of Halpha together with
evolutionary synthesis models to constrain the age, star formation timescale,
dust content, stellar mass and star formation rate of the galaxies. Inclusion
of the NIR spectra in the stellar population fits greatly reduces the range of
possible solutions for stellar population properties. We find that the stellar
populations differ greatly among the three galaxies, ranging from a young dusty
starburst with a small break and strong emission lines to an evolved galaxy
with a strong break and no detected line emission. The dusty starburst galaxy
has an age of 0.3 Gyr and a stellar mass of 1*10^11 Msun. The spectra of the
two most evolved galaxies imply ages of 1.3-1.4 Gyr and stellar masses of
4*10^11 Msun. The large range of properties seen in these galaxies strengthens
our previous much more uncertain results from broadband photometry. Larger
samples are required to determine the relative frequency of dusty starbursts
and (nearly) passively evolving galaxies at z~2.5.Comment: Accepted for publication in the Astrophysical Journal. 12 pages, 6
figure
Gaining insight from large data volumes with ease
Efficient handling of large data-volumes becomes a necessity in today's
world. It is driven by the desire to get more insight from the data and to gain
a better understanding of user trends which can be transformed into economic
incentives (profits, cost-reduction, various optimization of data workflows,
and pipelines). In this paper, we discuss how modern technologies are
transforming well established patterns in HEP communities. The new data insight
can be achieved by embracing Big Data tools for a variety of use-cases, from
analytics and monitoring to training Machine Learning models on a terabyte
scale. We provide concrete examples within context of the CMS experiment where
Big Data tools are already playing or would play a significant role in daily
operations
Design Architecture-Based on Web Server and Application Cluster in Cloud Environment
Cloud has been a computational and storage solution for many data centric
organizations. The problem today those organizations are facing from the cloud
is in data searching in an efficient manner. A framework is required to
distribute the work of searching and fetching from thousands of computers. The
data in HDFS is scattered and needs lots of time to retrieve. The major idea is
to design a web server in the map phase using the jetty web server which will
give a fast and efficient way of searching data in MapReduce paradigm. For real
time processing on Hadoop, a searchable mechanism is implemented in HDFS by
creating a multilevel index in web server with multi-level index keys. The web
server uses to handle traffic throughput. By web clustering technology we can
improve the application performance. To keep the work down, the load balancer
should automatically be able to distribute load to the newly added nodes in the
server
- …
