13,663 research outputs found

    Cloud Storage Performance and Security Analysis with Hadoop and GridFTP

    Get PDF
    Even though cloud server has been around for a few years, most of the web hosts today have not converted to cloud yet. If the purpose of the cloud server is distributing and storing files on the internet, FTP servers were much earlier than the cloud. FTP server is sufficient to distribute content on the internet. Therefore, is it worth to shift from FTP server to cloud server? The cloud storage provider declares high durability and availability for their users, and the ability to scale up for more storage space easily could save users tons of money. However, does it provide higher performance and better security features? Hadoop is a very popular platform for cloud computing. It is free software under Apache License. It is written in Java and supports large data processing in a distributed environment. Characteristics of Hadoop include partitioning of data, computing across thousands of hosts, and executing application computations in parallel. Hadoop Distributed File System allows rapid data transfer up to thousands of terabytes, and is capable of operating even in the case of node failure. GridFTP supports high-speed data transfer for wide-area networks. It is based on the FTP and features multiple data channels for parallel transfers. This report describes the technology behind HDFS and enhancement to the Hadoop security features with Kerberos. Based on data transfer performance and security features of HDFS and GridFTP server, we can decide if we should replace GridFTP server with HDFS. According to our experiment result, we conclude that GridFTP server provides better throughput than HDFS, and Kerberos has minimal impact to HDFS performance. We proposed a solution which users authenticate with HDFS first, and get the file from HDFS server to the client using GridFTP

    Only Aggressive Elephants are Fast Elephants

    Full text link
    Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast. We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.Comment: VLDB201

    Data locality in Hadoop

    Get PDF
    Current market tendencies show the need of storing and processing rapidly growing amounts of data. Therefore, it implies the demand for distributed storage and data processing systems. The Apache Hadoop is an open-source framework for managing such computing clusters in an effective, fault-tolerant way. Dealing with large volumes of data, Hadoop, and its storage system HDFS (Hadoop Distributed File System), face challenges to keep the high efficiency with computing in a reasonable time. The typical Hadoop implementation transfers computation to the data, rather than shipping data across the cluster. Otherwise, moving the big quantities of data through the network could significantly delay data processing tasks. However, while a task is already running, Hadoop favours local data access and chooses blocks from the nearest nodes. Next, the necessary blocks are moved just when they are needed in the given ask. For supporting the Hadoop’s data locality preferences, in this thesis, we propose adding an innovative functionality to its distributed file system (HDFS), that enables moving data blocks on request. In-advance shipping of data makes it possible to forcedly redistribute data between nodes in order to easily adapt it to the given processing tasks. New functionality enables the instructed movement of data blocks within the cluster. Data can be shifted either by user running the proper HDFS shell command or programmatically by other module like an appropriate scheduler. In order to develop such functionality, the detailed analysis of Apache Hadoop source code and its components (specifically HDFS) was conducted. Research resulted in a deep understanding of internal architecture, what made it possible to compare the possible approaches to achieve the desired solution, and develop the chosen one

    Topology of the Galaxy Distribution in the Hubble Deep Fields

    Full text link
    We have studied topology of the distribution of the high redshift galaxies identified in the Hubble Deep Field (HDF) North and South. The two-dimensional genus is measured from the projected distributions of the HDF galaxies at angular scales from 3.83.8'' to 6.1 6.1''. We have also divided the samples into three redshift slices with roughly equal number of galaxies using photometric redshifts to see possible evolutionary effects on the topology. The genus curve of the HDF North clearly indicates clustering of galaxies over the Poisson distribution while the clustering is somewhat weaker in the HDF South. This clustering is mainly due to the nearer galaxies in the samples. We have also found that the genus curve of galaxies in the HDF is consistent with the Gaussian random phase distribution with no significant redshift dependence.Comment: 14 pages, 4 figures, submitted to Ap

    Direct Measurements of the Stellar Continua and Balmer/4000 Angstrom Breaks of Red z>2 Galaxies: Redshifts and Improved Constraints on Stellar Populations

    Full text link
    We use near-infrared (NIR) spectroscopy obtained with GNIRS on Gemini, NIRSPEC on KECK, and ISAAC on the VLT to study the rest-frame optical continua of three `Distant Red Galaxies' (having Js - Ks > 2.3) at z>2. All three galaxy spectra show the Balmer/4000 Angstrom break in the rest-frame optical. The spectra allow us to determine spectroscopic redshifts from the continuum with an estimated accuracy dz/(1+z) ~ 0.001-0.04. These redshifts agree well with the emission line redshifts for the 2 galaxies with Halpha emission. This technique is particularly important for galaxies that are faint in the rest-frame UV, as they are underrepresented in high redshift samples selected in optical surveys and are too faint for optical spectroscopy. Furthermore, we use the break, continuum shape, and equivalent width of Halpha together with evolutionary synthesis models to constrain the age, star formation timescale, dust content, stellar mass and star formation rate of the galaxies. Inclusion of the NIR spectra in the stellar population fits greatly reduces the range of possible solutions for stellar population properties. We find that the stellar populations differ greatly among the three galaxies, ranging from a young dusty starburst with a small break and strong emission lines to an evolved galaxy with a strong break and no detected line emission. The dusty starburst galaxy has an age of 0.3 Gyr and a stellar mass of 1*10^11 Msun. The spectra of the two most evolved galaxies imply ages of 1.3-1.4 Gyr and stellar masses of 4*10^11 Msun. The large range of properties seen in these galaxies strengthens our previous much more uncertain results from broadband photometry. Larger samples are required to determine the relative frequency of dusty starbursts and (nearly) passively evolving galaxies at z~2.5.Comment: Accepted for publication in the Astrophysical Journal. 12 pages, 6 figure

    Gaining insight from large data volumes with ease

    Get PDF
    Efficient handling of large data-volumes becomes a necessity in today's world. It is driven by the desire to get more insight from the data and to gain a better understanding of user trends which can be transformed into economic incentives (profits, cost-reduction, various optimization of data workflows, and pipelines). In this paper, we discuss how modern technologies are transforming well established patterns in HEP communities. The new data insight can be achieved by embracing Big Data tools for a variety of use-cases, from analytics and monitoring to training Machine Learning models on a terabyte scale. We provide concrete examples within context of the CMS experiment where Big Data tools are already playing or would play a significant role in daily operations

    Design Architecture-Based on Web Server and Application Cluster in Cloud Environment

    Full text link
    Cloud has been a computational and storage solution for many data centric organizations. The problem today those organizations are facing from the cloud is in data searching in an efficient manner. A framework is required to distribute the work of searching and fetching from thousands of computers. The data in HDFS is scattered and needs lots of time to retrieve. The major idea is to design a web server in the map phase using the jetty web server which will give a fast and efficient way of searching data in MapReduce paradigm. For real time processing on Hadoop, a searchable mechanism is implemented in HDFS by creating a multilevel index in web server with multi-level index keys. The web server uses to handle traffic throughput. By web clustering technology we can improve the application performance. To keep the work down, the load balancer should automatically be able to distribute load to the newly added nodes in the server
    corecore