64,021 research outputs found
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Lustre, Hadoop, Accumulo
Data processing systems impose multiple views on data as it is processed by
the system. These views include spreadsheets, databases, matrices, and graphs.
There are a wide variety of technologies that can be used to store and process
data through these different steps. The Lustre parallel file system, the Hadoop
distributed file system, and the Accumulo database are all designed to address
the largest and the most challenging data storage problems. There have been
many ad-hoc comparisons of these technologies. This paper describes the
foundational principles of each technology, provides simple models for
assessing their capabilities, and compares the various technologies on a
hypothetical common cluster. These comparisons indicate that Lustre provides 2x
more storage capacity, is less likely to loose data during 3 simultaneous drive
failures, and provides higher bandwidth on general purpose workloads. Hadoop
can provide 4x greater read bandwidth on special purpose workloads. Accumulo
provides 10,000x lower latency on random lookups than either Lustre or Hadoop
but Accumulo's bulk bandwidth is 10x less. Significant recent work has been
done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo
to be combined in different ways.Comment: 6 pages; accepted to IEEE High Performance Extreme Computing
conference, Waltham, MA, 201
A scalable application server on Beowulf clusters : a thesis presented in partial fulfilment of the requirement for the degree of Master of Information Science at Albany, Auckland, Massey University, New Zealand
Application performance and scalability of a large distributed multi-tiered application is a core requirement for most of today's critical business applications. I have investigated the scalability of a J2EE application server using the standard ECperf benchmark application in the Massey Beowulf Clusters namely the Sisters and the Helix. My testing environment consists of Open Source software: The integrated JBoss-Tomcat as the application server and the web server, along with PostgreSQL as the database. My testing programs were run on the clustered application server, which provide replication of the Enterprise Java Bean (EJB) objects. I have completed various centralized and distributed tests using the JBoss Cluster. I concluded that clustering of the application server and web server will effectively increase the performance of the application running on them given sufficient system resources. The application performance will scale to a point where a bottleneck has occurred in the testing system, the bottleneck could be any resources included in the testing environment: the hardware, software, network and the application that is running. Performance tuning for a large-scale J2EE application is a complicated issue, which is related to the resources available. However, by carefully identifying the performance bottleneck in the system with hardware, software, network, operating system and application configuration. I can improve the performance of the J2EE applications running in a Beowulf Cluster. The software bottleneck can be solved by changing the default settings, on the other hand, hardware bottlenecks are harder unless more investment are made to purchase higher speed and capacity hardware
Querying Large Physics Data Sets Over an Information Grid
Optimising use of the Web (WWW) for LHC data analysis is a complex problem
and illustrates the challenges arising from the integration of and computation
across massive amounts of information distributed worldwide. Finding the right
piece of information can, at times, be extremely time-consuming, if not
impossible. So-called Grids have been proposed to facilitate LHC computing and
many groups have embarked on studies of data replication, data migration and
networking philosophies. Other aspects such as the role of 'middleware' for
Grids are emerging as requiring research. This paper positions the need for
appropriate middleware that enables users to resolve physics queries across
massive data sets. It identifies the role of meta-data for query resolution and
the importance of Information Grids for high-energy physics analysis rather
than just Computational or Data Grids. This paper identifies software that is
being implemented at CERN to enable the querying of very large collaborating
HEP data-sets, initially being employed for the construction of CMS detectors.Comment: 4 pages, 3 figure
Comparative genomic analysis of Acinetobacter spp. plasmids originating from clinical settings and environmental habitats
Bacteria belonging to the genus Acinetobacter have become of clinical importance over the last decade due to the development of a multi-resistant phenotype and their ability to survive under multiple environmental conditions. The development of these traits among Acinetobacter strains occurs frequently as a result of plasmid-mediated horizontal gene transfer. In this work, plasmids from nosocomial and environmental Acinetobacter spp. collections were separately sequenced and characterized. Assembly of the sequenced data resulted in 19 complete replicons in the nosocomial collection and 77 plasmid contigs in the environmental collection. Comparative genomic analysis showed that many of them had conserved backbones. Plasmid coding sequences corresponding to plasmid specific functions were bioinformatically and functionally analyzed. Replication initiation protein analysis revealed the predominance of the Rep_3 superfamily. The phylogenetic tree constructed from all Acinetobacter Rep_3 superfamily plasmids showed 16 intermingled clades originating from nosocomial and environmental habitats. Phylogenetic analysis of relaxase proteins revealed the presence of a new sub-clade named MOBQAci, composed exclusively of Acinetobacter relaxases. Functional analysis of proteins belonging to this group showed that they behaved differently when mobilized using helper plasmids belonging to different incompatibility groups.Fil: Salto, Ileana Paula. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - La Plata. Instituto de BiotecnologĂa y BiologĂa Molecular. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de BiotecnologĂa y BiologĂa Molecular; ArgentinaFil: Torres Tejerizo, Gonzalo Arturo. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - La Plata. Instituto de BiotecnologĂa y BiologĂa Molecular. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de BiotecnologĂa y BiologĂa Molecular; Argentina. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: Wibberg, Daniel. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: PĂĽhler, Alfred. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: SchlĂĽter, Andreas. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: Pistorio, Mariano. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - La Plata. Instituto de BiotecnologĂa y BiologĂa Molecular. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de BiotecnologĂa y BiologĂa Molecular; Argentin
- …