Search CORE

134,245 research outputs found

Provisioning of data locality for HEP analysis workflows

Author: Fischer M.
Giffels M.
Heidecker C.
Kuehn E.
Quast G.
Schnepf M. J.
Publication venue: IOP Publishing
Publication date: 23/11/2018
Field of study

The heavily increasing amount of data produced by current experiments in high energy particle physics challenge both end users and providers of computing resources. The boosted data rates and the complexity of analyses require huge datasets being processed in short turnaround cycles. Usually, data storages and computing farms are deployed by different providers, which leads to data delocalization and a strong influence of the interconnection transfer rates. The CMS collaboration at KIT has developed a prototype enabling data locality for HEP analysis processing via two concepts. A coordinated and distributed caching approach that reduce the limiting factor of data transfers by joining local high performance devices with large background storages were tested. Thereby, a throughput optimization was reached by selecting and allocating critical data within user work-flows. A highly performant setup using these caching solutions enables fast processing of throughput dependent analysis workflows

KITopen

Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning

Author: Li Jiali
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2023
Field of study

In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency. Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications. Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency. Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments

University of Tennessee, Knoxville: Trace

Distributed Computing in a Pandemic

Author: Alnasir Jamie
Publication venue: Ediciones Universidad de Salamanca (España)
Publication date: 01/06/2022
Field of study

The current COVID-19 global pandemic caused by the SARS-CoV-2 betacoronavirus has resulted in over a million deaths and is having a grave socio-economic impact, hence there is an urgency to find solutions to key research challenges. Much of this COVID-19 research depends on distributed computing. In this article, I review distributed architectures -- various types of clusters, grids and clouds -- that can be leveraged to perform these tasks at scale, at high-throughput, with a high degree of parallelism, and which can also be used to work collaboratively. High-performance computing (HPC) clusters will be used to carry out much of this work. Several bigdata processing tasks used in reducing the spread of SARS-CoV-2 require high-throughput approaches, and a variety of tools, which Hadoop and Spark offer, even using commodity hardware. Extremely large-scale COVID-19 research has also utilised some of the world's fastest supercomputers, such as IBM's SUMMIT -- for ensemble docking high-throughput screening against SARS-CoV-2 targets for drug-repurposing, and high-throughput gene analysis -- and Sentinel, an XPE-Cray based system used to explore natural products. Grid computing has facilitated the formation of the world's first Exascale grid computer. This has accelerated COVID-19 research in molecular dynamics simulations of SARS-CoV-2 spike protein interactions through massively-parallel computation and was performed with over 1 million volunteer computing devices using the Folding@home platform. Grids and clouds both can also be used for international collaboration by enabling access to important datasets and providing services that allow researchers to focus on research rather than on time-consuming data-management tasks

Directory of Open Access Journals

Gestion del Repositorio Documental de la Universidad de Salamanca

Recommended from our members

TeleMed: Wide-area, secure, collaborative object computing with Java and CORBA for healthcare

Author: Forslund D.W.
Gavrilov E.M.
George J.E.
Publication venue: Los Alamos National Laboratory
Publication date: 31/12/1998
Field of study

Distributed computing is becoming commonplace in a variety of industries with healthcare being a particularly important one for society. The authors describe the development and deployment of TeleMed in a few healthcare domains. TeleMed is a 100% Java distributed application build on CORBA and OMG standards enabling the collaboration on the treatment of chronically ill patients in a secure manner over the Internet. These standards enable other systems to work interoperably with TeleMed and provide transparent access to high performance distributed computing to the healthcare domain. The goal of wide scale integration of electronic medical records is a grand-challenge scale problem of global proportions with far-reaching social benefits

UNT Digital Library

Social Networking Adapted for Distributed Scientific Collaboration

Author: Karimabadi Homa
Publication venue
Publication date
Field of study

Share is a social networking site with novel, specially designed feature sets to enable simultaneous remote collaboration and sharing of large data sets among scientists. The site will include not only the standard features found on popular consumer-oriented social networking sites such as Facebook and Myspace, but also a number of powerful tools to extend its functionality to a science collaboration site. A Virtual Observatory is a promising technology for making data accessible from various missions and instruments through a Web browser. Sci-Share augments services provided by Virtual Observatories by enabling distributed collaboration and sharing of downloaded and/or processed data among scientists. This will, in turn, increase science returns from NASA missions. Sci-Share also enables better utilization of NASA s high-performance computing resources by providing an easy and central mechanism to access and share large files on users space or those saved on mass storage. The most common means of remote scientific collaboration today remains the trio of e-mail for electronic communication, FTP for file sharing, and personalized Web sites for dissemination of papers and research results. Each of these tools has well-known limitations. Sci-Share transforms the social networking paradigm into a scientific collaboration environment by offering powerful tools for cooperative discourse and digital content sharing. Sci-Share differentiates itself by serving as an online repository for users digital content with the following unique features: a) Sharing of any file type, any size, from anywhere; b) Creation of projects and groups for controlled sharing; c) Module for sharing files on HPC (High Performance Computing) sites; d) Universal accessibility of staged files as embedded links on other sites (e.g. Facebook) and tools (e.g. e-mail); e) Drag-and-drop transfer of large files, replacing awkward e-mail attachments (and file size limitations); f) Enterprise-level data and messaging encryption; and g) Easy-to-use intuitive workflow

NASA Technical Reports Server

The integration of grid and peer-to-peer to support scientific collaboration

Author: Dew P.M.
Lau L.M.S.
Pham T.V.
Publication venue: Semantic Grid Community
Publication date: 01/01/2004
Field of study

There have been a number of e-Science projects which address the issues of collaboration within and between scientific communities. Most effort to date focussed on the building of the Grid infrastructure to enable the sharing of huge volume of computational and data resources. The ‘portal’ approach has been used by some to bring the power of grid computing to the desk top of individual researchers. However, collaborative activities within a scientific community are not only confined to the sharing of data or computational intensive resources. There are other forms of sharing which can be better supported by other forms of architecture. In order to provide a more holistic support to a scientific community, this paper proposes a hybrid architecture, which integrates Grid and peer-to-peer technologies using Service Oriented Architecture. This platform will then be used for a semantic architecture which captures characteristics of the data, functional and process requirements for a range of collaborative activities. A combustion chemistry research community is being used as a case study

White Rose Research Online

Sharing a conceptual model of grid resources and services

Author: Andreozzi Sergio
Sgaravatto Massimo
Vistoli Cristina
Publication venue
Publication date: 01/01/2003
Field of study

Grid technologies aim at enabling a coordinated resource-sharing and problem-solving capabilities over local and wide area networks and span locations, organizations, machine architectures and software boundaries. The heterogeneity of involved resources and the need for interoperability among different grid middlewares require the sharing of a common information model. Abstractions of different flavors of resources and services and conceptual schemas of domain specific entities require a collaboration effort in order to enable a coherent information services cooperation. With this paper, we present the result of our experience in grid resources and services modelling carried out within the Grid Laboratory Uniform Environment (GLUE) effort, a joint US and EU High Energy Physics projects collaboration towards grid interoperability. The first implementation-neutral agreement on services such as batch computing and storage manager, resources such as the hierarchy cluster, sub-cluster, host and the storage library are presented. Design guidelines and operational results are depicted together with open issues and future evolutions.Comment: 4 pages, 0 figures, CHEP 200

arXiv.org e-Print Archive

CiteSeerX