18,517 research outputs found
GridFTP: Protocol Extensions to FTP for the Grid
GridFTP: Protocol Extensions to FTP for the Gri
High-Performance Solvers for Dense Hermitian Eigenproblems
We introduce a new collection of solvers - subsequently called EleMRRR - for
large-scale dense Hermitian eigenproblems. EleMRRR solves various types of
problems: generalized, standard, and tridiagonal eigenproblems. Among these,
the last is of particular importance as it is a solver on its own right, as
well as the computational kernel for the first two; we present a fast and
scalable tridiagonal solver based on the Algorithm of Multiple Relatively
Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers,
PMRRR is part of the freely available Elemental library, and is designed to
fully support both message-passing (MPI) and multithreading parallelism (SMP).
As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP
fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's
solvers on two supercomputers. Such a study, performed with up to 8,192 cores,
provides precise guidelines to assemble the fastest solver within the ScaLAPACK
framework; it also indicates that EleMRRR outperforms even the fastest solvers
built from ScaLAPACK's components
Recommended from our members
Executing matrix multiply on a process oriented data flow machine
The Process-Oriented Dataflow System (PODS) is an execution model that combines the von Neumann and dataflow models of computation to gain the benefits of each. Central to PODS is the concept of array distribution and its effects on partitioning and mapping of processes.In PODS arrays are partitioned by simply assigning consecutive elements to each processing element (PE) equally. Since PODS uses single assignment, there will be only one producer of each element. This producing PE owns that element and will perform the necessary computations to assign it. Using this approach the filling loop is distributed across the PEs. This simple partitioning and mapping scheme provides excellent results for executing scientific code on MIMD machines. In this way PODS allows MIMD machines to exploit vector and data parallelism easily while still providing the flexibility of MIMD over SIMD for multi-user systems.In this paper, the classic matrix multiply algorithm, with 1024 data points, is executed on a PODS simulator and the results are presented and discussed. Matrix multiply is a good example because it has several interesting properties: there are multiple code-blocks; a new array must be dynamically allocated and distributed; there is a loop-carried dependency in the innermost loop; the two input arrays have different access patterns; and the sizes of the input arrays are not known at compile time. Matrix multiply also forms the basis for many important scientific algorithms such as: LU decomposition, convolution, and the Fast-Fourier Transform.The results show that PODS is comparable to both Iannucci's Hybrid Architecture and MIT's TTDA in terms of overhead and instruction power. They also show that PODS easily distributes the work load evenly across the PEs. The key result is that PODS can scale matrix multiply in a near linear fashion until there is little or no work to be performed for each PE. Then overhead and message passing become a major component of the execution time. With larger problems (e.g., >/=16k data points) this limit would be reached at around 256 PEs
Principles in Patterns (PiP) : Evaluation of Impact on Business Processes
The innovation and development work conducted under the auspices of the Principles in Patterns (PiP) project is intended to explore and develop new technology-supported approaches to curriculum design, approval and review. An integral component of this innovation is the use of business process analysis and process change techniques - and their instantiation within the C-CAP system (Class and Course Approval Pilot) - in order to improve the efficacy of curriculum approval processes. Improvements to approval process responsiveness and overall process efficacy can assist institutions in better reviewing or updating curriculum designs to enhance pedagogy. Such improvements also assume a greater significance in a globalised HE environment, in which institutions must adapt or create curricula quickly in order to better reflect rapidly changing academic contexts, as well as better responding to the demands of employment marketplaces and the expectations of professional bodies. This is increasingly an issue for disciplines within the sciences and engineering, where new skills or knowledge need to be rapidly embedded in curricula as a response to emerging technological or environmental developments. All of the aforementioned must also be achieved while simultaneously maintaining high standards of academic quality, thus adding a further layer of complexity to the way in which HE institutions engage in "responsive curriculum design" and approval. This strand of the PiP evaluation therefore entails an analysis of the business process techniques used by PiP, their efficacy, and the impact of process changes on the curriculum approval process, as instantiated by C-CAP. More generally the evaluation is a contribution towards a wider understanding of technology-supported process improvement initiatives within curriculum approval and their potential to render such processes more transparent, efficient and effective. Partly owing to limitations in the data required to facilitate comparative analyses, this evaluation adopts a mixed approach, making use of qualitative and quantitative methods as well as theoretical techniques. These approaches combined enable a comparative evaluation of the curriculum approval process under the "new state" (i.e. using C-CAP) and under the "previous state". This report summarises the methodology used to enable comparative evaluation and presents an analysis and discussion of the results. As the report will explain, the impact of C-CAP and its ability to support improvements in process and document management has resulted in the resolution of numerous process failings. C-CAP has also demonstrated potential for improvements in approval process cycle time, process reliability, process visibility, process automation, process parallelism and a reduction in transition delays within the approval process, thus contributing to considerable process efficiencies; although it is acknowledged that enhancements and redesign may be required to take advantage of C-CAP's potential. Other aspects pertaining to C-CAP's impact on process change, improvements to document management and the curation of curriculum designs will also be discussed
Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients
We present a robust and scalable preconditioner for the solution of
large-scale linear systems that arise from the discretization of elliptic PDEs
amenable to rank compression. The preconditioner is based on hierarchical
low-rank approximations and the cyclic reduction method. The setup and
application phases of the preconditioner achieve log-linear complexity in
memory footprint and number of operations, and numerical experiments exhibit
good weak and strong scalability at large processor counts in a distributed
memory environment. Numerical experiments with linear systems that feature
symmetry and nonsymmetry, definiteness and indefiniteness, constant and
variable coefficients demonstrate the preconditioner applicability and
robustness. Furthermore, it is possible to control the number of iterations via
the accuracy threshold of the hierarchical matrix approximations and their
arithmetic operations, and the tuning of the admissibility condition parameter.
Together, these parameters allow for optimization of the memory requirements
and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics,
Dec 201
Combined shared and distributed memory ab-initio computations of molecular-hydrogen systems in the correlated state: process pool solution and two-level parallelism
An efficient computational scheme devised for investigations of ground state
properties of the electronically correlated systems is presented. As an
example, chain is considered with the long-range
electron-electron interactions taken into account. The implemented procedure
covers: (i) single-particle Wannier wave-function basis construction in the
correlated state, (ii) microscopic parameters calculation, and (iii) ground
state energy optimization. The optimization loop is based on highly effective
process-pool solution - specific root-workers approach. The hierarchical,
two-level parallelism was applied: both shared (by use of Open
Multi-Processing) and distributed (by use of Message Passing Interface) memory
models were utilized. We discuss in detail the feature that such approach
results in a substantial increase of the calculation speed reaching factor of
for the fully parallelized solution.Comment: 14 pages, 10 figures, 1 tabl
Parallel Algorithm for Frequent Itemset Mining on Intel Many-core Systems
Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional databases. Apriori is a
classical frequent itemset mining algorithm, which employs iterative passes
over database combining with generation of candidate itemsets based on frequent
itemsets found at the previous iteration, and pruning of clearly infrequent
itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of
Apriori, which tries to reduce the number of passes made over a transactional
database while keeping the number of itemsets counted in a pass relatively low.
In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi
many-core system for the case when the transactional database fits in main
memory. Intel Xeon Phi provides a large number of small compute cores with
vector processing units. The paper presents a parallel implementation of DIC
based on OpenMP technology and thread-level parallelism. We exploit the
bit-based internal layout for transactions and itemsets. This technique reduces
the memory space for storing the transactional database, simplifies the support
count via logical bitwise operation, and allows for vectorization of such a
step. Experimental evaluation on the platforms of the Intel Xeon CPU and the
Intel Xeon Phi coprocessor with large synthetic and real databases showed good
performance and scalability of the proposed algorithm.Comment: Accepted for publication in Journal of Computing and Information
Technology (http://cit.fer.hr
- …