15 research outputs found
Recommended from our members
Cray X1 Evaluation Status Report
On August 15, 2002 the Department of Energy (DOE) selected the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory (ORNL) to deploy a new scalable vector supercomputer architecture for solving important scientific problems in climate, fusion, biology, nanoscale materials and astrophysics. ''This program is one of the first steps in an initiative designed to provide U.S. scientists with the computational power that is essential to 21st century scientific leadership,'' said Dr. Raymond L. Orbach, director of the department's Office of Science The Cray X1 is an attempt to incorporate the best aspects of previous Cray vector systems and massively-parallel-processing (MPP) systems into one design. Like the Cray T90, the X1 has high memory bandwidth, which is key to realizing a high percentage of theoretical peak performance. Like the Cray T3E, the X1 has a high-bandwidth, low-latency, scalable interconnect, and scalable system software. And, like the Cray SV1, the X1 leverages commodity off-the-shelf (CMOS) technology and incorporates non-traditional vector concepts, like vector caches and multi-streaming processors. In FY03, CCS procured a 256-processor Cray X1 to evaluate the processors, memory subsystem, scalability of the architecture, software environment and to predict the expected sustained performance on key DOE applications codes. The results of the micro-benchmarks and kernel benchmarks show the architecture of the Cray X1 to be exceptionally fast for most operations. The best results are shown on large problems, where it is not possible to fit the entire problem into the cache of the processors. These large problems are exactly the types of problems that are important for the DOE and ultra-scale simulation
The distributed ASCI supercomputer project
The Distributed ASCI Supercomputer (DAS) is a homogeneous wide-area distributed system consisting of four cluster computers at different locations. DAS has been used for research on communication software, parallel languages and programming systems, schedulers, parallel applications, and distributed applications. The paper gives a preview of the most interesting research results obtained so far in the DAS project
Database System Acceleration on FPGAs
Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems.
The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases.
Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs
1 INTRODUCTION
1.1 Databases & the Importance of Performance
1.2 Accelerators & FPGAs
1.3 Requirements
1.4 Outline & Summary of Contributions
2 BACKGROUND ON DATABASE SYSTEMS
2.1 Databases
2.1.1 Storage Model
2.1.2 Storage Medium
2.2 Database Operators
2.2.1 Projection
2.2.2 Filter
2.2.3 Sort
2.2.4 Aggregation
2.2.5 Join
2.2.6 Operator Classification
2.3 Database Queries
2.4 Impact of Acceleration
3 BACKGROUND ON FPGAS
3.1 FPGA
3.1.1 Logic Element
3.1.2 Block RAM (BRAM)
3.1.3 Digital Signal Processor (DSP)
3.1.4 IO Element
3.1.5 Programmable Interconnect
3.2 FPGADesignFlow
3.2.1 Specifications
3.2.2 RTL Description
3.2.3 Verification
3.2.4 Synthesis, Mapping, Placement, and Routing
3.2.5 TimingAnalysis
3.2.6 Bitstream Generation and FPGA Programming
3.3 Implementation Quality Metrics
3.4 FPGA Cards
3.5 Benefits of Using FPGAs
3.6 Challenges of Using FPGAs
4 RELATED WORK
4.1 Summary of Related Work
4.2 Platform Type
4.2.1 Accelerator Card
4.2.2 Coprocessor
4.2.3 Smart Storage
4.2.4 Network Processor
4.3 Implementation
4.3.1 Loop-based implementation
4.3.2 Sort-based Implementation
4.3.3 Hash-based Implementation
4.3.4 Mixed Implementation
4.4 A Note on Quantitative Performance Comparisons
II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK)
5 OBJECTIVES AND ARCHITECTURE OVERVIEW
5.1 From Requirements to Objectives
5.2 Architecture Overview
5.3 Outlineof Part II
6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS
6.1 Programming FPGAs
6.2 RelatedWork
6.3 Architecture
6.3.1 Global Architecture
6.3.2 Sorter Architecture
6.3.3 Merger Architecture
6.3.4 Scalability and Resource Adaptability
6.4 Experiments
6.4.1 OpenCL Sort-Merge Implementation
6.4.2 RTLSorters
6.4.3 RTLMergers
6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation
6.5 Summary & Discussion
7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS
7.1 The Case for Resource Efficiency
7.2 Related Work
7.3 Architecture
7.3.1 Sorters
7.3.2 Sort-Network
7.3.3 X:Y Mergers
7.3.4 Merge-Network
7.3.5 Join Materialiser (JoinMat)
7.4 Experiments
7.4.1 Experimental Setup
7.4.2 Implementation Description & Tuning
7.4.3 Sort Benchmarks
7.4.4 Aggregation Benchmarks
7.4.5 Join Benchmarks
7. Summary
8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS
8.1 The Scope of Database System Accelerators
8.2 Related Work
8.3 Key-Reduce Radix Sort(KeRRaS)
8.3.1 Time Complexity
8.3.2 Space Complexity (Memory Utilization)
8.3.3 Discussion and Optimizations
8.4 Architecture
8.4.1 MSM
8.4.2 MSMK: Extending MSM with KeRRaS
8.4.3 Payload, Aggregation and Join Processing
8.4.4 Limitations
8.5 Experiments
8.5.1 Experimental Setup
8.5.2 Datasets
8.5.3 MSMK vs. MSM
8.5.4 Payload-Less Benchmarks
8.5.5 Payload-Based Benchmarks
8.5.6 Flexibility
8.6 Summary
9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS
9.1 Early Aggregation
9.2 Background & Related Work
9.2.1 Sort-Based Early Aggregation
9.2.2 Cache-Based Early Aggregation
9.3 Simulations
9.3.1 Datasets
9.3.2 Metrics
9.3.3 Sort-Based Versus Cache-Based Early Aggregation
9.3.4 Comparison of Set-Associative Caches
9.3.5 Comparison of Cache Structures
9.3.6 Comparison of Replacement Policies
9.3.7 Cache Selection Methodology
9.4 Cache System Architecture
9.4.1 Window Aggregator
9.4.2 Compressor & Hasher
9.4.3 Collision Detector
9.4.4 Collision Resolver
9.4.5 Cache
9.5 Experiments
9.5.1 Experimental Setup
9.5.2 Resource Utilization and Parameter Tuning
9.5.3 Datasets
9.5.4 Benchmarks on Synthetic Data
9.5.5 Benchmarks on Real Data
9.6 Summary
10 THE FULL PICTURE
10.1 System Architecture
10.2 Benchmarks
10.3 Meeting the Objectives
III Conclusion
11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH
11.1 Summary
11.2 Future Work
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLE
A dynamic prediction and monitoring framework for distributed applications
This research builds on an application performance prediction and characterisation environment (known as PACE), whose aim is to characterise the performance-critical elements of both an application and its target execution environment and deduce from this model a predicted behaviour of the application prior to its execution.
Underlying the research presented in this thesis are a number of themes: the tasks involved in the performance characterisation of applications and how this might be semi- automated: the level of abstraction at which these characterisations are performed in order to maintain a sufficient predictive accuracy: the automated refinement of these characterisations from runtime performance data: the extension of both the target programming languages and the class of application at which these techniques are aimed.
In this thesis a number of novel extensions to PACE are described. These include: a new transaction-based performance characterisation language that provides a flexible framework for describing broader classes of application; a performance monitoring framework (based on an extension to the OpenGroup’s Application Response Measurement (ARM) standard) for the runtime monitoring of an application's data-dependent components and the automated refinement of performance models: an adaptation of this performance characterisation for the prediction of Java applications. These contributions are demonstrated through their application to a number of scientific kernels. This thesis also documents how these predictive results can be used in a real-time distributed runtime management environment, and also how these techniques can be applied to non-scientific codes, in particular to an IBM request-driven distributed web services demonstrator
Flow-Induced Vibrations of In-Line Cylinder Arrangements at Low Reynolds Numbers
RÉSUMÉ
Les vibrations induites par sillage (Wake-Induced Vibration ou WIV en anglais) est un type d’interactions fluide-structure qui peut se produire quand deux corps ou plus, montés élastiquement, sont disposés l’un derrière l’autre dans un écoulement transverse. Dans cette configuration, le corps situé en aval est soumis non seulement à son propre lâcher tourbillonnaire mais également à celui généré par le cylindre amont. Par conséquent, le corps aval peut
osciller fortement avec des amplitudes maximales pouvant atteindre A/D=10 (Paidaussis et al. (2011)). Les WIV sont encore mal connues. Même un leader mondial en classification
dans le domaine de l’ingénierie offshore ne sait pas comment traiter les phénomènes d’interférences entre plusieurs colonnes montantes avec WIV (Det Norske Veritas (2009)). La plus part des études effectuées considèrent simplement une configuration en tandem d’une paire
de cylindres. Peu d’études ont été réalisées avec plus de deux corps montés élastiquement.
En 2009, Etienne at al. ont considéré 3 cylindres arrangés en ligne dans un écoulement uniforme. Pour un nombre de Reynolds de 200 et une vitesse réduite de 8, ils ont montré
par simulation numérique que les cylindres pouvaient subir de fortes oscillations. En 2013,Oviedo-Tolentino et al. ont étudié expérimentalement les oscillations de 10 cylindres placés les uns derrière les autres pour un facteur de masse amortissement de m¤³ = 0.13. Ils ont confirmé que le troisième cylindre, c’est-à -dire celui placé derrière les deux premiers, peut subir des oscillations transverses plus importantes encore que celles subies par le deuxième cylindre. Ces grandes oscillations peuvent non seulement causer une fatigue excessive des matériaux mais également provoquer des collisions entre les cylindres. Ainsi, les WIV peuvent poser de sérieux problèmes lors de la conception de nombreux systèmes en ingénierie.
À la lumière de ces études récentes, il est donc nécessaire d’approfondir l’étude des comportements de plusieurs corps placés les uns derrière les autres dans un écoulement transverse et montés élastiquement. Mise à part les fortes oscillations observées, de nombreux aspects des WIV de plusieurs cylindres en ligne restent très mal connus : les réponses fréquentielles et les amplitudes maximales produites, l’influence du nombre de Reynolds de l’écoulement, les effets dus à des ratios de masse ou des facteurs de masse amortissement faibles, etc.
Cette thèse vise à explorer numériquement les réponses d’oscillations induites par sillage de 3 cylindres circulaires disposés en ligne et ayant un nombre de masse faible et un amortissement nul pour de faibles nombres de Reynolds. Pour atteindre cet objectif de recherche,
on procède en trois étapes.----------ABSTRACT
Wake-induced vibration (WIV) is a type of fluid-structure interaction (FSI) that may occur when there are two or more elastically mounted bodies, arranged one after the other, in a cross flow. Here, the downstream body is not only affected by the vortices generated behind the body itself, but also is subjected to the influence of the wake developed behind the upstream body. Under these two disturbances, the downstream body can develop severe
oscillations with a maximum amplitude as large as A/D = 10 (Paidoussis et al. (2011)).
The knowledge of WIV is still so limited that even in the recommended practice for riser interference from a world class leader in offshore engineering classification does not know yet how to consistently incorporate the consideration of WIV (Det Norske Veritas (2009)). Most
investigations consider the configuration with a tandem cylinder pair placed in a uniform flow. Very little is known when there are more than two elastically mounted structural bodies.
In a brief investigation, Étienne et al. (2009) numerically showed that three freely oscillating cylinders arranged in-line, in a uniform flow at the Reynolds number of Re = 200 and at a fixed reduced velocity of Ur = 8, can develop significant vibrations. A recent original experiment
by Oviedo-Tolentino et al. (2013), who studied the oscillation response of ten collinear cylinders with a medium large mass-damping factor (m¤³ = 0.13) placed in a uniform flow, confirmed that the cylinders behind the second one can develop transverse oscillations that
are actually larger than those of the second cylinder. These more severe oscillations, not only can cause fatigue of material, but also can potentially lead to collisions among the cylinders. These conditions pose great challenges for engineering design.
Based on these recent findings, it is therefore important to take a closer look at the behavior of multiple elastically mounted bodies arranged in-line placed in a cross flow. Apart from the more significant oscillations observed, many important aspects about WIV of multiple
in-line cylinders, e.g. the low mass ratio, the low mass-damping factor, the maximum oscillation amplitude, the frequency responses, and the effect of Reynolds number, etc., remain essentially unknown.
This thesis aims to numerically explore the wake-induced vibration responses of three circular cylinders with low mass ratio and zero damping arranged in-line at low Reynolds
number in order to advance the fundamental engineering knowledge regarding multiple elastically mounted in-line bodies placed in a cross flow. To reach this research goal, we have identified three specific objectives