Search CORE

235,255 research outputs found

Implementing Access to Data Distributed on Many Processors

Author: James Mark
Publication venue
Publication date
Field of study

A reference architecture is defined for an object-oriented implementation of domains, arrays, and distributions written in the programming language Chapel. This technology primarily addresses domains that contain arrays that have regular index sets with the low-level implementation details being beyond the scope of this discussion. What is defined is a complete set of object-oriented operators that allows one to perform data distributions for domain arrays involving regular arithmetic index sets. What is unique is that these operators allow for the arbitrary regions of the arrays to be fragmented and distributed across multiple processors with a single point of access giving the programmer the illusion that all the elements are collocated on a single processor. Today's massively parallel High Productivity Computing Systems (HPCS) are characterized by a modular structure, with a large number of processing and memory units connected by a high-speed network. Locality of access as well as load balancing are primary concerns in these systems that are typically used for high-performance scientific computation. Data distributions address these issues by providing a range of methods for spreading large data sets across the components of a system. Over the past two decades, many languages, systems, tools, and libraries have been developed for the support of distributions. Since the performance of data parallel applications is directly influenced by the distribution strategy, users often resort to low-level programming models that allow fine-tuning of the distribution aspects affecting performance, but, at the same time, are tedious and error-prone. This technology presents a reusable design of a data-distribution framework for data parallel high-performance applications. Distributions are a means to express locality in systems composed of large numbers of processor and memory components connected by a network. Since distributions have a great effect on the performance of applications, it is important that the distribution strategy is flexible, so its behavior can change depending on the needs of the application. At the same time, high productivity concerns require that the user be shielded from error-prone, tedious details such as communication and synchronization

NASA Technical Reports Server

FPGA-Based Acceleration of the Self-Organizing Map (SOM) Algorithm using High-Level Synthesis

Author: Oninda Mohammad Abdul Moin
Publication venue: 'University of Windsor Leddy Library'
Publication date: 17/11/2019
Field of study

One of the fastest growing and the most demanding areas of computer science is Machine Learning (ML). Self-Organizing Map (SOM), categorized as unsupervised ML, is a popular data-mining algorithm widely used in Artificial Neural Network (ANN) for mapping high dimensional data into low dimensional feature maps. SOM, being computationally intensive, requires high computational time and power when dealing with large datasets. Acceleration of many computationally intensive algorithms can be achieved using Field-Programmable Gate Arrays (FPGAs) but it requires extensive hardware knowledge and longer development time when employing traditional Hardware Description Language (HDL) based design methodology. Open Computing Language (OpenCL) is a standard framework for writing parallel computing programs that execute on heterogeneous computing systems. Intel FPGA Software Development Kit for OpenCL (IFSO) is a High-Level Synthesis (HLS) tool that provides a more efficient alternative to HDL-based design. This research presents an optimized OpenCL implementation of SOM algorithm on Stratix V and Arria 10 FPGAs using IFSO. Compared to recent SOM implementations on Central Processing Unit (CPU) and Graphics Processing Unit (GPU), our OpenCL implementation on FPGAs provides superior speed performance and power consumption results. Stratix V achieves speedup of 1.41x - 16.55x compared to AMD and Intel CPU and 2.18x compared to Nvidia GPU whereas Arria 10 achieves speedup of 1.63x - 19.15x compared to AMD and Intel CPU and 2.52x compared to Nvidia GPU. In terms of power consumption, Stratix V is 35.53x and 42.53x whereas Arria 10 is 15.82x and 15.93x more power efficient compared to CPU and GPU respectively

Scholarship at UWindsor

Portable parallel stochastic optimization for the design of aeropropulsion components

Author: Rhodes G. S.
Sues Robert H.
Publication venue
Publication date
Field of study

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically

NASA Technical Reports Server

Distributed Discrete Time Network Simulator

Author: Helminen Väinö
Publication venue
Publication date: 02/06/2010
Field of study

The Discrete Time Network Simulator (DTNS) is a System-on-Chip (SoC) simulator developed at Tampere University of Technology. It is used to analyze interconnection architectures and systems built around them. The abstraction level is between the conventional hardware simulators, such as Mentor Graphics ModelSim, and algorithm level simulators, such as Synopsys System Studio. DNTS makes it possible to get cycle-accurate information about the communication while other high-level tools often lack timing completely. This is because DTNS is a time-driven simulator where the simulated time advances in fixed increments of half a clock cycle and the system bus is always simulated at that level of detail. The simulator itself is programmed in C and the system design is described in C or C++ programming language with detail level from high level functional code to almost hardware description language level code. The accuracy of the simulation increases as the model is refined. However, this also makes the simulation times longer. The goal of this thesis work was to develop a distributed version of DTNS as a remedy. The emphasis on the system bus made it a natural point of partitioning. The simulator was split to a central core process and separate processes for system component models which can then be executed in parallel. Processes communicate any writes to the system bus to the core process which then accumulates them and communicates changes to all other processes. At the same time all processes synchronize with the core process after every simulation step. For communication between processes executed on different computers, it was originally given as a premise that Common Object Request Broker Architecture (CORBA) should be used. However, the amount of overhead was found to be too significant and another implementation that communicated directly over the TCP/IP protocol was created, too. For performance testing a statistical model of a H.263 video encoder was used. The model was instrumented to mimic different complexity levels with artificial delays. Multiple simulations was then executed using both communication implementations while varying the delay gradually from very high level and fast model to very complex and slow. Also, the number of computers was varied from one to three. The measured wall clock times of these simulations clearly show the high overhead of CORBA in comparison to TCP/IP. Both implementation were able to speed-up the simulation as the models became slower. The performance of the TCP/IP implementation seems rather impressive. The distribution method of DTNS was also used to distribute a commercial simulator, ModelSim. The system consisted of two to eight TUTWLAN terminal and this was distributed up to eight simulators executed in parallel with signals passed between them using TCP/IP protocol. The simulation times show that this method is capable of significant speed-up even in real world simulations if the system model is in fine enough detail. In conclusion the distribution of DTNS is not very useful in real life as the models are unlikely to be slow enough to see any speed-up. This new version of DTNS is, however, also capable of parallel execution on a single computer with, for example, a multi-core processor and without network overhead simulation times can be improved noticeably even for higher level models. /Kir1

Trepo - Institutional Repository of Tampere University

Holographic and 3D teleconferencing and visualization: implications for terabit networked applications

Author: Gharai L.
Perkins C.S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Abstract not available

Crossref

Enlighten

Deep Space Network information system architecture study

Author: Atkinson D. J.
Beswick C. A.
Cooper L. P.
Crowe R. A.
Jenkins J. S.
Markley R. W.
Masline R. C.
Stoloff M. J.
Tausworthe R. C.
Thomas J. L.
Publication venue
Publication date: 15/05/1992
Field of study

The purpose of this article is to describe an architecture for the Deep Space Network (DSN) information system in the years 2000-2010 and to provide guidelines for its evolution during the 1990s. The study scope is defined to be from the front-end areas at the antennas to the end users (spacecraft teams, principal investigators, archival storage systems, and non-NASA partners). The architectural vision provides guidance for major DSN implementation efforts during the next decade. A strong motivation for the study is an expected dramatic improvement in information-systems technologies, such as the following: computer processing, automation technology (including knowledge-based systems), networking and data transport, software and hardware engineering, and human-interface technology. The proposed Ground Information System has the following major features: unified architecture from the front-end area to the end user; open-systems standards to achieve interoperability; DSN production of level 0 data; delivery of level 0 data from the Deep Space Communications Complex, if desired; dedicated telemetry processors for each receiver; security against unauthorized access and errors; and highly automated monitor and control

NASA Technical Reports Server

Network Virtual Machine (NetVM): A New Architecture for Efficient and Portable Packet Processing Applications

Author: Baldi Mario
Buffa D.
Degioanni L.
Risso Fulvio Giovanni Ottavio
Stirano F.
Varenni G.
Publication venue: IEEE
Publication date: 01/01/2005
Field of study

A challenge facing network device designers, besides increasing the speed of network gear, is improving its programmability in order to simplify the implementation of new applications (see for example, active networks, content networking, etc). This paper presents our work on designing and implementing a virtual network processor, called NetVM, which has an instruction set optimized for packet processing applications, i.e., for handling network traffic. Similarly to a Java Virtual Machine that virtualizes a CPU, a NetVM virtualizes a network processor. The NetVM is expected to provide a compatibility layer for networking tasks (e.g., packet filtering, packet counting, string matching) performed by various packet processing applications (firewalls, network monitors, intrusion detectors) so that they can be executed on any network device, ranging from expensive routers to small appliances (e.g. smart phones). Moreover, the NetVM will provide efficient mapping of the elementary functionalities used to realize the above mentioned networking tasks upon specific hardware functional units (e.g., ASICs, FPGAs, and network processing elements) included in special purpose hardware systems possibly deployed to implement network devices

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino