179 research outputs found
Modular Acquisition and Stimulation System for Timestamp-Driven Neuroscience Experiments
Dedicated systems are fundamental for neuroscience experimental protocols
that require timing determinism and synchronous stimuli generation. We
developed a data acquisition and stimuli generator system for neuroscience
research, optimized for recording timestamps from up to 6 spiking neurons and
entirely specified in a high-level Hardware Description Language (HDL). Despite
the logic complexity penalty of synthesizing from such a language, it was
possible to implement our design in a low-cost small reconfigurable device.
Under a modular framework, we explored two different memory arbitration schemes
for our system, evaluating both their logic element usage and resilience to
input activity bursts. One of them was designed with a decoupled and latency
insensitive approach, allowing for easier code reuse, while the other adopted a
centralized scheme, constructed specifically for our application. The usage of
a high-level HDL allowed straightforward and stepwise code modifications to
transform one architecture into the other. The achieved modularity is very
useful for rapidly prototyping novel electronic instrumentation systems
tailored to scientific research.Comment: Preprint submitted to ARC 2015. Extended: 16 pages, 10 figures. The
final publication is available at link.springer.co
FOS: A Modular FPGA Operating System for Dynamic Workloads
With FPGAs now being deployed in the cloud and at the edge, there is a need
for scalable design methods which can incorporate the heterogeneity present in
the hardware and software components of FPGA systems. Moreover, these FPGA
systems need to be maintainable and adaptable to changing workloads while
improving accessibility for the application developers. However, current FPGA
systems fail to achieve modularity and support for multi-tenancy due to
dependencies between system components and lack of standardised abstraction
layers. To solve this, we introduce a modular FPGA operating system -- FOS,
which adopts a modular FPGA development flow to allow each system component to
be changed and be agnostic to the heterogeneity of EDA tool versions, hardware
and software layers. Further, to dynamically maximise the utilisation
transparently from the users, FOS employs resource-elastic scheduling to
arbitrate the FPGA resources in both time and spatial domain for any type of
accelerators. Our evaluation on different FPGA boards shows that FOS can
provide performance improvements in both single-tenant and multi-tenant
environments while substantially reducing the development time and, at the same
time, improving flexibility
Towards Power- and Energy-Efficient Datacenters
As the Internet evolves, cloud computing is now a dominant form of computation in modern lives. Warehouse-scale computers (WSCs), or datacenters, comprising the foundation of this cloud-centric web have been able to deliver satisfactory performance to both the Internet companies and the customers. With the increased focus and popularity of the cloud, however, datacenter loads rise and grow rapidly, and Internet companies are in need of boosted computing capacity to serve such demand. Unfortunately, power and energy are often the major limiting factors prohibiting datacenter growth: it is often the case that no more servers can be added to datacenters without surpassing the capacity of the existing power infrastructure.
This dissertation aims to investigate the issues of power and energy usage in a modern datacenter environment. We identify the source of power and energy inefficiency at three levels in a modern datacenter environment and provides insights and solutions to address each of these problems, aiming to prepare datacenters for critical future growth. We start at the datacenter-level and find that the peak provisioning and improper service placement in multi-level power delivery infrastructures fragment the power budget inside production datacenters, degrading the compute capacity the existing infrastructure can support. We find that the heterogeneity among datacenter workloads is key to address this issue and design systematic methods to reduce the fragmentation and improve the utilization of the power budget. This dissertation then narrow the focus to examine the energy usage of individual servers running cloud workloads. Especially, we examine the power management mechanisms employed in these servers and find that the coarse time granularity of these mechanisms is one critical factor that leads to excessive energy consumption. We propose an intelligent and low overhead solution on top of the emerging finer granularity voltage/frequency boosting circuit to effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost, improving energy efficiency without sacrificing the quality of services. The final focus of this dissertation takes a further step to investigate how using a fundamentally more efficient computing substrate, field programmable gate arrays (FPGAs), benefit datacenter power and energy efficiency. Different from other types of hardware accelerations, FPGAs can be reconfigured on-the-fly to provide fine-grain control over hardware resource allocation and presents a unique set of challenges for optimal workload scheduling and resource allocation. We aim to design a set coordinated algorithms to manage these two key factors simultaneously and fully explore the benefit of deploying FPGAs in the highly varying cloud environment.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144043/1/hsuch_1.pd
Improving low latency applications for reconfigurable devices
This thesis seeks to improve low latency application performance via architectural improvements in reconfigurable devices. This is achieved by improving resource utilisation and access, and by exploiting the different environments within which reconfigurable devices are deployed.
Our first contribution leverages devices deployed at the network level to enable the low latency processing of financial market data feeds. Financial exchanges transmit messages via two identical data feeds to reduce the chance of message loss. We present an approach to arbitrate these redundant feeds at the network level using a Field-Programmable Gate Array (FPGA). With support for any messaging protocol, we evaluate our design using the NASDAQ TotalView-ITCH, OPRA, and ARCA data feed protocols, and provide two simultaneous outputs: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods.
Our second contribution is a new ring-based architecture for low latency, parallel access to FPGA memory. Traditional FPGA memory is formed by grouping block memories (BRAMs) together and accessing them as a single device. Our architecture accesses these BRAMs independently and in parallel. Targeting memory-based computing, which stores pre-computed function results in memory, we benefit low latency applications that rely on: highly-complex functions; iterative computation; or many parallel accesses to a shared resource. We assess square root, power, trigonometric, and hyperbolic functions within the FPGA, and provide a tool to convert Python functions to our new architecture.
Our third contribution extends the ring-based architecture to support any FPGA processing element. We unify E heterogeneous processing elements within compute pools, with each element implementing the same function, and the pool serving D parallel function calls. Our implementation-agnostic approach supports processing elements with different latencies, implementations, and pipeline lengths, as well as non-deterministic latencies. Compute pools evenly balance access to processing elements across the entire application, and are evaluated by implementing eight different neural network activation functions within an FPGA.Open Acces
Efficient runtime placement management for high performance and reliability in COTS FPGAs
Designing high-performance, fault-tolerant multisensory electronic systems for
hostile environments such as nuclear plants and outer space within the constraints of
cost, power and flexibility is challenging. Issues such as ionizing radiation, extreme
temperature and ageing can lead to faults in the electronics of these systems. In
addition, the remote nature of these environments demands a level of flexibility and
autonomy in their operations. The standard practice of using specially hardened
electronic devices for such systems is not only very expensive but also has limited
flexibility.
This thesis proposes novel techniques that promote the use of Commercial Off-The-
Shelf (COTS) reconfigurable devices to meet the challenges of high-performance
systems for hostile environments. Reconfigurable hardware such as Field
Programmable Gate Arrays (FPGA) have a unique combination of flexibility and
high performance. The flexibility offered through features such as dynamic partial
reconfiguration (DPR) can be harnessed not only to achieve cost-effective designs as
a smaller area can be used to execute multiple tasks, but also to improve the
reliability of a system as a circuit on one portion of the device can be physically
relocated to another portion in the case of fault occurrence. However, to harness
these potentials for high performance and reliability in a cost-effective manner, novel
runtime management tools are required. Most runtime support tools for
reconfigurable devices are based on ideal models which do not adequately consider
the limitations of realistic FPGAs, in particular modern FPGAs which are
increasingly heterogeneous. Specifically, these tools lack efficient mechanisms for
ensuring a high utilization of FPGA resources, including the FPGA area and the
configuration port and clocking resources, in a reliable manner.
To ensure high utilization of reconfigurable device area, placement management is a
key aspect of these tools. This thesis presents novel techniques for the management
of hardware task placement on COTS reconfigurable devices for high performance
and reliability. To this end, it addresses design-time issues that affect efficient
hardware task placement, with a focus on reliability. It also presents techniques to
maximize the utilization of the FPGA area in runtime, including techniques to
minimize fragmentation. Fragmentation leads to the creation of unusable areas due to
dynamic placement of tasks and the heterogeneity of the resources on the chip.
Moreover, this thesis also presents an efficient task reuse mechanism to improve the
availability of the internal configuration infrastructure of the FPGA for critical
responsibilities like error mitigation. The task reuse scheme, unlike previous
approaches, also improves the utilization of the chip area by offering
defragmentation.
Task relocation, which involves changing the physical location of circuits is a
technique for error mitigation and high performance. Hence, this thesis also provides
a functionality-based relocation mechanism for improving the number of locations to
which tasks can be relocated on heterogeneous FPGAs. As tasks are relocated, clock
networks need to be routed to them. As such, a reliability-aware technique of clock
network routing to tasks after placement is also proposed.
Finally, this thesis offers a prototype implementation and characterization of a
placement management system (PMS) which is an integration of the aforementioned
techniques. The performance of most of the proposed techniques are tested using
data processing tasks of a NASA JPL spectrometer application. The results show that
the proposed techniques have potentials to improve the reliability and performance of
applications in hostile environment compared to state-of-the-art techniques. The task
optimization technique presented leads to better capacity to circumvent permanent
faults on COTS FPGAs compared to state-of-the-art approaches (48.6% more errors
were circumvented for the JPL spectrometer application). The proposed task reuse
scheme leads to approximately 29% saving in the amount of configuration time. This
frees up the internal configuration interface for more error mitigation operations. In
addition, the proposed PMS has a worst-case latency of less than 50% of that of state-of-
the-art runtime placement systems, while maintaining the same level of placement
quality and resource overhead
Solutions and application areas of flip-flop metastability
PhD ThesisThe state space of every continuous multi-stable system is bound to contain one or more
metastable regions where the net attraction to the stable states can be infinitely-small.
Flip-flops are among these systems and can take an unbounded amount of time to decide
which logic state to settle to once they become metastable. This problematic behavior is
often prevented by placing the setup and hold time conditions on the flip-flop’s input.
However, in applications such as clock domain crossing where these constraints cannot
be placed flip-flops can become metastable and induce catastrophic failures. These
events are fundamentally impossible to prevent but their probability can be significantly
reduced by employing synchronizer circuits. The latter grant flip-flops longer decision
time at the expense of introducing latency in processing the synchronized input.
This thesis presents a collection of research work involving the phenomenon of
flip-flop metastability in digital systems. The main contributions include three novel
solutions for the problem of synchronization. Two of these solutions are speculative
methods that rely on duplicate state machines to pre-compute data-dependent states
ahead of the completion of synchronization. Speculation is a core theme of this thesis
and is investigated in terms of its functional correctness, cost efficacy and fitness for
being automated by electronic design automation tools. It is shown that speculation
can outperform conventional synchronization solutions in practical terms and is a viable
option for future technologies. The third solution attempts to address the problem of
synchronization in the more-specific context of variable supply voltages. Finally, the
thesis also identifies a novel application of metastability as a means of quantifying
intra-chip physical parameters. A digital sensor is proposed based on the sensitivity
of metastable flip-flops to changes in their environmental parameters and is shown to
have better precision while being more compact than conventional digital sensors
Design Space Exploration and Resource Management of Multi/Many-Core Systems
The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
Low-Impact Profiling of Streaming, Heterogeneous Applications
Computer engineers are continually faced with the task of translating improvements in fabrication process technology: i.e., Moore\u27s Law) into architectures that allow computer scientists to accelerate application performance. As feature-size continues to shrink, architects of commodity processors are designing increasingly more cores on a chip. While additional cores can operate independently with some tasks: e.g. the OS and user tasks), many applications see little to no improvement from adding more processor cores alone. For many applications, heterogeneous systems offer a path toward higher performance. Significant performance and power gains have been realized by combining specialized processors: e.g., Field-Programmable Gate Arrays, Graphics Processing Units) with general purpose multi-core processors. Heterogeneous applications need to be programmed differently than traditional software. One approach, stream processing, fits these systems particularly well because of the segmented memories and explicit expression of parallelism. Unfortunately, debugging and performance tools that support streaming, heterogeneous applications do not exist. This dissertation presents TimeTrial, a performance measurement system that enables performance optimization of streaming applications by profiling the application deployed on a heterogeneous system. TimeTrial performs low-impact measurements by dedicating computing resources to monitoring and by aggressively compressing performance traces into statistical summaries guided by user specification of the performance queries of interest
- …