141 research outputs found
Energy-Aware Scheduling of Conditional Task Graphs on NoC-Based MPSoCs
We investigate the problem of scheduling a set of tasks with individual deadlines and conditional precedence constraints on a heterogeneous Network on Chip (NoC)-based Multi-Processor System-on-Chip (MPSoC) such that the total expected energy consumption of all the tasks is minimized, and propose a novel approach. Our approach consists of a scheduling heuristic for constructing a single unified schedule for all the tasks and assigning a frequency to each task and each communication assuming continuous frequencies, an Integer Linear Programming (ILP)-based algorithm and a polynomial time heuristic for assigning discrete frequencies and voltages to tasks and communications. We have performed experiments on 16 synthetic and 4 real-world benchmarks. The experimental results show that compared to the state-of-the-art approach, our approach using the ILP-based algorithm and our approach using the polynomial-time heuristic achieve average improvements of 31% and 20%, respectively, in terms of energy reduction
Dynamic Power Management for Neuromorphic Many-Core Systems
This work presents a dynamic power management architecture for neuromorphic
many core systems such as SpiNNaker. A fast dynamic voltage and frequency
scaling (DVFS) technique is presented which allows the processing elements (PE)
to change their supply voltage and clock frequency individually and
autonomously within less than 100 ns. This is employed by the neuromorphic
simulation software flow, which defines the performance level (PL) of the PE
based on the actual workload within each simulation cycle. A test chip in 28 nm
SLP CMOS technology has been implemented. It includes 4 PEs which can be scaled
from 0.7 V to 1.0 V with frequencies from 125 MHz to 500 MHz at three distinct
PLs. By measurement of three neuromorphic benchmarks it is shown that the total
PE power consumption can be reduced by 75%, with 80% baseline power reduction
and a 50% reduction of energy per neuron and synapse computation, all while
maintaining temporary peak system performance to achieve biological real-time
operation of the system. A numerical model of this power management model is
derived which allows DVFS architecture exploration for neuromorphics. The
proposed technique is to be used for the second generation SpiNNaker
neuromorphic many core system
Query processing on low-energy many-core processors
Aside from performance, energy efficiency is an increasing challenge in database systems. To tackle both aspects in an integrated fashion, we pursue a hardware/software co-design approach. To fulfill the energy requirement from the hardware perspective, we utilize a low-energy processor design offering the possibility to us to place hundreds to millions of chips on a single board without any thermal restrictions. Furthermore, we address the performance requirement by the development of several database-specific instruction set extensions to customize each core, whereas each core does not have all extensions. Therefore, our hardware foundation is a low-energy processor consisting of a high number of heterogeneous cores. In this paper, we introduce our hardware setup on a system level and present several challenges for query processing. Based on these challenges, we describe two implementation concepts and a comparison between these concepts. Finally, we conclude the paper with some lessons learned and an outlook on our upcoming research directions
Recommended from our members
SoC-Based In-Storage Processing: Bringing Flexibility and Efficiency to Near-Data Processing
Data are among the most valuable assets in the modern world, and they have caused a revolutionary stage in human life. Nowadays, companies make knowledge-based decisions by analyzing a huge volume of data, super-scale data centers are used to process customers’ data to suggest products to them, government services rely on the data people provide to them, and there are many similar cases wherein data are used as an important asset. Data are originally stored in storage systems. To process data, application servers need to fetch the data from storage units, which imposes the cost of moving the data to the system. This cost has a direct relationship to the distance of the processing engines from the data, and this is the key motivation for the emergence of distributed processing platforms such as Hadoop, which bring the process closer to the data.In-storage processing (ISP) pushes the “bring the process to data” paradigm to its ultimate boundaries by utilizing processing engines inside the storage units to process data. The architecture of modern solid-state drives (SSDs) provides a suitable environment for implementing such technology. Thus, this dissertation focuses on SSD architectures that are able to run user applications in-place, which are called computational storage devices (CSDs). In this dissertation, we propose CSD architectures and investigate the benefits of deploying CSDs for running different applications. This research uses a practical approach that includes building fully functional prototypes of the proposed CSD architectures, developing storage systems equipped with the CSDs, and running different benchmarks to investigate the benefits of deploying the CSDs in the systems. This research proposes two different CSD architectures, namely CompStor and Catalina.These are the first CSDs to be equipped with a dedicated ISP engine for running user applications in-place that includes a quad-core ARM Cortex-A53 processor together with FPGA- and application-specific integrated circuit (ASIC) based accelerators. The proposed architectures run a full-fledged operating system inside, which provides a flexible environment for running a wide range of user applications in-place. The system-on-chip (SOC) based architecture of Catalina CSD, together with a software stack developed for seamless deployment of the CSD, makes it a platform for the implementation of different ISP concepts and ideas.To the best of our knowledge, Catalina is the only ISP platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and message passing interface (MPI) based applications in-place without any modifications to the underlying distributed processing framework. We performed extensive experimental tests using several datasets on both CompStor and Catalina CSDs. The experimental results show up to 2.2x and 4.3x improvements in performance and energy consumption, respectively, for running Hadoop MapReduce benchmarks using Catalina CSDs and up to 5.4x and 8.9x improvements for running 1-, 2-, and 3-dimensional DFT algorithms due to the Neon SIMD engines inside Catalina. Additionally, using FPGA-based accelerators, Catalina CSDs can improve the performance and energy consumption of a highly demanding image similarity search application up to 11x and 7x, respectively
A database accelerator for energy-efficient query processing and optimization
Data processing on a continuously growing amount of information and the increasing power restrictions have become an ubiquitous challenge in our world today. Besides parallel computing, a promising approach to improve the energy efficiency of current systems is to integrate specialized hardware. This paper presents a Tensilica RISC processor extended with an instruction set to accelerate basic database operators frequently used in modern database systems. The core was taped out in a 28 nm SLP CMOS technology and allows energy-efficient query processing as well as query optimization by applying selectivity estimation techniques. Our chip measurements show an 1000x energy improvement on selected database operators compared to state-of-the-art systems
Towards Efficient Resource Allocation for Embedded Systems
Das Hauptthema ist die dynamische Ressourcenverwaltung in eingebetteten Systemen, insbesondere die Verwaltung von Rechenzeit und Netzwerkverkehr auf einem MPSoC. Die Idee besteht darin, eine Pipeline für die Verarbeitung von Mobiler Kommunikation auf dem Chip dynamisch zu schedulen, um die Effizienz der Hardwareressourcen zu verbessern, ohne den Ressourcenverbrauch des dynamischen Schedulings dramatisch zu erhöhen. Sowohl Software- als auch Hardwaremodule werden auf Hotspots im Ressourcenverbrauch untersucht und optimiert, um diese zu entfernen. Da Applikationen im Bereich der Signalverarbeitung normalerweise mit Hilfe von SDF-Diagrammen beschrieben werden können, wird deren dynamisches Scheduling optimiert, um den Ressourcenverbrauch gegenüber dem üblicherweise verwendeten statischen Scheduling zu verbessern. Es wird ein hybrider dynamischer Scheduler vorgestellt, der die Vorteile von Processing-Networks und der Planung von Task-Graphen kombiniert. Es ermöglicht dem Scheduler, ein Gleichgewicht zwischen der Parallelisierung der Berechnung und der Zunahme des dynamischen Scheduling-Aufands optimal abzuwägen. Der resultierende dynamisch erstellte Schedule reduziert den Ressourcenverbrauch um etwa 50%, wobei die Laufzeit im Vergleich zu einem statischen Schedule nur um 20% erhöht wird. Zusätzlich wird ein verteilter dynamischer SDF-Scheduler vorgeschlagen, der das Scheduling in verschiedene Teile zerlegt, die dann zu einer Pipeline verbunden werden, um mehrere parallele Prozessoren einzubeziehen. Jeder Scheduling-Teil wird zu einem Cluster mit Load-Balancing erweitert, um die Anzahl der parallel laufenden Scheduling-Jobs weiter zu erhöhen. Auf diese Weise wird dem vorhandene Engpass bei dem dynamischen Scheduling eines zentralisierten Schedulers entgegengewirkt, sodass 7x mehr Prozessoren mit dem Pipelined-Clustered-Dynamic-Scheduler für eine typische Signalverarbeitungsanwendung verwendet werden können.
Das neue dynamische Scheduling-System setzt das Vorhandensein von drei verschiedenen Kommunikationsmodi zwischen den Verarbeitungskernen voraus. Bei der Emulation auf Basis des häufig verwendeten RDMA-Protokolls treten Leistungsprobleme auf. Sehr gut kann RDMA für einmalige Punkt-zu-Punkt-Datenübertragungen verwendet werden, wie sie bei der Ausführung von Task-Graphen verwendet werden. Process-Networks verwenden normalerweise Datenströme mit hohem Volumen und hoher Bandbreite. Es wird eine FIFO-basierte Kommunikationslösung vorgestellt, die einen zyklischen Puffer sowohl im Sender als auch im Empfänger implementiert, um diesen Bedarf zu decken. Die Pufferbehandlung und die Datenübertragung zwischen ihnen erfolgen ausschließlich in Hardware, um den Software-Overhead aus der Anwendung zu entfernen. Die Implementierung verbessert die Zugriffsverwaltung mehrerer Nutzer auf flächen-effiziente Single-Port Speichermodule. Es werden 0,8 der theoretisch möglichen Bandbreite, die normalerweise nur mit flächenmäßig teureren Dual-Port-Speichern erreicht wird. Der dritte Kommunikationsmodus definiert eine einfache Message-Passing-Implementierung, die ohne einen Verbindungszustand auskommt. Dieser Modus wird für eine effiziente prozessübergreifende Kommunikation des verteilten Scheduling-Systems und der engen Ansteuerung der restlichen Prozessoren benötigt. Eine Flusskontrolle in Hardware stellt sicher, dass eine große Anzahl von Sendern Nachrichten an denselben Empfänger senden kann. Dabei wird garantiert, dass alle Nachrichten korrekt empfangen werden, ohne dass eine Verbindung hergestellt werden muss und die Nachrichtenlaufzeit gering bleibt.
Die Arbeit konzentriert sich auf die Optimierung des Codesigns von Hardware und Software, um die kompromisslose Ressourceneffizienz der dynamischen SDF-Graphen-Planung zu erhöhen. Besonderes Augenmerk wird auf die Abhängigkeiten zwischen den Ebenen eines verteilten Scheduling-Systems gelegt, das auf der Verfügbarkeit spezifischer hardwarebeschleunigter Kommunikationsmethoden beruht.:1 Introduction
1.1 Motivation
1.2 The Multiprocessor System on Chip Architecture
1.3 Concrete MPSoC Architecture
1.4 Representing LTE/5G baseband processing as Static Data Flow
1.5 Compuation Stack
1.6 Performance Hotspots Addressed
1.7 State of the Art
1.8 Overview of the Work
2 Hybrid SDF Execution
2.1 Addressed Performance Hotspot
2.2 State of the Art
2.3 Static Data Flow Graphs
2.4 Runtime Environment
2.5 Overhead of Deloying Tasks to a MPSoC
2.6 Interpretation of SDF Graphs as Task Graphs
2.7 Interpreting SDF Graphs as Process Networks
2.8 Hybrid Interpretation
2.9 Graph Topology Considerations
2.10 Theoretic Impact of Hybrid Interpretation
2.11 Simulating Hybrid Execution
2.12 Pipeline SDF Graph Example
2.13 Random SDF Graphs
2.14 LTE-like SDF Graph
2.15 Key Lernings
3 Distribution of Management
3.1 Addressed Performance Hotspot
3.2 State of the Art
3.3 Revising Deployment Overhead
3.4 Distribution of Overhead
3.5 Impact of Management Distribution to Resource Utilization
3.6 Reconfigurability
3.7 Key Lernings
4 Sliced FIFO Hardware
4.1 Addressed Performance Hotspot
4.2 State of the Art
4.3 System Environment
4.4 Sliced Windowed FIFO buffer
4.5 Single FIFO Evaluation
4.6 Multiple FIFO Evalutaion
4.7 Hardware Implementation
4.8 Key Lernings
5 Message Passing Hardware
5.1 Addressed Performance Hotspot
5.2 State of the Art
5.3 Message Passing Regarded as Queueing
5.4 A Remote Direct Memory Access Based Implementation
5.5 Hardware Implementation Concept
5.6 Evalutation of Performance
5.7 Key Lernings
6 SummaryThe main topic is the dynamic resource allocation in embedded systems, especially the allocation of computing time and network traffic on an multi processor system on chip (MPSoC). The idea is to dynamically schedule a mobile communication signal processing pipeline on the chip to improve hardware resource efficiency while not dramatically improve resource consumption because of dynamic scheduling overhead. Both software and hardware modules are examined for resource consumption hotspots and optimized to remove them. Since signal processing can usually be described with the help of static data flow (SDF) graphs, the dynamic handling of those is optimized to improve resource consumption over the commonly used static scheduling approach. A hybrid dynamic scheduler is presented that combines benefits from both processing networks and task graph scheduling. It allows the scheduler to optimally balance parallelization of computation and addition of dynamic scheduling overhead. The resulting dynamically created schedule reduces resource consumption by about 50%, with a runtime increase of only 20% compared to a static schedule. Additionally, a distributed dynamic SDF scheduler is proposed that splits the scheduling into different parts, which are then connected to a scheduling pipeli ne to incorporate multiple parallel working processors. Each scheduling stage is reworked into a load-balanced cluster to increase the number of parallel scheduling jobs further. This way, the still existing dynamic scheduling bottleneck of a centralized scheduler is widened, allowing handling 7x more processors with the pipelined, clustered dynamic scheduler for a typical signal processing application.
The presented dynamic scheduling system assumes the presence of three different communication modes between the processing cores. When emulated on top of the commonly used remote direct memory access (RDMA) protocol, performance issues are encountered. Firstly, RDMA can neatly be used for single-shot point-to-point data transfers, like used in task graph scheduling. Process networks usually make use of high-volume and high-bandwidth data streams. A first in first out (FIFO) communication solution is presented that implements a cyclic buffer on both sender and receiver to serve this need. The buffer handling and data transfer between them are done purely in hardware to remove software overhead from the application. The implementation improves the multi-user access to area-efficient single port on-chip memory modules. It achieves 0.8 of the theoretically possible bandwidth, usually only achieved with area expensive dual-port memories. The third communication mode defines a lightweight message passing (MP) implementation that is truly connectionless. It is needed for efficient inter-process communication of the distributed and clustered scheduling system and the worker processing units’ tight coupling. A hardware flow control assures that an arbitrary number of senders can spontaneously start sending messages to the same receiver. Yet, all messages are guaranteed to be correctly received while eliminating the need for connection establishment and keeping a low message delay.
The work focuses on the hardware-software codesign optimization to increase the uncompromised resource efficiency of dynamic SDF graph scheduling. Special attention is paid to the inter-level dependencies in developing a distributed scheduling system, which relies on the availability of specific hardwareaccelerated communication methods.:1 Introduction
1.1 Motivation
1.2 The Multiprocessor System on Chip Architecture
1.3 Concrete MPSoC Architecture
1.4 Representing LTE/5G baseband processing as Static Data Flow
1.5 Compuation Stack
1.6 Performance Hotspots Addressed
1.7 State of the Art
1.8 Overview of the Work
2 Hybrid SDF Execution
2.1 Addressed Performance Hotspot
2.2 State of the Art
2.3 Static Data Flow Graphs
2.4 Runtime Environment
2.5 Overhead of Deloying Tasks to a MPSoC
2.6 Interpretation of SDF Graphs as Task Graphs
2.7 Interpreting SDF Graphs as Process Networks
2.8 Hybrid Interpretation
2.9 Graph Topology Considerations
2.10 Theoretic Impact of Hybrid Interpretation
2.11 Simulating Hybrid Execution
2.12 Pipeline SDF Graph Example
2.13 Random SDF Graphs
2.14 LTE-like SDF Graph
2.15 Key Lernings
3 Distribution of Management
3.1 Addressed Performance Hotspot
3.2 State of the Art
3.3 Revising Deployment Overhead
3.4 Distribution of Overhead
3.5 Impact of Management Distribution to Resource Utilization
3.6 Reconfigurability
3.7 Key Lernings
4 Sliced FIFO Hardware
4.1 Addressed Performance Hotspot
4.2 State of the Art
4.3 System Environment
4.4 Sliced Windowed FIFO buffer
4.5 Single FIFO Evaluation
4.6 Multiple FIFO Evalutaion
4.7 Hardware Implementation
4.8 Key Lernings
5 Message Passing Hardware
5.1 Addressed Performance Hotspot
5.2 State of the Art
5.3 Message Passing Regarded as Queueing
5.4 A Remote Direct Memory Access Based Implementation
5.5 Hardware Implementation Concept
5.6 Evalutation of Performance
5.7 Key Lernings
6 Summar
Next generation of Exascale-class systems: ExaNeSt project and the status of its interconnect and storage development
The ExaNeSt project started on December 2015 and is funded by EU H2020 research framework (call H2020-FETHPC-2014, n. 671553) to study the adoption of low-cost, Linux-based power-efficient 64-bit ARM processors clusters for Exascale-class systems. The ExaNeSt consortium pools partners with industrial and academic research expertise in storage, interconnects and applications that share a vision of an European Exascale-class supercomputer. The common goal is designing and implementing a physical rack prototype together with its cooling system, the non-volatile memory (NVM) architecture and a unified low-latency interconnect able to test different options for network and storage. Furthermore, the consortium goal is to provide real HPC applications to validate the system. In this paper we describe the unified data and storage network architecture, reporting on the status of development of different testbeds and highlighting preliminary benchmark results obtained through the execution of scientific, engineering and data analytics scalable application kernels
Chat Robot for Medical Applications
The Medical bot project is built using artificial algorithms that analyses user’s queries and
understand user’s message. This System is a web application which provides answer to the query of the
patients. Patients just have to query through the bot which is used for chatting. Patients can chat using
any format there is no specific format the user has to follow. The System uses built in artificial
intelligence to answer the query. The answers are appropriate what the user queries. The User can
query any Medical related activities through the system. The user does not have to personally go to the
Medical for enquiry. The System analyses the question and then answers to the user. The system
answers to the query as if it is answered by the person. With the help of artificial intelligence, the
system answers the query asked by the patients. The system replies using an effective Graphical user
interface which implies that as if a real person is talking to the user. The user just has to register himself
to the system and has to login to the system. After login user can access to the various helping pages.
Various helping pages has the bot through which the user can chat by asking queries related to Medical
activities. The system replies to the user with the help of effective graphical user interface. The user can
query about the Medical related activities through online with the help of this web application. The
user can query Medical related activities such as date and timing of annual day, sports day, and other
cultural activities. This system helps the patients to be updated about the Medical activities
Identification of plant Syndrome using IPT
Agricultural productivity is something on which Indian economy highly depends.
This is the one of the reasons that disease detection in plants plays a vital role in agriculture
field, as having disease in plants are unavoidable. If proper care is not taken in this area, then it
causes serious effects on plants and due to which the overall agriculture yield will be affected.
For instance, a disease named little leaf disease is a hazardous disease found in pine trees in
United States. Detection of plant disease through some automatic technique is beneficial as it
reduces a large work of monitoring in big farms of crops, and at very early stage itself if
detected properly by identifying the symptoms of diseases can result in increased productivity.
This paper presents an algorithm for image segmentation technique which is used for
automatic detection and classification of plant leaf diseases. It also covers diseases
classification techniques that can be used for plant leaf disease detection. Image segmentation
is one of the method which will segment the raw images in to two or more clusters and the
programmed algorithm will work fine in analyzing these clusters for disease classification and
prediction of type of disease that a plant leaf gets affecte
- …