Abstract-"Rebooting Computing" (RC) is an effort in the IEEE to rethink future computers. RC started in 2012 by the co-chairs, Elie Track (IEEE Council on Superconductivity) and Tom Conte (Computer Society). RC takes a holistic approach, considering revolutionary as well as evolutionary solutions needed to advance computer technologies. Three summits have been held in 2013 and 2014, discussing different technologies, from emerging devices to user interface, from security to energy efficiency, from neuromorphic to reversible computing. The first part of this paper introduces RC to the design automation community and solicits revolutionary ideas from the community for the directions of future computer research.
I. IEEE REBOOTING COMPUTING

A. Introduction
The microelectronic revolution has provided one of the major transformations of the 20 th century, enabling the computer and telecommunications industries, with profound technological and societal implications. This was driven by transistor scaling known as "Moore's Law", whereby performance and cost improved as the devices were scaled down. Such exponential improvements cannot go on forever, and it is now widely accepted that after 50 years, Moore's Law is coming to an end. We propose that the end of the traditional Moore's Law scaling provides an opportunity to review and reinvent the entire basis for computing, in order to continue the computer revolution well into the 21 st century. Early computers required an initialization process to load the operating system into memory, which became known as "booting up," based on the old saying about "pulling yourself up by your own bootstraps." Even now, if a computer freezes or overloads, a power cycle or "reboot" may be necessary to reinitialize the system. Can we apply this concept metaphorically to the entire computer industry? IEEE Rebooting Computing (RC) is an inter-society initiative of the IEEE Future Directions Committee, started in 2012, to identify future trends in the technology of computing, a goal which is intentionally distinct from refinement of present-day trends.
This is an ambitious endeavor that requires reconsideration of hardware and software at all levels, from nanoscale devices to supercomputers to international networks. For this reason, the RC initiative is a joint project of members of 9 IEEE Societies and Councils, including the Council on Electronic Design Automation (CEDA) for its important role of Computer-Aided Design in the initiative. The RC participating organizations are:
• IEEE Societies:
-
Computer Society (CS) -Circuits and Systems Society (CAS) -Electron Devices Society (EDS) -Magnetics Society (MAG) -Reliability Society (RS) -Solid-State Circuits Society (SSCS)
• IEEE Councils: -Electronic Design Automation (CEDA) -Nanotechnology Council (NANO) -Council on Superconductivity (CSC)
• Partner: International Roadmap for Semiconductors (ITRS) RC is co-chaired by Tom Conte and Elie Track. It consists of a team of volunteers from the various IEEE Societies and Councils, as well as staff from IEEE Future Directions. RC has an active Web Portal [1] , as well as a presence on various social media sites.
B. RC Summits
In 2013 and 2014, the primary activity was to organize three Summits (known as RCS1-RCS3), bringing together a range of leaders of industry, academia, and government to discuss the future of computing. Reports and presentations from these Summits are available on the RC Web Portal [2] .
The primary conclusion from RCS1 is that any future computing technology must be built on three "pillars": Energy efficiency, Security, and Human-Computer Interactions (see Figure 1 ). This paper focuses on Energy Efficiency, which is relevant for devices and circuits, for chips and processors, for mobile systems and sensors, all the way to data centers and supercomputers. For mobile systems the primary concern is battery lifetime; for data centers it is the large electricity bill; on the chip level excessive heating leads to the dominance of "dark silicon". RCS2 and RCS3 also addressed the energy efficiency issue, by considering several alternative computing approaches. For example, conventional silicon devices and circuits consume orders of magnitude more power than the limit of energy per switching operation. Alternative circuit designs based on adiabatic and reversible computing may permit drastic reductions in energy, although thus far only in novel device technologies such as quantum dots and superconducting Josephson junction circuits.
Neuromorphic computing is a radically different computer architecture based on a massively parallel network of interconnected electronic neurons, inspired by biological brains. We still don't understand how brains work (or how they can be programmed), but they are undeniably energy efficient. The human brain consumes only about 20 W, and is composed of slow, unreliable elements, yet many of its capabilities are beyond those of a supercomputer.
Approximate computing is an approach that recognizes that there is often a tradeoff between precision and power consumption, and that many modern computer problems (such as image recognition) do not always require calculation of precise results. This approach also includes dealing with random errors associated with low-power operation of nanoscale devices.
Finally, supercomputing technology is moving toward exascale performance (10 18 floating point operations per second), but the total projected power of such a system, of order 200 MW (comparable to a power generator) would be prohibitive. The solution will require either a completely different technology (such as cryogenic superconducting computers), or a different architecture that operates much more efficiently by distributing memory and logic in a way that minimizes shuttling of data between components.
Many of these alternative computing approaches are not yet fully mature, but it is clear that automated design tools are needed and can incorporate power budgets and powerperformance tradeoffs at all levels, from circuit design to software implementations. A user can optimize for power efficiency only if power metrics are built into the design tools.
C. 2015 and Future
In 2015, the RC efforts have expanded to include cooperative efforts with several other organizations. For example, recognizing the central importance of energy efficiency in computing, RC created a Low-Power Image Recognition Challenge (LPIRC) [3] as a one-day workshop at the 2015 Design Automation Conference in San Francisco in June 2015. The LPIRC is described further below.
RC also recently entered into a strategic partnership with the International Roadmap for Semiconductors (ITRS) [4] . ITRS has traditionally focused on generating an industry-wide roadmap for Moore's Law scaling, but has recently recognized that the landscape has changed, necessitating a new type of Roadmap, known as ITRS 2.0. As part of this partnership, RC is now working with ITRS, including joint meetings held in February and July [5] . In addition, the 4 th RC Summit is being planned in December, 2015, at the International Electron Devices Meeting in Washington, DC [6] .
RC is also working to promote the concepts of Rebooting Computing to a broader audience. In cooperation with the IEEE Computer Society, the December issue of Computer magazine is dedicated to Rebooting Computing [7] . The issue is being edited by the RC Co-Chairs, and will present a variety of potential approaches for future computing.
Plans for 2016 and beyond are still in development, but may include conferences, roadmaps, and standards for future computing. Standards may include incorporating metrics for energy efficiency throughout the design stack.
II. LOW-POWER IMAGE RECOGNITION CHALLENGE A. Origin
As wearable cameras become commercially available, realtime visual recognition using battery-powered systems is attracting attention from researchers and engineers. Even though many papers have been published on relevant topics (IEEE Xplore finds more than 1,200 papers when searching "low power" and "image processing"), there is no common benchmark for comparing solutions. A challenge can bring together researchers and take a snapshot of the current technology as a reference for further improvement, and hopefully to highlight future progress in this area. The idea of a competition exploring low-power systems and visual recognition was proposed by Yung-Hsiang Lu and David Kirk during the first RC Summit in December 2013.
Identifying objects in images seems a straightforward problem: many people can easily find everyday objects, such as fruit, car, and microwave. Writing a computer program to identify objects, however, is surprisingly difficult. Instead of creating a new set of images, LPIRC uses the data from ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [8] . Figure 2 shows a sample image with four objects and their bounding boxes. Alexander C. Berg is one of ILSVRC organizers; he and Lu served as the co-chairs of the first LPIRC. Two sets of images were used for training and testing. The former was released in November 2014 and the latter was used for the challenge on June 7, 2015. 
B. Rules
The first LPIRC aimed to impose as few restrictions as possible so that contestants could present their best solutions. The first LPIRC took place during a one-day workshop as part of the 2015 Design Automation Conference. Contestants brought their systems to the San Francisco Convention Center. It was an opportunity for the contestants, as well as spectators, to exchange ideas. The same test images and power meter was used for all teams to ensure fairness. An intranet (both Wifi and RJ45) was established for the contestants' systems to retrieve image files from the referee system and to send the answers to the referee system. Hypertext Transfer Protocol (HTTP) was used because it was widely supported, including many embedded systems. The source code of the referee system was released in March, 2015 through a public repository [9] .
LPIRC evaluates both accuracy of detection and the amount of energy used. Most evaluations of detection accuracy are based on precision vs recall plots of the detector evaluated on previously unseen test data. A detector produces a list of detections ordered by confidence, and at each point in the list contains a cumulative value for precision (the fraction of detections seen so far that are correct) and for recall (the fraction of possible detections that have been seen so far). A target can only be detected once. Subsequent putative detections of the same target are discarded as false detections. They do not change the recall, and do decrease the precision. LPIRC follows standard practice (e.g. [10] ), first computing a precision recall curve for each object category, then determining the average precision for each category, and finally aggregating these by computing the mean average precision across categories, the mAP score. Before the challenge, the training data was released for detection from the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). The 5000 test images were mixed from newly collected and annotated images and some existing test images from ILSVRC 2014. These were retrieved from the referee system during the challenge and blocks of images were permuted in order between contestants. For many more details about the ILSVRC detection data used for this challenge and on evaluating detection, please see [11] . The detection accuracy (mAP) was divided by energy used (watthour) to produce a single score for ranking the results. Each team has 10 minutes to process the images. To encourage exploring accuracy-energy trade-offs, each team may present multiple solutions (if the team has multiple registrations).
C. Challenge Day
Totally, 34 people registered (10 teams from USA, China, Canada, and Taiwan) for the challenge. Two teams were unable to present their solutions on June 7. The remaining 8 teams presented 20 solutions. Among the eight teams, seven teams chose Track one (no offloading) and one team chose Track two (offloading). The order of teams were determined by drawing. The team in Track two had a software mistake and was unable to report any detected objects. The first prize in Track one was given to the Tsinghua-Huawei team and the second prize in Track one was given to the Chinese Academy of Science Institute of Automation (CASIA)-Huawei team. The two teams in Track One report their approaches later in this paper. The other winners are: Third prize: Tsinghua-Huawei team; Highest accuracy with low energy: Tsinghua-Huawei team; Least energy with high accuracy: CASIA-Huawei team; "Ready-to-Go" Prize: Carnegie Mellon University; "Standing Alone" Prize: Rice University. More details of the first LPIRC can be seen in this report [12] . D. Future LPIRC Preparation for the second LPIRC has already started. For more information, please visit the web site [3] , or contact the LPIRC Co-Chairs, Yung-Hsiang Lu and Alexander C. Berg. Figure 3 illustrates an overview of the Tsinghua-Huawei team. The system is based on the Fast R-CNN framework [13] and NVIDIA Jeston TK1 embedded GPU platform.
III. PIPELINED FAST RCNN ON EMBEDDED GPU
For the detection algorithm, Fast R-CNN was selected after comparing serval recent detection frameworks including Region-based Convolutional Neural Network (R-CNN) [14] , Spatial Pyramid Pooling Network (SPPnet) [15] , and Fast R-CNN [13] . In the framework, multiple bottom-up region proposals (∼200 proposals) were extracted on the input images by EdgeBoxes (EB) [16] and Binarized Normed Gradients (BING) [17] . Then the input image was resized to a large scale and put into the convolutional (Conv) layers of a Convolutional Neural Network (CNN) to extract convolutional feature map only once. The feature map was reused to provide convolutional features. Regions in the feature map that were corresponding to the locality of region proposals, were spatially pooled into fixed-size features. Those features were input into the fully-connected (FC) layers of the CNN, and the FC layers output two vectors for each region proposal: softmax probabilities for classification, and per-class bounding-box regression (BB) offsets. Finally, a list of bounding box coordinates, each with an object class label and a confidence score, was output as a detection result.
For the hardware implementation, the system used NVIDIA Jeston TK1 development kit with an embedded GPU and a quad-core ARM CPU. The detection algorithm was decoupled into (1) the region proposal part and (2) the CNN part, which was executed by ARM cores and embedded GPU, respectively. Moreover, LPIRC required Internet interface to download data and upload results. A two-stage pipeline was used to take full advantage of the computing resources. Specifically, the first stage of the pipeline downloaded images from the referee, opened the image file, and extracted region proposals with ARM cores. The second stage performed CNN with GPU and uploaded the results back to the referee by the ARM cores.
This following sections explain the implementation in detail. The two solutions with BING and EdgeBoxes (EB) won the 1st and 3rd prize of LPIRC 2015, respectively.
B. Region Proposals
A variety of papers have offered methods for generating category-independent region proposals. The team prepared 4 methods [18] , i.e., SelectiveSearch (SS), EdgeBoxes (EB), Binarized Normed Gradients (BING), and Geodesic (GOP) before the competition. The team finally developed two solutions with EB [16] and BING [17] , respectively. BING is aimed to realize a faster implementation, and EB is used to balance between speed and detection accuracy (mAP). Moreover, recent work has demonstrated that mAP will increase slowly with the number of region proposals [18, 13] . However, more region proposals also demand more processing energy, which increases faster than the mAP. This implementation set the maximum number of region proposals to 200 in order to tradeoff between mAP and energy.
C. Feature Extraction and Object Classification
Convolutional Neural Network (CNN) plays a key role in modern object detection framework to achieve high detection accuracy. The team chose CaffeNet (essentially AlexNet [19] ) to fit for the limited memory size (2 GB) of TK1. Following Fast R-CNN framework [13] , when extracting the convolutional feature map for the solution with EdgeBoxes (EB), the shortest side of the input image were resized to 600 pixels and the longest image side was capped to 1000 pixels in case of out-of-memory. It was discovered that 600 pixels might be an optimal choice for mAP after it was increased to 750 pixels or decreased it to 450 pixels. In order to increase the detection accuracy of EB, the top-30 region proposals with a score of > 10 −3 after the bounding box regression (BB) survived and was used to perform non-maximum suppression (NMS) independently for each class. In contrast, the team shrank the input scale from 600-1000 to 450-750, remove BB, and increased the threshold of the region proposal score before NMS to 0.1 in the BING solution to increase the processing speed. Those techniques accelerated the speed of BING solution by ∼ 1.5× compared with the one with EB.
D. Training
The training procedure was performed based on the Fast R-CNN framework with some modifications for the ImageNet dataset. The CaffeNet was trained on ILSVRC2014 [8] val 1 dataset and validated on val 2 dataset [14] . The images without any object of ground-truth were ignored when training. Instead of fine-tuning the CaffeNet for classification directly on the detection dataset, the team started with the CaffeNet that had been fine-tuned by R-CNN [14] . In order to prepare more training data, both SelectiveSearch (SS) and EdgeBoxes (EB) methods were used to generate region proposals (∼4k proposals per image in total) when fine-tuning the network in the EB solution. It was observed that the training with both proposal methods provided better results than the training with any single method alone. After the network for EB was trained, the team fine-tuned the EB network (just with different scale and proposal methods) for BING solution with a very small learning rate (1e-4 to 1e-6).
E. Pipeline with Shared Memory
The heterogeneous ARM Core and GPU share the main memory on the TK1. The shared memory was extensively used to facilitate the information sharing, message passing and memory reservation. Specifically, the shared memory was split into six zones: one sharing the downloaded image data, one as the queue for region proposals and to cascade the two pipeline stages, and one for the final results. The other zones were reserved to avoid frequent memory swapping.
The size of each memory zone and depth of the queue were carefully configured in order to balance the speed of the two pipeline stages. For example, EdgeBoxes (EB) run much slower on the ARM than BING, and sometime may even fall behind the CNN kernel. Therefore, the queue depth was increased to cope with the imbalance. Other zones were also reduced accordingly for the best performance.
F. Results
The team first tested the mAP of two solutions on the val 2 dataset of R-CNN [14] . With 200 proposals, the EdgeBoxes (EB) solution achieved 26% mAP and the mAP of BING is 15.3%. The speeds of the two solutions were 1.2 s/image and 0.8 s/image, respectively. The power consumption of TK1 was ∼9.6W with USB and VGA ports unplugged. The two solutions achieved similar results in terms of mAP/Energy. Surprisingly, the BING solution outperformed the EB by ∼60% in the final competition. The BING solution achieved the mAP of 2.971e-2 and the energy of 1.634 watt-hour. The EB solution achieves the mAP of 1.816e-2 and the energy of 1.574 watt-hour. The final score of mAP/energy for BING and EB are 1.818e-2 and 1.154e-2, respectively. The results imply that a faster solution that processes more images may achieve a better score in terms of mAP/energy. The energy cost is similar in the two solutions since both of them do not finish processing all the images (5,000 images) in the limited time (10 minutes).
G. Related Work and Discussion
Modern general object detection tasks are mainly based on the R-CNN framework [14] , which includes three steps: 1) coarse-grained region proposal extraction; 2) CNN feature extraction and object classification; and 3) fine-grained bounding box regression. The original R-CNN requires to perform a full execution of CNN for each region proposal and it is slow [14] . SPPnet improves the speed by ∼ 10× through reusing the convolutional feature map [15] . Fast R-CNN [13] uses a simplified version of SPPnet and combines step 2 and 3 to CNN to improve the detection accuracy by realizing a better bounding box regression. However, all the above frameworks use traditional bottom-up region proposal methods like SelectiveSearch (SS) or EdgeBoxes (EB). The framework is not an "End-to-End" solution and region proposal methods are currently the speed bottleneck in the framework, especially for the Fast R-CNN. Therefore, a recent proposed method, Faster R-CNN [20] , merges all the three steps into CNN, and realizes an "End-to-End" detection framework by using CNN to extract region proposals. State-of-the-art results have been reported by this method. The team believes the "End-to-End" model will attract more attention in the future. Finally, the team used embedded GPU platform for the ease-of-development, and FPGA had been demonstrated to be more energy efficient than CPU and GPU. A more energy efficient FPGA-based detection solution was expected to realize better low power image recognition systems.
IV. OBJECT DETECTION BASED ON FAST OBJECT PROPOSAL AND REPRESENTATION
A. Introduction
The CASIA-Huawei team introduced two integrated frameworks of BING (Binarized Normed Gradients) and fast-RCNN (region-based convolutional neural networks) for object detection of 200 classes. The two frameworks were embedded in the NVIDIA Jetson TK1 platform [21] . 
B. Method
With the power of convolutional neural networks (CNN) [19] , modern object detection methods significantly improve the accuracy. With the power of object proposal methods, modern detection methods are speeded up by excluding many detection windows. To meet the requirements of LPIRC, the team adopted BING [17] to extract object proposals, and two different CNNs were used to extract feature presentation in the framework of fast-RCNN [13] for 200 classes object detection. BING [17] proposed by Cheng et al. is a very fast method for extracting object proposals. According to the paper, BING generates a small set of category-independent, high quality windows. Fast-RCNN [13] proposed by Ross is an enhancement over the famous framework of object detection called region-based CNN (RCNN) [14] . There are two main improvements. First, the Region-Of-Interest Pooling layer is used to achieve 146 times faster performance for the VGG16 network [22] . Second, adding truncated SVD of full-connected layers in CNN further enhances the speed. In total, Fast-RCNN runs 213 times faster than RCNN. The pipeline of our method is shown in Figure 4 . C. Implementation BING: The original BING extracts features of different scale rectangles. To make it faster, the team directly dropped the smallest 10 scales. Moreover, the team only stored top 100 scored proposals for further use in fast-RCNN.
Fast-RCNN: The team prepared two CNNs: Alex-net [19] and small-net. Alex-net is directly downloaded from Caffes official website. The small-net is designed by us to process all 5,000 images in 10 minutes. Its structure is: K5N64S3-K4N128S2-K3N192-K3N128S1-Fc512-Fc512-Fc1000, where "K" for "kernel size", "N" for "channels" and "S" for "stride". The small-net is trained from scratch and fed with ILSVRC 1000-class images from its training set with top-1 accuracy of 0.284. For both networks, the team fine-tuned them in Fast-RCNN using the validation set of ILSVRC 2013. The competition results are reported in Table I and Table II reports the offline results. For the offline results, 15,121 images from ILSVRC 2013 validation dataset were used for finetuning and the rest 5,000 images for testing, and the accuracy was calculated by mAP. In the competition, the Alex-net solution was ranked the 2nd in all 20 solutions and the Small-net solution obtained the prize of the lowest power with good accuracy. The main difference (about 4 times in accuracy) between the offline results and the competition results mainly came from the slow image downloading and the results uploading in the competition. Since the method demonstrates a processing speed near 150 ms for each image, the speed is very close to the rate of image downloading and the result uploading via the network. The network had a great influence to the final score. A possible solution is to make a separate pipeline stage for the image downloading and uploading, which would be the team's advice to future contestants.
D. Results
Solution
V. CONCLUSION IEEE Rebooting Computing aims to rethink future directions of computer technologies. This initiative explores many possible solutions and has obtained supports from many organizations. Energy efficiency is one of the pillars in future computer technologies and the first Low Power Image Recognition Challenge serves as a benchmark. Future RC activities may include conferences, roadmaps, and standards. The second LPIRC will be held in 2016.
