20 research outputs found

    Integrated Photonic Tensor Processing Unit for a Matrix Multiply: a Review

    Full text link
    The explosion of artificial intelligence and machine-learning algorithms, connected to the exponential growth of the exchanged data, is driving a search for novel application-specific hardware accelerators. Among the many, the photonics field appears to be in the perfect spotlight for this global data explosion, thanks to its almost infinite bandwidth capacity associated with limited energy consumption. In this review, we will overview the major advantages that photonics has over electronics for hardware accelerators, followed by a comparison between the major architectures implemented on Photonics Integrated Circuits (PIC) for both the linear and nonlinear parts of Neural Networks. By the end, we will highlight the main driving forces for the next generation of photonic accelerators, as well as the main limits that must be overcome

    Evaluation of Clustering Algorithms on HPC Platforms

    Full text link
    [EN] Clustering algorithms are one of the most widely used kernels to generate knowledge from large datasets. These algorithms group a set of data elements (i.e., images, points, patterns, etc.) into clusters to identify patterns or common features of a sample. However, these algorithms are very computationally expensive as they often involve the computation of expensive fitness functions that must be evaluated for all points in the dataset. This computational cost is even higher for fuzzy methods, where each data point may belong to more than one cluster. In this paper, we evaluate different parallelisation strategies on different heterogeneous platforms for fuzzy clustering algorithms typically used in the state-of-the-art such as the Fuzzy C-means (FCM), the Gustafson-Kessel FCM (GK-FCM) and the Fuzzy Minimals (FM). The experimental evaluation includes performance and energy trade-offs. Our results show that depending on the computational pattern of each algorithm, their mathematical foundation and the amount of data to be processed, each algorithm performs better on a different platform.This work has been partially supported by the Spanish Ministry of Science and Innovation, under the Ramon y Cajal Program (Grant No. RYC2018-025580-I) and by the Spanish "Agencia Estatal de Investigacion" under grant PID2020-112827GB-I00 /AEI/ 10.13039/501100011033, and under grants RTI2018-096384-B-I00, RTC-2017-6389-5 and RTC2019-007159-5, by the Fundacion Seneca del Centro de Coordinacion de la Investigacion de la Region de Murcia under Project 20813/PI/18, and by the "Conselleria de Educacion, Investigacion, Cultura y Deporte, Direccio General de Ciencia i Investigacio, Proyectos AICO/2020", Spain, under Grant AICO/2020/302.Cebrian, JM.; Imbernón, B.; Soto, J.; Cecilia-Canales, JM. (2021). Evaluation of Clustering Algorithms on HPC Platforms. Mathematics. 9(17):1-20. https://doi.org/10.3390/math917215612091

    TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference

    Full text link
    TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost, sparse input operand interconnect comprising an 8-input multiplexer per multiplier input, with an area-efficient hardware scheduler. While the interconnect allows a very limited set of movements per operand, the scheduler can effectively extract sparsity when it is present in the activations, weights or gradients of neural networks. Over a wide set of models covering various applications, TensorDash accelerates the training process by 1.95×1.95{\times} while being 1.89×1.89\times more energy-efficient, 1.6×1.6\times more energy efficient when taking on-chip and off-chip memory accesses into account. While TensorDash works with any datatype, we demonstrate it with both single-precision floating-point units and bfloat16

    A Survey of Software-Defined Networks-on-Chip: Motivations, Challenges and Opportunities

    Get PDF
    Current computing platforms encourage the integration of thousands of processing cores, and their interconnections, into a single chip. Mobile smartphones, IoT, embedded devices, desktops, and data centers use Many-Core Systems-on-Chip (SoCs) to exploit their compute power and parallelism to meet the dynamic workload requirements. Networks-on-Chip (NoCs) lead to scalable connectivity for diverse applications with distinct traffic patterns and data dependencies. However, when the system executes various applications in traditional NoCs—optimized and fixed at synthesis time—the interconnection nonconformity with the different applications’ requirements generates limitations in the performance. In the literature, NoC designs embraced the Software-Defined Networking (SDN) strategy to evolve into an adaptable interconnection solution for future chips. However, the works surveyed implement a partial Software-Defined Network-on-Chip (SDNoC) approach, leaving aside the SDN layered architecture that brings interoperability in conventional networking. This paper explores the SDNoC literature and classifies it regarding the desired SDN features that each work presents. Then, we described the challenges and opportunities detected from the literature survey. Moreover, we explain the motivation for an SDNoC approach, and we expose both SDN and SDNoC concepts and architectures. We observe that works in the literature employed an uncomplete layered SDNoC approach. This fact creates various fertile areas in the SDNoC architecture where researchers may contribute to Many-Core SoCs designs.Las plataformas informáticas actuales fomentan la integración de miles de núcleos de procesamiento y sus interconexiones, en un solo chip. Los smartphones móviles, el IoT, los dispositivos embebidos, los ordenadores de sobremesa y los centros de datos utilizan sistemas en chip (SoC) de muchos núcleos para explotar su potencia de cálculo y paralelismo para satisfacer los requisitos de las cargas de trabajo dinámicas. Las redes en chip (NoC) conducen a una conectividad escalable para diversas aplicaciones con distintos patrones de tráfico y dependencias de datos. Sin embargo, cuando el sistema ejecuta varias aplicaciones en las NoC tradicionales -optimizadas y fijadas en el momento de síntesis, la disconformidad de la interconexión con los requisitos de las distintas aplicaciones genera limitaciones en el rendimiento. En la literatura, los diseños de NoC adoptaron la estrategia de redes definidas por software (SDN) para evolucionar hacia una solución de interconexión adaptable para los futuros chips. Sin embargo, los trabajos estudiados implementan un enfoque parcial de red definida por software en el chip (SDNoC) de SDN, dejando de lado la arquitectura en capas de SDN que aporta interoperabilidad en la red convencional. Este artículo explora la literatura sobre SDNoC y la clasifica en función de las características SDN que presenta cada trabajo. A continuación, describimos los retos y oportunidades detectados a partir del estudio de la literatura. Además, explicamos la motivación para un enfoque SDNoC, y exponemos los conceptos y arquitecturas de SDN y SDNoC. Observamos que los trabajos en la literatura emplean un enfoque SDNoC por capas no completo. Este hecho crea varias áreas fértiles en la arquitectura SDNoC en las que los investigadores pueden contribuir a los diseños de SoCs de muchos núcleos

    Neuromorphic Optical Flow and Real-time Implementation with Event Cameras

    Full text link
    Optical flow provides information on relative motion that is an important component in many computer vision pipelines. Neural networks provide high accuracy optical flow, yet their complexity is often prohibitive for application at the edge or in robots, where efficiency and latency play crucial role. To address this challenge, we build on the latest developments in event-based vision and spiking neural networks. We propose a new network architecture, inspired by Timelens, that improves the state-of-the-art self-supervised optical flow accuracy when operated both in spiking and non-spiking mode. To implement a real-time pipeline with a physical event camera, we propose a methodology for principled model simplification based on activity and latency analysis. We demonstrate high speed optical flow prediction with almost two orders of magnitude reduced complexity while maintaining the accuracy, opening the path for real-time deployments.Comment: Accepted for IEEE CVPRW, Vancouver 2023. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media. Copyright 2023 IEE

    Revisiting the high-performance reconfigurable computing for future datacenters

    Get PDF
    Modern datacenters are reinforcing the computational power and energy efficiency by assimilating field programmable gate arrays (FPGAs). The sustainability of this large-scale integration depends on enabling multi-tenant FPGAs. This requisite amplifies the importance of communication architecture and virtualization method with the required features in order to meet the high-end objective. Consequently, in the last decade, academia and industry proposed several virtualization techniques and hardware architectures for addressing resource management, scheduling, adoptability, segregation, scalability, performance-overhead, availability, programmability, time-to-market, security, and mainly, multitenancy. This paper provides an extensive survey covering three important aspects-discussion on non-standard terms used in existing literature, network-on-chip evaluation choices as a mean to explore the communication architecture, and virtualization methods under latest classification. The purpose is to emphasize the importance of choosing appropriate communication architecture, virtualization technique and standard language to evolve the multi-tenant FPGAs in datacenters. None of the previous surveys encapsulated these aspects in one writing. Open problems are indicated for scientific community as well

    Evaluation of Clustering Algorithms on GPU-Based Edge Computing Platforms

    Get PDF
    [EN] Internet of Things (IoT) is becoming a new socioeconomic revolution in which data and immediacy are the main ingredients. IoT generates large datasets on a daily basis but it is currently considered as "dark data", i.e., data generated but never analyzed. The efficient analysis of this data is mandatory to create intelligent applications for the next generation of IoT applications that benefits society. Artificial Intelligence (AI) techniques are very well suited to identifying hidden patterns and correlations in this data deluge. In particular, clustering algorithms are of the utmost importance for performing exploratory data analysis to identify a set (a.k.a., cluster) of similar objects. Clustering algorithms are computationally heavy workloads and require to be executed on high-performance computing clusters, especially to deal with large datasets. This execution on HPC infrastructures is an energy hungry procedure with additional issues, such as high-latency communications or privacy. Edge computing is a paradigm to enable light-weight computations at the edge of the network that has been proposed recently to solve these issues. In this paper, we provide an in-depth analysis of emergent edge computing architectures that include low-power Graphics Processing Units (GPUs) to speed-up these workloads. Our analysis includes performance and power consumption figures of the latest Nvidia's AGX Xavier to compare the energy-performance ratio of these low-cost platforms with a high-performance cloud-based counterpart version. Three different clustering algorithms (i.e., k-means, Fuzzy Minimals (FM), and Fuzzy C-Means (FCM)) are designed to be optimally executed on edge and cloud platforms, showing a speed-up factor of up to 11x for the GPU code compared to sequential counterpart versions in the edge platforms and energy savings of up to 150% between the edge computing and HPC platforms.This work has been partially supported by the Spanish Ministry of Science and Innovation, under the Ramon y Cajal Program (Grant No. RYC2018-025580-I) and under grants RTI2018-096384-B-I00, RTC-2017-6389-5 and RTC2019-007159-5 and by the Fundacion Seneca del Centro de Coordinacion de la Investigacion de la Region de Murcia under Project 20813/PI/18.Cecilia-Canales, JM.; Cano, J.; Morales-García, J.; Llanes, A.; Imbernón, B. (2020). Evaluation of Clustering Algorithms on GPU-Based Edge Computing Platforms. Sensors. 20(21):1-19. https://doi.org/10.3390/s20216335S1192021Gebauer, H., Fleisch, E., Lamprecht, C., & Wortmann, F. (2020). Growth paths for overcoming the digitalization paradox. Business Horizons, 63(3), 313-323. doi:10.1016/j.bushor.2020.01.005Guillén, M. A., Llanes, A., Imbernón, B., Martínez-España, R., Bueno-Crespo, A., Cano, J.-C., & Cecilia, J. M. (2020). Performance evaluation of edge-computing platforms for the prediction of low temperatures in agriculture using deep learning. The Journal of Supercomputing, 77(1), 818-840. doi:10.1007/s11227-020-03288-wWang, J., Ma, Y., Zhang, L., Gao, R. X., & Wu, D. (2018). Deep learning for smart manufacturing: Methods and applications. Journal of Manufacturing Systems, 48, 144-156. doi:10.1016/j.jmsy.2018.01.003Gretzel, U., Sigala, M., Xiang, Z., & Koo, C. (2015). Smart tourism: foundations and developments. Electronic Markets, 25(3), 179-188. doi:10.1007/s12525-015-0196-8Pramanik, M. I., Lau, R. Y. K., Demirkan, H., & Azad, M. A. K. (2017). Smart health: Big data enabled health paradigm within smart cities. Expert Systems with Applications, 87, 370-383. doi:10.1016/j.eswa.2017.06.027Weber, M., & Podnar Žarko, I. (2019). A Regulatory View on Smart City Services. Sensors, 19(2), 415. doi:10.3390/s19020415Ghosh, A., Chakraborty, D., & Law, A. (2018). Artificial intelligence in Internet of things. CAAI Transactions on Intelligence Technology, 3(4), 208-218. doi:10.1049/trit.2018.1008Monti, L., Vincenzi, M., Mirri, S., Pau, G., & Salomoni, P. (2020). RaveGuard: A Noise Monitoring Platform Using Low-End Microphones and Machine Learning. Sensors, 20(19), 5583. doi:10.3390/s20195583Kumar, P., Sinha, K., Nere, N. K., Shin, Y., Ho, R., Mlinar, L. B., & Sheikh, A. Y. (2020). A machine learning framework for computationally expensive transient models. Scientific Reports, 10(1). doi:10.1038/s41598-020-67546-wMittal, S., & Vetter, J. S. (2015). A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Computing Surveys, 47(4), 1-35. doi:10.1145/2788396Singh, D., & Reddy, C. K. (2014). A survey on platforms for big data analytics. Journal of Big Data, 2(1). doi:10.1186/s40537-014-0008-6Khayyat, M., Elgendy, I. A., Muthanna, A., Alshahrani, A. S., Alharbi, S., & Koucheryavy, A. (2020). Advanced Deep Learning-Based Computational Offloading for Multilevel Vehicular Edge-Cloud Computing Networks. IEEE Access, 8, 137052-137062. doi:10.1109/access.2020.3011705Satyanarayanan, M. (2017). The Emergence of Edge Computing. Computer, 50(1), 30-39. doi:10.1109/mc.2017.9Capra, M., Peloso, R., Masera, G., Roch, M. R., & Martina, M. (2019). Edge Computing: A Survey On the Hardware Requirements in the Internet of Things World. Future Internet, 11(4), 100. doi:10.3390/fi11040100Lu, H., Gu, C., Luo, F., Ding, W., & Liu, X. (2020). Optimization of lightweight task offloading strategy for mobile edge computing based on deep reinforcement learning. Future Generation Computer Systems, 102, 847-861. doi:10.1016/j.future.2019.07.019Mimmack, G. M., Mason, S. J., & Galpin, J. S. (2001). Choice of Distance Matrices in Cluster Analysis: Defining Regions. Journal of Climate, 14(12), 2790-2797. doi:10.1175/1520-0442(2001)0142.0.co;2Gimenez, C. (2006). Logistics integration processes in the food industry. International Journal of Physical Distribution & Logistics Management, 36(3), 231-249. doi:10.1108/09600030610661813Chang, P.-C., Liu, C.-H., & Fan, C.-Y. (2009). Data clustering and fuzzy neural network for sales forecasting: A case study in printed circuit board industry. Knowledge-Based Systems, 22(5), 344-355. doi:10.1016/j.knosys.2009.02.005Zheng, B., Yoon, S. W., & Lam, S. S. (2014). Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert Systems with Applications, 41(4), 1476-1482. doi:10.1016/j.eswa.2013.08.044Woodley, A., Tang, L.-X., Geva, S., Nayak, R., & Chappell, T. (2019). Parallel K-Tree: A multicore, multinode solution to extreme clustering. Future Generation Computer Systems, 99, 333-345. doi:10.1016/j.future.2018.09.038Kwedlo, W., & Czochanski, P. J. (2019). A Hybrid MPI/OpenMP Parallelization of KK -Means Algorithms Accelerated Using the Triangle Inequality. IEEE Access, 7, 42280-42297. doi:10.1109/access.2019.2907885Liu, B., He, S., He, D., Zhang, Y., & Guizani, M. (2019). A Spark-Based Parallel Fuzzy cc -Means Segmentation Algorithm for Agricultural Image Big Data. IEEE Access, 7, 42169-42180. doi:10.1109/access.2019.2907573Baydoun, M., Ghaziri, H., & Al-Husseini, M. (2018). CPU and GPU parallelized kernel K-means. The Journal of Supercomputing, 74(8), 3975-3998. doi:10.1007/s11227-018-2405-7Li, Y., Zhao, K., Chu, X., & Liu, J. (2013). Speeding up k-Means algorithm by GPUs. Journal of Computer and System Sciences, 79(2), 216-229. doi:10.1016/j.jcss.2012.05.004Cuomo, S., De Angelis, V., Farina, G., Marcellino, L., & Toraldo, G. (2019). A GPU-accelerated parallel K-means algorithm. Computers & Electrical Engineering, 75, 262-274. doi:10.1016/j.compeleceng.2017.12.002Al-Ayyoub, M., Abu-Dalo, A. M., Jararweh, Y., Jarrah, M., & Sa’d, M. A. (2015). A GPU-based implementations of the fuzzy C-means algorithms for medical image segmentation. The Journal of Supercomputing, 71(8), 3149-3162. doi:10.1007/s11227-015-1431-yAit Ali, N., Cherradi, B., El Abbassi, A., Bouattane, O., & Youssfi, M. (2018). GPU fuzzy c-means algorithm implementations: performance analysis on medical image segmentation. Multimedia Tools and Applications, 77(16), 21221-21243. doi:10.1007/s11042-017-5589-6Timón, I., Soto, J., Pérez-Sánchez, H., & Cecilia, J. M. (2016). Parallel implementation of fuzzy minimals clustering algorithm. Expert Systems with Applications, 48, 35-41. doi:10.1016/j.eswa.2015.11.011Cebrian, J. M., Imbernón, B., Soto, J., García, J. M., & Cecilia, J. M. (2020). High-throughput fuzzy clustering on heterogeneous architectures. Future Generation Computer Systems, 106, 401-411. doi:10.1016/j.future.2020.01.022Cecilia, J. M., Timon, I., Soto, J., Santa, J., Pereniguez, F., & Munoz, A. (2018). High-Throughput Infrastructure for Advanced ITS Services: A Case Study on Air Pollution Monitoring. IEEE Transactions on Intelligent Transportation Systems, 19(7), 2246-2257. doi:10.1109/tits.2018.2816741Sriramakrishnan, P., Kalaiselvi, T., & Rajeswaran, R. (2019). Modified local ternary patterns technique for brain tumour segmentation and volume estimation from MRI multi-sequence scans with GPU CUDA machine. Biocybernetics and Biomedical Engineering, 39(2), 470-487. doi:10.1016/j.bbe.2019.02.002Fang, Y., Chen, Q., & Xiong, N. (2019). A multi-factor monitoring fault tolerance model based on a GPU cluster for big data processing. Information Sciences, 496, 300-316. doi:10.1016/j.ins.2018.04.053Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O. M., Amancio, D. R., Costa, L. da F., & Rodrigues, F. A. (2019). Clustering algorithms: A comparative approach. PLOS ONE, 14(1), e0210236. doi:10.1371/journal.pone.0210236Pandove, D., Goel, S., & Rani, R. (2018). Systematic Review of Clustering High-Dimensional and Large Datasets. ACM Transactions on Knowledge Discovery from Data, 12(2), 1-68. doi:10.1145/3132088Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2-3), 191-203. doi:10.1016/0098-3004(84)90020-7Soto, J., Flores-Sintas, A., & Palarea-Albaladejo, J. (2008). Improving probabilities in a fuzzy clustering partition. Fuzzy Sets and Systems, 159(4), 406-421. doi:10.1016/j.fss.2007.08.016Kolen, J. F., & Hutcheson, T. (2002). Reducing the time complexity of the fuzzy c-means algorithm. IEEE Transactions on Fuzzy Systems, 10(2), 263-267. doi:10.1109/91.99512

    Balancing Static Islands in Dynamically Scheduled Circuits using Continuous Petri Nets

    Get PDF
    High-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description. A key challenge in HLS is scheduling, i.e. determining the start time of all the operations in the untimed program. A major shortcoming of existing approaches to scheduling – whether they are static (start times determined at compile-time), dynamic (start times determined at run-time), or a hybrid of both – is that the static analysis cannot efficiently explore the run-time hardware behaviours. Existing approaches either assume the timing behaviour in extreme cases, which can cause sub-optimal performance or larger area, or use simulation-based approaches, which take a long time to explore enough program traces. In this article, we propose an efficient approach using probabilistic analysis for HLS tools to efficiently explore the timing behaviour of scheduled hardware. We capture the performance of the hardware using Timed Continous Petri nets with immediate transitions, allowing us to leverage efficient Petri net analysis tools for making HLS decisions. We demonstrate the utility of our approach by using it to automatically estimate the hardware throughput for balancing the throughput for statically scheduled components (also known as static islands) computing in a dynamically scheduled circuit. Over a set of benchmarks, we show that our approach on average incurs a 2% overhead in area-delay product compared to optimal designs by exhaustive search
    corecore