22 research outputs found

    A matrix-multiply unit for posits in reconfigurable logic leveraging (Open)CAPI

    No full text
    In this paper, we present the design in reconfigurable logic of a matrix multiplier for matrices of 32-bit posit numbers with es=2 [1]. Vector dot products are computed without intermediate rounding as suggested by the proposed posit standard to maximally retain precision. An initial implementation targets the CAPI 1.0 interface on the POWER8 processor and achieves about 10Gpops (Giga posit operations per second). Follow-on implementations targeting CAPI 2.0 and OpenCAPI 3.0 on POWER9 are expected to achieve up to 64Gpops. Our design is available under a permissive open source license at https://github.com/ChenJianyunp/Unum_matrix_multiplier. We hope the current work, which works on CAPI 1.0, along with future community contributions, will help enable a more extensive exploration of this proposed new format.Computer Engineerin

    Benchmarking Apache Arrow Flight - A wire-speed protocol for data transfer, querying and microservices

    No full text
    Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional databases. Apache Arrow, a unified columnar in-memory data format promises to provide efficient data storage, access, manipulation and transport. In addition, with the introduction of the Arrow Flight communication capabilities, which is built on top of gRPC, Arrow enables high performance data transfer over TCP networks. Arrow Flight allows parallel Arrow RecordBatch transfer over networks in a platform and language-independent way, and offers high performance, parallelism and security based on open-source standards. In this paper, we bring together some recently implemented use cases of Arrow Flight with their benchmarking results. These use cases include bulk Arrow data transfer, querying subsystems and Flight as a microservice integration into different frameworks to show the throughput and scalability results of this protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s throughput for DoGet() and DoPut() operations respectively. On Mellanox ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95% of the total available bandwidth. Flight is scalable and can use upto half of the available system cores efficiently for a bidirectional communication. For query systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc protocols. Arrow Flight based implementation on Dremio performs 20x and 30x better as compared to turbodbc and ODBC connections respectively. We briefly outline some recent Flight based use cases both in big data frameworks like Apache Spark and Dask and remote Arrow data processing tools. We also discuss some limitations and future outlook of Apache Arrow and Arrow Flight as a whole. Computer Engineerin

    VC@Scale: Scalable and high-performance variant calling on cluster environments

    No full text
    Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.Computer Engineerin

    NASB: Neural Architecture Search for Binary Convolutional Neural Networks

    No full text
    Binary Convolutional Neural Networks (CNNs) have significantly reduced the number of arithmetic operations and the size of memory storage needed for CNNs, which makes their deployment on mobile and embedded systems more feasible. However, after binarization, the CNN architecture has to be redesigned and refined significantly due to two reasons: 1. the large accumulation error of binarization in the forward propagation, and 2. the severe gradient mismatch problem of binarization in the backward propagation. Even though substantial effort has been invested in designing architectures for single and multiple binary CNNs, it is still difficult to find an optimized architecture for binary CNNs. In this paper, we propose a strategy, named NASB, which adapts Neural Architecture Search (NAS) to find an optimized architecture for the binarization of CNNs. In the NASB strategy, the operations and their connections define a unique searching space and the training and binarization of the network progress in the three-stage training algorithm. 1 Due to the flexibility of this automated strategy, the obtained architecture is not only suitable for binarization but also has low overhead, achieving a better trade-off between the accuracy and computational complexity compared to hand-optimized binary CNNs. The implementation of the NASB strategy is evaluated on the ImageNet dataset and demonstrated as a better solution compared to existing quantized CNNs. With insignificant overhead increase, NASB outperforms existing single and multiple binary CNNs by up to 4.0% and 1.0% Top-1 accuracy respectively, bringing them closer to the precision of their full precision counterpart.Accepted author manuscriptComputer Engineerin

    ReAF: Reducing approximation of channels by reducing feature reuse within convolution

    No full text
    High-level feature maps of Convolutional Neural Networks are computed by reusing their corresponding low-level feature maps, which brings into full play feature reuse to improve the computational efficiency. This form of feature reuse is referred to as feature reuse between convolutional layers. The second type of feature reuse is referred to as feature reuse within the convolution, where the channels of the output feature maps of the convolution are computed by reusing the same channels of the input feature maps, which results in an approximation of the channels of the output feature maps. To compute them accurately, we need specialized input feature maps for every channel of the output feature maps. In this paper, we first discuss the approximation problem introduced by full feature reuse within the convolution and then propose a new feature reuse scheme called Reducing Approximation of channels by Reducing Feature reuse (REAF). The paper also shows that group convolution is a special case of our REAF scheme and we analyze the advantage of REAF compared to such group convolution. Moreover, we develop the REAF+ scheme and integrate it with group convolution-based models. Compared with baselines, experiments on image classification demonstrate the effectiveness of our REAF and REAF+ schemes. Under the given computational complexity budget, the Top-1 accuracy of REAF-ResNet50 and REAF+-MobileNetV2 on ImageNet will increase by 0.37% and 0.69% respectively. The code and pre-trained models will be publicly available.Computer Engineerin

    Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight

    No full text
    Current cluster scaled genomics data processing solutions rely on big data frameworks like Apache Spark, Hadoop and HDFS for data scheduling, processing and storage. These frameworks come with additional computation and memory overheads by default. It has been observed that scaling genomics dataset processing beyond 32 nodes is not efficient on such frameworks.To overcome the inefficiencies of big data frameworks for processing genomics data on clusters, we introduce a low-overhead and highly scalable solution on a SLURM based HPC batch system. This solution uses Apache Arrow as in-memory columnar data format to store genomics data efficiently and Arrow Flight as a network protocol to move and schedule this data across the HPC nodes with low communication overhead.As a use case, we use NGS short reads DNA sequencing data for pre-processing and variant calling applications. This solution outperforms existing Apache Spark based big data solutions in term of both computation time (2x) and lower communication overhead (more than 20-60% depending on cluster size). Our solution has similar performance to MPI-based HPC solutions, with the added advantage of easy programmability and transparent big data scalability. The whole solution is Python and shell script based, which makes it flexible to update and integrate alternative variant callers. Our solution is publicly available on GitHub at https://github.com/abs-tudelft/time-to-fly-high/tree/main/genomicsGreen Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Computer Engineerin

    Tydi-Chisel: Collaborative and Interface-Driven Data-Streaming Accelerators

    No full text
    In spite of progress on hardware design languages, the design of high-performance hardware accelerators forces many design decisions specializing the interfaces of these accelerators in ways that complicate the understanding of the design and hinder modularity and collaboration. In response to this challenge, Tydi is presented as an open specification for streaming dataflow designs in digital circuits, allowing designers to express how composite and variable-length data structures are transferred over streams using clear, data-centric types. In contrast, Chisel, with its high level of abstraction and customizability offers a suitable platform to implement Tydi-based components. In this paper, Tydi-Chisel is presented along with an A-to-Z design-process description. Tydi-Chisel aims to simplify the design of data-streaming accelerators through the integration of the Tydi interface standard in Chisel, along with helper components and syntax sugar. In combination Chisel and Tydi help bridge the hardware-software divide, making solo-design and collaboration between designers easier.Project repository: https://github.com/ccromjongh/Tydi-ChiselGreen Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Computer Engineerin

    An Attention Module for Convolutional Neural Networks

    No full text
    Attention mechanism has been regarded as an advanced technique to capture long-range feature interactions and to boost the representation capability for convolutional neural networks. However, we found two ignored problems in current attentional activations-based models: the approximation problem and the insufficient capacity problem of the attention maps. To solve the two problems together, we initially propose an attention module for convolutional neural networks by developing an AW-convolution, where the shape of attention maps matches that of the weights rather than the activations. Our proposed attention module is a complementary method to previous attention-based schemes, such as those that apply the attention mechanism to explore the relationship between channel-wise and spatial features. Experiments on several datasets for image classification and object detection tasks show the effectiveness of our proposed attention module. In particular, our proposed attention module achieves 1.00 % Top-1 accuracy improvement on ImageNet classification over a ResNet101 baseline and 0.63 COCO-style Average Precision improvement on the COCO object detection on top of a Faster R-CNN baseline with the backbone of ResNet101-FPN. When integrating with the previous attentional activations-based models, our proposed attention module can further increase their Top-1 accuracy on ImageNet classification by up to 0.57 % and COCO-style Average Precision on the COCO object detection by up to 0.45. Code and pre-trained models will be publicly available.Computer Engineerin

    Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

    No full text
    Background: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM.Computer EngineeringNumerical AnalysisQuantum & Computer Engineerin

    An Accelerator for Posit Arithmetic Targeting Posit Level 1 BLAS Routines and Pair-HMM

    No full text
    The newly proposed posit number format uses a significantly different approach to represent floating point numbers. This paper introduces a framework for posit arithmetic in reconfigurable logic that maintains full precision in intermediate results. We present the design and implementation of a L1 BLAS arithmetic accelerator on posit vectors leveraging Apache Arrow. For a vector dot product with an input vector length of 10^6 elements, a hardware speedup of approximately 10^4 is achieved as compared to posit software emulation. For 32-bit numbers, the decimal accuracy of the posit dot product results improve by one decimal of accuracy on average compared to a software implementation, and two extra decimals compared to the IEEE754 format. We also present a posit-based implementation of pair-HMM. In this case, the hardware speedup vs. a posit-based software implementation ranges from 10^5 to 10^6. With appropriate initial scaling constants, accuracy improves on an implementation based on IEEE 754.Computer Engineerin
    corecore