4,624 research outputs found

    The End of Slow Networks: It's Time for a Redesign

    Full text link
    Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand FDR 4x, the bandwidth available to transfer data across network is in the same ballpark as the bandwidth of one memory channel, and it increases even further with the most recent EDR standard. Moreover, with the increasing advances of RDMA, the latency improves similarly fast. In this paper, we first argue that the "old" distributed database design is not capable of taking full advantage of the network. Second, we propose architectural redesigns for OLTP, OLAP and advanced analytical frameworks to take better advantage of the improved bandwidth, latency and RDMA capabilities. Finally, for each of the workload categories, we show that remarkable performance improvements can be achieved

    A Rapid Prototyping Environment for Wireless Communication Embedded Systems

    Get PDF
    This paper introduces a rapid prototyping methodology which overcomes important barriers in the design and implementation of digital signal processing (DSP) algorithms and systems on embedded hardware platforms, such as cellular phones. This paper describes rapid prototyping in terms of a simulation/prototype bridge and in terms of appropriate language design. The simulation/prototype bridge combines the strengths of simulation and of prototyping, allowing the designer to develop and evaluate next-generation communications systems, partly in simulation on a host computer and partly as a prototype on embedded hardware. Appropriate language design allows designers to express a communications system as a block diagram, in which each block represents an algorithm specified by a set of equations. Software tools developed for this paper implement both concepts, and have been successfully used in the development of a next-generation code division multiple access (CDMA) cellular wireless communications system.NokiaTexas InstrumentsThe Texas Advanced Technology ProgramNational Science Foundatio

    Geometry-Oblivious FMM for Compressing Dense SPD Matrices

    Full text link
    We present GOFMM (geometry-oblivious FMM), a novel method that creates a hierarchical low-rank approximation, "compression," of an arbitrary dense symmetric positive definite (SPD) matrix. For many applications, GOFMM enables an approximate matrix-vector multiplication in NlogNN \log N or even NN time, where NN is the matrix size. Compression requires NlogNN \log N storage and work. In general, our scheme belongs to the family of hierarchical matrix approximation methods. In particular, it generalizes the fast multipole method (FMM) to a purely algebraic setting by only requiring the ability to sample matrix entries. Neither geometric information (i.e., point coordinates) nor knowledge of how the matrix entries have been generated is required, thus the term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme for hierarchical matrix computations that reduces synchronization barriers. We present results on the Intel Knights Landing and Haswell architectures, and on the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1

    Implementation of Genetic Algorithms in FPGA-based Reconfigurable Computing Systems

    Get PDF
    Genetic Algorithms (GAs) are used to solve many optimization problems in science and engineering. GA is a heuristics approach which relies largely on random numbers to determine the approximate solution of an optimization problem. We use the Mersenne Twister Algorithm (MTA) to generate a non-overlapping sequence of random numbers with a period of 219937-1. The random numbers are generated from a state vector that consists of 624 elements. Our work on state vector generation and the GA implementation targets the solution of a flow-line scheduling problem where the flow-lines have jobs to process and the goal is to find a suitable completion time for all jobs using a GA. The state vector generation algorithm (MTA) performs poorly in traditional von Neumann architectures due to its poor temporal and spatial locality. Therefore its performance is limited by the speed at which we can access memory. With an approximate increase of processor performance by 60% per year and a drop of memory latency only 7% per year, a new approach is needed for performance improvement. On the other hand, the GA implementation in a general-purpose microprocessor, though performs reasonably well, has scope for performance gain in a parallel implementation. The parallel implementation of the GA can work as a kernel for applications that uses a GA to reach a solution. Our approach is to implement the state vector generation process and the GA in an FPGA-based Reconfigurable Computing (RC) system with the goal of improving the overall performance. Application design for FPGA-based RC systems is not trivial and the performance improvement is not guaranteed. Designing for RC systems requires algorithmic parallelism in order to exploit the inherent parallelism of the FPGA. We are using a high-level language that provides a level of abstraction from the lower-level hardware in the RC system making it difficult to fully exploit some of the architectural benefits of the FPGA. Considering these factors, we improve the state vector generation process algorithmically. Our implementation generates state vectors 5X faster than the previous implementation in an Intel Xeon microprocessor of 2GHz. The modified algorithm is also implemented in a Xilinx Virtex-4 FPGA that results in a 2.4X speedup. Improvement in this preprocessing step accelerates GA application performance as random numbers are generated from these state vectors for the genetic operators. We simulate the basic operations of a GA in an FPGA to study its behavior in a parallel environment and analyze the results. The initial FPGA implementation of the GA runs about 7X slower than its microprocessor counterpart. The reasons are explained along with suggestions for improvement and future work

    Modeling software product lines using color-blind transition systems

    Get PDF

    Thirty Years of Machine Learning: The Road to Pareto-Optimal Wireless Networks

    Full text link
    Future wireless networks have a substantial potential in terms of supporting a broad range of complex compelling applications both in military and civilian fields, where the users are able to enjoy high-rate, low-latency, low-cost and reliable information services. Achieving this ambitious goal requires new radio techniques for adaptive learning and intelligent decision making because of the complex heterogeneous nature of the network structures and wireless services. Machine learning (ML) algorithms have great success in supporting big data analytics, efficient parameter estimation and interactive decision making. Hence, in this article, we review the thirty-year history of ML by elaborating on supervised learning, unsupervised learning, reinforcement learning and deep learning. Furthermore, we investigate their employment in the compelling applications of wireless networks, including heterogeneous networks (HetNets), cognitive radios (CR), Internet of things (IoT), machine to machine networks (M2M), and so on. This article aims for assisting the readers in clarifying the motivation and methodology of the various ML algorithms, so as to invoke them for hitherto unexplored services as well as scenarios of future wireless networks.Comment: 46 pages, 22 fig

    HeTM: Transactional Memory for Heterogeneous Systems

    Full text link
    Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19
    corecore