785 research outputs found

    A Scalable Asynchronous Distributed Algorithm for Topic Modeling

    Full text link
    Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons: First, one needs to deal with a large number of topics (typically in the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over TT items in O(logT)O(\log T) time. Moreover, when topic counts change the data structure can be updated in O(logT)O(\log T) time. In order to distribute the computation across multiple processor we present a novel asynchronous framework inspired by the Nomad algorithm of \cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform state-of-the-art on massive problems which involve millions of documents, billions of words, and thousands of topics

    How GPU Rendering Affects Image Processing and Scientific Calculation Speed, Power and Energy on a Raspberry Pi

    Get PDF
    In this thesis, we explore the speed, power, and energy performance of the same data process on the central processing unit (CPU) with and without the acceleration of the Graphics Processing Unit (GPU) on the microcomputer Raspberry Pi (RPI). We tested on the RPI in two different fields. The first was comparing the speed, power, and energy usage with and without GPU acceleration in the image processing impacts on RPI model B+. The second was comparing speed, power, energy usage, and accuracy for scientific calculation with and without GPU acceleration on RPI model B+ and 3B. We used a novel method to correlate graphics processing, CPU load, power consumption, and total energy consumption. Three different benchmarks were utilized to play a short video. OMXplayer was used with GPU rendering while the Mplayer and VLC player were without GPU rendering. A 3 Dimensions model simulator (3D Slash) benchmark was also used to compare its power usage with the previous benchmarks’. We used system counter tool PERF and system usage monitor TOP for acquiring accurate system CPU and Random-Access Memory (RAM) usage information. The first study design included a comparison of the running time, frame rate, power usage, and the total energy consumed by the benchmarks. We used the Adafruit USB Power Gauge to log the power and energy consumed by the RPI, and its values were output to a CSV file for ease of graphing and calculation. The first study results showed that the number of frames rendered per second increased dramatically when hardware rendering was used, as did electrical power consumption. Interestingly, the hardware rendering takes less time than the software rendering, and the total energy consumed by the hardware rendering lower than the software rendering despite the power during hardware rendering being higher. In the second study, we used the Fast Fourier Transform (FFT) as the calculation method for analyzing. We developed six benchmark programs using three libraries that included: GPU_FFT, Fastest Fourier Transform in the West (FFTW) and Python SciPy FFTpack (SciPy FFT) [1-3]. They were used for doing FFT in both one dimension (1D) and two dimensions (2D) using single precision floating point numbers as the primary data type. The study design includes: the write-up of the involved code, a comparison of the accuracy of the results compared to the known solution, running time, power consumption during the calculation, and the total energy consumed by the calculation. The Power Gauge was used to measure the power and energy consumed by the RPI as we did in the first field. In the second study, we found that General-purpose computing on graphics processing units (GPGPU) code was more energy efficient and faster than the serial code on both RPI models without much sacrifice of the precision. From the two studies, we interpreted that particular type of data processing like image processing and typical complex matrices value calculating would have numerous benefits in speed, energy expenditure with the GPU rendering

    SqORAM: Read-Optimized Sequential Write-Only Oblivious RAM

    Full text link
    Oblivious RAM protocols (ORAMs) allow a client to access data from an untrusted storage device without revealing the access patterns. Typically, the ORAM adversary can observe both read and write accesses. Write-only ORAMs target a more practical, {\em multi-snapshot adversary} only monitoring client writes -- typical for plausible deniability and censorship-resilient systems. This allows write-only ORAMs to achieve significantly-better asymptotic performance. However, these apparent gains do not materialize in real deployments primarily due to the random data placement strategies used to break correlations between logical and physical namespaces, a required property for write access privacy. Random access performs poorly on both rotational disks and SSDs (often increasing wear significantly, and interfering with wear-leveling mechanisms). In this work, we introduce SqORAM, a new locality-preserving write-only ORAM that preserves write access privacy without requiring random data access. Data blocks close to each other in the logical domain land in close proximity on the physical media. Importantly, SqORAM maintains this data locality property over time, significantly increasing read throughput. A full Linux kernel-level implementation of SqORAM is 100x faster than non locality-preserving solutions for standard workloads and is 60-100% faster than the state-of-the-art for typical file system workloads

    Knowledge representation into Ada parallel processing

    Get PDF
    The Knowledge Representation into Ada Parallel Processing project is a joint NASA and Air Force funded project to demonstrate the execution of intelligent systems in Ada on the Charles Stark Draper Laboratory fault-tolerant parallel processor (FTPP). Two applications were demonstrated - a portion of the adaptive tactical navigator and a real time controller. Both systems are implemented as Activation Framework Objects on the Activation Framework intelligent scheduling mechanism developed by Worcester Polytechnic Institute. The implementations, results of performance analyses showing speedup due to parallelism and initial efficiency improvements are detailed and further areas for performance improvements are suggested

    Hardware Acceleration of the Robust Header Compression (RoHC) Algorithm

    Get PDF
    With the proliferation of Long Term Evolution (LTE) networks, many cellular carriers are embracing the emerging eld of mobile Voice over Internet Protocol (VoIP). The robust header compression (RoHC) framework was introduced as a part of the LTE Layer 2 stack to compress the large headers of the VoIP packets before transmitted over LTE IP-based architectures. The headers, which are encapsulated Real-time Transport Protocol (RTP)/User Datagram Protocol (UDP)/Internet Protocol (IP) stack, are large compared to the small payload. This header-compression scheme is especially useful for ecient utilization of the radio bandwidth and network resources. In an LTE base-station implementation, RoHC is a processing-intensive algorithm that may be the bottleneck of the system, and thus, may be the limiting factor when it comes to number of users served. In this thesis, a hardware-software and a full-hardware solution are proposed, targeting LTE base-stations to accelerate this computationally intensive algorithm and enhance the throughput and the capacity of the system. The results of both solutions are discussed and compared with respect to design metrics like throughput, capacity, power consumption, chip area and exibility. This comparison is instrumental in taking architectural level trade-o decisions in-order to meet the present day requirements and also be ready to support future evolution. In terms of throughput, a gain of 20% (6250 packets/sec can be processed at a frequency of 150 MHz) is achieved in the HW-SW solution compared to the SW-Only solution by implementing the Cyclic Redundancy Check (CRC) and the Least Signicant Bit(LSB) encoding blocks as hardware accelerators . Whereas, a Full-HW implementation leads to a throughput of 45 times (244000 packets/sec can be processed at a frequency of 100 MHz) the throughput of the SW-Only solution. However, the full-HW solution consumes more Lookup Tables (LUTs) when it is synthesized on an Field-Programmable Gate Array (FPGA) platform compared to the HW-SW solution. In Arria II GX, the HW-SW and the full-HW solutions use 2578 and 7477 LUTs and consume 1.5 and 0.9 Watts, respectively. Finally, both solutions are synthesized and veried on Altera's Arria II GX FPGA

    Protein alignment algorithms with an efficient backtracking routine on multiple GPUs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a great potential of a GPU platform but in most cases address the problem of sequence database scanning and computing only the alignment score whereas the alignment itself is omitted. Thus, the need arose to implement the global and semiglobal Needleman-Wunsch, and Smith-Waterman algorithms with a backtracking procedure which is needed to construct the alignment.</p> <p>Results</p> <p>In this paper we present the solution that performs the alignment of every given sequence pair, which is a required step for progressive multiple sequence alignment methods, as well as for DNA recognition at the DNA assembly stage. Performed tests show that the implementation, with performance up to 6.3 GCUPS on a single GPU for affine gap penalties, is very efficient in comparison to other CPU and GPU-based solutions. Moreover, multiple GPUs support with load balancing makes the application very scalable.</p> <p>Conclusions</p> <p>The article shows that the backtracking procedure of the sequence alignment algorithms may be designed to fit in with the GPU architecture. Therefore, our algorithm, apart from scores, is able to compute pairwise alignments. This opens a wide range of new possibilities, allowing other methods from the area of molecular biology to take advantage of the new computational architecture. Performed tests show that the efficiency of the implementation is excellent. Moreover, the speed of our GPU-based algorithms can be almost linearly increased when using more than one graphics card.</p

    Combining k-Induction with Continuously-Refined Invariants

    Full text link
    Bounded model checking (BMC) is a well-known and successful technique for finding bugs in software. k-induction is an approach to extend BMC-based approaches from falsification to verification. Automatically generated auxiliary invariants can be used to strengthen the induction hypothesis. We improve this approach and further increase effectiveness and efficiency in the following way: we start with light-weight invariants and refine these invariants continuously during the analysis. We present and evaluate an implementation of our approach in the open-source verification-framework CPAchecker. Our experiments show that combining k-induction with continuously-refined invariants significantly increases effectiveness and efficiency, and outperforms all existing implementations of k-induction-based software verification in terms of successful verification results.Comment: 12 pages, 5 figures, 2 tables, 2 algorithm

    Microprocessor Implementation of Autoregressive Analysis of Process Sensor Signals

    Get PDF
    Automated signal analysis can help for effective system surveillance and also to analyze the dynamic behavior of the system such as impulse response, step response etc. Autoregressive analysis is a parametric technique widely used for system surveillance and diagnosis. The main aim objective of this research work is to develop an embedded system for autoregressive analysis of sensor signals in an online fashion for monitoring system parameters. This thesis presents the algorithm, data representation and performance of the optimized microprocessor implementation of autoregressive analysis. In this work an autoregressive (AR) model is generated as a solution to a linear system of equations called Yule-Walker linear equations. The generated model is then implemented on Motorola PowerPC MPC555 processor. The embedded software for autoregressive analysis is written in the C programming language using fixed point arithmetic. It includes estimation of the autoregressive parameters, estimation of the noise variance recursively using the AR parameters, determination of the optimal model order and the model validation
    corecore