135 research outputs found

    Rapid Parallelization by Collaboration

    Get PDF
    The widespread adoption of Chip Multiprocessors has renewed the emphasis on the use of parallelism to improve performance. The present and growing diversity in hardware architectures and software environments, however, continues to pose difficulties in the effective use of parallelism thus delaying a quick and smooth transition to the concurrency era. In this document, we describe the research being conducted at the Computer Science Department at Columbia University on a system called COMPASS that aims to simplify this transition by providing advice to programmers considering parallelizing their code. The advice proffered to the programmer is based on the wisdom collected from programmers who have already parallelized some code. The utility of COMPASS rests, not only on its ability to collect the wisdom unintrusively but also on its ability to automatically seek, find and synthesize this wisdom into advice that is tailored to the code the user is considering parallelizing and to the environment in which the optimized program will execute in. COMPASS provides a platform and an extensible framework for sharing human expertise about code parallelization -- widely and on diverse hardware and software. By leveraging the "Wisdom of Crowds" model which has been conjunctured to scale exponentially and which has successfully worked for Wikis, COMPASS aims to enable rapid parallelization of code and thus continue to extend the benefits for Moore's law scaling to science and society

    Big Data for the Real-Time Analysis of the Cherenkov Telescope Array Observatory

    Get PDF
    Lo scopo di questo lavoro di tesi è quello di progettare e sviluppare un framework che supporti l'analisi in tempo reale nel contesto del Cherenkov Telescope Array (CTA). CTA è un consorzio internazionale che comprende 1420 membri provenienti da oltre 200 istituti da 31 Nazioni. CTA punta ad essere il più grande e più sensibile osservatorio ground-based di raggi gamma di prossima generazione in grado di gestire un'elevata quantità di dati e un'alta velocità di trasmissione, compresa tra i 0,5 e i 10 GB/s, con una rate di acquisizione nominale di 6 kHz. A tale riguardo, è stata sviluppata la RTAlib in grado di fornire un'API semplice e ad alte prestazioni per archiviare o fare caching dei dati generati durante la fase di ricostruzione e analisi. Per far fronte alle elevate velocità di trasmissione di CTA, la RTAlib sfrutta il multiprocesso, il multi-threading, le transazioni ed un accesso trasparente a MySQL o Redis per far fronte a diversi casi d’uso. Tutte queste funzionalità sono state testate ottenendo risultati entro i requisiti richiesti. In particolare, con la libreria sviluppata si riesce a fare caching di dati con Redis, con processi scrittori e lettori che lavorano in parallelo, ad una rate di 8 kHz in scrittura e 30 kHz in lettura. Il team in cui ho lavorato ha basato sui principi dell'approccio Scrum e DevOps il proprio processo di sviluppo del software, in particolare dalle unit test fino alla continuous integration, utilizzando tools ad accesso pubblico su GitHub oppure tramite Jenkins. Grazie a questo approccio si è puntato ad avere una elevata qualità del codice fin dall’inizio del progetto, e questo è risultato uno degli approcci più importanti per ottenere i risultati raggiunti

    Hadoop-based solutions for variant calling and variant analysis

    Get PDF

    Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems

    Get PDF
    Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming. The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends. The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines. The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past

    Development of Copy Number Variation Detection Algorithms and Their Application to Genome Diversity Studies

    Full text link
    Copy number variation (CNV) is an important class of variation that contributes to genome evolution and disease. CNVs that become fixed in a species give rise to segmental duplications; and already duplicated sequence is prone to subsequent gain and loss leading to additional copy-number variation. Multiple methods exist for defining CNV based on high-throughput sequencing data, including analysis of mapped read-depth. However, accurately assessing CNV can be computationally costly and multi-mapping-based approaches may not specifically distinguish among paralogs or gene families. We present two rapid CNV estimation algorithms, QuicK-mer and fastCN, for second generation short sequencing data. The QuicK-mer program is a paralog sensitive CNV detector which relies on enumerating unique k-mers from a pre-tabulated reference genome. The latest version of QuicK-mer 2.0 utilizes a newly constructed k-mer counting core based on the DJB hash function and permits multithreaded CNV counting of a large input file. As a result, QuicK-mer 2.0 can produce copy-number profiles form a 10X coverage mammalian genome in less than 5 minutes. The second CNV estimator, fastCN, is based on sequence mapping and has tolerance for mismatches. The pipeline is built around the mrsFAST read mapper and does not use additional time compared to the mrsFAST mapping process. We validated the accuracy of both approaches with existing data on human paralogous regions from the 1000 Genomes Project. We also employed QuicK-mer to perform an assessment of copy number variation on chimpanzee and human Y chromosomes. CNV has also been associated with phenotypic changes that occur also during animal domestication. Large scale CNVs were observed previously in cattle, pigs and chicken domestication. We assessed the role of CNV in dog domestication though a comparison of semi-feral village dogs and a global collection of wolfs. Our CNV selection scan uncovered many previously confirmed duplications and deletions but did not identify fixed variants that may have contributed to the initial domestication process. During this selection study, we uncovered CNVs that are errors in the existing canine reference assembly. We attempted to the complement the current CanFam3.1 reference with the de novo genome assembly of a Great Dane breed dog named Zoey. A 50x PacBio long reads sequencing with median insert size of 7.8kbp was conducted. The resulting assembly shows significant improvement with 20x increased continuity and two third reductions of unplaced contigs. The Zoey Great Dane assembly closes 80% of CanFam3.1 gaps where high GC content was the major culprit in the original assembly. Using unique k-mers assigned in these closed gaps, QuicK-mer was able to find many of these regions are fixed across dogs while small proportion shows variability.PHDHuman GeneticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/150064/1/feichens_1.pdfDescription of feichens_1.pdf : Restricted to UM users only

    Doublade: Unknown Vulnerability Detection in Smart Contracts Via Abstract Signature Matching and Refined Detection Rules

    Full text link
    With the prosperity of smart contracts and the blockchain technology, various security analyzers have been proposed from both the academia and industry to address the associated risks. Yet, there does not exist a high-quality benchmark of smart contract vulnerability for security research. In this study, we propose an approach towards building a high-quality vulnerability benchmark. Our approach consists of two parts. First, to improve recall, we propose to search for similar vulnerabilities in an automated way by leveraging the abstract vulnerability signature (AVS). Second, to remove the false positives (FPs) due to AVS-based matching, we summarize the detection rules of existing tools and apply the refined rules by considering various defense mechanisms (DMs). By integrating AVS-based code matching and the refined detection rules (RDR), our approach achieves higher precision and recall. On the collected 76,354 contracts, we build a benchmark consisting of 1,219 vulnerabilities covering five different vulnerability types identified together by our tool (DOUBLADE) and other three scanners. Additionally, we conduct a comparison between DOUBLADE and the others, on an additional 17,770 contracts. Results show that DOUBLADE can yield a better detection accuracy with similar execution time

    Mixed Speculative Multithreaded Execution Models

    Get PDF
    Institute for Computing Systems ArchitectureThe current trend toward chip multiprocessor architectures has placed great pressure on programmers and compilers to generate thread-parallel programs. Improved execution performance can no longer be obtained via traditional single-thread instruction level parallelism (ILP), but, instead, via multithreaded execution. One notable technique that facilitates the extraction of parallel threads from sequential applications is thread-level speculation (TLS). This technique allows programmers/compilers to generate threads without checking for inter-thread data and control dependences, which are then transparently enforced by the hardware. Most prior work on TLS has concentrated on thread selection and mechanisms to efficiently support the main TLS operations, such as squashes, data versioning, and commits. This thesis seeks to enhance TLS functionality by combining it with other speculative multithreaded execution models. The main idea is that TLS already requires extensive hardware support, which when slightly augmented can accommodate other speculative multithreaded techniques. Recognizing that for different applications, or even program phases, the application bottlenecks may be different, it is reasonable to assume that the more versatile a system is, the more efficiently it will be able to execute the given program. As mentioned above, generating thread-parallel programs is hard and TLS has been suggested as an execution model that can speculatively exploit thread-level parallelism (TLP) even when thread independence cannot be guaranteed by the programmer/ compiler. Alternatively, the helper threads (HT) execution model has been proposed where subordinate threads are executed in parallel with a main thread in order to improve the execution efficiency (i.e., ILP) of the latter. Yet another execution model, runahead execution (RA), has also been proposed where subordinate versions of the main thread are dynamically created especially to cope with long-latency operations, again with the aim of improving the execution efficiency of the main thread (ILP). Each one of these multithreaded execution models works best for different applications and application phases. We combine these three models into a single execution model and single hardware infrastructure such that the system can dynamically adapt to find the most appropriate multithreaded execution model. More specifically, TLS is favored whenever successful parallel execution of instructions in multiple threads (i.e., TLP) is possible and the system can seamlessly transition at run-time to the other models otherwise. In order to understand the tradeoffs involved, we also develop a performance model that allows one to quantitatively attribute overall performance gains to either TLP or ILP in such combined multithreaded execution model. Experimental results show that our combined execution model achieves speedups of up to 41.2%, with an average of 10.2%, over an existing state-of-the-art TLS system and speedups of up to 35.2%, with an average of 18.3%, over a flavor of runahead execution for a subset of the SPEC2000 Integer benchmark suite. We then investigate how a common ILP-enhancingmicroarchitectural feature, namely branch prediction, interacts with TLS.We show that branch prediction for TLS is even more important than it is for single core machines. Unfortunately, branch prediction for TLS systems is also inherently harder. Code partitioning and re-executions of squashed threads pollute the branch history making it harder for predictors to be accurate. We thus propose to augment the hardware, so as to accommodate Multi-Path (MP) execution within the existing TLS protocol. Under the MP execution model, all paths following a number of hard-to-predict conditional branches are followed. MP execution thus, removes branches that would have been otherwise mispredicted helping in this way the processor to exploit more ILP. We show that with only minimal hardware support, one can combine these two execution models into a unified one, which can achieve far better performance than both TLS and MP execution. Experimental results show that our combied execution model achieves speedups of up to 20.1%, with an average of 8.8%, over an existing state-of-the-art TLS system and speedups of up to 125%, with an average of 29.0%, when compared with multi-path execution for a subset of the SPEC2000 Integer benchmark suite. Finally, Since systems that support speculative multithreading usually treat all threads equally, they are energy-inefficient. This inefficiency stems from the fact that speculation occasionally fails and, thus, power is spent on threads that will have to be discarded. We propose a profitability-based power allocation scheme, where we “steal” power from non-profitable threads and use it to speed up more useful ones. We evaluate our techniques for a state-of-the-art TLS system and show that, with minimalhardware support, we achieve improvements in ED of up to 25.5% with an average of 18.9%, for a subset of the SPEC 2000 Integer benchmark suite

    Energy-aware scheduling in heterogeneous computing systems

    Get PDF
    In the last decade, the grid computing systems emerged as useful provider of the computing power required for solving complex problems. The classic formulation of the scheduling problem in heterogeneous computing systems is NP-hard, thus approximation techniques are required for solving real-world scenarios of this problem. This thesis tackles the problem of scheduling tasks in a heterogeneous computing environment in reduced execution times, considering the schedule length and the total energy consumption as the optimization objectives. An efficient multithreading local search algorithm for solving the multi-objective scheduling problem in heterogeneous computing systems, named MEMLS, is presented. The proposed method follows a fully multi-objective approach, applying a Pareto-based dominance search that is executed in parallel by using several threads. The experimental analysis demonstrates that the new multithreading algorithm outperforms a set of fast and accurate two-phase deterministic heuristics based on the traditional MinMin. The new ME-MLS method is able to achieve significant improvements in both makespan and energy consumption objectives in reduced execution times for a large set of testbed instances, while exhibiting very good scalability. The ME-MLS was evaluated solving instances comprised of up to 2048 tasks and 64 machines. In order to scale the dimension of the problem instances even further and tackle large-sized problem instances, the Graphical Processing Unit (GPU) architecture is considered. This line of future work has been initially tackled with the gPALS: a hybrid CPU/GPU local search algorithm for efficiently tackling a single-objective heterogeneous computing scheduling problem. The gPALS shows very promising results, being able to tackle instances of up to 32768 tasks and 1024 machines in reasonable execution times.En la última década, los sistemas de computación grid se han convertido en útiles proveedores de la capacidad de cálculo necesaria para la resolución de problemas complejos. En su formulación clásica, el problema de la planificación de tareas en sistemas heterogéneos es un problema NP difícil, por lo que se requieren técnicas de resolución aproximadas para atacar instancias de tamaño realista de este problema. Esta tesis aborda el problema de la planificación de tareas en sistemas heterogéneos, considerando el largo de la planificación y el consumo energético como objetivos a optimizar. Para la resolución de este problema se propone un algoritmo de búsqueda local eficiente y multihilo. El método propuesto se trata de un enfoque plenamente multiobjetivo que consiste en la aplicación de una búsqueda basada en dominancia de Pareto que se ejecuta en paralelo mediante el uso de varios hilos de ejecución. El análisis experimental demuestra que el algoritmo multithilado propuesto supera a un conjunto de heurísticas deterministas rápidas y e caces basadas en el algoritmo MinMin tradicional. El nuevo método, ME-MLS, es capaz de lograr mejoras significativas tanto en el largo de la planificación y como en consumo energético, en tiempos de ejecución reducidos para un gran número de casos de prueba, mientras que exhibe una escalabilidad muy promisoria. El ME-MLS fue evaluado abordando instancias de hasta 2048 tareas y 64 máquinas. Con el n de aumentar la dimensión de las instancias abordadas y hacer frente a instancias de gran tamaño, se consideró la utilización de la arquitectura provista por las unidades de procesamiento gráfico (GPU). Esta línea de trabajo futuro ha sido abordada inicialmente con el algoritmo gPALS: un algoritmo híbrido CPU/GPU de búsqueda local para la planificación de tareas en en sistemas heterogéneos considerando el largo de la planificación como único objetivo. La evaluación del algoritmo gPALS ha mostrado resultados muy prometedores, siendo capaz de abordar instancias de hasta 32768 tareas y 1024 máquinas en tiempos de ejecución razonables
    corecore