10 research outputs found

    Towards co-designed optimizations in parallel frameworks: A MapReduce case study

    Full text link
    The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the semantic distance between an application and its generated code. However, parallel software frameworks contain inherent semantic information that general purpose compilers are not designed to exploit. This paper presents a case study demonstrating how the specific semantic information of the MapReduce paradigm can be exploited on multicore architectures. MR4J has been implemented in Java and evaluated against hand-optimized C and C++ equivalents. The initial observed results led to the design of a semantically aware optimizer that runs automatically without requiring modification to application code. The optimizer is able to speedup the execution time of MR4J by up to 2.0x. The introduced optimization not only improves the performance of the generated code, during the map phase, but also reduces the pressure on the garbage collector. This demonstrates how semantic information can be harnessed without sacrificing sound software engineering practices when using parallel software frameworks.Comment: 8 page

    Research and Education in Computational Science and Engineering

    Get PDF
    Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie

    SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

    Get PDF
    Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive. This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded. We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27×\times on average (up to 1.78×\times) under high-contention scenarios, and by 1.35×\times on average (up to 2.29×\times) under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consumption by 2.08×\times on average (up to 4.25×\times).Comment: To appear in the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA-27

    Programação paralela baseada em skeletons para processamento de imagens 3D

    Get PDF
    A melhoria de desempenho obtida através da computação paralela permitiu o aumento da sua utilização na resolução de problemas computacionalmente exigentes em muitas áreas em ciência e engenharia.No entanto, devido à complexidade da criação de programas paralelos são necessárias ferramentas que simplifiquem o seu desenvolvimento. Um tipo de problemas que é resolvido com programação paralela é o processamento de imagem em áreas como a Ciência dos Materiais e a Medicina. À semelhança de outras áreas, também nestas e para este tipo de problemas, é possível encontrar soluções e estratégias de paralelização comuns, e que capturam o conhecimento acumulado ao longo do tempo. O conhecimento sobre estes padrões e a sua disponibilização permitem assim simplificar o desenvolvimento desses programas paralelos mas é necessário existirem ferramentas que os implementem com um desempenho adequado. Os padrões devem também ser de fácil adaptação e reutilização em problemas similares, melhorando a produtividade no desenvolvimento de programas em diversas áreas que necessitem de processamento de imagem. No contexto da computação paralela, em geral, existem já ferramentas que disponibilizam padrões de paralelização permitindo que não peritos na área possam desenvolver os seus programas de um modo mais simples. Os algorithmic skeletons são uma das soluções existentes para capturar esses padrões, existindo frameworks que os implementam libertando os programadores da necessidade do conhecerem os detalhes da arquitetura alvo. Os algorithmic skeletons podem também ser aplicados aos problemas de processamento de imagem, capturando diretamente ou por composição padrões nesses domínio. No entanto, as ferramentas de algorithmic skeletons existentes não disponibilizam padrões otimizados com propriedades adaptativas que possam ter em conta, quer as características do sistema em execução (e.g. carga do sistema versus consumo de energia, etc.), quer da imagem em processamento (e.g. imagens com mais ou menos objetos). Neste contexto, este trabalho começou por estudar e comparar as implementações de um algoritmo de processamento de imagem usando dois framework de algorithmic skeletons que permitem gerar código para GPGPUs, de modo a identificar os padrões subjacentes e o framework mais adequado. Seguiu-se como contribuição a extensão do framework FastFlow com uma arquitetura de medição do estado de execução do skeleton farm, e a extensão deste com propriedades adaptativas. É possível alterar o número de workers de uma farm, controlar a distribuição de tarefas pelos vários workers, e escolher se a a execução do skeleton é feita em CPU ou GPU
    corecore