730 research outputs found

    On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

    Get PDF
    Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution

    Evaluating Multicore Algorithms on the Unified Memory Model

    Get PDF
    One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM) to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores). It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5

    Evaluating Multicore Algorithms on the Unified Memory Model

    Get PDF
    One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM) to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores). It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5

    A Lower Bound Technique for Communication in BSP

    Get PDF
    Communication is a major factor determining the performance of algorithms on current computing systems; it is therefore valuable to provide tight lower bounds on the communication complexity of computations. This paper presents a lower bound technique for the communication complexity in the bulk-synchronous parallel (BSP) model of a given class of DAG computations. The derived bound is expressed in terms of the switching potential of a DAG, that is, the number of permutations that the DAG can realize when viewed as a switching network. The proposed technique yields tight lower bounds for the fast Fourier transform (FFT), and for any sorting and permutation network. A stronger bound is also derived for the periodic balanced sorting network, by applying this technique to suitable subnetworks. Finally, we demonstrate that the switching potential captures communication requirements even in computational models different from BSP, such as the I/O model and the LPRAM

    On Characterizing the Data Access Complexity of Programs

    Full text link
    Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of developing lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by developing an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size

    Sparse multinomial kernel discriminant analysis (sMKDA)

    No full text
    Dimensionality reduction via canonical variate analysis (CVA) is important for pattern recognition and has been extended variously to permit more flexibility, e.g. by "kernelizing" the formulation. This can lead to over-fitting, usually ameliorated by regularization. Here, a method for sparse, multinomial kernel discriminant analysis (sMKDA) is proposed, using a sparse basis to control complexity. It is based on the connection between CVA and least-squares, and uses forward selection via orthogonal least-squares to approximate a basis, generalizing a similar approach for binomial problems. Classification can be performed directly via minimum Mahalanobis distance in the canonical variates. sMKDA achieves state-of-the-art performance in terms of accuracy and sparseness on 11 benchmark datasets

    Limites práticos de segurança da distribuição de chaves quânticas de variáveis contínuas

    Get PDF
    Discrete Modulation Continuous Variable Quantum Key Distribution (DM-CV-QKD) systems are very attractive for modern quantum cryptography, since they manage to surpass all Gaussian modulation (GM) system’s disadvantages while maintaining the advantages of using CVs. Nonetheless, DM-CV-QKD is still underdeveloped, with a very limited study of large constellations. This work intends to increase the knowledge on DM-CV-QKD systems considering large constellations, namely M-symbol Amplitude Phase Shift Keying (M-APSK) irregular and regular constellations. As such, a complete DM-CV-QKD system was implemented, con sidering collective attacks and reverse reconciliation under the realistic scenario, assuming Bob detains the knowledge of his detector’s noise. Tight security bounds were obtained considering M-APSK constellations and GM, both for the mutual information between Bob and Alice and the Holevo bound between Bob and Eve. M-APSK constellations with binomial distribution can approximate GM’s results for the secret key rate. Without the consideration of the finite size effects (FSEs), the regular constellation 256-APSK (reg. 32) with binomial distribution achieves 242.9 km, only less 7.2 km than GM for a secret key rate of 10¯⁶ photons per symbol. Considering FSEs, 256-APSK (reg. 32) achieves 96.4% of GM’s maximum transmission distance (2.3 times more than 4-PSK), and 78.4% of GM’s maximum compatible excess noise (10.2 times more than 4-PSK). Additionally, larger constellations allow the use of higher values of modulation variance in a practical implementation, i.e., we are no longer subjected to the sub-one limit for the mean number of photons per symbol. The information reconciliation step considering a binary symmetric channel, the sum-product algorithm and multi-edge type low den sity parity check matrices, constructed from the progressive edge growth algorithm, allowed the correction of keys up to 18 km. The consideration of multidimensional reconciliation allows 256-APSK (reg. 32) to reconcile keys up to 55 km. Privacy amplification was carried out considering the application of fast Fourier transforms to the Toeplitz extractor, being unable of extracting keys for more than, approximately, 49 km, almost haft the theoretical value, and for excess noises larger than 0.16 SNU, like the theoretical value.Os sistemas de distribuição de chaves quânticas com variáveis contínuas e modulação discreta (DM-CV-QKD) são muito atrativos para a criptografia quântica moderna, pois conseguem superar todas as desvantagens do sistema com modulação Gaussiana (GM) enquanto mantêm as vantagens do uso de CVs. No entanto, DM-CV-QKD ainda está subdesenvolvida, sendo o estudo de grandes constelações muito reduzido. Este trabalho pretende aumentar o conhecimento sobre os sistemas DM-CV-QKD com constelações grandes, nomeadamente as do tipo M-symbol Amplitude Phase Shift Keying (M-APSK) irregulares e regulares. Com isto, foi simulado um sistema DM-CV-QKD completo, considerando ataques coletivos e reconciliação reversa tendo em conta o cenário realista, assumindo que o Bob co nhece o ruído de seu detetor. Os limites de segurança foram obtidos considerando constelações M-APSK e GM, tanto para a informação mútua entre o Bob e a Alice, quanto para o limite de Holevo entre o Bob e a Eve. As constelações M-APSK com distribuição binomial aproximam-se à GM quanto à taxa de chave secreta. Sem considerar o efeito de tamanho finito (FSE), a constelação regular 256-APSK (reg. 32) com distribuição binomial atinge 242.9 km, apenas menos 7.2 km do que GM para uma taxa de chave secreta de 10¯⁶ fotões por símbolo. Considerando FSEs, a 256-APSK (reg. 32) atinge 96.4% da distância máxima de transmissão para GM (2.3 vezes mais que a 4-PSK), e 78.4% do valor máximo de excesso de ruído compatível para GM (10.2 vezes mais do que a 4-PSK). Adicionalmente, grandes constelações permitem o uso de valores mais altos de variância de modulação em implementações práticas, pelo que deixa de ser necessário um número de fotões por símbolo abaixo de um. A etapa de reconciliação de informação considerou um canal binário simétrico, o algoritmo soma-produto e matrizes multi-edge type low density parity check, construídas a partir do algoritmo progressive edge growth, permitindo a correção de chaves até 18 km. A consideração de reconciliação multidimensional permite que a 256-APSK (reg. 32) reconcilie chaves até 55 km. A amplificação de privacidade foi realizada considerando a aplicação de transformadas de Fourier rápidas ao extrator de Toeplitz, mostrando-se incapaz de extrair chaves para mais de, aproximadamente, 49 km, quase metade do valor teórico, e para excesso de ruído superior a 0.16 SNU, semelhante ao valor teórico.Mestrado em Engenharia Físic
    corecore