14 research outputs found

    The I/O Complexity of Hybrid Algorithms for Square Matrix Multiplication

    Get PDF
    Asymptotically tight lower bounds are derived for the I/O complexity of a general class of hybrid algorithms computing the product of n x n square matrices combining "Strassen-like" fast matrix multiplication approach with computational complexity Theta(n^{log_2 7}), and "standard" matrix multiplication algorithms with computational complexity Omega (n^3). We present a novel and tight Omega ((n/max{sqrt M, n_0})^{log_2 7}(max{1,(n_0)/M})^3M) lower bound for the I/O complexity of a class of "uniform, non-stationary" hybrid algorithms when executed in a two-level storage hierarchy with M words of fast memory, where n_0 denotes the threshold size of sub-problems which are computed using standard algorithms with algebraic complexity Omega (n^3). The lower bound is actually derived for the more general class of "non-uniform, non-stationary" hybrid algorithms which allow recursive calls to have a different structure, even when they refer to the multiplication of matrices of the same size and in the same recursive level, although the quantitative expressions become more involved. Our results are the first I/O lower bounds for these classes of hybrid algorithms. All presented lower bounds apply even if the recomputation of partial results is allowed and are asymptotically tight. The proof technique combines the analysis of the Grigoriev\u27s flow of the matrix multiplication function, combinatorial properties of the encoding functions used by fast Strassen-like algorithms, and an application of the Loomis-Whitney geometric theorem for the analysis of standard matrix multiplication algorithms. Extensions of the lower bounds for a parallel model with P processors are also discussed

    Lyra2: Efficient Password Hashing with High Security against Time-Memory Trade-Offs

    Get PDF
    We present Lyra2, a password hashing scheme (PHS) based on cryptographic sponges. Lyra2 was designed to be strictly sequential (i.e., not easily parallelizable), providing strong security even against attackers that uses multiple processing cores (e.g., custom hardware or a powerful GPU). At the same time, it is very simple to implement in software and allows legitimate users to fine tune its memory and processing costs according to the desired level of security against brute force password-guessing. Lyra2 is an improvement of the recently proposed Lyra algorithm, providing an even higher security level against different attack venues and overcoming some limitations of this and other existing schemes

    Strategies for including cloud-computing into an engineering modeling workflow

    Get PDF
    With the advent of cloud computing, high-end computing, networking, and storage resources are available on-demand at a relatively low price point. Internet applications in the consumer and increasingly in the enterprise space are making use of these resources to upgrade existing applications and build new ones. This is made possible by building decentralized applications that can be integrated with one another through web-enabled application programming interfaces (APIs). However, in the fields of engineering and computational science, cloud computing resources have been utilized primarily to augment existing high-performance computing hardware, but engineering model integrations still occur by the use of software libraries. In this research, a novel approach is proposed where engineering models are constructed as independent services that publish web-enabled APIs. To enable this, the engineering models are built as stateless microservices that solve a single computational problem. Composite services are then built utilizing these independent component models, much like in the consumer application space. Interactions between component models is orchestrated by a federation management system. This proposed approach is then demonstrated by disaggregating an existing monolithic model for a cookstove into a set of component models. The component models are then reintegrated and compared with the original model for computational accuracy and run-time. Additionally, a novel engineering workflow is proposed that reuses computational data by constructing reduced-order models (ROMs). This framework is evaluated empirically for a number of producers and consumers of engineering models based on computation and data synchronization aspects. The framework is also evaluated by simulating an engineering design workflow with multiple producers and consumers at various stages during the design process. Finally, concepts from the federated system of models and ROMs are combined to propose the concept of a hybrid model (information artefact). The hybrid model is a web-enabled microservice that encapsulates information from multiple engineering models at varying fidelities, and responds to queries based on the best available information. Rules for the construction of hybrid models have been proposed and evaluated in the context of engineering workflows

    Register Optimizations for Stencils on GPUs

    Get PDF
    International audienceThe recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes

    Provably Secure Countermeasures against Side-channel Attacks

    Get PDF
    Side-channel attacks exploit the fact that the implementations of cryptographic algorithms leak information about the secret key. In power analysis attacks, the observable leakage is the power consumption of the device, which is dependent on the processed data and the performed operations.\ignore{While Simple Power Analysis (SPA) attacks try to recover the secret value by directly interpreting the power measurements with the corresponding operations, Differential Power Analysis (DPA) attacks are more sophisticated and aim to recover the secret value by applying statistical techniques on multiple measurements from the same operation.} Masking is a widely used countermeasure to thwart the powerful Differential Power Analysis (DPA) attacks. It uses random variables called masks to reduce the correlation between the secret key and the obtained leakage. The advantage with masking countermeasure is that one can formally prove its security under reasonable assumptions on the device leakage model. This thesis proposes several new masking schemes along with the analysis and improvement of few existing masking schemes. The first part of the thesis addresses the problem of converting between Boolean and arithmetic masking. To protect a cryptographic algorithm which contains a mixture of Boolean and arithmetic operations, one uses both Boolean and arithmetic masking. Consequently, these masks need to be converted between the two forms based on the sequence of operations. The existing conversion schemes are secure against first-order DPA attacks only. This thesis proposes first solution to switch between Boolean and arithmetic masking that is secure against attacks of any order. Secondly, new solutions are proposed for first-order secure conversion with logarithmic complexity (O(logk){\cal O}(\log k) for kk-bit operands) compared to the existing solutions with linear complexity (O(k){\cal O}(k)). It is shown that this new technique also improves the complexity of the higher-order conversion algorithms from O(n2k){\cal O}(n^2 k) to O(n2logk){\cal O}(n^2 \log k) secure against attacks of order dd, where n=2d+1n = 2d+1. Thirdly, for the special case of second-order masking, the running times of the algorithms are further improved by employing lookup tables. The second part of the thesis analyzes the security of two existing Boolean masking schemes. Firstly, it is shown that a higher-order masking scheme claimed to be secure against attacks of order dd can be broken with an attack of order d/2+1d/2+1. An improved scheme is proposed to fix the flaw. Secondly, a new issue concerning the problem of converting the security proofs from one leakage model to another is examined. It is shown that a second-order masking scheme secure in the Hamming weight model can be broken with a first-order attack on a device leaking in the Hamming distance model. This result underlines the importance of re-evaluating the security proofs for devices leaking in different models

    Scalable High-Quality Graph and Hypergraph Partitioning

    Get PDF
    The balanced hypergraph partitioning problem (HGP) asks for a partition of the node set of a hypergraph into kk blocks of roughly equal size, such that an objective function defined on the hyperedges is minimized. In this work, we optimize the connectivity metric, which is the most prominent objective function for HGP. The hypergraph partitioning problem is NP-hard and there exists no constant factor approximation. Thus, heuristic algorithms are used in practice with the multilevel scheme as the most successful approach to solve the problem: First, the input hypergraph is coarsened to obtain a hierarchy of successively smaller and structurally similar approximations. The smallest hypergraph is then initially partitioned into kk blocks, and subsequently, the contractions are reverted level-by-level, and, on each level, local search algorithms are used to improve the partition (refinement phase). In recent years, several new techniques were developed for sequential multilevel partitioning that substantially improved solution quality at the cost of an increased running time. These developments divide the landscape of existing partitioning algorithms into systems that either aim for speed or high solution quality with the former often being more than an order of magnitude faster than the latter. Due to the high running times of the best sequential algorithms, it is currently not feasible to partition the largest real-world hypergraphs with the highest possible quality. Thus, it becomes increasingly important to parallelize the techniques used in these algorithms. However, existing state-of-the-art parallel partitioners currently do not achieve the same solution quality as their sequential counterparts because they use comparatively weak components that are easier to parallelize. Moreover, there has been a recent trend toward simpler methods for partitioning large hypergraphs that even omit the multilevel scheme. In contrast to this development, we present two shared-memory multilevel hypergraph partitioners with parallel implementations of techniques used by the highest-quality sequential systems. Our first multilevel algorithm uses a parallel clustering-based coarsening scheme, which uses substantially fewer locking mechanisms than previous approaches. The contraction decisions are guided by the community structure of the input hypergraph obtained via a parallel community detection algorithm. For initial partitioning, we implement parallel multilevel recursive bipartitioning with a novel work-stealing approach and a portfolio of initial bipartitioning techniques to compute an initial solution. In the refinement phase, we use three different parallel improvement algorithms: label propagation refinement, a highly-localized direct kk-way FM algorithm, and a novel parallelization of flow-based refinement. These algorithms build on our highly-engineered partition data structure, for which we propose several novel techniques to compute accurate gain values of node moves in the parallel setting. Our second multilevel algorithm parallelizes the nn-level partitioning scheme used in the highest-quality sequential partitioner KaHyPar. Here, only a single node is contracted on each level, leading to a hierarchy with approximately nn levels where nn is the number of nodes. Correspondingly, in each refinement step, only a single node is uncontracted, allowing a highly-localized search for improvements. We show that this approach, which seems inherently sequential, can be parallelized efficiently without compromises in solution quality. To this end, we design a forest-based representation of contractions from which we derive a feasible parallel schedule of the contraction operations that we apply on a novel dynamic hypergraph data structure on-the-fly. In the uncoarsening phase, we decompose the contraction forest into batches, each containing a fixed number of nodes. We then uncontract each batch in parallel and use highly-localized versions of our refinement algorithms to improve the partition around the uncontracted nodes. We further show that existing sequential partitioning algorithms considerably struggle to find balanced partitions for weighted real-world hypergraphs. To this end, we present a technique that enables partitioners based on recursive bipartitioning to reliably compute balanced solutions. The idea is to preassign a small portion of the heaviest nodes to one of the two blocks of each bipartition and optimize the objective function on the remaining nodes. We integrated the approach into the sequential hypergraph partitioner KaHyPar and show that our new approach can compute balanced solutions for all tested instances without negatively affecting the solution quality and running time of KaHyPar. In our experimental evaluation, we compare our new shared-memory (hyper)graph partitioner Mt-KaHyPar to 2525 different graph and hypergraph partitioners on over 800800 (hyper)graphs with up to two billion edges/pins. The results indicate that already our fastest configuration outperforms almost all existing hypergraph partitioners with regards to both solution quality and running time. Our highest-quality configuration (nn-level with flow-based refinement) achieves the same solution quality as the currently best sequential partitioner KaHyPar, while being almost an order of magnitude faster with ten threads. In addition, we optimize our data structures for graph partitioning, which improves the running times of both multilevel partitioners by almost a factor of two for graphs. As a result, Mt-KaHyPar also outperforms most of the existing graph partitioning algorithms. While the shared-memory graph partitioner KaMinPar is still faster than Mt-KaHyPar, its produced solutions are worse by 10%10\% in the median. The best sequential graph partitioner KaFFPa-StrongS computes slightly better partitions than Mt-KaHyPar (median improvement is 1%1\%), but is more than an order of magnitude slower on average

    Simplifying the Analysis of C++ Programs

    Get PDF
    Based on our experience of working with different C++ front ends, this thesis identifies numerous problems that complicate the analysis of C++ programs along the entire spectrum of analysis applications. We utilize library, language, and tool extensions to address these problems and offer solutions to many of them. In particular, we present efficient, expressive and non-intrusive means of dealing with abstract syntax trees of a program, which together render the visitor design pattern obsolete. We further extend C++ with open multi-methods to deal with the broader expression problem. Finally, we offer two techniques, one based on refining the type system of a language and the other on abstract interpretation, both of which allow developers to statically ensure or verify various run-time properties of their programs without having to deal with the full language semantics or even the abstract syntax tree of a program. Together, the solutions presented in this thesis make ensuring properties of interest about C++ programs available to average language users

    A stochastic Reputation System Architecture to support the Partner Selection in Virtual Organisations

    Get PDF
    In recent business environments, collaborations among organisations raise an increased demand for swift establishment. Such collaborations are increasingly formed without prior experience of the other partner\u27s previous performance. The STochastic REputation system (STORE) is designed to provide swift, automated decision support for selecting partner organisations. STORE is based on a stochastic trust model and evaluated by means of multi agent simulations in Virtual Organisation scenarios

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF
    corecore