248 research outputs found

    Optimization as a design strategy. Considerations based on building simulation-assisted experiments about problem decomposition

    Full text link
    In this article the most fundamental decomposition-based optimization method - block coordinate search, based on the sequential decomposition of problems in subproblems - and building performance simulation programs are used to reason about a building design process at micro-urban scale and strategies are defined to make the search more efficient. Cyclic overlapping block coordinate search is here considered in its double nature of optimization method and surrogate model (and metaphore) of a sequential design process. Heuristic indicators apt to support the design of search structures suited to that method are developed from building-simulation-assisted computational experiments, aimed to choose the form and position of a small building in a plot. Those indicators link the sharing of structure between subspaces ("commonality") to recursive recombination, measured as freshness of the search wake and novelty of the search moves. The aim of these indicators is to measure the relative effectiveness of decomposition-based design moves and create efficient block searches. Implications of a possible use of these indicators in genetic algorithms are also highlighted.Comment: 48 pages. 12 figures, 3 table

    Performance Analysis Of Pde Based Parallel Algorithms On Different Computer Architectures

    Get PDF
    Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Bilişim Enstitüsü, 2009Thesis (M.Sc.) -- İstanbul Technical University, Institute of Informatics, 2009Son yıllarda dağıtık algoritmaların farklı platformlarda kullanılabilmesi platform ve uygulama bağımsız performans analizi uygulamaları ihtiyacını arttırmıştır. Farklı donanımları ve haberleşme metodlarını destekleyen uygulamalar kullanıcılara donanım ve yazılımdan bağımsız ortak bir zemin hazırladıkları için kolaylık sağlamaktadır. Kısmi fark denklemleri hesaplamalı bilim ve mühendisliğin bir çok alanında kullanılmaktadır (ısı, dalga yayılımı gibi). Bu denklemlerin sayısal çözümü yinelemeli yöntemler kullanılarak yapılmaktadır. Problemin boyutu ve hata değerine göre çözüme ulaşmak için gereken yineleme sayısı ve buna bağlı olarak süresi değişmektedir. Kısmi fark denklemelerinin tek işlemcili bilgisayarlardaki çözümü uzun sürdüğü ve yüksek boyutlarda hafızaları yetersiz kaldığı için paralelleştirilerek birden fazla bilgisayarın işlemcisi ve hafızası kullanılarak çözülmektedir. Tezimde eliptik kısmi fark denklemlerini Gauss-Seidel ve Successive Over-Relaxation (SOR) metodlarını kullanarak çözen paralel algoritmalar kullanılmıştır. Performans analizi ve eniyilemesi kabaca üç adımdan oluşmaktadır; ölçüm, sonuçların analizi, darboğazların tespit edilip yazılımda iyileştirme yapılması. Ölçüm aşamasında programın koşarken ürettiği performans bilgisi toplanır, toplanan bu veriler görselleştirme araçları ile anlaşılır hale getirilerek yorumlanır. Yorumlama aşamasında tespit edilen dar boğazlar belirlenir ve giderilme yöntemleri araştırılır. Gerekli iyileştirmeler yapılarak program yeniden analiz edilir. Bu aşamaların her birinde farklı uygulamalar kullanılabilir fakat tez çalışmamda uygulamaları tek çatı altında toplayan TAU kullanılmıştır. TAU (Tuning and Analysis Utilities) farklı donanımları ve işletim sistemlerini destekleyerek farklı paralelleştirme metodlarını analiz edebilmektedir. Açık kaynak kodlu olan TAU diğer açık kaynak kodlu uygulamalar ile uyumlu olup birçok seviyede bütünleşme sağlanmıştır. Bu tez çalışmasında, iki farklı platformda aynı uygulamanın performans analizi yapılarak platform farkının getirdiği farklılıklar incelenmektedir. Performans analizinde bir algoritmanın eniyilemesini yapmak için genel bir kural olmadığından her algoritma her platformda incelenerek gerekli değişiklikler yapılmalıdır. Bu bağlamda kullandığım PDE algoritmasının her iki sistemdeki analizi sonucu elde edilen bilgiler yorumlanmıştır.In last two decades, use of parallel algorithms on different architectures increased the need of architecture and application independent performance analysis tools. Tools that support different communication methods and hardware prepare a common ground regardless of equipments provided. Partial differential equations (PDE) are used in several applications (such as propagation of heat, wave) in computational science and engineering. These equations can be solved using iterative numerical methods. Problem size and error tolerance effects iteration count and computation time to solve equation. PDE computations take long time using single processor computers with sequential algorithms, and if data size gets bigger single processors memory may be insufficient. Thus, PDE?s are solved using parallel algorithms on multiple processors. In this thesis, elliptic partial differential equation is solved using Gauss-Seidel and Successive Over-Relaxation (SOR) methods parallel algorithms. Performance analysis and optimization basically has three steps; evaluation, analysis of gathered information, defining and optimizing bottlenecks. In evaluation, performance information is gathered while program runs, then observations are made on gathered information by using visualization tools. Bottlenecks are defined and optimization techniques are researched. Necessary improvements are made to analyze the program again. Different applications in each of these stages can be used but in this thesis TAU is used, which collects these applications under one roof. TAU (Tuning and Analysis Utilities) supports many hardware, operating systems and parallelization methods. TAU is an open source application and collaborates with other open source applications at different levels. In this thesis, differences based on performance analysis of an algorithm in different two architectures are investigated. In performance analysis and optimization there is no golden rule to speed up algorithm. Each algorithm must be analyzed on that specific architecture. In this context, the performance analysis of a PDE algorithm on two architectures has been interpreted.Yüksek LisansM.Sc

    Effective data parallel computing on multicore processors

    Get PDF
    The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this dissertation is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this dissertation include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform nai¨ve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores

    Doctor of Philosophy in Computer Science

    Get PDF
    dissertationStencil computations are operations on structured grids. They are frequently found in partial differential equation solvers, making their performance critical to a range of scientific applications. On modern architectures where data movement costs dominate computation, optimizing stencil computations is a challenging task. Typically, domain scientists must reduce and orchestrate data movement to tackle the memory bandwidth and latency bottlenecks. Furthermore, optimized code must map efficiently to ever increasing parallelism on a chip. This dissertation studies several stencils with varying arithmetic intensities, thus requiring contrasting optimization strategies. Stencils traditionally have low arithmetic intensity, making their performance limited by memory bandwidth. Contemporary higher-order stencils are designed to require smaller grids, hence less memory, but are bound by increased floating-point operations. This dissertation develops communication-avoiding optimizations to reduce data movement in memory-bound stencils. For higher-order stencils, a novel transformation, partial sums, is designed to reduce the number of floating-point operations and improve register reuse. These optimizations are implemented in a compiler framework, which is further extended to generate parallel code targeting multicores and graphics processor units (GPUs). The augmented compiler framework is then combined with autotuning to productively address stencil optimization challenges. Autotuning explores a search space of possible implementations of a computation to find the optimal code for an execution context. In this dissertation, autotuning is used to compose sequences of optimizations to drive the augmented compiler framework. This compiler-directed autotuning approach is used to optimize stencils in the context of a linear solver, Geometric Multigrid (GMG). GMG uses sequences of stencil computations, and presents greater optimization challenges than isolated stencils, as interactions between stencils must also be considered. The efficacy of our approach is demonstrated by comparing the performance of generated code against manually tuned code, over commercial compiler-generated code, and against analytic performance bounds. Generated code outperforms manually optimized codes on multicores and GPUs. Against Intel's compiler on multicores, generated code achieves up to 4x speedup for stencils, and 3x for the solver. On GPUs, generated code achieves 80% of an analytically computed performance bound

    Designing a Real-Time Grid Simulator for use in Market and A.G.C. Studies

    Get PDF
    Market based generation dispatch is becoming the industry norm in advanced electrical jurisdictions. Due to the continuous evolution of markets and their potential impacts on system operation, studies are performed from economic and social perspectives in order to gauge the effects of any changes before implementation into live systems. In addition, it is essential to verify the effect of changes in market design on power systems from a technical perspective. The main objective of this thesis is to develop a real-time power system simulator for use in the investigation of market designs and automatic generation control schemes. The scope of this thesis is the mathematical algorithms used in the simulator, hardware and software implementation, and validation of the implemented simulator. The simulator is based on a modified version of the power flow calculation using an innovative combination of a standard numerical technique implemented on a readily available computing hardware platform. The result is a significant decrease in computation time. The power flow is performed repeatedly, with frequency being calculated between time steps to provide system dynamics. Frequency is calculated using a modified version of the generic generator swing equation. Generator and load models and their respective control systems are provided for the purposes of simulator validation and testing, although do not fall within the scope of the simulator itself.1 yea

    The Four-C Framework for High Capacity Ultra-Low Latency in 5G Networks: A Review

    Get PDF
    Network latency will be a critical performance metric for the Fifth Generation (5G) networks expected to be fully rolled out in 2020 through the IMT-2020 project. The multi-user multiple-input multiple-output (MU-MIMO) technology is a key enabler for the 5G massive connectivity criterion, especially from the massive densification perspective. Naturally, it appears that 5G MU-MIMO will face a daunting task to achieve an end-to-end 1 ms ultra-low latency budget if traditional network set-ups criteria are strictly adhered to. Moreover, 5G latency will have added dimensions of scalability and flexibility compared to prior existing deployed technologies. The scalability dimension caters for meeting rapid demand as new applications evolve. While flexibility complements the scalability dimension by investigating novel non-stacked protocol architecture. The goal of this review paper is to deploy ultra-low latency reduction framework for 5G communications considering flexibility and scalability. The Four (4) C framework consisting of cost, complexity, cross-layer and computing is hereby analyzed and discussed. The Four (4) C framework discusses several emerging new technologies of software defined network (SDN), network function virtualization (NFV) and fog networking. This review paper will contribute significantly towards the future implementation of flexible and high capacity ultra-low latency 5G communications

    Research and Education in Computational Science and Engineering

    Get PDF
    Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie

    Multiphysics simulations: challenges and opportunities.

    Full text link

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven más comunes a medida que las tecnologías de silicio reducen su tamaño. La variabilidad de fabricación y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparición siempre causa errores inesperados. Además, la memoria también sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos están causados por efectos como la radiación o los reducidos márgenes de voltaje y frecuencia. Otras restricciones coetáneas, como el consumo de energía y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez más complejas. Para programar tales procesadores, se desarrollaron modelos de programación basados en entornos de ejecución. Estos sistemas forman una nueva abstracción entre hardware y software, realizando tareas como la distribución del trabajo entre recursos informáticos: núcleos de procesadores, aceleradores, etc. Estos entornos de ejecución disponen de mucha información tanto sobre el hardware como sobre las aplicaciones, y ofrecen así muchas posibilidades de optimización. Esta tesis propone soluciones a los fallos en memoria entre múltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los códigos de corrección de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecución. La primera contribución de esta tesis es una técnica a nivel algorítmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los métodos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperación de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programación para superponer las computaciones del algoritmo y de la recuperación de datos. La segunda parte de esta tesis propone una métrica para caracterizar el impacto de los fallos en la memoria. Esta métrica supera en precisión a las métricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificación de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por último, la contribución a nivel arquitectónico de esta tesis es un esquema de Código de Corrección de Errores (ECC por sus siglas en inglés) adaptable dinámicamente. Este esquema puede aumentar la protección de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodología para estimar el riesgo de fallo en tiempo de ejecución utilizando nuestra métrica, a través de las herramientas de monitorización del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado dinámicamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracción del coste de redundancia requerido para un ECC uniformemente más fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecución permiten aprovechar al máximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar más fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecución permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperación, identificando datos redundantes, o focalizando esfuerzos de protección en los datos críticos.Postprint (published version

    Algebraic Multigrid for Meshfree Methods

    Get PDF
    This thesis deals with the development of a new Algebraic Multigrid method (AMG) for the solution of linear systems arising from Generalized Finite Difference Methods (GFDM). In particular, we consider the Finite Pointset Method, which is based on GFDM. Being a meshfree method, FPM does not rely on a mesh and can therefore deal with moving geometries and free surfaces is a natural way and it does not require the generation of a mesh before the actual simulation. In industrial use cases the size of the linear systems often becomes large, which means that classical linear solvers often become the bottleneck in terms of simulation run time, because their convergence rate depends on the discretization size. Multigrid methods have proven to be very efficient linear solvers in the domain of mesh-based methods. Their convergence is independent of the discretization size, yielding a run time that only scales linearly with the problem size. AMG methods are a natural candidate for the solution of the linear systems arising in the FPM, as this thesis will show. They need to be tuned to the specific characteristics of GFDM, though. The AMG methods that are developed in this thesis achieve a speed-up of up to 33x compared to the classical linear solvers and therefore allow much more accurate simulations in the future.Diese Dissertation beschäftigt sich mit der Entwicklung einer neuen Algebraischen Mehrgittermethode für die Lösung linearer Gleichungssysteme aus Generalisierten Finite Differenzen Methoden. Im Speziellen betrachten wir die sogenannte Finite Pointset Method, eine gitterfreie Lagrange Methode, welche auf Generalisierten Finite Differenzen Methoden basiert. Die Finite Pointset Method wurde insbesondere für Simulationen von Vorgängen mit freien Oberflächen und bewegten Geometrien entwickelt, bei denen der gitterfreie Charakter der Methode besonders große Vorteile liefert: An den freien Oberflächen und nahe der Geometrie muss zu keinem Zeitpunkt – auch nicht zu Beginn der Simulation – ein Gitter erstellt oder angepasst werden. Dies ist ein großer Vorteil gegenüber klassischen gitterbasierten Methoden. Wie in gitterbasierten Methoden entstehen auch in der Finite Pointset Method und anderen Generalisierten Finite Differenzen Methoden große, dünn besetze lineare Gleichungssysteme. Das Lösen dieser Gleichungssysteme wird bei fein aufgelösten Simulationen, wie sie in der Industrie oft nötig sind, schnell zum zeitlichen Flaschenhals der Gesamtsimulation. Ohne eine geeignete Methode zur Lösung dieser Gleichungssysteme dauern Simulationen oft sehr lange oder sind praktisch nicht durchführbar. Auch kann es vorkommen, dass klassische Lösungsverfahren divergieren und die Simulation damit unmöglich wird. Im Kontext von gitterbasierten Methoden sind Mehrgittermethoden ein etabliertes Werkzeug, um die entstehenden linearen Gleichungssysteme effizient und robust zu lösen. Besonders hervorzuheben ist dabei die lineare Skalierbarkeit dieser Methoden in der Größe der Matrix. Damit eignen sie sich besonders für fein aufgelöste Simulationen. Algebraische Mehrgittermethoden sind natürliche Kandidaten für die Lösung der Gleichungssysteme aus Generalisierten Finite Differenzen Methoden, wie diese Dissertation zeigen wird. Außerdem entwickeln wir eine neue Algebraische Mehrgittermethode, die auf den Einsatz in der Finite Pointset Method zugeschnitten ist und die Besonderheiten dieser Methode beachtet. Dazu zählen die Eigenschaften der einzelnen Matrizen, die wir ebenfalls analysieren werden, und auch die Veränderung der Matrizen über mehrere Zeitschritte hinweg, die im Vergleich mit gitterbasierten Verfahren eine größere Schwierigkeit darstellt. Wir evaluieren unsere neue Methode anhand von akademischen und realen Beispielen, sowohl mit nur einem Prozess als auch mit mehreren (MPI-)Prozessen. Die hier neu entwickelte Algebraische Mehrgittermethode ist um ein Vielfaches schneller als klassische Verfahren zur Lösung linearer Gleichungssysteme und erlaubt damit neue, genauere Simulationen mit gitterfreien Methoden
    corecore