3,392 research outputs found

    Numerical Solution of the Expanding Stellar Atmosphere Problem

    Get PDF
    In this paper we discuss numerical methods and algorithms for the solution of NLTE stellar atmosphere problems involving expanding atmospheres, e.g., found in novae, supernovae and stellar winds. We show how a scheme of nested iterations can be used to reduce the high dimension of the problem to a number of problems with smaller dimensions. As examples of these sub-problems, we discuss the numerical solution of the radiative transfer equation for relativistically expanding media with spherical symmetry, the solution of the multi-level non-LTE statistical equilibrium problem for extremely large model atoms, and our temperature correction procedure. Although modern iteration schemes are very efficient, parallel algorithms are essential in making large scale calculations feasible, therefore we discuss some parallelization schemes that we have developed.Comment: JCAM, in press. 28 pages, also available at ftp://calvin.physast.uga.edu:/pub/preprints/CompAstro.ps.g

    Cache Conscious Data Layouting for In-Memory Databases

    Get PDF
    Many applications with manually implemented data management exhibit a data storage pattern in which semantically related data items are stored closer in memory than unrelated data items. The strong sematic relationship between these data items commonly induces contemporary accesses to them. This is called the principle of data locality and has been recognized by hardware vendors. It is commonly exploited to improve the performance of hardware. General Purpose Database Management Systems (DBMSs), whose main goal is to simplify optimal data storage and processing, generally fall short of this claim because the usage pattern of the stored data cannot be anticipated when designing the system. The current interest in column oriented databases indicates that one strategy does not fit all applications. A DBMS that automatically adapts it’s storage strategy to the workload of the database promises a significant performance increase by maximizing the benefit of hardware optimizations that are based on the principle of data locality. This thesis gives an overview of optimizations that are based on the principle of data locality and the effect they have on the data access performance of applications. Based on the findings, a model is introduced that allows an estimation of the costs of data accesses based on the arrangement of the data in the main memory. This model is evaluated through a series of experiments and incorporated into an automatic layouting component for a DBMS. This layouting component allows the calculation of an analytically optimal storage layout. The performance benefits brought by this component are evaluated in an application benchmark

    10381 Summary and Abstracts Collection -- Robust Query Processing

    Get PDF
    Dagstuhl seminar 10381 on robust query processing (held 19.09.10 - 24.09.10) brought together a diverse set of researchers and practitioners with a broad range of expertise for the purpose of fostering discussion and collaboration regarding causes, opportunities, and solutions for achieving robust query processing. The seminar strove to build a unified view across the loosely-coupled system components responsible for the various stages of database query processing. Participants were chosen for their experience with database query processing and, where possible, their prior work in academic research or in product development towards robustness in database query processing. In order to pave the way to motivate, measure, and protect future advances in robust query processing, seminar 10381 focused on developing tests for measuring the robustness of query processing. In these proceedings, we first review the seminar topics, goals, and results, then present abstracts or notes of some of the seminar break-out sessions. We also include, as an appendix, the robust query processing reading list that was collected and distributed to participants before the seminar began, as well as summaries of a few of those papers that were contributed by some participants

    Extending Event Sequence Processing:New Models and Optimization Techniques

    Get PDF
    Many modern applications, including online financial feeds, tag-based mass transit systems and RFID-based supply chain management systems transmit real-time data streams. There is a need for event stream processing technology to analyze this vast amount of sequential data to enable online operational decision making. This dissertation focuses on innovating several techniques at the core of a scalable E-Analytic system to achieve efficient, scalable and robust methods for in-memory multi-dimensional nested pattern analysis over high-speed event streams. First, I address the problem of processing flat pattern queries on event streams with out-of-order data arrival. I design two alternate solutions: aggressive and conservative strategies respectively. The aggressive strategy produces maximal output under the optimistic assumption that out-of-order event arrival is rare. The conservative method works under the assumption that out-of-order data may be common, and thus produces output only when its correctness can be guaranteed. Second, I design the integration of CEP and OLAP techniques (ECube model) for efficient multi-dimensional event pattern analysis at different abstraction levels. Strategies of drill-down (refinement from abstract to specific patterns) and of roll-up (generalization from specific to abstract patterns) are developed for the efficient workload evaluation. I design a cost-driven adaptive optimizer called Chase that exploits reuse strategies for optimal E-Cube hierarchy execution. Then, I explore novel optimization techniques to support the high- performance processing of powerful nested CEP patterns. A CEP query language called NEEL, is designed to express nested CEP pattern queries composed of sequence, negation, AND and OR operators. To allow flexible execution ordering, I devise a normalization procedure that employs rewriting rules for flattening a nested complex event expression. To conserve CPU and memory consumption, I propose several strategies for efficient shared processing of groups of normalized NEEL subexpressions. Our comprehensive experimental studies, using both synthetic as well as real data streams demonstrate superiority of our proposed strategies over alternate methods in the literature in both effectiveness and efficiency

    Managing Distributed Cloud Applications and Infrastructure

    Get PDF
    The emergence of the Internet of Things (IoT), combined with greater heterogeneity not only online in cloud computing architectures but across the cloud-to-edge continuum, is introducing new challenges for managing applications and infrastructure across this continuum. The scale and complexity is simply so complex that it is no longer realistic for IT teams to manually foresee the potential issues and manage the dynamism and dependencies across an increasing inter-dependent chain of service provision. This Open Access Pivot explores these challenges and offers a solution for the intelligent and reliable management of physical infrastructure and the optimal placement of applications for the provision of services on distributed clouds. This book provides a conceptual reference model for reliable capacity provisioning for distributed clouds and discusses how data analytics and machine learning, application and infrastructure optimization, and simulation can deliver quality of service requirements cost-efficiently in this complex feature space. These are illustrated through a series of case studies in cloud computing, telecommunications, big data analytics, and smart cities

    Intermediate Results Materialization Selection and Format for Data-Intensive Flows

    Get PDF
    Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions

    Storage Format Selection and Optimization for Materialized Intermediate Results in Data-Intensive Flows

    Get PDF
    Modern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost. In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs. The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet). Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs.Moderne Unternehmen produzieren und sammeln große Datenmengen, die wiederholt und schnell verarbeitet werden müssen, um geschäftliche Erkenntnisse zu gewinnen. Für die Verarbeitung dieser Daten werden typischerweise Datenintensive Prozesse (DIFs) auf verteilten Systemen wie z.B. MapReduce bereitgestellt. Dabei ist festzustellen, dass die DIFs verschiedener Nutzer sich in großen Teilen überschneiden, wodurch viel Arbeit mehrfach geleistet, Ressourcen verschwendet und damit die Gesamtkosten erhöht werden. Um diesen Effekt entgegenzuwirken, können die Zwischenergebnisse der DIFs für spätere Wiederverwendungen materialisiert werden. Hierbei müssen vor allem die unterschiedlichen Speicherlayouts (horizontal, vertikal und hybrid) berücksichtigt werden. In dieser Doktorarbeit wird ein neuartiger Ansatz zur automatischen Materialisierung der Zwischenergebnisse von DIFs durch eine mehrkriterielle Optimierungsmethode vorgeschlagen, der in der Lage ist widersprüchliche Qualitätsmetriken zu behandeln. Des Weiteren wird untersucht die Wechselwirkung zwischen verschiedenen peratortypen und unterschiedlichen Speicherlayouts untersucht. Basierend auf dieser Untersuchung wird ein regelbasierter Ansatz vorgeschlagen, der das Speicherlayout für materialisierte Ergebnisse, basierend auf den nachfolgenden Operationstypen, festlegt. Obwohl sich die Gesamtkosten für die Ausführung der DIFs im Allgemeinen verbessern, ist der heuristische Ansatz nicht in der Lage die gelesene Datenmenge bei der Auswahl des Speicherlayouts zu berücksichtigen. Dies kann in einigen Fällen zu falschen Entscheidung führen. Aus diesem Grund wird ein Kostenmodell entwickelt, mit dem für jedes Szenario das richtige Speicherlayout gefunden werden kann. Das Kostenmodell schätzt anhand von Daten und Auslastungsmerkmalen die E/A-Kosten eines materialisierten Zwischenergebnisses mit unterschiedlichen Speicherlayouts und wählt das kostenminimale aus. Die Ergebnisse zeigen, dass Speicherlayouts die Ladezeit materialisierter Ergebnisse verkürzen und insgesamt die Leistung von DIFs verbessern. Die Arbeit befasst sich auch mit der Optimierung der konfigurierbaren Parameter von hybriden Layouts. Konkret wird der sogenannte ATUN-HL Ansatz (Auto TUNing Hybrid Layouts) entwickelt, der auf der Grundlage des gleichen Kostenmodells und unter Berücksichtigung der Auslastung und der Merkmale der Daten die optimalen Werte für konfigurierbare Parameter in Parquet, d.h. eine Implementierung von hybrider Layouts. Schließlich werden in dieser Arbeit auch die Auswirkungen von Parallelität in DIFs und hybriden Layouts untersucht. Dazu wird ein Ansatz entwickelt, der in der Lage ist die Anzahl der Aufgaben und dafür notwendigen Maschinen automatisch zu bestimmen. Zusammengefasst lässt sich festhalten, dass das in dieser Arbeit vorgeschlagene Kostenmodell es ermöglicht, das bestmögliche Speicherlayout für materialisierte Zwischenergebnisse zu ermitteln, die konfigurierbaren Parameter hybrider Layouts festzulegen und die Anzahl der Aufgaben und Maschinen für die Ausführung von DIFs zu schätzen

    Storage format selection and optimization for materialized intermediate results in data-intensive flows

    Get PDF
    Tesi en modalitat de cotuela: Universitat Politècnica de Catalunya i Technische Universität DresdenModern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost. In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs. The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet). Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs.Las organizaciones producen y recopilan grandes volúmenes de datos, que deben procesarse de forma repetitiva y rápida para obtener información relevante para la empresa. Para tal procesamiento, por lo general, se emplean flujos intensivos de datos (DIFs por sussiglas en inglés) en entornos de procesamiento distribuido. Los DIFs de diferentes usuarios tienen elementos comunes (es decir, se duplican partes del procesamiento, lo que desperdicia recursos computacionales y aumenta el coste en general). Los resultados intermedios de varios DIFs pueden pues coincidir y se pueden por tanto materializar para facilitar su reutilización, lo que ayuda a reducir el coste y ahorrar recursos si se realiza correctamente. Además, la forma en qué se materializan dichos resultados debe ser considerada. Por ejemplo, diferentes tipos de diseño lógico de los datos (es decir, horizontal, vertical o híbrido) se pueden utilizar para reducir el coste de E/S. En esta tesis doctoral, primero proponemos un enfoque novedoso para materializar automáticamente los resultados intermedios de los DIFs a través de un método de optimización multi-objetivo, que puede considerar múltiples y contradictorias métricas de calidad. A continuación, estudiamos el comportamiento de diferentes operadores de DIF que acceden directamente a los resultados materializados. Sobre la base de este estudio, ideamos un enfoque basado en reglas, que decide el diseño del almacenamiento para los resultados materializados en función de los tipos de operaciones que los utilizan directamente. A pesar de mejorar el coste en general, las reglas heurísticas no consideran estadísticas sobre la cantidad de datos leídos al hacer la elección, lo que podría llevar a una decisión errónea. Consecuentemente, diseñamos un modelo de costos que es capaz de encontrar el diseño de almacenamiento adecuado para cada escenario dependiendo de las características de los datos almacenados. El modelo de costes usa estadísticas y características de acceso para estimar el coste de E/S de un resultado intervii medio materializado con diferentes diseños de almacenamiento y elige el de menor coste. Los resultados muestran que los diseños de almacenamiento ayudan a reducir el tiempo de carga de los resultados materializados y, en general, mejoran el rendimiento de los DIF. La tesis también presta atención a la optimización de los parámetros configurables de diseños híbridos. Proponemos así ATUN-HL (Auto TUNing Hybrid Layouts), que, basado en el mismo modelo de costes, las características de los datos y el tipo de acceso que se está haciendo, encuentra los valores óptimos para los parámetros de configuración en disponibles Parquet (una implementación de diseños híbridos para Hadoop Distributed File System). Finalmente, esta tesis estudia el impacto del paralelismo en DIF y diseños híbridos. El modelo de coste propuesto ayuda a idear un enfoque para ajustar el paralelismo al decidir la cantidad de tareas y máquinas para procesar los datos. En resumen, el modelo de costes propuesto permite elegir el mejor diseño de almacenamiento posible para los resultados intermedios materializados, ajustar los parámetros configurables de diseños híbridos y estimar el número de tareas y máquinas para la ejecución de DIF.Moderne Unternehmen produzieren und sammeln große Datenmengen, die wiederholt und schnell verarbeitet werden müssen, um geschäftliche Erkenntnisse zu gewinnen. Für die Verarbeitung dieser Daten werden typischerweise Datenintensive Prozesse (DIFs) auf verteilten Systemen wie z.B. MapReduce bereitgestellt. Dabei ist festzustellen, dass die DIFs verschiedener Nutzer sich in großen Teilen überschneiden, wodurch viel Arbeit mehrfach geleistet, Ressourcen verschwendet und damit die Gesamtkosten erhöht werden. Um diesen Effekt entgegenzuwirken, können die Zwischenergebnisse der DIFs für spätere Wiederverwendungen materialisiert werden. Hierbei müssen vor allem die unterschiedlichen Speicherlayouts (horizontal, vertikal und hybrid) berücksichtigt werden. In dieser Doktorarbeit wird ein neuartiger Ansatz zur automatischen Materialisierung der Zwischenergebnisse von DIFs durch eine mehrkriterielle Optimierungsmethode vorgeschlagen, der in der Lage ist widersprüchliche Qualitätsmetriken zu behandeln. Des Weiteren wird untersucht die Wechselwirkung zwischen verschiedenen Operatortypen und unterschiedlichen Speicherlayouts untersucht. Basierend auf dieser Untersuchung wird ein regelbasierter Ansatz vorgeschlagen, der das Speicherlayout für materialisierte Ergebnisse, basierend auf den nachfolgenden Operationstypen, festlegt. Obwohl sich die Gesamtkosten für die Ausführung der DIFs im Allgemeinen verbessern, ist der heuristische Ansatz nicht in der Lage die gelesene Datenmenge bei der Auswahl des Speicherlayouts zu berücksichtigen. Dies kann in einigen Fällen zu falschen Entscheidung führen. Aus diesem Grund wird ein Kostenmodell entwickelt, mit dem für jedes Szenario das richtige Speicherlayout gefunden werden kann. Das Kostenmodell schätzt anhand von Daten und Auslastungsmerkmalen die E/A-Kosten eines materialisierten Zwischenergebnisses mit unterschiedlichen Speicherlayouts und wählt das kostenminimale aus. Die Ergebnisse zeigen, dass Speicherlayouts die Ladezeit materialisierter Ergebnisse verkürzen und insgesamt die Leistung von DIFs verbessern. Die Arbeit befasst sich auch mit der Optimierung der konfigurierbaren Parameter von hybriden Layouts. Konkret wird der sogenannte ATUN-HLAnsatz (Auto TUNing Hybrid Layouts) entwickelt, der auf der Grundlage des gleichen Kostenmodells und unter Berücksichtigung der Auslastung und der Merkmale der Daten die optimalen Werte für konfigurierbare Parameter in Parquet, d.h. eine Implementierung von hybrider Layouts. Schließlich werden in dieser Arbeit auch die Auswirkungen von Parallelität in DIFs und hybriden Layouts untersucht. Dazu wird ein Ansatz entwickelt, der in der Lage ist die Anzahl der Aufgaben und dafür notwendigen Maschinen automatisch zu bestimmen. Zusammengefasst lässt sich festhalten, dass das in dieser Arbeit vorgeschlagene Kostenmodell es ermöglicht, das bestmögliche Speicherlayout für materialisierte Zwischenergebnisse zu ermitteln, die konfigurierbaren Parameter hybrider Layouts festzulegen und die Anzahl der Aufgaben und Maschinen für die Ausführung von DIFs zu schätzenPostprint (published version
    corecore