Search CORE

105 research outputs found

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

Author: Thomasian Alexander
Publication venue
Publication date: 06/01/2024
Field of study

RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial text overlap with arXiv:2306.0876

arXiv.org e-Print Archive

Optimizations for Energy-Aware, High-Performance and Reliable Distributed Storage Systems

Author: Karakoyunlu Cengiz
Publication venue: OpenCommons@UConn
Publication date: 15/01/2016
Field of study

With the decreasing cost and wide-spread use of commodity hard drives, it has become possible to create very large-scale storage systems with less expense. However, as we approach exabyte-scale storage systems, maintaining important features such as energy-efficiency, performance, reliability and usability became increasingly difficult. Despite the decreasing cost of storage systems, the energy consumption of these systems still needs to be addressed in order to retain cost-effectiveness. Any improvements in a storage system can be outweighed by high energy costs. On the other hand, large-scale storage systems can benefit more from the object storage features for improved performance and usability. One area of concern is metadata performance bottleneck of applications reading large directories or creating a large number of files. Similarly, computation on big data where data needs to be transferred between compute and storage clusters adversely affects I/O performance. As the storage systems become more complex and larger, transferring data between remote compute and storage tiers becomes impractical. Furthermore, storage systems implement reliability typically at the file system or client level. This approach might not always be practical in terms of performance. Lastly, object storage features are usually tailored to specific use cases that makes it harder to use them in various contexts. In this thesis, we are presenting several approaches to enhance energy-efficiency, performance, reliability and usability of large-scale storage systems. To begin with, we improve the energy-efficiency of storage systems by moving I/O load to a subset of the storage nodes with energy-aware node allocation methods and turn off the unused nodes, while preserving load balance on demand. To address the metadata performance issue associated with large creates and directory reads, we represent directories with object storage collections and implement lazy creation of objects. Similarly, in-situ computation on large-scale data is enabled by using object storage features to integrate a computational framework with the existing object storage layer to eliminate the need to transfer data between compute and storage silos for better performance. We then present parity-based redundancy using object storage features to achieve reliability with less performance impact. Finally, unified storage brings together the object storage features to meet the needs of distinct use cases; such as cloud storage, big data or high-performance computing to alleviate the unnecessary fragmentation of storage resources. We evaluate each proposed approach thoroughly and validate their effectiveness in terms of improving energy-efficiency, performance, reliability and usability of a large-scale storage system

DigitalCommons@UConn

OpenCommons at University of Connecticut

An Introduction to Hyperdex and the Brave New World of High Performance, Scalable, Consistent, Faulttolerant Data Stores

Author: Bernard Wong
Emin
Gün Sirer
Robert Escriva
Publication venue
Publication date
Field of study

CiteSeerX

Cluster Computing: A Novel Peer-to-Peer Cluster for Generic Application Sharing

Author: GUO CHEN
Publication venue
Publication date: 19/08/2013
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Blockchain for secured IoT and D2D applications over 5G cellular networks : a thesis by publications presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer and Electronics Engineering, Massey University, Albany, New Zealand

Author: Honar Pajooh Houshyar
Publication venue: 'Massey University'
Publication date: 01/01/2021
Field of study

Author's Declaration: "In accordance with Sensors, SpringerOpen, and IEEE’s copyright policy, this thesis contains the accepted and published version of each manuscript as the final version. Consequently, the content is identical to the published versions."The Internet of things (IoT) is in continuous development with ever-growing popularity. It brings significant benefits through enabling humans and the physical world to interact using various technologies from small sensors to cloud computing. IoT devices and networks are appealing targets of various cyber attacks and can be hampered by malicious intervening attackers if the IoT is not appropriately protected. However, IoT security and privacy remain a major challenge due to characteristics of the IoT, such as heterogeneity, scalability, nature of the data, and operation in open environments. Moreover, many existing cloud-based solutions for IoT security rely on central remote servers over vulnerable Internet connections. The decentralized and distributed nature of blockchain technology has attracted significant attention as a suitable solution to tackle the security and privacy concerns of the IoT and device-to-device (D2D) communication. This thesis explores the possible adoption of blockchain technology to address the security and privacy challenges of the IoT under the 5G cellular system. This thesis makes four novel contributions. First, a Multi-layer Blockchain Security (MBS) model is proposed to protect IoT networks while simplifying the implementation of blockchain technology. The concept of clustering is utilized to facilitate multi-layer architecture deployment and increase scalability. The K-unknown clusters are formed within the IoT network by applying a hybrid Evolutionary Computation Algorithm using Simulated Annealing (SA) and Genetic Algorithms (GA) to structure the overlay nodes. The open-source Hyperledger Fabric (HLF) Blockchain platform is deployed for the proposed model development. Base stations adopt a global blockchain approach to communicate with each other securely. The quantitative arguments demonstrate that the proposed clustering algorithm performs well when compared to the earlier reported methods. The proposed lightweight blockchain model is also better suited to balance network latency and throughput compared to a traditional global blockchain. Next, a model is proposed to integrate IoT systems and blockchain by implementing the permissioned blockchain Hyperledger Fabric. The security of the edge computing devices is provided by employing a local authentication process. A lightweight mutual authentication and authorization solution is proposed to ensure the security of tiny IoT devices within the ecosystem. In addition, the proposed model provides traceability for the data generated by the IoT devices. The performance of the proposed model is validated with practical implementation by measuring performance metrics such as transaction throughput and latency, resource consumption, and network use. The results indicate that the proposed platform with the HLF implementation is promising for the security of resource-constrained IoT devices and is scalable for deployment in various IoT scenarios. Despite the increasing development of blockchain platforms, there is still no comprehensive method for adopting blockchain technology on IoT systems due to the blockchain's limited capability to process substantial transaction requests from a massive number of IoT devices. The Fabric comprises various components such as smart contracts, peers, endorsers, validators, committers, and Orderers. A comprehensive empirical model is proposed that measures HLF's performance and identifies potential performance bottlenecks to better meet blockchain-based IoT applications' requirements. The implementation of HLF on distributed large-scale IoT systems is proposed. The performance of the HLF is evaluated in terms of throughput, latency, network sizes, scalability, and the number of peers serviceable by the platform. The experimental results demonstrate that the proposed framework can provide a detailed and real-time performance evaluation of blockchain systems for large-scale IoT applications. The diversity and the sheer increase in the number of connected IoT devices have brought significant concerns about storing and protecting the large IoT data volume. Dependencies of the centralized server solution impose significant trust issues and make it vulnerable to security risks. A layer-based distributed data storage design and implementation of a blockchain-enabled large-scale IoT system is proposed to mitigate these challenges by using the HLF platform for distributed ledger solutions. The need for a centralized server and third-party auditor is eliminated by leveraging HLF peers who perform transaction verification and records audits in a big data system with the help of blockchain technology. The HLF blockchain facilitates storing the lightweight verification tags on the blockchain ledger. In contrast, the actual metadata is stored in the off-chain big data system to reduce the communication overheads and enhance data integrity. Finally, experiments are conducted to evaluate the performance of the proposed scheme in terms of throughput, latency, communication, and computation costs. The results indicate the feasibility of the proposed solution to retrieve and store the provenance of large-scale IoT data within the big data ecosystem using the HLF blockchain

Massey Research Online

HEC: Collaborative Research: SAM^2 Toolkit: Scalable and Adaptive Metadata Management for High-End Computing

Author: Zhu Yifeng
Publication venue: DigitalCommons@UMaine
Publication date: 04/06/2010
Field of study

The increasing demand for Exa-byte-scale storage capacity by high end computing applications requires a higher level of scalability and dependability than that provided by current file and storage systems. The proposal deals with file systems research for metadata management of scalable cluster-based parallel and distributed file storage systems in the HEC environment. It aims to develop a scalable and adaptive metadata management (SAM2) toolkit to extend features of and fully leverage the peak performance promised by state-of-the-art cluster-based parallel and distributed file storage systems used by the high performance computing community. There is a large body of research on data movement and management scaling, however, the need to scale up the attributes of cluster-based file systems and I/O, that is, metadata, has been underestimated. An understanding of the characteristics of metadata traffic, and an application of proper load-balancing, caching, prefetching and grouping mechanisms to perform metadata management correspondingly, will lead to a high scalability. It is anticipated that by appropriately plugging the scalable and adaptive metadata management components into the state-of-the-art cluster-based parallel and distributed file storage systems one could potentially increase the performance of applications and file systems, and help translate the promise and potential of high peak performance of such systems to real application performance improvements. The project involves the following components: 1. Develop multi-variable forecasting models to analyze and predict file metadata access patterns. 2. Develop scalable and adaptive file name mapping schemes using the duplicative Bloom filter array technique to enforce load balance and increase scalability 3. Develop decentralized, locality-aware metadata grouping schemes to facilitate the bulk metadata operations such as prefetching. 4. Develop an adaptive cache coherence protocol using a distributed shared object model for client-side and server-side metadata caching. 5. Prototype the SAM2 components into the state-of-the-art parallel virtual file system PVFS2 and a distributed storage data caching system, set up an experimental framework for a DOE CMS Tier 2 site at University of Nebraska-Lincoln and conduct benchmark, evaluation and validation studies

University of Maine

Developing New Power Management and High-Reliability Schemes in Data-Intensive Environment

Author: Wang Ruijun
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2016
Field of study

With the increasing popularity of data-intensive applications as well as the large-scale computing and storage systems, current data centers and supercomputers are often dealing with extremely large data-sets. To store and process this huge amount of data reliably and energy-efficiently, three major challenges should be taken into consideration for the system designers. Firstly, power conservation–Multicore processors or CMPs have become a mainstream in the current processor market because of the tremendous improvement in transistor density and the advancement in semiconductor technology. However, the increasing number of transistors on a single die or chip reveals a super-linear growth in power consumption [4]. Thus, how to balance system performance and power-saving is a critical issue which needs to be solved effectively. Secondly, system reliability–Reliability is a critical metric in the design and development of replication-based big data storage systems such as Hadoop File System (HDFS). In the system with thousands machines and storage devices, even in-frequent failures become likely. In Google File System, the annual disk failure rate is 2:88%,which means you were expected to see 8,760 disk failures in a year. Unfortunately, given an increasing number of node failures, how often a cluster starts losing data when being scaled out is not well investigated. Thirdly, energy efficiency–The fast processing speeds of the current generation of supercomputers provide a great convenience to scientists dealing with extremely large data sets. The next generation of exascale supercomputers could provide accurate simulation results for the automobile industry, aerospace industry, and even nuclear fusion reactors for the very first time. However, the energy cost of super-computing is extremely high, with a total electricity bill of 9 million dollars per year. Thus, conserving energy and increasing the energy efficiency of supercomputers has become critical in recent years. This dissertation proposes new solutions to address the above three key challenges for current large-scale storage and computing systems. Firstly, we propose a novel power management scheme called MAR (model-free, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption subject to performance constraints. By introducing new I/O wait status, MAR is able to accurately describe the relationship between core frequencies, performance and power consumption. Moreover, we adopt a model-free control method to filter out the I/O wait status from the traditional CPU busy/idle model in order to achieve fast responsiveness to burst situations and take full advantage of power saving. Our extensive experiments on a physical testbed demonstrate that, for SPEC benchmarks and data-intensive (TPC-C) benchmarks, an MAR prototype system achieves 95.8-97.8% accuracy of the ideal power saving strategy calculated offline. Compared with baseline solutions, MAR is able to save 12.3-16.1% more power while maintain a comparable performance loss of about 0.78-1.08%. In addition, more simulation results indicate that our design achieved 3.35-14.2% more power saving efficiency and 4.2-10.7% less performance loss under various CMP configurations as compared with various baseline approaches such as LAST, Relax, PID and MPC. Secondly, we create a new reliability model by incorporating the probability of replica loss to investigate the system reliability of multi-way declustering data layouts and analyze their potential parallel recovery possibilities. Our comprehensive simulation results on Matlab and SHARPE show that the shifted declustering data layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by upto 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the two baseline approaches in both cases by consuming up to 79 % and 87% less recovery bandwidth for copyset, as well as 4.8% and 10.2% less recovery bandwidth for random layout. Thirdly, we develop a power-aware job scheduler by applying a rule based control method and taking into account real world power and speedup profiles to improve power efficiency while adhering to predetermined power constraints. The intensive simulation results shown that our proposed method is able to achieve the maximum utilization of computing resources as compared to baseline scheduling algorithms while keeping the energy cost under the threshold. Moreover, by introducing a Power Performance Factor (PPF) based on the real world power and speedup profiles, we are able to increase the power efficiency by up to 75%

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Plataforma ABAC para aplicações da IoT baseada na norma OASIS XACML

Author: Semenski Vedran
Publication venue: Universidade de Aveiro
Publication date: 01/01/2015
Field of study

Mestrado em Engenharia de Computadores e TelemáticaA IoT (Internet of Things) é uma área que apresenta grande potencial mas embora muitos dos seus problemas já terem soluções satisfatórias, a segurança permanece um pouco esquecida, mantendo-se um como questão ainda por resolver. Um dos aspectos da segurança que ainda não foi endereçado é o controlo de acessos. O controlo de acesso é uma forma de reforçar a segurança que envolve avaliar os pedidos de acesso a recursos e negar o acesso caso este não seja autorizado, garantindo assim a segurança no acesso a recursos críticos ou vulneráveis. O controlo de Acesso é um termo lato, existindo diversos modelos ou paradigmas possíveis, dos quais os mais significativos são: IBAC (Identity Based Access Control), RBAC (Role Based Access Control) and ABAC (Attribute Based Access Control). Neste trabalho será usado o ABAC, já que oferece uma maior flexibilidade comparativamente a IBAC e RBAC. Além disso, devido à sua natureza adaptativa o ABAC tem maior longevidade e menor necessidade de manutenção. A OASIS (Organization for the Advancement of Structured Information Standards) desenvolveu a norma XACML (eXtensible Access Control Markup Language) para escrita/definição de políticas de acesso e pedidos de acesso, e de avaliação de pedidos sobre conjuntos de políticas com o propósito de reforçar o controlo de acesso sobre recursos. O XACML foi definido com a intenção de que os pedidos e as políticas fossem de fácil leitura para os humanos, garantindo, porém, uma estrutura bem definida que permita uma avaliação precisa. A norma XACML usa ABAC. Este trabalho tem o objetivo de criar uma plataforma de segurança que utilize os padrões ABAC e XACML que possa ser usado por outros sistemas, reforçando o controlo de acesso sobre recursos que careçam de proteção, e garantindo acesso apenas a sujeitos autorizadas. Vai também possibilitar a definição fina ou granular de regras e pedidos permitindo uma avaliação com maior precisão e um maior grau de segurança. Os casos de uso principais são grandes aplicações IoT, como aplicações Smart City, que inclui monitorização inteligente de tráfego, consumo de energia e outros recursos públicos, monitorização pessoal de saúde, etc. Estas aplicações lidam com grandes quantidades de informação (Big Data) que é confidencial e/ou pessoal. Existe um número significativo de soluções NoSQL (Not Only SQL) para resolver o problema do volume de dados, mas a segurança é ainda uma questão por resolver. Este trabalho vai usar duas bases de dados NoSQL: uma base de dados key-value (Redis) para armazenamento de políticas e uma base de dados wide-column (Cassandra) para armazenamento de informação de sensores e informação de atributos adicionais durante os testes.IoT (Internet of Things) is an area which offers great opportunities and although a lot of issues already have satisfactory solutions, security has remained somewhat unaddressed and remains to be a big issue. Among the security aspects, we emphasize access control. Access Control is a way of enforcing security that involves evaluating requests for accessing resources and denies access if it is unauthorised, therefore providing security for vulnerable resources. Access Control is a broad term that consists of several methodologies of which the most significant are: IBAC (Identity Based Access Control), RBAC (Role Based Access Control) and ABAC (Attribute Based Access Control). In this work ABAC will be used as it offers the most flexibility compared to IBAC and RBAC. Also, because of ABAC's adaptive nature, it offers longevity and lower maintenance requirements. OASIS (Organization for the Advancement of Structured Information Standards) developed the XACML (eXtensible Access Control Markup Language) standard for writing/defining requests and policies and the evaluation of the requests over sets of policies for the purpose of enforcing access control over resources. It is defined so the requests and policies are readable by humans but also have a well defined structure allowing for precise evaluation. The standard uses ABAC. This work aims to create a security framework that utilizes ABAC and the XACML standard so that it can be used by other systems and enforce access control over resources that need to be protected by allowing access only to authorised subjects. It will also allow for fine grained defining of rules and requests for more precise evaluation and therefore a greater level of security. The primary use-case scenarios are large IoT applications such as Smart City applications including: smart traffic monitoring, energy and utility consumption, personal healthcare monitoring, etc. These applications deal with large quantities (Big Data) of confidential and/or personal data. A number of NoSQL (Not Only SQL) solutions exist for solving the problem of volume but security is still an issue. This work will use two NoSQL databases. A key-value database (Redis) for the storing of policies and a wide-column database (Cassandra) for storing sensor data and additional attribute data during testing

Repositório Institucional da Universidade de Aveiro

Autonomic Management of Cloud Virtual Infrastructures

Author: Loreti Daniela <1984>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 12/05/2016
Field of study

The new model of interaction suggested by Cloud Computing has experienced a significant diffusion over the last years thanks to its capability of providing customers with the illusion of an infinite amount of reliable resources. Nevertheless, the challenge of efficiently manage a large collection of virtual computing nodes has just been partially moved from the customer's private datacenter to the larger provider's infrastructure that we generally address as “the cloud”. A lot of effort - in both academic and industrial field - is therefore concentrated on policies for the efficient and autonomous management of virtual infrastructures. The research on this topic is further encouraged by the diffusion of cheap and portable sensors and the availability of almost ubiquitous Internet connectivity that are constantly creating large flows of information about the environment we live in. The need for fast and reliable mechanisms to process these considerable volumes of data has inevitably pushed the evolution from the initial scenario of a single (private or public) cloud towards cloud interoperability, giving birth to several forms of collaboration between clouds. The efficient resource management is further complicated in these heterogeneous environments, making autonomous administration more and more desirable. In this thesis, we initially focus on the challenges of autonomic management in a single-cloud scenario, considering the benefits and shortcomings of centralized and distributed solutions and proposing an original decentralized model. Later in this dissertation, we face the challenge of autonomic management in large interconnected cloud environments, where the movement of virtual resources across the infrastructure nodes is further complicated by the intrinsic heterogeneity of the scenario and difficulties introduced by the higher latency medium between datacenters. According to that, we focus on the cost model for the execution of distributed data-intensive application on multiple clouds and we propose different management policies leveraging cloud interoperability

AMS Tesi di Dottorato

Arquitectura, técnicas y modelos para posibilitar la Ciencia de Datos en el Archivo de la Misión Gaia

Author: Tapiador de Pedro Daniel
Publication venue: 'Universidad Complutense de Madrid (UCM)'
Publication date: 29/10/2018
Field of study

Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 26/05/2017.The massive amounts of data that the world produces every day pose new challenges to modern societies in terms of how to leverage their inherent value. Social networks, instant messaging, video, smart devices and scientific missions are just mere examples of the vast number of sources generating data every second. As the world becomes more and more digitalized, new needs arise for organizing, archiving, sharing, analyzing, visualizing and protecting the ever-increasing data sets, so that we can truly develop into a data-driven economy that reduces inefficiencies and increases sustainability, creating new business opportunities on the way. Traditional approaches for harnessing data are not suitable any more as they lack the means for scaling to the larger volumes in a timely and cost efficient manner. This has somehow changed with the advent of Internet companies like Google and Facebook, which have devised new ways of tackling this issue. However, the variety and complexity of the value chains in the private sector as well as the increasing demands and constraints in which the public one operates, needs an ongoing research that can yield newer strategies for dealing with data, facilitate the integration of providers and consumers of information, and guarantee a smooth and prompt transition when adopting these cutting-edge technological advances. This thesis aims at providing novel architectures and techniques that will help perform this transition towards Big Data in massive scientific archives. It highlights the common pitfalls that must be faced when embracing it and how to overcome them, especially when the data sets, their transformation pipelines and the tools used for the analysis are already present in the organizations. Furthermore, a new perspective for facilitating a smoother transition is laid out. It involves the usage of higher-level and use case specific frameworks and models, which will naturally bridge the gap between the technological and scientific domains. This alternative will effectively widen the possibilities of scientific archives and therefore will contribute to the reduction of the time to science. The research will be applied to the European Space Agency cornerstone mission Gaia, whose final data archive will represent a tremendous discovery potential. It will create the largest and most precise three dimensional chart of our galaxy (the Milky Way), providing unprecedented position, parallax and proper motion measurements for about one billion stars. The successful exploitation of this data archive will depend to a large degree on the ability to offer the proper architecture, i.e. infrastructure and middleware, upon which scientists will be able to do exploration and modeling with this huge data set. In consequence, the approach taken needs to enable data fusion with other scientific archives, as this will produce the synergies leading to an increment in scientific outcome, both in volume and in quality. The set of novel techniques and frameworks presented in this work addresses these issues by contextualizing them with the data products that will be generated in the Gaia mission. All these considerations have led to the foundations of the architecture that will be leveraged by the Science Enabling Applications Work Package. Last but not least, the effectiveness of the proposed solution will be demonstrated through the implementation of some ambitious statistical problems that will require significant computational capabilities, and which will use Gaia-like simulated data (the first Gaia data release has recently taken place on September 14th, 2016). These ambitious problems will be referred to as the Grand Challenge, a somewhat grandiloquent name that consists in inferring a set of parameters from a probabilistic point of view for the Initial Mass Function (IMF) and Star Formation Rate (SFR) of a given set of stars (with a huge sample size), from noisy estimates of their masses and ages respectively. This will be achieved by using Hierarchical Bayesian Modeling (HBM). In principle, the HBM can incorporate stellar evolution models to infer the IMF and SFR directly, but in this first step presented in this thesis, we will start with a somewhat less ambitious goal: inferring the PDMF and PDAD. Moreover, the performance and scalability analyses carried out will also prove the suitability of the models for the large amounts of data that will be available in the Gaia data archive.Las grandes cantidades de datos que se producen en el mundo diariamente plantean nuevos retos a la sociedad en términos de cómo extraer su valor inherente. Las redes sociales, mensajería instantánea, los dispositivos inteligentes y las misiones científicas son meros ejemplos del gran número de fuentes generando datos en cada momento. Al mismo tiempo que el mundo se digitaliza cada vez más, aparecen nuevas necesidades para organizar, archivar, compartir, analizar, visualizar y proteger la creciente cantidad de datos, para que podamos desarrollar economías basadas en datos e información que sean capaces de reducir las ineficiencias e incrementar la sostenibilidad, creando nuevas oportunidades de negocio por el camino. La forma en la que se han manejado los datos tradicionalmente no es la adecuada hoy en día, ya que carece de los medios para escalar a los volúmenes más grandes de datos de una forma oportuna y eficiente. Esto ha cambiado de alguna manera con la llegada de compañías que operan en Internet como Google o Facebook, ya que han concebido nuevas aproximaciones para abordar el problema. Sin embargo, la variedad y complejidad de las cadenas de valor en el sector privado y las crecientes demandas y limitaciones en las que el sector público opera, necesitan una investigación continua en la materia que pueda proporcionar nuevas estrategias para procesar las enormes cantidades de datos, facilitar la integración de productores y consumidores de información, y garantizar una transición rápida y fluida a la hora de adoptar estos avances tecnológicos innovadores. Esta tesis tiene como objetivo proporcionar nuevas arquitecturas y técnicas que ayudarán a realizar esta transición hacia Big Data en archivos científicos masivos. La investigación destaca los escollos principales a encarar cuando se adoptan estas nuevas tecnologías y cómo afrontarlos, principalmente cuando los datos y las herramientas de transformación utilizadas en el análisis existen en la organización. Además, se exponen nuevas medidas para facilitar una transición más fluida. Éstas incluyen la utilización de software de alto nivel y específico al caso de uso en cuestión, que haga de puente entre el dominio científico y tecnológico. Esta alternativa ampliará de una forma efectiva las posibilidades de los archivos científicos y por tanto contribuirá a la reducción del tiempo necesario para generar resultados científicos a partir de los datos recogidos en las misiones de astronomía espacial y planetaria. La investigación se aplicará a la misión de la Agencia Espacial Europea (ESA) Gaia, cuyo archivo final de datos presentará un gran potencial para el descubrimiento y hallazgo desde el punto de vista científico. La misión creará el catálogo en tres dimensiones más grande y preciso de nuestra galaxia (la Vía Láctea), proporcionando medidas sin precedente acerca del posicionamiento, paralaje y movimiento propio de alrededor de mil millones de estrellas. Las oportunidades para la explotación exitosa de este archivo de datos dependerán en gran medida de la capacidad de ofrecer la arquitectura adecuada, es decir infraestructura y servicios, sobre la cual los científicos puedan realizar la exploración y modelado con esta inmensa cantidad de datos. Por tanto, la estrategia a realizar debe ser capaz de combinar los datos con otros archivos científicos, ya que esto producirá sinergias que contribuirán a un incremento en la ciencia producida, tanto en volumen como en calidad de la misma. El conjunto de técnicas e infraestructuras innovadoras presentadas en este trabajo aborda estos problemas, contextualizándolos con los productos de datos que se generarán en la misión Gaia. Todas estas consideraciones han conducido a los fundamentos de la arquitectura que se utilizará en el paquete de trabajo de aplicaciones que posibilitarán la ciencia en el archivo de la misión Gaia (Science Enabling Applications). Por último, la eficacia de la solución propuesta se demostrará a través de la implementación de dos problemas estadísticos que requerirán cantidades significativas de cómputo, y que usarán datos simulados en el mismo formato en el que se producirán en el archivo de la misión Gaia (la primera versión de datos recogidos por la misión está disponible desde el día 14 de Septiembre de 2016). Estos ambiciosos problemas representan el Gran Reto (Grand Challenge), un nombre grandilocuente que consiste en inferir una serie de parámetros desde un punto de vista probabilístico para la función de masa inicial (Initial Mass Function) y la tasa de formación estelar (Star Formation Rate) dado un conjunto de estrellas (con una muestra grande), desde estimaciones con ruido de sus masas y edades respectivamente. Esto se abordará utilizando modelos jerárquicos bayesianos (Hierarchical Bayesian Modeling). Enprincipio,losmodelospropuestos pueden incorporar otros modelos de evolución estelar para inferir directamente la función de masa inicial y la tasa de formación estelar, pero en este primer paso presentado en esta tesis, empezaremos con un objetivo algo menos ambicioso: la inferencia de la función de masa y distribución de edades actual (Present-Day Mass Function y Present-Day Age Distribution respectivamente). Además, se llevará a cabo el análisis de rendimiento y escalabilidad para probar la idoneidad de la implementación de dichos modelos dadas las enormes cantidades de datos que estarán disponibles en el archivo de la misión Gaia...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

Docta Complutense