767 research outputs found

    Dynamic re-optimization techniques for stream processing engines and object stores

    Get PDF
    Large scale data storage and processing systems are strongly motivated by the need to store and analyze massive datasets. The complexity of a large class of these systems is rooted in their distributed nature, extreme scale, need for real-time response, and streaming nature. The use of these systems on multi-tenant, cloud environments with potential resource interference necessitates fine-grained monitoring and control. In this dissertation, we present efficient, dynamic techniques for re-optimizing stream-processing systems and transactional object-storage systems.^ In the context of stream-processing systems, we present VAYU, a per-topology controller. VAYU uses novel methods and protocols for dynamic, network-aware tuple-routing in the dataflow. We show that the feedback-driven controller in VAYU helps achieve high pipeline throughput over long execution periods, as it dynamically detects and diagnoses any pipeline-bottlenecks. We present novel heuristics to optimize overlays for group communication operations in the streaming model.^ In the context of object-storage systems, we present M-Lock, a novel lock-localization service for distributed transaction protocols on scale-out object stores to increase transaction throughput. Lock localization refers to dynamic migration and partitioning of locks across nodes in the scale-out store to reduce cross-partition acquisition of locks. The service leverages the observed object-access patterns to achieve lock-clustering and deliver high performance. We also present TransMR, a framework that uses distributed, transactional object stores to orchestrate and execute asynchronous components in amorphous data-parallel applications on scale-out architectures

    Arquitectura, técnicas y modelos para posibilitar la Ciencia de Datos en el Archivo de la Misión Gaia

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 26/05/2017.The massive amounts of data that the world produces every day pose new challenges to modern societies in terms of how to leverage their inherent value. Social networks, instant messaging, video, smart devices and scientific missions are just mere examples of the vast number of sources generating data every second. As the world becomes more and more digitalized, new needs arise for organizing, archiving, sharing, analyzing, visualizing and protecting the ever-increasing data sets, so that we can truly develop into a data-driven economy that reduces inefficiencies and increases sustainability, creating new business opportunities on the way. Traditional approaches for harnessing data are not suitable any more as they lack the means for scaling to the larger volumes in a timely and cost efficient manner. This has somehow changed with the advent of Internet companies like Google and Facebook, which have devised new ways of tackling this issue. However, the variety and complexity of the value chains in the private sector as well as the increasing demands and constraints in which the public one operates, needs an ongoing research that can yield newer strategies for dealing with data, facilitate the integration of providers and consumers of information, and guarantee a smooth and prompt transition when adopting these cutting-edge technological advances. This thesis aims at providing novel architectures and techniques that will help perform this transition towards Big Data in massive scientific archives. It highlights the common pitfalls that must be faced when embracing it and how to overcome them, especially when the data sets, their transformation pipelines and the tools used for the analysis are already present in the organizations. Furthermore, a new perspective for facilitating a smoother transition is laid out. It involves the usage of higher-level and use case specific frameworks and models, which will naturally bridge the gap between the technological and scientific domains. This alternative will effectively widen the possibilities of scientific archives and therefore will contribute to the reduction of the time to science. The research will be applied to the European Space Agency cornerstone mission Gaia, whose final data archive will represent a tremendous discovery potential. It will create the largest and most precise three dimensional chart of our galaxy (the Milky Way), providing unprecedented position, parallax and proper motion measurements for about one billion stars. The successful exploitation of this data archive will depend to a large degree on the ability to offer the proper architecture, i.e. infrastructure and middleware, upon which scientists will be able to do exploration and modeling with this huge data set. In consequence, the approach taken needs to enable data fusion with other scientific archives, as this will produce the synergies leading to an increment in scientific outcome, both in volume and in quality. The set of novel techniques and frameworks presented in this work addresses these issues by contextualizing them with the data products that will be generated in the Gaia mission. All these considerations have led to the foundations of the architecture that will be leveraged by the Science Enabling Applications Work Package. Last but not least, the effectiveness of the proposed solution will be demonstrated through the implementation of some ambitious statistical problems that will require significant computational capabilities, and which will use Gaia-like simulated data (the first Gaia data release has recently taken place on September 14th, 2016). These ambitious problems will be referred to as the Grand Challenge, a somewhat grandiloquent name that consists in inferring a set of parameters from a probabilistic point of view for the Initial Mass Function (IMF) and Star Formation Rate (SFR) of a given set of stars (with a huge sample size), from noisy estimates of their masses and ages respectively. This will be achieved by using Hierarchical Bayesian Modeling (HBM). In principle, the HBM can incorporate stellar evolution models to infer the IMF and SFR directly, but in this first step presented in this thesis, we will start with a somewhat less ambitious goal: inferring the PDMF and PDAD. Moreover, the performance and scalability analyses carried out will also prove the suitability of the models for the large amounts of data that will be available in the Gaia data archive.Las grandes cantidades de datos que se producen en el mundo diariamente plantean nuevos retos a la sociedad en términos de cómo extraer su valor inherente. Las redes sociales, mensajería instantánea, los dispositivos inteligentes y las misiones científicas son meros ejemplos del gran número de fuentes generando datos en cada momento. Al mismo tiempo que el mundo se digitaliza cada vez más, aparecen nuevas necesidades para organizar, archivar, compartir, analizar, visualizar y proteger la creciente cantidad de datos, para que podamos desarrollar economías basadas en datos e información que sean capaces de reducir las ineficiencias e incrementar la sostenibilidad, creando nuevas oportunidades de negocio por el camino. La forma en la que se han manejado los datos tradicionalmente no es la adecuada hoy en día, ya que carece de los medios para escalar a los volúmenes más grandes de datos de una forma oportuna y eficiente. Esto ha cambiado de alguna manera con la llegada de compañías que operan en Internet como Google o Facebook, ya que han concebido nuevas aproximaciones para abordar el problema. Sin embargo, la variedad y complejidad de las cadenas de valor en el sector privado y las crecientes demandas y limitaciones en las que el sector público opera, necesitan una investigación continua en la materia que pueda proporcionar nuevas estrategias para procesar las enormes cantidades de datos, facilitar la integración de productores y consumidores de información, y garantizar una transición rápida y fluida a la hora de adoptar estos avances tecnológicos innovadores. Esta tesis tiene como objetivo proporcionar nuevas arquitecturas y técnicas que ayudarán a realizar esta transición hacia Big Data en archivos científicos masivos. La investigación destaca los escollos principales a encarar cuando se adoptan estas nuevas tecnologías y cómo afrontarlos, principalmente cuando los datos y las herramientas de transformación utilizadas en el análisis existen en la organización. Además, se exponen nuevas medidas para facilitar una transición más fluida. Éstas incluyen la utilización de software de alto nivel y específico al caso de uso en cuestión, que haga de puente entre el dominio científico y tecnológico. Esta alternativa ampliará de una forma efectiva las posibilidades de los archivos científicos y por tanto contribuirá a la reducción del tiempo necesario para generar resultados científicos a partir de los datos recogidos en las misiones de astronomía espacial y planetaria. La investigación se aplicará a la misión de la Agencia Espacial Europea (ESA) Gaia, cuyo archivo final de datos presentará un gran potencial para el descubrimiento y hallazgo desde el punto de vista científico. La misión creará el catálogo en tres dimensiones más grande y preciso de nuestra galaxia (la Vía Láctea), proporcionando medidas sin precedente acerca del posicionamiento, paralaje y movimiento propio de alrededor de mil millones de estrellas. Las oportunidades para la explotación exitosa de este archivo de datos dependerán en gran medida de la capacidad de ofrecer la arquitectura adecuada, es decir infraestructura y servicios, sobre la cual los científicos puedan realizar la exploración y modelado con esta inmensa cantidad de datos. Por tanto, la estrategia a realizar debe ser capaz de combinar los datos con otros archivos científicos, ya que esto producirá sinergias que contribuirán a un incremento en la ciencia producida, tanto en volumen como en calidad de la misma. El conjunto de técnicas e infraestructuras innovadoras presentadas en este trabajo aborda estos problemas, contextualizándolos con los productos de datos que se generarán en la misión Gaia. Todas estas consideraciones han conducido a los fundamentos de la arquitectura que se utilizará en el paquete de trabajo de aplicaciones que posibilitarán la ciencia en el archivo de la misión Gaia (Science Enabling Applications). Por último, la eficacia de la solución propuesta se demostrará a través de la implementación de dos problemas estadísticos que requerirán cantidades significativas de cómputo, y que usarán datos simulados en el mismo formato en el que se producirán en el archivo de la misión Gaia (la primera versión de datos recogidos por la misión está disponible desde el día 14 de Septiembre de 2016). Estos ambiciosos problemas representan el Gran Reto (Grand Challenge), un nombre grandilocuente que consiste en inferir una serie de parámetros desde un punto de vista probabilístico para la función de masa inicial (Initial Mass Function) y la tasa de formación estelar (Star Formation Rate) dado un conjunto de estrellas (con una muestra grande), desde estimaciones con ruido de sus masas y edades respectivamente. Esto se abordará utilizando modelos jerárquicos bayesianos (Hierarchical Bayesian Modeling). Enprincipio,losmodelospropuestos pueden incorporar otros modelos de evolución estelar para inferir directamente la función de masa inicial y la tasa de formación estelar, pero en este primer paso presentado en esta tesis, empezaremos con un objetivo algo menos ambicioso: la inferencia de la función de masa y distribución de edades actual (Present-Day Mass Function y Present-Day Age Distribution respectivamente). Además, se llevará a cabo el análisis de rendimiento y escalabilidad para probar la idoneidad de la implementación de dichos modelos dadas las enormes cantidades de datos que estarán disponibles en el archivo de la misión Gaia...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    Raphtory: Modelling, Maintenance and Analysis of Distributed Temporal Graphs.

    Get PDF
    PhD ThesesTemporal graphs capture the development of relationships within data throughout time. This model ts naturally within a streaming architecture, where new events can be inserted directly into the graph upon arrival from a data source and be compared to related entities or historical state. However, the majority of graph processing systems only consider traditional graph analysis on static data, whilst those which do expand past this often only support batched updating and delta analysis across graph snapshots. In this work we de ne a temporal property graph model and the semantics for updating it in both a distributed and non-distributed context. We have built Raphtory, a distributed temporal graph analytics platform which maintains the full graph history in memory, leveraging the de ned update semantics to insert streamed events directly into the model without batching or centralised ordering. In parallel with the ingestion, traditional and time-aware analytics may be performed on the most up-to-date version of the graph, as well as any point throughout its history. The depth of history viewed from the perspective of a time point may also be varied to explore both short and long term patterns within the data. Through this we extract novel insights over a variety of use cases, including phenomena never seen before in social networks. Finally, we demonstrate Raphtory's ability to scale both vertically and horizontally, handling consistent throughput in excess of 100,000 updates a second alongside the ingestion and maintenance of graphs built from billions of events

    Design and development of a system for vario-scale maps

    Get PDF
    Nowadays, there are many geo-information data sources available such as maps on the Internet, in-car navigation devices and mobile apps. All datasets used in these applications are the same in principle, and face the same issues, namely: Maps of different scales are stored separately. With many separate fixed levels, a lot of information is the same, but still needs to be included, which leads to duplication. With many redundant data throughout the scales, features are represented again and again, which may lead to inconsistency. Currently available maps contain significantly more levels of detail (twenty map scales on average) than in the past. These levels must be created, but the optimal strategy to do so is not known. For every user’s data request, a significant part of the data remains the same, but still needs to be included. This leads to more data transfer, and slower response. The interactive Internet environment is not used to its full potential for user navigation. It is common to observe lagging, popping features or flickering of a newly retrieved map scale feature while using the map. This research develops principles of variable scale (vario-scale) maps to address these issues. The vario-scale approach is an alternative for obtaining and maintaining geographical data sets at different map scales. It is based on the specific topological structure called tGAP (topological Generalized Area Partitioning) which addresses the main open issues of current solutions for managing spatial data sets of different scales such as: redundancy data, inconsistency of map scales and dynamic transfer. The objective of this thesis is to design, to develop and to extend the variable-scale data structures and it is expressed as the following research question: How to design and develop a system for vario-scale maps?  To address the above research question, this research has been conducted using the following outline: 1) Investigate state-of-the-art in map generalization. 2) Study development of vario-scale structure done so far. 3) Propose techniques for generating better vario-scale map content. 4) Implement strategies to process really massive datasets. 5) Research smooth representation of map features and their impact on user interaction. Results of our research led to new functionality, were addressed in prototype developments and were tested against real world data sets. Throughout this research we have made following main contributions to the design and development of a system of vario-scale maps. We have: studied vario-scale development in the past and we have identified the most urgent needs of the research. designed the concept of granularity and presented our strategy where changes in map content should be as small and as gradual as possible (e. g. use groups, maintain road network, support line feature representation). introduced line features in the solution and presented a fully-automated generalization process that preserves a road network features throughout all scales. proposed an approach to create a vario-scale data structure of massive datasets. demonstrated a method to generate an explicit 3D representation from the structure which can provide smoother user experience. developed a software prototype where a 3D vario-scale dataset can be used to its full potential. conducted initial usability test. All aspects together with already developed functionality provide a more complex and more unified solution for vario-scale mapping. Based on our research, design and development of a system for vario-scale maps should be clearer now. In addition, it is easier to identified necessary steps which need to be taken towards an optimal solution. Our recommendations for future work are: One of the contributions has been an integration of the road features in the structure and their automated generalization throughout the process. Integrating more map features besides roads deserve attention. We have investigated how to deal with massive datasets which do not fit in the main memory of the computer. Our experiences consisted of dataset of one province or state with records in order of millions. To verify our findings, it will be interesting to process even bigger dataset with records in order of billions (a whole continent). We have introduced representation where map content changes as gradually as possible. It is based on process where: 1) explicit 3D geometry from the structure is generated. 2) A slice of the geometry is calculated. 3) Final maps based on the slice is constructed. Investigation of how to integrate this in a server-client pipeline on the Internet is another point of further research. Our research focus has been mainly on one specific aspect of the concept at a time. Now all aspects may be brought together where integration, tuning and orchestration play an important role is another interesting research that desire attention. Carry out more user testing including; 1) maps of sufficient cartographic quality, 2) a large testing region, and 3) the finest version of visualization prototype. &nbsp

    Design and development of a system for vario-scale maps

    Get PDF
    Nowadays, there are many geo-information data sources available such as maps on the Internet, in-car navigation devices and mobile apps. All datasets used in these applications are the same in principle, and face the same issues, namely: Maps of different scales are stored separately. With many separate fixed levels, a lot of information is the same, but still needs to be included, which leads to duplication. With many redundant data throughout the scales, features are represented again and again, which may lead to inconsistency. Currently available maps contain significantly more levels of detail (twenty map scales on average) than in the past. These levels must be created, but the optimal strategy to do so is not known. For every user’s data request, a significant part of the data remains the same, but still needs to be included. This leads to more data transfer, and slower response. The interactive Internet environment is not used to its full potential for user navigation. It is common to observe lagging, popping features or flickering of a newly retrieved map scale feature while using the map. This research develops principles of variable scale (vario-scale) maps to address these issues. The vario-scale approach is an alternative for obtaining and maintaining geographical data sets at different map scales. It is based on the specific topological structure called tGAP (topological Generalized Area Partitioning) which addresses the main open issues of current solutions for managing spatial data sets of different scales such as: redundancy data, inconsistency of map scales and dynamic transfer. The objective of this thesis is to design, to develop and to extend the variable-scale data structures and it is expressed as the following research question: How to design and develop a system for vario-scale maps? To address the above research question, this research has been conducted using the following outline:  To address the above research question, this research has been conducted using the following outline: 1) Investigate state-of-the-art in map generalization. 2) Study development of vario-scale structure done so far. 3) Propose techniques for generating better vario-scale map content. 4) Implement strategies to process really massive datasets. 5) Research smooth representation of map features and their impact on user interaction. Results of our research led to new functionality, were addressed in prototype developments and were tested against real world data sets. Throughout this research we have made following main contributions to the design and development of a system of vario-scale maps. We have: studied vario-scale development in the past and we have identified the most urgent needs of the research. designed the concept of granularity and presented our strategy where changes in map content should be as small and as gradual as possible (e. g. use groups, maintain road network, support line feature representation). introduced line features in the solution and presented a fully-automated generalization process that preserves a road network features throughout all scales. proposed an approach to create a vario-scale data structure of massive datasets. demonstrated a method to generate an explicit 3D representation from the structure which can provide smoother user experience. developed a software prototype where a 3D vario-scale dataset can be used to its full potential. conducted initial usability test. All aspects together with already developed functionality provide a more complex and more unified solution for vario-scale mapping. Based on our research, design and development of a system for vario-scale maps should be clearer now. In addition, it is easier to identified necessary steps which need to be taken towards an optimal solution. Our recommendations for future work are: One of the contributions has been an integration of the road features in the structure and their automated generalization throughout the process. Integrating more map features besides roads deserve attention. We have investigated how to deal with massive datasets which do not fit in the main memory of the computer. Our experiences consisted of dataset of one province or state with records in order of millions. To verify our findings, it will be interesting to process even bigger dataset with records in order of billions (a whole continent). We have introduced representation where map content changes as gradually as possible. It is based on process where: 1) explicit 3D geometry from the structure is generated. 2) A slice of the geometry is calculated. 3) Final maps based on the slice is constructed. Investigation of how to integrate this in a server-client pipeline on the Internet is another point of further research. Our research focus has been mainly on one specific aspect of the concept at a time. Now all aspects may be brought together where integration, tuning and orchestration play an important role is another interesting research that desire attention. Carry out more user testing including; 1) maps of sufficient cartographic quality, 2) a large testing region, and 3) the finest version of visualization prototype

    Month of birth and its relationship to streaming in the primary school

    Get PDF
    This study is concerned with an investigation into the relationship between month of birth and stream placement in the primary school. It is particularly concerned with the possibility that, where traditional streaming is implemented, there may be an under-estimation of the younger children in a school year age group. Streaming is usually defined as "grouping according to ability with considerations of attainment", but, in practice, only attainment seems to be assessed adequately, and ability tends to be given less attention. In the traditionally streamed primary school, allocation is usually based on attainment level at the time of leaving the infant department. It is possible that some of the younger children in the year group, who have matured less intellectually, and who have had less time in the infant department to benefit from early formal tuition, may be under-estimated and placed in lower streams than their potential would warrant. In the study 1000 children from 5 schools, 500 in the first year of the junior department and 500 in the fourth year, were investigated with respect to Month of Birth, I.Q., and Stream Placement. Results showed that, although, in general, the children were successfully streamed, and although no birth months were superior with respect to intelligence, the younger children tended to be placed more readily in the lower streams. This was the case at first year level but not at fourth year level. Thus, although there was a tendency for early underestimation of the younger children of the school year group, this seemed to be rectified later to a great extent
    corecore