38 research outputs found

    PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark

    Get PDF
    Large-scale data processing techniques, currently known as Big-Data, are used to manage the huge amount of data that are generated by sequencers. Although these techniques have significant advantages, few biological applications have adopted them. In the Bioinformatic scientific area, Multiple Sequence Alignment (MSA) tools are widely applied for evolution and phylogenetic analysis, homology and domain structure prediction. Highly-rated MSA tools, such as MAFFT, ProbCons and T-Coffee (TC), use the probabilistic consistency as a prior step to the progressive alignment stage in order to improve the final accuracy. In this paper, a novel approach named PPCAS (Probabilistic Pairwise model for Consistency-based multiple alignment in Apache Spark) is presented. PPCAS is based on the MapReduce processing paradigm in order to enable large datasets to be processed with the aim of improving the performance and scalability of the original algorithm.This work was supported by the MEyC-Spain [contract TIN2014-53234-C2-2-R]

    Smart Monetization - Telecom Revenue Management beyond the traditional invoice

    Get PDF
    Nowadays, there is a fast and unpredictable technological evolution, with new systems constantly emerging on the market, with the capability of being monetized. However, these systems are not always fully and flexibly explored. Many hardware distributors are selling products without a clear view on sustainable business models for them, leaving these as an afterthought. Communications Service Providers are suddenly under pres sure to modernize and expande their business models as to regain the ground claimed by Over-the-top service providers, who make use of existing infrastructures to provide their own services, which naturally may lead to substantial revenue loss from the actual infras tructure owners. The Smart Monetization project aims to explore this paradigm, with the design and implementation of a reusable asset, making use of Big Data and Analytics tools that can ingest and process usage and billing data from customers, detecting event patterns and correlations that can be monetized, leading to improved and new service experiences and ensuring, as well, greater transparency on the process of billing and charging of these services

    A framework for multidimensional indexes on distributed and highly-available data stores

    Get PDF
    Spatial Big Data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of peta bytes of spatial data per year. However, as many authors have pointed out, the lack of specialized frameworks dealing with such kind of data is limiting possible applications and probably precluding many scientific breakthroughs. In this thesis, we describe three HPC scientific applications, ranging from molecular dynamics, neuroscience analysis, and physics simulations, where we experience first hand the limits of the existing technologies. Thanks to our experience, we define the desirable missing functionalities, and we focus on two features that when combined significantly improve the way scientific data is analyzed. On one side, scientific simulations generate complex datasets where multiple correlated characteristics describe each item. For instance, a particle might have a space position (x,y,z) at a given time (t). If we want to find all elements within the same area and period, we either have to scan the whole dataset, or we must organize the data so that all items in the same space and time are stored together. The second approach is called Multidimensional Indexing (MI), and it uses different techniques to cluster and to organize similar data together. On the other side, approximate analytics has been often indicated as a smart and flexible way to explore large datasets in a short period. Approximate analytics includes a broad family of algorithms which aims to speed up analytical workloads by relaxing the precision of the results within a specific interval of confidence. For instance, if we want to know the average age in a group with 1-year precision, we can consider just a random fraction of all the people, thus reducing the amount of calculation. But if we also want less I/O operations, we need efficient data sampling, which means organizing data in a way that we do not need to scan the whole data set to generate a random sample of it. According to our analysis, combining Multidimensional Indexing with efficient data Sampling (MIS) is a vital missing feature not available in the current distributed data management solutions. This thesis aims to solve such a shortcoming and it provides novel scalable solutions. At first, we describe the existing data management alternatives; then we motivate our preference for NoSQL key-value databases. Secondly, we propose an analytical model to study the influence of data models on the scalability and performance of this kind of distributed database. Thirdly, we use the analytical model to design two novel multidimensional indexes with efficient data sampling: the D8tree and the AOTree. Our first solution, the D8tree, improves state of the art for approximate spatial queries on static and mostly read dataset. Later, we enhanced the data ingestion capability or our approach by introducing the AOTree, an algorithm that enables the query performance of the D8tree even for HPC write-intensive applications. We compared our solution with PostgreSQL and plain storage, and we demonstrate that our proposal has better performance and scalability. Finally, we describe Qbeast, the novel distributed system that implements the D8tree and the AOTree using NoSQL technologies, and we illustrate how Qbeast simplifies the workflow of scientists in various HPC applications providing a scalable and integrated solution for data analysis and management.La gestión de BigData con información espacial está considerada como una tendencia esencial en el futuro de las aplicaciones científicas y de negocio. De hecho, se generan cientos de petabytes de datos espaciales por año mediante instrumentos de investigación, dispositivos médicos y redes sociales. Sin embargo, tal y como muchos autores han señalado, la falta de entornos especializados en manejar este tipo de datos está limitando sus posibles aplicaciones y está impidiendo muchos avances científicos. En esta tesis, describimos 3 aplicaciones científicas HPC, que cubren los ámbitos de dinámica molecular, análisis neurocientífico y simulaciones físicas, donde hemos experimentado en primera mano las limitaciones de las tecnologías existentes. Gracias a nuestras experiencias, hemos podido definir qué funcionalidades serían deseables y no existen, y nos hemos centrado en dos características que, al combinarlas, mejoran significativamente la manera en la que se analizan los datos científicos. Por un lado, las simulaciones científicas generan conjuntos de datos complejos, en los que cada elemento es descrito por múltiples características correlacionadas. Por ejemplo, una partícula puede tener una posición espacial (x, y, z) en un momento dado (t). Si queremos encontrar todos los elementos dentro de la misma área y periodo, o bien recorremos y analizamos todo el conjunto de datos, o bien organizamos los datos de manera que se almacenen juntos todos los elementos que comparten área en un momento dado. Esta segunda opción se conoce como Indexación Multidimensional (IM) y usa diferentes técnicas para agrupar y organizar datos similares. Por otro lado, se suele señalar que las analíticas aproximadas son una manera inteligente y flexible de explorar grandes conjuntos de datos en poco tiempo. Este tipo de analíticas incluyen una amplia familia de algoritmos que acelera el tiempo de procesado, relajando la precisión de los resultados dentro de un determinado intervalo de confianza. Por ejemplo, si queremos saber la edad media de un grupo con precisión de un año, podemos considerar sólo un subconjunto aleatorio de todas las personas, reduciendo así la cantidad de cálculo. Pero si además queremos menos operaciones de entrada/salida, necesitamos un muestreo eficiente de datos, que implica organizar los datos de manera que no necesitemos recorrerlos todos para generar una muestra aleatoria. De acuerdo con nuestros análisis, la combinación de Indexación Multidimensional con Muestreo eficiente de datos (IMM) es una característica vital que no está disponible en las soluciones actuales de gestión distribuida de datos. Esta tesis pretende resolver esta limitación y proporciona unas soluciones novedosas que son escalables. En primer lugar, describimos las alternativas de gestión de datos que existen y motivamos nuestra preferencia por las bases de datos NoSQL basadas en clave-valor. En segundo lugar, proponemos un modelo analítico para estudiar la influencia que tienen los modelos de datos sobre la escalabilidad y el rendimiento de este tipo de bases de datos distribuidas. En tercer lugar, usamos el modelo analítico para diseñar dos novedosos algoritmos IMM: el D8tree y el AOTree. Nuestra primera solución, el D8tree, mejora el estado del arte actual para consultas espaciales aproximadas, cuando el conjunto de datos es estático y mayoritariamente de lectura. Después, mejoramos la capacidad de ingestión introduciendo el AOTree, un algoritmo que conserva el rendimiento del D8tree incluso para aplicaciones HPC intensivas en escritura. Hemos comparado nuestra solución con PostgreSQL y almacenamiento plano demostrando que nuestra propuesta mejora tanto el rendimiento como la escalabilidad. Finalmente, describimos Qbeast, el sistema que implementa los algoritmos D8tree y AOTree, e ilustramos cómo Qbeast simplifica el flujo de trabajo de los científicos ofreciendo una solución escalable e integraPostprint (published version

    OASIS - Identifying the Core Attributes for RDBMS Alternatives

    Get PDF
    Since their introduction in the 1970s, relational database management systems have served as the dominate data storage technology. However, the demands of big data and Web 2.0 necessitated a change in the market, sparking the beginning of the NoSQL movement in the late 2000s. NoSQL databases exchanged the relational model and the guaranteed consistency of ACID transactions for improved performance and massive scalability [1]. While the benefits NoSQL provided proved useful, the lack of sufficient SQL functionality presented a major hurdle for organizations which require it to properly operate. It was clear that new RDBMS solutions which did not compromise functionality or scalability were necessary, which has led to the rise of a new class of modern relational database management systems, NewSQL [2]. This paper seeks to identify a consistent set of requirements necessary for an ideal RDBMS substitute. Among these requirements include possessing the features of a modern RDBMS, which includes support of the relational data model and standard ANSI SQL, ACID transactions, and ODBC/JDBC drivers. Additionally, the substitute must address typical RDBMS’ shortcomings in scalability by providing cost-effective scale-out capabilities. These requirements will then be used to filter out existing NoSQL and NewSQL database systems which could serve as viable substitutes to a typical RDBMS

    A framework for multidimensional indexes on distributed and highly-available data stores

    Get PDF
    Spatial Big Data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of peta bytes of spatial data per year. However, as many authors have pointed out, the lack of specialized frameworks dealing with such kind of data is limiting possible applications and probably precluding many scientific breakthroughs. In this thesis, we describe three HPC scientific applications, ranging from molecular dynamics, neuroscience analysis, and physics simulations, where we experience first hand the limits of the existing technologies. Thanks to our experience, we define the desirable missing functionalities, and we focus on two features that when combined significantly improve the way scientific data is analyzed. On one side, scientific simulations generate complex datasets where multiple correlated characteristics describe each item. For instance, a particle might have a space position (x,y,z) at a given time (t). If we want to find all elements within the same area and period, we either have to scan the whole dataset, or we must organize the data so that all items in the same space and time are stored together. The second approach is called Multidimensional Indexing (MI), and it uses different techniques to cluster and to organize similar data together. On the other side, approximate analytics has been often indicated as a smart and flexible way to explore large datasets in a short period. Approximate analytics includes a broad family of algorithms which aims to speed up analytical workloads by relaxing the precision of the results within a specific interval of confidence. For instance, if we want to know the average age in a group with 1-year precision, we can consider just a random fraction of all the people, thus reducing the amount of calculation. But if we also want less I/O operations, we need efficient data sampling, which means organizing data in a way that we do not need to scan the whole data set to generate a random sample of it. According to our analysis, combining Multidimensional Indexing with efficient data Sampling (MIS) is a vital missing feature not available in the current distributed data management solutions. This thesis aims to solve such a shortcoming and it provides novel scalable solutions. At first, we describe the existing data management alternatives; then we motivate our preference for NoSQL key-value databases. Secondly, we propose an analytical model to study the influence of data models on the scalability and performance of this kind of distributed database. Thirdly, we use the analytical model to design two novel multidimensional indexes with efficient data sampling: the D8tree and the AOTree. Our first solution, the D8tree, improves state of the art for approximate spatial queries on static and mostly read dataset. Later, we enhanced the data ingestion capability or our approach by introducing the AOTree, an algorithm that enables the query performance of the D8tree even for HPC write-intensive applications. We compared our solution with PostgreSQL and plain storage, and we demonstrate that our proposal has better performance and scalability. Finally, we describe Qbeast, the novel distributed system that implements the D8tree and the AOTree using NoSQL technologies, and we illustrate how Qbeast simplifies the workflow of scientists in various HPC applications providing a scalable and integrated solution for data analysis and management.La gestión de BigData con información espacial está considerada como una tendencia esencial en el futuro de las aplicaciones científicas y de negocio. De hecho, se generan cientos de petabytes de datos espaciales por año mediante instrumentos de investigación, dispositivos médicos y redes sociales. Sin embargo, tal y como muchos autores han señalado, la falta de entornos especializados en manejar este tipo de datos está limitando sus posibles aplicaciones y está impidiendo muchos avances científicos. En esta tesis, describimos 3 aplicaciones científicas HPC, que cubren los ámbitos de dinámica molecular, análisis neurocientífico y simulaciones físicas, donde hemos experimentado en primera mano las limitaciones de las tecnologías existentes. Gracias a nuestras experiencias, hemos podido definir qué funcionalidades serían deseables y no existen, y nos hemos centrado en dos características que, al combinarlas, mejoran significativamente la manera en la que se analizan los datos científicos. Por un lado, las simulaciones científicas generan conjuntos de datos complejos, en los que cada elemento es descrito por múltiples características correlacionadas. Por ejemplo, una partícula puede tener una posición espacial (x, y, z) en un momento dado (t). Si queremos encontrar todos los elementos dentro de la misma área y periodo, o bien recorremos y analizamos todo el conjunto de datos, o bien organizamos los datos de manera que se almacenen juntos todos los elementos que comparten área en un momento dado. Esta segunda opción se conoce como Indexación Multidimensional (IM) y usa diferentes técnicas para agrupar y organizar datos similares. Por otro lado, se suele señalar que las analíticas aproximadas son una manera inteligente y flexible de explorar grandes conjuntos de datos en poco tiempo. Este tipo de analíticas incluyen una amplia familia de algoritmos que acelera el tiempo de procesado, relajando la precisión de los resultados dentro de un determinado intervalo de confianza. Por ejemplo, si queremos saber la edad media de un grupo con precisión de un año, podemos considerar sólo un subconjunto aleatorio de todas las personas, reduciendo así la cantidad de cálculo. Pero si además queremos menos operaciones de entrada/salida, necesitamos un muestreo eficiente de datos, que implica organizar los datos de manera que no necesitemos recorrerlos todos para generar una muestra aleatoria. De acuerdo con nuestros análisis, la combinación de Indexación Multidimensional con Muestreo eficiente de datos (IMM) es una característica vital que no está disponible en las soluciones actuales de gestión distribuida de datos. Esta tesis pretende resolver esta limitación y proporciona unas soluciones novedosas que son escalables. En primer lugar, describimos las alternativas de gestión de datos que existen y motivamos nuestra preferencia por las bases de datos NoSQL basadas en clave-valor. En segundo lugar, proponemos un modelo analítico para estudiar la influencia que tienen los modelos de datos sobre la escalabilidad y el rendimiento de este tipo de bases de datos distribuidas. En tercer lugar, usamos el modelo analítico para diseñar dos novedosos algoritmos IMM: el D8tree y el AOTree. Nuestra primera solución, el D8tree, mejora el estado del arte actual para consultas espaciales aproximadas, cuando el conjunto de datos es estático y mayoritariamente de lectura. Después, mejoramos la capacidad de ingestión introduciendo el AOTree, un algoritmo que conserva el rendimiento del D8tree incluso para aplicaciones HPC intensivas en escritura. Hemos comparado nuestra solución con PostgreSQL y almacenamiento plano demostrando que nuestra propuesta mejora tanto el rendimiento como la escalabilidad. Finalmente, describimos Qbeast, el sistema que implementa los algoritmos D8tree y AOTree, e ilustramos cómo Qbeast simplifica el flujo de trabajo de los científicos ofreciendo una solución escalable e integr

    Big Data Now, 2015 Edition

    Get PDF
    Now in its fifth year, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve talked about over the past year. For 2015, we’ve included a collection of blog posts, authored by leading thinkers and experts in the field, that reflect a unique set of themes we’ve identified as gaining significant attention and traction. Our list of 2015 topics include: Data-driven cultures Data science Data pipelines Big data architecture and infrastructure The Internet of Things and real time Applications of big data Security, ethics, and governance Is your organization on the right track? Get a hold of this free report now and stay in tune with the latest significant developments in big data
    corecore