926 research outputs found

    An Empirical Evaluation of Columnar Storage Formats

    Full text link
    Columnar storage is one of the core components of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed significantly. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. Our analysis identifies important considerations that may guide future formats to better fit modern technology trends

    Engineering for a Science-Centric Experimentation Platform

    Get PDF
    Netflix is an internet entertainment service that routinely employs experimentation to guide strategy around product innovations. As Netflix grew, it had the opportunity to explore increasingly specialized improvements to its service, which generated demand for deeper analyses supported by richer metrics and powered by more diverse statistical methodologies. To facilitate this, and more fully harness the skill sets of both engineering and data science, Netflix engineers created a science-centric experimentation platform that leverages the expertise of data scientists from a wide range of backgrounds by allowing them to make direct code contributions in the languages used by scientists (Python and R). Moreover, the same code that runs in production is able to be run locally, making it straightforward to explore and graduate both metrics and causal inference methodologies directly into production services. In this paper, we utilize a case-study research method to provide two main contributions. Firstly, we report on the architecture of this platform, with a special emphasis on its novel aspects: how it supports science-centric end-to-end workflows without compromising engineering requirements. Secondly, we describe its approach to causal inference, which leverages the potential outcomes conceptual framework to provide a unified abstraction layer for arbitrary statistical models and methodologies.Comment: 10 page

    New Secure IoT Architectures, Communication Protocols and User Interaction Technologies for Home Automation, Industrial and Smart Environments

    Get PDF
    Programa Oficial de Doutoramento en Tecnoloxías da Información e das Comunicacións en Redes Móbiles. 5029V01Tese por compendio de publicacións[Abstract] The Internet of Things (IoT) presents a communication network where heterogeneous physical devices such as vehicles, homes, urban infrastructures or industrial machinery are interconnected and share data. For these communications to be successful, it is necessary to integrate and embed electronic devices that allow for obtaining environmental information (sensors), for performing physical actuations (actuators) as well as for sending and receiving data (network interfaces). This integration of embedded systems poses several challenges. It is needed for these devices to present very low power consumption. In many cases IoT nodes are powered by batteries or constrained power supplies. Moreover, the great amount of devices needed in an IoT network makes power e ciency one of the major concerns of these deployments, due to the cost and environmental impact of the energy consumption. This need for low energy consumption is demanded by resource constrained devices, con icting with the second major concern of IoT: security and data privacy. There are critical urban and industrial systems, such as tra c management, water supply, maritime control, railway control or high risk industrial manufacturing systems such as oil re neries that will obtain great bene ts from IoT deployments, for which non-authorized access can posse severe risks for public safety. On the other hand, both these public systems and the ones deployed on private environments (homes, working places, malls) present a risk for the privacy and security of their users. These IoT deployments need advanced security mechanisms, both to prevent access to the devices and to protect the data exchanged by them. As a consequence, it is needed to improve two main aspects: energy e ciency of IoT devices and the use of lightweight security mechanisms that can be implemented by these resource constrained devices but at the same time guarantee a fair degree of security. The huge amount of data transmitted by this type of networks also presents another challenge. There are big data systems capable of processing large amounts of data, but with IoT the granularity and dispersion of the generated information presents a new scenario very di erent from the one existing nowadays. Forecasts anticipate that there will be a growth from the 15 billion installed devices in 2015 to more than 75 billion devices in 2025. Moreover, there will be much more services exploiting the data produced by these networks, meaning the resulting tra c will be even higher. The information must not only be processed in real time, but data mining processes will have to be performed to historical data. The main goal of this Ph.D. thesis is to analyze each one of the previously described challenges and to provide solutions that allow for an adequate adoption of IoT in Industrial, domestic and, in general, any scenario that can obtain any bene t from the interconnection and exibility that IoT brings.[Resumen] La internet de las cosas (IoT o Internet of Things) representa una red de intercomunicaciones en la que participan dispositivos físicos de toda índole, como vehículos, viviendas, electrodomésticos, infraestructuras urbanas o maquinaria y dispositivos industriales. Para que esta comunicación se pueda llevar a cabo es necesario integrar elementos electr onicos que permitan obtener informaci on del entorno (sensores), realizar acciones f sicas (actuadores) y enviar y recibir la informaci on necesaria (interfaces de comunicaciones de red). La integración y uso de estos sistemas electrónicos embebidos supone varios retos. Es necesario que dichos dispositivos presenten un consumo reducido. En muchos casos deberían ser alimentados por baterías o fuentes de alimentación limitadas. Además, la gran cantidad de dispositivos que involucra la IoT hace necesario que la e ciencia energética de los mismos sea una de las principales preocupaciones, por el coste e implicaciones medioambientales que supone el consumo de electricidad de los mismos. Esta necesidad de limitar el consumo provoca que dichos dispositivos tengan unas prestaciones muy limitadas, lo que entra en conflicto con la segunda mayor preocupación de la IoT: la seguridad y privacidad de los datos. Por un lado existen sistemas críticos urbanos e industriales, como puede ser la regulación del tráfi co, el control del suministro de agua, el control marítimo, el control ferroviario o los sistemas de producción industrial de alto riesgo, como refi nerías, que son claros candidatos a benefi ciarse de la IoT, pero cuyo acceso no autorizado supone graves problemas de seguridad ciudadana. Por otro lado, tanto estos sistemas de naturaleza publica, como los que se desplieguen en entornos privados (viviendas, entornos de trabajo o centros comerciales, entre otros) suponen un riesgo para la privacidad y también para la seguridad de los usuarios. Todo esto hace que sean necesarios mecanismos de seguridad avanzados, tanto de acceso a los dispositivos como de protección de los datos que estos intercambian. En consecuencia, es necesario avanzar en dos aspectos principales: la e ciencia energética de los dispositivos y el uso de mecanismos de seguridad e ficientes, tanto computacional como energéticamente, que permitan la implantación de la IoT sin comprometer la seguridad y la privacidad de los usuarios. Por otro lado, la ingente cantidad de información que estos sistemas puede llegar a producir presenta otros dos retos que deben ser afrontados. En primer lugar, el tratamiento y análisis de datos toma una nueva dimensión. Existen sistemas de big data capaces de procesar cantidades enormes de información, pero con la internet de las cosas la granularidad y dispersión de los datos plantean un escenario muy distinto al actual. La previsión es pasar de 15.000.000.000 de dispositivos instalados en 2015 a más de 75.000.000.000 en 2025. Además existirán multitud de servicios que harán un uso intensivo de estos dispositivos y de los datos que estos intercambian, por lo que el volumen de tráfico será todavía mayor. Asimismo, la información debe ser procesada tanto en tiempo real como a posteriori sobre históricos, lo que permite obtener información estadística muy relevante en diferentes entornos. El principal objetivo de la presente tesis doctoral es analizar cada uno de estos retos (e ciencia energética, seguridad, procesamiento de datos e interacción con el usuario) y plantear soluciones que permitan una correcta adopción de la internet de las cosas en ámbitos industriales, domésticos y en general en cualquier escenario que se pueda bene ciar de la interconexión y flexibilidad de acceso que proporciona el IoT.[Resumo] O internet das cousas (IoT ou Internet of Things) representa unha rede de intercomunicaci óns na que participan dispositivos físicos moi diversos, coma vehículos, vivendas, electrodomésticos, infraestruturas urbanas ou maquinaria e dispositivos industriais. Para que estas comunicacións se poidan levar a cabo é necesario integrar elementos electrónicos que permitan obter información da contorna (sensores), realizar accións físicas (actuadores) e enviar e recibir a información necesaria (interfaces de comunicacións de rede). A integración e uso destes sistemas electrónicos integrados supón varios retos. En primeiro lugar, é necesario que estes dispositivos teñan un consumo reducido. En moitos casos deberían ser alimentados por baterías ou fontes de alimentación limitadas. Ademais, a gran cantidade de dispositivos que se empregan na IoT fai necesario que a e ciencia enerxética dos mesmos sexa unha das principais preocupacións, polo custo e implicacións medioambientais que supón o consumo de electricidade dos mesmos. Esta necesidade de limitar o consumo provoca que estes dispositivos teñan unhas prestacións moi limitadas, o que entra en con ito coa segunda maior preocupación da IoT: a seguridade e privacidade dos datos. Por un lado existen sistemas críticos urbanos e industriais, como pode ser a regulación do tráfi co, o control de augas, o control marítimo, o control ferroviario ou os sistemas de produción industrial de alto risco, como refinerías, que son claros candidatos a obter benefi cios da IoT, pero cuxo acceso non autorizado supón graves problemas de seguridade cidadá. Por outra parte tanto estes sistemas de natureza pública como os que se despreguen en contornas privadas (vivendas, contornas de traballo ou centros comerciais entre outros) supoñen un risco para a privacidade e tamén para a seguridade dos usuarios. Todo isto fai que sexan necesarios mecanismos de seguridade avanzados, tanto de acceso aos dispositivos como de protección dos datos que estes intercambian. En consecuencia, é necesario avanzar en dous aspectos principais: a e ciencia enerxética dos dispositivos e o uso de mecanismos de seguridade re cientes, tanto computacional como enerxéticamente, que permitan o despregue da IoT sen comprometer a seguridade e a privacidade dos usuarios. Por outro lado, a inxente cantidade de información que estes sistemas poden chegar a xerar presenta outros retos que deben ser tratados. O tratamento e a análise de datos toma unha nova dimensión. Existen sistemas de big data capaces de procesar cantidades enormes de información, pero coa internet das cousas a granularidade e dispersión dos datos supón un escenario moi distinto ao actual. A previsión e pasar de 15.000.000.000 de dispositivos instalados no ano 2015 a m ais de 75.000.000.000 de dispositivos no ano 2025. Ademais existirían multitude de servizos que farían un uso intensivo destes dispositivos e dos datos que intercambian, polo que o volume de tráfico sería aínda maior. Do mesmo xeito a información debe ser procesada tanto en tempo real como posteriormente sobre históricos, o que permite obter información estatística moi relevante en diferentes contornas. O principal obxectivo da presente tese doutoral é analizar cada un destes retos (e ciencia enerxética, seguridade, procesamento de datos e interacción co usuario) e propor solucións que permitan unha correcta adopción da internet das cousas en ámbitos industriais, domésticos e en xeral en todo aquel escenario que se poda bene ciar da interconexión e flexibilidade de acceso que proporciona a IoT

    Design and evaluation of a scalable Internet of Things backend for smart ports

    Get PDF
    Internet of Things (IoT) technologies, when adequately integrated, cater for logistics optimisation and operations' environmental impact monitoring, both key aspects for today's EU ports management. This article presents Obelisk, a scalable and multi-tenant cloud-based IoT integration platform used in the EU H2020 PortForward project. The landscape of IoT protocols being particularly fragmented, the first role of Obelisk is to provide uniform access to data originating from a myriad of devices and protocols. Interoperability is achieved through adapters that provide flexibility and evolvability in protocol and format mapping. Additionally, due to ports operating in a hub model with various interacting actors, a second role of Obelisk is to secure access to data. This is achieved through encryption and isolation for data transport and processing, respectively, while user access control is ensured through authentication and authorisation standards. Finally, as ports IoTisation will further evolve, a third need for Obelisk is to scale with the data volumes it must ingest and process. Platform scalability is achieved by means of a reactive micro-services based design. Those three essential characteristics are detailed in this article with a specific focus on how to achieve IoT data platform scalability. By means of an air quality monitoring use-case deployed in the city of Antwerp, the scalability of the platform is evaluated. The evaluation shows that the proposed reactive micro-service based design allows for horizontal scaling of the platform as well as for logarithmic time complexity of its service time

    Engineering for a science-centric experimentation platform

    Get PDF
    Netflix is an internet entertainment service that routinely employs experimentation to guide strategy around product innovations. As Netflix grew, it had the opportunity to explore increasingly specialized improvements to its service, which generated demand for deeper analyses supported by richer metrics and powered by more diverse statistical methodologies. To facilitate this, and more fully harness the skill sets of both engineering and data science, Netflix engineers created a science-centric experimentation platform that leverages the expertise of scientists from a wide range of backgrounds working on data science tasks by allowing them to make direct code contributions in the languages used by them (Python and R). Moreover, the same code that runs in production is able to be run locally, making it straightforward to explore and graduate both metrics and causal inference methodologies directly into production services. In this paper, we provide two main contributions. Firstly, we report on the architecture of this platform, with a special emphasis on its novel aspects: how it supports science-centric end-to-end workflows without compromising engineering requirements. Secondly, we describe its approach to causal inference, which leverages the potential outcomes conceptual framework to provide a unified abstarction layer for arbitrary statistical models and methodologies

    Strategies for digital preservation of information

    Get PDF
    With the advent of the digital era, the digital information produced has grown exponentially. According to Seamus Ross, digital information is a cultural product [1]. The growing dependency on digital information is changing the way our culture is recorded. There is no longer a strict relation between the logical structure of information, the physical storage support and its interpretation. Internet provided the right environment not only for the thriving of new communities but also for the growth of the information produced by them. Software and hardware have also evolved. Along with them came new capabilities of producing more accurate and space demanding information. One good example is multimedia content, audio and video, but there are many more. This new reality rose an awareness for the need to preserve all this information for future generations to come. Unlike their analogue peers, digital formats require a different effort to maintain as they are exposed to such threats as the deterioration of the medium they are stored in or formats obsolescence. A number of actions must be taken to ensure their long-term access. To address this need, digital repositories have evolved, accommodating now sets of features capable of implementing different strategies for the digital preservation of information. This thesis presents an analysis of the current state of the art open source repository software for the digital preservation of information identifying the five most relevant solutions. From those solutions, we picked the most feature rich and broader user community software, RODA, to which we propose and implement further improvements to an existing preservation strategy: federation. These improvements consist in building into the system an interoperability mechanism capable of allowing RODA to interact with other systems. This improvement is made by implementing a prototype composed by a CMIS server inside the repository, which communicates with client applications through the implementation of the CMIS protocol. We name this prototype RODA OpenCMIS Server, or in short, RODA-OCS. The prototype allows RODA to expose contents stored inside it publicly, under a controlled environment, in an authenticated and secure way. This goal is achieved by integrating our prototype with RODA native permissions system. RODA-OCS not only implements a file browser mechanism for content navigation but also a query engine capable of searching and retrieving contents based on their metadata, either technical or descriptive. Finally, we present a demonstration of the functioning of RODA-OCS. Through the use of a dataset conceived to test the functionalitie

    Pet sense: sistema de monitorização de animais em hospitalização

    Get PDF
    The observation and treatment of animals in veterinary hospitals is still very dependent on manual procedures, including the collection of vital signs (temperature, heart rate, respiratory rate and blood pressure). These manual procedures are time-consuming and invasive, affecting the animal’s well-being. In this work, we purpose the use of IoT technologies to monitor animals in hospitalization, wearing sensors to collect vitals, and low-cost hardware to forward them into a cloud backend that analyses and stores data. The history of observed vitals and alarms can be accessed in the web, included in the Pet Universal software suite. The overall architecture follows a stream processing approach, using telemetry protocols to transport data, and Apache Kafka Streams to analyse streams and trigger alarms on potential hazard conditions. The system was fully implemented, although with laboratory sensors to emulate the smart devices to be worn by the animals. We were able to implement a data gathering and processing pipeline and integrate with the existing clinical management information system. The proposed solution can offer a practical way for long-term monitoring and detect abnormal values of temperature and heart rate in hospitalized animals, taking into consideration the characteristics of the monitored individual (species and state).A observação e tratamento de animais hospitalizados continua muito dependente de procedimentos manuais, especialmente no que diz respeito à colheita de sinais vitais (temperatura, frequência cardíaca, frequência respiratória e pressão arterial). Estes procedimentos manuais são dispendiosos em termos de tempo e afetam o bem-estar do animal. Neste projeto, propomos o recurso a tecnologias IoT para monitorizar animais hospitalizados equipados com sensores que medem sinais vitais, com hardware acessível, e envio dos dados para um serviço na cloud que os analisa e armazena. O histórico dos valores e alarmes podem ser acedidos na web e incluídos na plataforma comercial da Pet Universal. A arquitetura geral segue uma abordagem de processamento funcional, usando protocolos de telemetria para transportar dados e Apache Kafka Streams, analisando e lançando alarmes sobre potenciais condições de risco de acordo com a temperatura e pulsação. O sistema foi totalmente implementado, embora com sensores de laboratório para simular os dispositivos a serem usados pelos animais. Conseguimos implementar um circuito de colheita e processamento de dados e integrar com o sistema de gestão clínica já existente. A solução proposta oferece uma forma prática de monitorização continuada e de deteção de valores anormais de temperatura e frequência cardíaca em animais hospitalizados, tomando em conta as características do indivíduo monitorado (espécie e estado).Mestrado em Engenharia Informátic

    Description and Experience of the Clinical Testbeds

    Get PDF
    This deliverable describes the up-to-date technical environment at three clinical testbed demonstrator sites of the 6WINIT Project, including the adapted clinical applications, project components and network transition technologies in use at these sites after 18 months of the Project. It also provides an interim description of early experiences with deployment and usage of these applications, components and technologies, and their clinical service impact
    corecore