7 research outputs found

    Online Person Identification based on Multitask Learning

    Get PDF
    In the digital world, everything is digitized and data are generated consecutively over the times. To deal with this situation, incremental learning plays an important role. One of the important applications that needs an incremental learning is person identification. On the other hand, password and code are no longer the only way to prevent the unauthorized person to access the information and it tends to be forgotten.  Therefore, biometric characteristics system is introduced to solve the problems. However, recognition based on single biometric may not be effective, thus, multitask learning is needed. To solve the problems, incremental learning is applied for person identification based on multitask learning. Considering that the complete data is not possible to be collected at one time, online learning is adopted to update the system accordingly. Linear Discriminant Analysis (LDA) is used to create a feature space while Incremental LDA (ILDA) is adopted to update LDA. Through multitask learning, not only human faces are trained, but fingerprint images are trained in order to improve the performance. The performance of the system is evaluated by using 50 datasets which includes both male and female datasets. Experimental results demonstrate that the learning time of ILDA is faster than LDA. Apart from that, the learning accuracies are evaluated by using K-Nearest Neighbor (KNN) and achieve more than 80% for most of the simulation results. In the future, the system is suggested to be improved by using better sensor for all the biometrics. Other than that, incremental feature extraction is improved to deal with some other online learning problems

    Online Person Identification based on Multitask Learning

    Get PDF
    In the digital world, everything is digitized and data are generated consecutively over the times. To deal with this situation, incremental learning plays an important role. One of the important applications that needs an incremental learning is person identification. On the other hand, password and code are no longer the only way to prevent the unauthorized person to access the information and it tends to be forgotten. Therefore, biometric characteristics system is introduced to solve the problems. However, recognition based on single biometric may not be effective, thus, multitask learning is needed. To solve the problems, incremental learning is applied for person identification based on multitask learning. Considering that the complete data is not possible to be collected at one time, online learning is adopted to update the system accordingly. Linear Discriminant Analysis (LDA) is used to create a feature space while Incremental LDA (ILDA) is adopted to update LDA. Through multitask learning, not only human faces are trained, but fingerprint images are trained in order to improve the performance. The performance of the system is evaluated by using 50 datasets which includes both male and female datasets. Experimental results demonstrate that the learning time of ILDA is faster than LDA. Apart from that, the learning accuracies are evaluated by using K-Nearest Neighbor (KNN) and achieve more than 80% for most of the simulation results. In the future, the system is suggested to be improved by using better sensor for all the biometrics. Other than that, incremental feature extraction is improved to deal with some other online learning problems

    NebulaStream: Complex Analytics Beyond the Cloud

    Get PDF
    The arising Internet of Things (IoT) will require significant changes to current stream processing engines (SPEs) to enable large-scale IoT applications. In this paper, we present challenges and opportunities for an IoT data management system to enable complex analytics beyond the cloud. As one of the most important upcoming IoT applications, we focus on the vision of a smart city. The goal of this paper is to bridge the gap between the requirements of upcoming IoT applications and the supported features of an IoT data management system. To this end, we outline how state-of-the-art SPEs have to change to exploit the new capabilities of the IoT and showcase how we tackle IoT challenges in our own system, NebulaStream. This paper lays the foundation for a new type of systems that leverages the IoT to enable large-scale applications over millions of IoT devices in highly dynamic and geo-distributed environments

    Performance Optimizations and Operator Semantics for Streaming Data Flow Programs

    Get PDF
    Unternehmen sammeln mehr Daten als je zuvor und müssen auf diese Informationen zeitnah reagieren. Relationale Datenbanken eignen sich nicht für die latenzfreie Verarbeitung dieser oft unstrukturierten Daten. Um diesen Anforderungen zu begegnen, haben sich in der Datenbankforschung seit dem Anfang der 2000er Jahre zwei neue Forschungsrichtungen etabliert: skalierbare Verarbeitung unstrukturierter Daten und latenzfreie Datenstromverarbeitung. Skalierbare Verarbeitung unstrukturierter Daten, auch bekannt unter dem Begriff "Big Data"-Verarbeitung, hat in der Industrie schnell Einzug erhalten. Gleichzeitig wurden in der Forschung Systeme zur latenzfreien Datenstromverarbeitung entwickelt, die auf eine verteilte Architektur, Skalierbarkeit und datenparallele Verarbeitung setzen. Obwohl diese Systeme in der Industrie vermehrt zum Einsatz kommen, gibt es immer noch große Herausforderungen im praktischen Einsatz. Diese Dissertation verfolgt zwei Hauptziele: Zuerst wird das Laufzeitverhalten von hochskalierbaren datenparallelen Datenstromverarbeitungssystemen untersucht. Im zweiten Hauptteil wird das "Dual Streaming Model" eingeführt, das eine Semantik zur gleichzeitigen Verarbeitung von Datenströmen und Tabellen beschreibt. Das Ziel unserer Untersuchung ist ein besseres Verständnis über das Laufzeitverhalten dieser Systeme zu erhalten und dieses Wissen zu nutzen um Anfragen automatisch ausreichende Rechenkapazität zuzuweisen. Dazu werden ein Kostenmodell und darauf aufbauende Optimierungsalgorithmen für Datenstromanfragen eingeführt, die Datengruppierung und Datenparallelität einbeziehen. Das vorgestellte Datenstromverarbeitungsmodell beschreibt das Ergebnis eines Operators als kontinuierlichen Strom von Veränderugen auf einer Ergebnistabelle. Dabei behandelt unser Modell die Diskrepanz der physikalischen und logischen Ordnung von Datenelementen inhärent und erreicht damit eine deterministische Semantik und eine minimale Verarbeitungslatenz.Modern companies are able to collect more data and require insights from it faster than ever before. Relational databases do not meet the requirements for processing the often unstructured data sets with reasonable performance. The database research community started to address these trends in the early 2000s. Two new research directions have attracted major interest since: large-scale non-relational data processing as well as low-latency data stream processing. Large-scale non-relational data processing, commonly known as "Big Data" processing, was quickly adopted in the industry. In parallel, low latency data stream processing was mainly driven by the research community developing new systems that embrace a distributed architecture, scalability, and exploits data parallelism. While these systems have gained more and more attention in the industry, there are still major challenges to operate them at large scale. The goal of this dissertation is two-fold: First, to investigate runtime characteristics of large scale data-parallel distributed streaming systems. And second, to propose the "Dual Streaming Model" to express semantics of continuous queries over data streams and tables. Our goal is to improve the understanding of system and query runtime behavior with the aim to provision queries automatically. We introduce a cost model for streaming data flow programs taking into account the two techniques of record batching and data parallelization. Additionally, we introduce optimization algorithms that leverage our model for cost-based query provisioning. The proposed Dual Streaming Model expresses the result of a streaming operator as a stream of successive updates to a result table, inducing a duality between streams and tables. Our model handles the inconsistency of the logical and the physical order of records within a data stream natively, which allows for deterministic semantics as well as low latency query execution

    Big Data na gestão eficiente das Smart Grids. HDS: Uma Plataforma Híbrida, Dinâmica e Inteligente

    Get PDF
    [POR]Nos últimos anos tem-se verificado um acréscimo exponencial de informação gerada e disponibilizada a cada dia. Devido ao rápido avanço tecnológico (dispositivos móveis; sensores; comunicação wireless; etc.) biliões e biliões de bytes são criados todos os dias. Este fenómeno, denominado por Big Data, é caracterizado por 5 Vs (i.e. Volume, Velocidade, Variedade, Veracidade, Valor) e cada um deles representa verdadeiros desafios (e.g. como recolher e transportar um grande volume de informação; como armazenar essa informação; como minerá-la, como analisá-la e extrair conhecimento, como garantir a sua segurança e privacidade, como processá-la em tempo real, etc.). É unanime na comunidade científica que o valor a extrair de toda esta informação constituirá um fator de extrema importância para a tomada de decisão, determinante no sucesso das mais variadíssimas áreas económicas, bem como na resolução de inúmeros problemas. Nestas áreas inclui-se o ecossistema energético que por razões ecológicas, económicas e políticas conduziu ao repensar da forma como consumimos e produzimos energia. Devido ao aumento das necessidades energéticas provocado pelo avanço tecnológico, ao previsto esgotamento dos recursos energéticos não renováveis e devido às diretivas para a eficiência energética impostas pela União Europeia, muitos têm sido os estudos feitos na área da gestão de recursos energéticos. O termo Smart Grids surgiu nas últimas décadas com o objetivo de definir um ecossistema energético inteligente, que visa não só a integração de inteligência, mas também de automação na operabilidade extremamente complexa de todos os seus processos. As Smart Grids têm sido alvo de grandes estudos e investimentos dos quais têm resultado avanços significativos. No entanto, alguns desafios estão ainda por concretizar nomeadamente na gestão do seu complexo fluxo de dados. É neste contexto que se enquadra a presente dissertação cujo principal objetivo se centra na obtenção de soluções para alguns dos problemas identificados no domínio de Smart Grids com recurso às novas técnicas e metodologias propostas na área de Big Data. Este trabalho apresenta um estudo sobre os recentes e crescentes avanços tecnológicos realizados na área de Big Data, onde são identificados os seus grandes desafios. Destes destacam-se a complexidade na gestão de fluxos contínuos e desordenados, a necessidade de reduzir o tempo despendido na prépreparação dos dados e o desafio de explorar soluções que proporcionem a automatização analítica. Por outro lado, o estudo analisa o impacto da aplicação nas novas tecnologias no desenvolvimento das Smart Grids, no qual se conclui que apesar de embrionária, a sua aplicação é imprescindível para a evolução do ecossistema energético. Deste estudo resultou ainda a identificação dos principais desafios na área das Smart Grids, dos quais se destacam a complexidade na gestão do seu fluxo de dados em tempo real e a necessidade de melhorar a precisão das previsões de consumo e produção de energia. Face aos desafios identificados foi proposto um modelo conceptual, baseado na arquitetura Docker Container, para o desenvolvimento de uma plataforma. Este modelo objetiva a flexibilidade e agilidade de forma a permitir a integração e validação das novas e crescentes abordagens tecnológicas propostas na área de Big Data, necessárias ao desenvolvimento das Smart Grids. A fim de validar o modelo proposto, foi desenvolvida uma stack onde foram implementados vários serviços que visaram contribuir para os desafios identificados na área de Big Data e Smart Grids, nomeadamente: visualização e monitorização dos dados recolhidos em tempo real; preparação dos dados recolhidos em tempo real; previsão em tempo real de várias séries temporais simultaniamente; deteção de anomalias; avaliação da precisão das previsões e geração de novos modelos para a previsão de consumo e produção de energia segundo determinados critérios. Finalmente foram desenvolvidos vários casos de estudo cujos resultados obtidos permitiram concluir sobre a importância da pré-preparação dos dados na fase analítica, sobre a eficiência na automatização analítica e sobre as vantagens da análise de ponta (Edge Analytics). Ao contrário de abordagens mais tradicionais que visam a execução centralizada do processo analítico, o edge analytics explora a possibilidade de executar a analise de dados de forma descentralizada a partir de um ponto não central do sistema. Os resultados permitiram concluir que o edge analytics traz vantagens acrescidas para a precisão das previsões. Permitiram ainda, inferir sobre como recolher os resultados a fim de se obter uma melhor precisão nas previsões, i.e., quanto mais específica e ajustada ao contexto forem executadas as previsões maior será a sua precisão.[ES]En los últimos años se ha verificado un aumento exponencial de información generada y disponible cada día. Debido al rápido avance tecnológico (dispositivos móviles, sensores, comunicación inalámbrica, etc.) billones y billones de bytes se crean todos los días. Este fenómeno, denominado Big Data, se caracteriza por 5 Vs (es decir, Volumen, Velocidad, Variedad, Veracidad, Valor) y cada uno de ellos representa verdaderos desafíos (por ejemplo, cómo recoger y transportar un gran volumen de información, cómo almacenar esa información, minarla, cómo analizarla y extraer conocimiento, cómo garantizar su seguridad y privacidad, cómo procesarla en tiempo real, etc.). Es unánime en la comunidad científica que el valor a extraer de toda esta información constituirá un factor de extrema importancia para la toma de decisión, determinante el éxito de las variadísimas áreas económicas, así como en la resolución de innumerables problemas. En estas áreas se incluye el ecosistema energético que por razones ecológicas, económicas y políticas condujo a repensar la forma en que consumimos y producimos energía. Debido al aumento de las necesidades energéticas provocado por el avance tecnológico, al previsto agotamiento de los recursos energéticos no renovables y debido a las directivas para la eficiencia energética impuestas por la Unión Europea, muchos han sido los estudios realizados en el ámbito de la gestión de recursos energéticos. El término Smart Grid surgió en las últimas décadas con el objetivo de definir un ecosistema energético inteligente, que apunta no sólo a la integración de inteligencia, sino también de automatización en la operatividad extremadamente compleja de todos sus procesos. Las Smart Grids han sido objeto de grandes estudios e inversiones de los cuales han resultado avances significativos. Sin embargo, algunos desafíos aún no se concretan en la gestión de su complejo flujo de datos. Es en este contexto que se encuadra la presente disertación cuyo principal objetivo se centra en la obtención de soluciones para algunos de los problemas identificados en el dominio de Smart Grids utilizando las nuevas técnicas y metodologías propuestas en el área de Big Data. Este trabajo presenta un estudio sobre los recientes y crecientes avances tecnológicos realizados en el área de Big Data, donde se identifican sus grandes desafíos. De ellos se destacan la complejidad en la gestión de flujos continuos y desordenados, la necesidad de reducir el tiempo empleado en la prepreparación de los datos y el desafío de explorar soluciones que proporcionen la automatización analítica. Por otro lado, el estudio analiza el impacto de la aplicación de nuevas tecnologías en el desarrollo de las Smart Grids, en el que se concluye que, a pesar de embrionaria, su aplicación es imprescindible para la evolución del ecosistema energético. De este estudio resultó también la identificación de los principales desafíos en el área de las Smart Grids, de los cuales se destacan la complejidad en la gestión de su flujo de datos en tiempo real y la necesidad de mejorar la precisión de las previsiones de consumo y producción de energía. En cuanto a los desafíos identificados, se propuso un modelo conceptual, basado en la arquitectura Docker Container, para el desarrollo de una plataforma. Este modelo tiene como objetivo la flexibilidad y agilidad para permitir la integración y validación de los nuevos y crecientes enfoques tecnológicos propuestos en el área de Big Data, necesarios para el desarrollo de las Smart Grids. Con el fin de validar el modelo propuesto, se desarrolló una stack donde se implementaron varios servicios que pretendían contribuir a los desafíos identificados en el área de Big Data y Smart Grids, en particular: visualización y seguimiento de los datos recogidos en tiempo real; preparación de los datos recogidos en tiempo real; previsión en tiempo real de multillas series temporales simultáneamente; detección de anomalías; evaluación de la precisión del predicción y generación de nuevos modelos para la previsión de consumo y producción de energía según ciertos criterios. Finalmente, se desarrollaron una serie de casos de estudo cuyos resultados nos permitieron concluir sobre la importancia de la preparación previa de los datos en la fase analítica, la eficiencia en la automatización analítica y las ventajas del análisis de borde (Edge Analytics). A diferencia de los enfoques más tradicionales para la ejecución centralizada del proceso analítico, el análisis de borde explora la posibilidad de realizar análisis de datos de forma descentralizada desde un punto no central del sistema. Los resultados permitieron concluir que el análisis de borde aporta ventajas añadidas a la precisión de los pronósticos. También nos permitieron inferir cómo recopilar los resultados para obtener una mejor precisión en las predicciones, por ejemplo, cuanto más precisos y ajustados al contexto se ejecuten los pronósticos, mayor será su precisión.[EN]In recent years, there has been an exponential increase of information generated and made available every day. Due to rapid technological advancement (e.g., mobile devices, sensors, wireless communication, etc.) billions and billions of bytes are created every day. This phenomenon, called Big Data, is characterized by 5 Vs (i.e., Volume, Velocity, Variety, Veracity, Value) and each represents real challenges (e.g., how to collect and carry a large amount of information; how to store this information; how mining it, analyzing it and extracting knowledge; how to ensure its security and privacy; how to process it in real time, etc.). It is unanimous in the scientific community that the value to be extracted from all this information will be a factor of extreme importance for the decision making, determining the success of the most varied economic areas, as well as the resolution of numerous problems. These areas include the energy ecosystem that, for ecological, economic and political reasons, led us to rethink the way we consume and produce energy. Due to the increase in energy needs caused by technological advances, the expected depletion of non-renewable energy resources and due to the energy efficiency directives imposed by the European Union, many studies have been carried out in the area of energy resources management. The term Smart Grid has emerged in the last decades with the objective of defining an intelligent energy ecosystem, which aims not only to integrate intelligence but also to automate the extremely complex operability of all its processes. Smart grids have been the subject of major studies and investments which have resulted in significant advances. However, some challenges have to be addressed in the management of its complex data flow. It is in this context that the present dissertation falls, with the main objective on obtaining solutions to some of the problems identified in the field of Smart Grids using new techniques and methodologies proposed in the area of Big Data. This paper presents a study on the recent and growing technological advances in the area of Big Data, where its major challenges are identified. These include complexity in the management of continuous and disordered flows, the need to reduce the time spent in pre-preparation of data and the challenge of exploring solutions that provide analytical automation. On the other hand, the study analyzes the impact of the application in the new technologies in the development of the Smart Grids, in which it is concluded that, although embryonic, its application is essential for the evolution of the energy ecosystem. This study also resulted in the identification of the main challenges in the area of Smart Grids, which highlight the complexity in managing its data flow in real time and the need to improve the accuracy of energy consumption and production forecasts. Given the identified challenges, a conceptual model, based on the Docker Container architecture, was proposed for the development of a platform. This model aims at flexibility and agility in order to allow the integration and validation of the new and growing technological approaches proposed in the area of Big Data, necessary for the development of Smart Grids. In order to validate the proposed model, a stack was developed where several services were implemented that aimed to contribute to the challenges identified in the area of Big Data and Smart Grids, namely: visualization and monitoring of data collected in real time; preparation of data collected in real time; real-time forecasting of multiple time series simultaneously; detection of anomalies; evaluation of the accuracy of forecasting and generation of new models for the forecast of consumption and production of energy according to certain criteria. Finally, a number of case studies were developed whose results allowed us to conclude on the importance of the pre-preparation of the data in the analytical phase, on the efficiency in the analytical automation and on the advantages of the Edge Analytics. Unlike more traditional approaches to the centralized execution of the analytic process, edge analytics explores the possibility of performing data analysis in a decentralized way from a non-central point of the system. The results allowed to conclude that edge analytics brings added advantages to the precision of the forecasts. Results allowed us to infer how to collect the data in order to obtain a better precision in the predictions, i.e., the more precise and context-adjusted the forecasts are executed the greater their accuracy

    Design and implementation of an efficient data stream processing system

    Get PDF
    In standard database scenarios, an end-user assumes that all data (e.g., sensor readings) is stored in a database. Therefore, one can simply submit any arbitrary complex processing in the form of SQL queries or stored procedures to a database server. Data stream oriented applications are typically dealing with huge volumes of data. Storing data and performing off-line processing on this huge dataset can be costly, time consuming and impractical. This work describes our research results while designing and implementing an efficient data management system for online and off-line processing of data streams in the field of environmental monitoring. Our target data sources are wireless sensor networks. Although our focus is on a specific application domain, the results of this thesis are designed in a generic way, so that they can be applied to wide variety of data stream oriented applications. This thesis starts by first presenting the state-of-the-art in data stream processing research specifically window processing concepts, continuous queries, stream filtering query languages and in-network data processing (particular focus on TinyOS-based approaches). We present key existing data stream processing engines, their internal architecture and how they are compared to our platform, namely Global Sensor Network (GSN) middleware. GSN middleware enables fast and flexible deployment and interconnection of sensor networks. It provides simple and uniform access to a comprehensive set of heterogeneous technologies. Additionally, GSN offers zero-programming deployment and data-oriented integration of sensor networks and supports dynamic re-configuration and adaptation at runtime. We present the virtual sensor concept, which offers a high-level view of arbitrary stream data sources, its powerful declarative specification and query tools. Furthermore, we describe design, conceptual, architectural and optimization decisions of GSN platform in detail. In order to achieve high efficiency while processing large volumes of streaming data using window-based continuous queries, we present a set of optimization algorithms and techniques to intelligently group and process different types of continuous queries. While adapting GSN to large scale sensor network deployments, we have encountered several performance bottlenecks. One of the challenges we faced was related to scalable delivery of streaming data for high data rate streams. We found out that we could dramatically improve the performance of a query processor by performing simple grouping of user queries hence sharing both the processing and memory costs among similar queries. Moreover, we encountered a similar performance issue while scheduling continuous queries. Problem of efficiently scheduling the execution of continuous queries with window and sliding parameters is not addressed in depth in literature. This problem becomes severe when one considers large volumes of high data rate streams. In these cases, an efficient query scheduler not only increases the performance at least by an order of magnitude but also, decreases the response time and memory requirements. Finally, we present how our GSN platform can get integrated with an external data sharing and visualization framework namely Microsoft's SenseWeb platform. Microsoft's SenseWeb platform, provides a sensor network data gathering and visualization infrastructure which is globally accessible to the end users. This integration (which is initiated by the Swiss Experiment project and demanded by GSN users) not only shows the scalability of GSN platform when combined with optimized algorithms, but also demonstrates its flexibility
    corecore