75 research outputs found

    Outlier Detection In Big Data

    Get PDF
    The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches

    Processing Rank-Aware Queries in Schema-Based P2P Systems

    Get PDF
    Effiziente Anfragebearbeitung in Datenintegrationssystemen sowie in P2P-Systemen ist bereits seit einigen Jahren ein Aspekt aktueller Forschung. Konventionelle Datenintegrationssysteme bestehen aus mehreren Datenquellen mit ggf. unterschiedlichen Schemata, sind hierarchisch aufgebaut und besitzen eine zentrale Komponente: den Mediator, der ein globales Schema verwaltet. Anfragen an das System werden auf diesem globalen Schema formuliert und vom Mediator bearbeitet, indem relevante Daten von den Datenquellen transparent für den Benutzer angefragt werden. Aufbauend auf diesen Systemen entstanden schließlich Peer-Daten-Management-Systeme (PDMSs) bzw. schemabasierte P2P-Systeme. An einem PDMS teilnehmende Knoten (Peers) können einerseits als Mediatoren agieren andererseits jedoch ebenso als Datenquellen. Darüber hinaus sind diese Peers autonom und können das Netzwerk jederzeit verlassen bzw. betreten. Die potentiell riesige Datenmenge, die in einem derartigen Netzwerk verfügbar ist, führt zudem in der Regel zu sehr großen Anfrageergebnissen, die nur schwer zu bewältigen sind. Daher ist das Bestimmen einer vollständigen Ergebnismenge in vielen Fällen äußerst aufwändig oder sogar unmöglich. In diesen Fällen bietet sich die Anwendung von Top-N- und Skyline-Operatoren, ggf. in Verbindung mit Approximationstechniken, an, da diese Operatoren lediglich diejenigen Datensätze als Ergebnis ausgeben, die aufgrund nutzerdefinierter Ranking-Funktionen am relevantesten für den Benutzer sind. Da durch die Anwendung dieser Operatoren zumeist nur ein kleiner Teil des Ergebnisses tatsächlich dem Benutzer ausgegeben wird, muss nicht zwangsläufig die vollständige Ergebnismenge berechnet werden sondern nur der Teil, der tatsächlich relevant für das Endergebnis ist. Die Frage ist nun, wie man derartige Anfragen durch die Ausnutzung dieser Erkenntnis effizient in PDMSs bearbeiten kann. Die Beantwortung dieser Frage ist das Hauptanliegen dieser Dissertation. Zur Lösung dieser Problemstellung stellen wir effiziente Anfragebearbeitungsstrategien in PDMSs vor, die die charakteristischen Eigenschaften ranking-basierter Operatoren sowie Approximationstechniken ausnutzen. Peers werden dabei sowohl auf Schema- als auch auf Datenebene hinsichtlich der Relevanz ihrer Daten geprüft und dementsprechend in die Anfragebearbeitung einbezogen oder ausgeschlossen. Durch die Heterogenität der Peers werden Techniken zum Umschreiben einer Anfrage von einem Schema in ein anderes nötig. Da existierende Techniken zum Umschreiben von Anfragen zumeist nur konjunktive Anfragen betrachten, stellen wir eine Erweiterung dieser Techniken vor, die Anfragen mit ranking-basierten Anfrageoperatoren berücksichtigt. Da PDMSs dynamische Systeme sind und teilnehmende Peers jederzeit ihre Daten ändern können, betrachten wir in dieser Dissertation nicht nur wie Routing-Indexe verwendet werden, um die Relevanz eines Peers auf Datenebene zu bestimmen, sondern auch wie sie gepflegt werden können. Schließlich stellen wir SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) vor, ein System, welches im Rahmen dieser Dissertation entwickelt wurde und alle vorgestellten Techniken implementiert.In recent years, there has been considerable research with respect to query processing in data integration and P2P systems. Conventional data integration systems consist of multiple sources with possibly different schemas, adhere to a hierarchical structure, and have a central component (mediator) that manages a global schema. Queries are formulated against this global schema and the mediator processes them by retrieving relevant data from the sources transparently to the user. Arising from these systems, eventually Peer Data Management Systems (PDMSs), or schema-based P2P systems respectively, have attracted attention. Peers participating in a PDMS can act both as a mediator and as a data source, are autonomous, and might leave or join the network at will. Due to these reasons peers often hold incomplete or erroneous data sets and mappings. The possibly huge amount of data available in such a network often results in large query result sets that are hard to manage. Due to these reasons, retrieving the complete result set is in most cases difficult or even impossible. Applying rank-aware query operators such as top-N and skyline, possibly in conjunction with approximation techniques, is a remedy to these problems as these operators select only those result records that are most relevant to the user. Being aware that in most cases only a small fraction of the complete result set is actually output to the user, retrieving the complete set before evaluating such operators is obviously inefficient. Therefore, the questions we want to answer in this dissertation are how to compute such queries in PDMSs and how to do that efficiently. We propose strategies for efficient query processing in PDMSs that exploit the characteristics of rank-aware queries and optionally apply approximation techniques. A peer's relevance is determined on two levels: on schema-level and on data-level. According to its relevance a peer is either considered for query processing or not. Because of heterogeneity queries need to be rewritten, enabling cooperation between peers that use different schemas. As existing query rewriting techniques mostly consider conjunctive queries only, we present an extension that allows for rewriting queries involving rank-aware query operators. As PDMSs are dynamic systems and peers might update their local data, this dissertation addresses not only the problem of considering such structures within a query processing strategy but also the problem of keeping them up-to-date. Finally, we provide a system-level evaluation by presenting SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) -- a system created in the context of this dissertation implementing all presented techniques

    Scalable diversification for data exploration platforms

    Get PDF

    Recommendation Support for Multi-Attribute Databases

    Get PDF

    Contributions in image and video coding

    Get PDF
    Orientador: Max Henrique Machado CostaTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: A comunidade de codificação de imagens e vídeo vem também trabalhando em inovações que vão além das tradicionais técnicas de codificação de imagens e vídeo. Este trabalho é um conjunto de contribuições a vários tópicos que têm recebido crescente interesse de pesquisadores na comunidade, nominalmente, codificação escalável, codificação de baixa complexidade para dispositivos móveis, codificação de vídeo de múltiplas vistas e codificação adaptativa em tempo real. A primeira contribuição estuda o desempenho de três transformadas 3-D rápidas por blocos em um codificador de vídeo de baixa complexidade. O codificador recebeu o nome de Fast Embedded Video Codec (FEVC). Novos métodos de implementação e ordens de varredura são propostos para as transformadas. Os coeficiente 3-D são codificados por planos de bits pelos codificadores de entropia, produzindo um fluxo de bits (bitstream) de saída totalmente embutida. Todas as implementações são feitas usando arquitetura com aritmética inteira de 16 bits. Somente adições e deslocamentos de bits são necessários, o que reduz a complexidade computacional. Mesmo com essas restrições, um bom desempenho em termos de taxa de bits versus distorção pôde ser obtido e os tempos de codificação são significativamente menores (em torno de 160 vezes) quando comparados ao padrão H.264/AVC. A segunda contribuição é a otimização de uma recente abordagem proposta para codificação de vídeo de múltiplas vistas em aplicações de video-conferência e outras aplicações do tipo "unicast" similares. O cenário alvo nessa abordagem é fornecer vídeo com percepção real em 3-D e ponto de vista livre a boas taxas de compressão. Para atingir tal objetivo, pesos são atribuídos a cada vista e mapeados em parâmetros de quantização. Neste trabalho, o mapeamento ad-hoc anteriormente proposto entre pesos e parâmetros de quantização é mostrado ser quase-ótimo para uma fonte Gaussiana e um mapeamento ótimo é derivado para fonte típicas de vídeo. A terceira contribuição explora várias estratégias para varredura adaptativa dos coeficientes da transformada no padrão JPEG XR. A ordem de varredura original, global e adaptativa do JPEG XR é comparada com os métodos de varredura localizados e híbridos propostos neste trabalho. Essas novas ordens não requerem mudanças nem nos outros estágios de codificação e decodificação, nem na definição da bitstream A quarta e última contribuição propõe uma transformada por blocos dependente do sinal. As transformadas hierárquicas usualmente exploram a informação residual entre os níveis no estágio da codificação de entropia, mas não no estágio da transformada. A transformada proposta neste trabalho é uma técnica de compactação de energia que também explora as similaridades estruturais entre os níveis de resolução. A idéia central da técnica é incluir na transformada hierárquica um número de funções de base adaptativas derivadas da resolução menor do sinal. Um codificador de imagens completo foi desenvolvido para medir o desempenho da nova transformada e os resultados obtidos são discutidos neste trabalhoAbstract: The image and video coding community has often been working on new advances that go beyond traditional image and video architectures. This work is a set of contributions to various topics that have received increasing attention from researchers in the community, namely, scalable coding, low-complexity coding for portable devices, multiview video coding and run-time adaptive coding. The first contribution studies the performance of three fast block-based 3-D transforms in a low complexity video codec. The codec has received the name Fast Embedded Video Codec (FEVC). New implementation methods and scanning orders are proposed for the transforms. The 3-D coefficients are encoded bit-plane by bit-plane by entropy coders, producing a fully embedded output bitstream. All implementation is performed using 16-bit integer arithmetic. Only additions and bit shifts are necessary, thus lowering computational complexity. Even with these constraints, reasonable rate versus distortion performance can be achieved and the encoding time is significantly smaller (around 160 times) when compared to the H.264/AVC standard. The second contribution is the optimization of a recent approach proposed for multiview video coding in videoconferencing applications or other similar unicast-like applications. The target scenario in this approach is providing realistic 3-D video with free viewpoint video at good compression rates. To achieve such an objective, weights are computed for each view and mapped into quantization parameters. In this work, the previously proposed ad-hoc mapping between weights and quantization parameters is shown to be quasi-optimum for a Gaussian source and an optimum mapping is derived for a typical video source. The third contribution exploits several strategies for adaptive scanning of transform coefficients in the JPEG XR standard. The original global adaptive scanning order applied in JPEG XR is compared with the localized and hybrid scanning methods proposed in this work. These new orders do not require changes in either the other coding and decoding stages or in the bitstream definition. The fourth and last contribution proposes an hierarchical signal dependent block-based transform. Hierarchical transforms usually exploit the residual cross-level information at the entropy coding step, but not at the transform step. The transform proposed in this work is an energy compaction technique that can also exploit these cross-resolution-level structural similarities. The core idea of the technique is to include in the hierarchical transform a number of adaptive basis functions derived from the lower resolution of the signal. A full image codec is developed in order to measure the performance of the new transform and the obtained results are discussed in this workDoutoradoTelecomunicações e TelemáticaDoutor em Engenharia Elétric

    An integrated associative processing system

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994.Includes bibliographical references (p. 97-105).by Frederick Paul Herrmann.Ph.D

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Fruit Detection and Tree Segmentation for Yield Mapping in Orchards

    Get PDF
    Accurate information gathering and processing is critical for precision horticulture, as growers aim to optimise their farm management practices. An accurate inventory of the crop that details its spatial distribution along with health and maturity, can help farmers efficiently target processes such as chemical and fertiliser spraying, crop thinning, harvest management, labour planning and marketing. Growers have traditionally obtained this information by using manual sampling techniques, which tend to be labour intensive, spatially sparse, expensive, inaccurate and prone to subjective biases. Recent advances in sensing and automation for field robotics allow for key measurements to be made for individual plants throughout an orchard in a timely and accurate manner. Farmer operated machines or unmanned robotic platforms can be equipped with a range of sensors to capture a detailed representation over large areas. Robust and accurate data processing techniques are therefore required to extract high level information needed by the grower to support precision farming. This thesis focuses on yield mapping in orchards using image and light detection and ranging (LiDAR) data captured using an unmanned ground vehicle (UGV). The contribution is the framework and algorithmic components for orchard mapping and yield estimation that is applicable to different fruit types and orchard configurations. The framework includes detection of fruits in individual images and tracking them over subsequent frames. The fruit counts are then associated to individual trees, which are segmented from image and LiDAR data, resulting in a structured spatial representation of yield. The first contribution of this thesis is the development of a generic and robust fruit detection algorithm. Images captured in the outdoor environment are susceptible to highly variable external factors that lead to significant appearance variations. Specifically in orchards, variability is caused by changes in illumination, target pose, tree types, etc. The proposed techniques address these issues by using state-of-the-art feature learning approaches for image classification, while investigating the utility of orchard domain knowledge for fruit detection. Detection is performed using both pixel-wise classification of images followed instance segmentation, and bounding-box regression approaches. The experimental results illustrate the versatility of complex deep learning approaches over a multitude of fruit types. The second contribution of this thesis is a tree segmentation approach to detect the individual trees that serve as a standard unit for structured orchard information systems. The work focuses on trellised trees, which present unique challenges for segmentation algorithms due to their intertwined nature. LiDAR data are used to segment the trellis face, and to generate proposals for individual trees trunks. Additional trunk proposals are provided using pixel-wise classification of the image data. The multi-modal observations are fine-tuned by modelling trunk locations using a hidden semi-Markov model (HSMM), within which prior knowledge of tree spacing is incorporated. The final component of this thesis addresses the visual occlusion of fruit within geometrically complex canopies by using a multi-view detection and tracking approach. Single image fruit detections are tracked over a sequence of images, and associated to individual trees or farm rows, with the spatial distribution of the fruit counting forming a yield map over the farm. The results show the advantage of using multi-view imagery (instead of single view analysis) for fruit counting and yield mapping. This thesis includes extensive experimentation in almond, apple and mango orchards, with data captured by a UGV spanning a total of 5 hectares of farm area, over 30 km of vehicle traversal and more than 7,000 trees. The validation of the different processes is performed using manual annotations, which includes fruit and tree locations in image and LiDAR data respectively. Additional evaluation of yield mapping is performed by comparison against fruit counts on trees at the farm and counts made by the growers post-harvest. The framework developed in this thesis is demonstrated to be accurate compared to ground truth at all scales of the pipeline, including fruit detection and tree mapping, leading to accurate yield estimation, per tree and per row, for the different crops. Through the multitude of field experiments conducted over multiple seasons and years, the thesis presents key practical insights necessary for commercial development of an information gathering system in orchards
    corecore