    Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification

    Recent work on scene classification still makes use of generic CNN features in a rudimentary manner. In this ICCV 2015 paper, we present a novel pipeline built upon deep CNN features to harvest discriminative visual objects and parts for scene classification. We first use a region proposal technique to generate a set of high-quality patches potentially containing objects, and apply a pre-trained CNN to extract generic deep features from these patches. Then we perform both unsupervised and weakly supervised learning to screen these patches and discover discriminative ones representing category-specific objects and parts. We further apply discriminative clustering enhanced with local CNN fine-tuning to aggregate similar objects and parts into groups, called meta objects. A scene image representation is constructed by pooling the feature response maps of all the learned meta objects at multiple spatial scales. We have confirmed that the scene image representation obtained using this new pipeline is capable of delivering state-of-the-art performance on two popular scene benchmark datasets, MIT Indoor 67~\cite{MITIndoor67} and Sun397~\cite{Sun397}Comment: To Appear in ICCV 201

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    A framework for clustering and adaptive topic tracking on evolving text and social media data streams.

    Recent advances and widespread usage of online web services and social media platforms, coupled with ubiquitous low cost devices, mobile technologies, and increasing capacity of lower cost storage, has led to a proliferation of Big data, ranging from, news, e-commerce clickstreams, and online business transactions to continuous event logs and social media expressions. These large amounts of online data, often referred to as data streams, because they get generated at extremely high throughputs or velocity, can make conventional and classical data analytics methodologies obsolete. For these reasons, the issues of management and analysis of data streams have been researched extensively in recent years. The special case of social media Big Data brings additional challenges, particularly because of the unstructured nature of the data, specifically free text. One classical approach to mine text data has been Topic Modeling. Topic Models are statistical models that can be used for discovering the abstract ``topics\u27\u27 that may occur in a corpus of documents. Topic models have emerged as a powerful technique in machine learning and data science, providing a great balance between simplicity and complexity. They also provide sophisticated insight without the need for real natural language understanding. However they have not been designed to cope with the type of text data that is abundant on social media platforms, but rather for traditional medium size corpora consisting of longer documents, adhering to a specific language and typically spanning a stable set of topics. Unlike traditional document corpora, social media messages tend to be very short, sparse, noisy, and do not adhere to a standard vocabulary, linguistic patterns, or stable topic distributions. They are also generated at high velocity that impose high demands on topic modeling; and their evolving or dynamic nature, makes any set of results from topic modeling quickly become stale in the face of changes in the textual content and topics discussed within social media streams. In this dissertation, we propose an integrated topic modeling framework built on top of an existing stream-clustering framework called Stream-Dashboard, which can extract, isolate, and track topics over any given time period. In this new framework, Stream Dashboard first clusters the data stream points into homogeneous groups. Then data from each group is ushered to the topic modeling framework which extracts finer topics from the group. The proposed framework tracks the evolution of the clusters over time to detect milestones corresponding to changes in topic evolution, and to trigger an adaptation of the learned groups and topics at each milestone. The proposed approach to topic modeling is different from a generic Topic Modeling approach because it works in a compartmentalized fashion, where the input document stream is split into distinct compartments, and Topic Modeling is applied on each compartment separately. Furthermore, we propose extensions to existing topic modeling and stream clustering methods, including: an adaptive query reformulation approach to help focus on the topic discovery with time; a topic modeling extension with adaptive hyper-parameter and with infinite vocabulary; an adaptive stream clustering algorithm incorporating the automated estimation of dynamic, cluster-specific temporal scales for adaptive forgetting to help facilitate clustering in a fast evolving data stream. Our experimental results show that the proposed adaptive forgetting clustering algorithm can mine better quality clusters; that our proposed compartmentalized framework is able to mine topics of better quality compared to competitive baselines; and that the proposed framework can automatically adapt to focus on changing topics using the proposed query reformulation strategy

    Denial of Service in Web-Domains: Building Defenses Against Next-Generation Attack Behavior

    The existing state-of-the-art in the field of application layer Distributed Denial of Service (DDoS) protection is generally designed, and thus effective, only for static web domains. To the best of our knowledge, our work is the first that studies the problem of application layer DDoS defense in web domains of dynamic content and organization, and for next-generation bot behaviour. In the first part of this thesis, we focus on the following research tasks: 1) we identify the main weaknesses of the existing application-layer anti-DDoS solutions as proposed in research literature and in the industry, 2) we obtain a comprehensive picture of the current-day as well as the next-generation application-layer attack behaviour and 3) we propose novel techniques, based on a multidisciplinary approach that combines offline machine learning algorithms and statistical analysis, for detection of suspicious web visitors in static web domains. Then, in the second part of the thesis, we propose and evaluate a novel anti-DDoS system that detects a broad range of application-layer DDoS attacks, both in static and dynamic web domains, through the use of advanced techniques of data mining. The key advantage of our system relative to other systems that resort to the use of challenge-response tests (such as CAPTCHAs) in combating malicious bots is that our system minimizes the number of these tests that are presented to valid human visitors while succeeding in preventing most malicious attackers from accessing the web site. The results of the experimental evaluation of the proposed system demonstrate effective detection of current and future variants of application layer DDoS attacks

    Data mining as a tool for environmental scientists

    Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous

    Data-driven covariance estimation for the iterative closest point algorithm

    Les nuages de points en trois dimensions sont un format de données très commun en robotique mobile. Ils sont souvent produits par des capteurs spécialisés de type lidar. Les nuages de points générés par ces capteurs sont utilisés dans des tâches impliquant de l’estimation d’état, telles que la cartographie ou la localisation. Les algorithmes de recalage de nuages de points, notamment l’algorithme ICP (Iterative Closest Point), nous permettent de prendre des mesures d’égo-motion nécessaires à ces tâches. La fusion des recalages dans des chaînes existantes d’estimation d’état dépend d’une évaluation précise de leur incertitude. Cependant, les méthodes existantes d’estimation de l’incertitude se prêtent mal aux données en trois dimensions. Ce mémoire vise à estimer l’incertitude de recalages 3D issus d’Iterative Closest Point (ICP). Premièrement, il pose des fondations théoriques desquelles nous pouvons articuler une estimation de la covariance. Notamment, il révise l’algorithme ICP, avec une attention spéciale sur les parties qui sont importantes pour l’estimation de la covariance. Ensuite, un article scientifique inséré présente CELLO-3D, notre algorithme d’estimation de la covariance d’ICP. L’article inséré contient une validation expérimentale complète du nouvel algorithme. Il montre que notre algorithme performe mieux que les méthodes existantes dans une grande variété d’environnements. Finalement, ce mémoire est conclu par des expérimentations supplémentaires, qui sont complémentaires à l’article.Three-dimensional point clouds are an ubiquitous data format in robotics. They are produced by specialized sensors such as lidars or depth cameras. The point clouds generated by those sensors are used for state estimation tasks like mapping and localization. Point cloud registration algorithms, such as Iterative Closest Point (ICP), allow us to make ego-motion measurements necessary to those tasks. The fusion of ICP registrations in existing state estimation frameworks relies on an accurate estimation of their uncertainty. Unfortunately, existing covariance estimation methods often scale poorly to the 3D case. This thesis aims to estimate the uncertainty of ICP registrations for 3D point clouds. First, it poses theoretical foundations from which we can articulate a covariance estimation method. It reviews the ICP algorithm, with a special focus on the parts of it that are pertinent to covariance estimation. Then, an inserted article introduces CELLO-3D, our data-driven covariance estimation method for ICP. The article contains a thorough experimental validation of the new algorithm. The latter is shown to perform better than existing covariance estimation techniques in a wide variety of environments. Finally, this thesis comprises supplementary experiments, which complement the article