202 research outputs found

    Large Scale Spectral Clustering Using Approximate Commute Time Embedding

    Full text link
    Spectral clustering is a novel clustering method which can detect complex shapes of data clusters. However, it requires the eigen decomposition of the graph Laplacian matrix, which is proportion to O(n3)O(n^3) and thus is not suitable for large scale systems. Recently, many methods have been proposed to accelerate the computational time of spectral clustering. These approximate methods usually involve sampling techniques by which a lot information of the original data may be lost. In this work, we propose a fast and accurate spectral clustering approach using an approximate commute time embedding, which is similar to the spectral embedding. The method does not require using any sampling technique and computing any eigenvector at all. Instead it uses random projection and a linear time solver to find the approximate embedding. The experiments in several synthetic and real datasets show that the proposed approach has better clustering quality and is faster than the state-of-the-art approximate spectral clustering methods

    Harnessing rare category trinity for complex data

    Get PDF
    In the era of big data, we are inundated with the sheer volume of data being collected from various domains. In contrast, it is often the rare occurrences that are crucially important to many high-impact domains with diverse data types. For example, in online transaction platforms, the percentage of fraudulent transactions might be small, but the resultant financial loss could be significant; in social networks, a novel topic is often neglected by the majority of users at the initial stage, but it could burst into an emerging trend afterward; in the Sloan Digital Sky Survey, the vast majority of sky images (e.g., known stars, comets, nebulae, etc.) are of no interest to the astronomers, while only 0.001% of the sky images lead to novel scientific discoveries; in the worldwide pandemics (e.g., SARS, MERS, COVID19, etc.), the primary cases might be limited, but the consequences could be catastrophic (e.g., mass mortality and economic recession). Therefore, studying such complex rare categories have profound significance and longstanding impact in many aspects of modern society, from preventing financial fraud to uncovering hot topics and trends, from supporting scientific research to forecasting pandemic and natural disasters. In this thesis, we propose a generic learning mechanism with trinity modules for complex rare category analysis: (M1) Rare Category Characterization - characterizing the rare patterns with a compact representation; (M2) Rare Category Explanation - interpreting the prediction results and providing relevant clues for the end-users; (M3) Rare Category Generation - producing synthetic rare category examples that resemble the real ones. The key philosophy of our mechanism lies in "all for one and one for all" - each module makes unique contributions to the whole mechanism and thus receives support from its companions. In particular, M1 serves as the de-novo step to discover rare category patterns on complex data; M2 provides a proper lens to the end-users to examine the outputs and understand the learning process; and M3 synthesizes real rare category examples for data augmentation to further improve M1 and M2. To enrich the learning mechanism, we develop principled theorems and solutions to characterize, understand, and synthesize rare categories on complex scenarios, ranging from static rare categories to time-evolving rare categories, from attributed data to graph-structured data, from homogeneous data to heterogeneous data, from low-order connectivity patterns to high-order connectivity patterns, etc. It is worthy of mentioning that we have also launched one of the first visual analytic systems for dynamic rare category analysis, which integrates our developed techniques and enables users to investigate complex rare categories in practice

    Approximate Data Analytics Systems

    Get PDF
    Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications

    Acquisition et exploitation des connaissances antérieures pour prédire le comportement des piétons autour des véhicules autonomes en environnement urbain

    Get PDF
    Autonomous Vehicles navigating in urban areas interact with pedestrians and other shared space users like bicycles throughout their journey either in open areas, like urban city centers, or closed areas, like parking lots. As more and more autonomous vehicles take to the city streets, their ability to understand and predict pedestrian behaviour becomes paramount. This is achieved by learning through continuous observation of the area to drive in. On the other hand, human drivers can instinctively infer pedestrian motion on an urban street even in previously unseen areas. This need for increasing a vehicle's situational awareness to reach parity with human drivers fuels the need for larger and deeper data on pedestrian motion in myriad situations and varying environments.This thesis focuses on the problem of reducing this dependency on large amounts of data to predict pedestrian motion accurately over an extended horizon. Instead, this work relies on Prior Knowledge, itself derived from the JJ Gibson's sociological principles of ``Natural Vision'' and ``Natural Movement''. It assumes that pedestrian behaviour is a function of the built environment and that all motion is directed towards reaching a goal. Knowing this underlying principle, the cost for traversing a scene from a pedestrian's perspective can be divined. Knowing this, inference on their behaviour can be performed. This work presents a contribution to the framework of understanding pedestrian behaviour as a confluence of probabilistic graphical models and sociological principles in three ways: modelling the environment, learning and predicting.Concerning modelling, the work assumes that there are some parts of the observed scene which are more attractive to pedestrians and some areas, repulsive. By quantifying these ``affordances'' as a consequence of certain Points of Interest (POIs) and the different elements in the scene, it is possible to model this scene under observation with different costs as a basis of the features contained within.Concerning learning, this work primarily extends the Growing Hidden Markov Model (GHMM) method - a variant of the Hidden Markov Model (HMM) probabilistic model- with the application of Prior Knowledge to initialise a topology able to infer accurately on ``typical motions'' in the scene. Secondly, the model that is generated behaves as a Self-Organising map, incrementally learning non-typical pedestrian behaviour and encoding this within the topology while updating the parameters of the underlying HMM.On prediction, this work carries out Bayesian inference on the generated model and can, as a result of Prior Knowledge, manage to perform better than the existing implementation of the GHMM method in predicting future pedestrian positions without the availability of training trajectories, thereby allowing for its utilisation in an urban scene with only environmental data.The contributions of this thesis are validated through experimental results on real data captured from an overhead camera overlooking a busy urban street, depicting a structured built environment and from the car's perspective in a parking lot, depicting a semi-structured environment and tested on typical and non-typical trajectories in each case.Les véhicules autonomes qui naviguent dans les zones urbaines interagissent avec les piétons et les autres utilisateurs de l'espace partagé, comme les bicyclettes, tout au long de leur trajet, soit dans des zones ouvertes, comme les centres urbains, soit dans des zones fermées, comme les parcs de stationnement. Alors que de plus en plus de véhicules autonomes sillonnent les rues de la ville, leur capacité à comprendre et à prévoir le comportement des piétons devient primordiale. Ceci est possible grâce à l'apprentissage par l'observation continue de la zone à conduire. D'autre part, les conducteurs humains peuvent instinctivement déduire le mouvement des piétons sur une rue urbaine, même dans des zones auparavant invisibles. Ce besoin d'accroître la conscience de la situation d'un véhicule pour atteindre la parité avec les conducteurs humains alimente le besoin de données plus vastes et plus approfondies sur le mouvement des piétons dans une myriade de situations et d'environnements variés.Cette thèse porte sur le problème de la réduction de cette dépendance à l'égard de grandes quantités de données pour prédire avec précision les mouvements des piétons sur un horizon prolongé. Ce travail s'appuie plutôt sur la connaissance préalable, elle-même dérivée des principes sociologiques de "Vision naturelle" et de "Mouvement naturel" du JJ Gibson. Il suppose que le comportement des piétons est fonction de l'environnement bâti et que tous les mouvements sont orientés vers l'atteinte d'un but. Connaissant ce principe sous-jacent, le coût de la traversée d'une scène du point de vue d'un piéton peut être deviné. Sachant cela, on peut en déduire leur comportement. Cet ouvrage apporte une contribution au cadre de compréhension du comportement piétonnier en tant que confluent de modèles graphiques probabilistes et de principes sociologiques de trois façons : modélisation de l'environnement, apprentissage et prévision.En ce qui concerne la modélisation, le travail suppose que certaines parties de la scène observée sont plus attrayantes pour les piétons et que d'autres sont répugnantes. En quantifiant ces " affordances " en fonction de certains Points d'Intérêt (POI) et des différents éléments de la scène, il est possible de modéliser cette scène sous observation avec différents coûts comme base des caractéristiques qu'elle contient.En ce qui concerne l'apprentissage, ce travail étend principalement la méthode du Modèle de Markov Caché Croissant (GHMM) - une variante du modèle probabiliste du Modèle de Markov Caché (HMM) - avec l'application des connaissances préalables pour initialiser une topologie capable de déduire avec précision les " mouvements types " dans la scène. Deuxièmement, le modèle généré se comporte comme une carte auto-organisatrice, apprenant progressivement un comportement piétonnier atypique et le codant dans la topologie tout en mettant à jour les paramètres du HMM sous-jacent.Sur la prédiction, ce travail effectue une inférence bayésienne sur le modèle généré et peut, grâce aux connaissances préalables, réussir à mieux prédire les positions futures des piétons sans disposer de trajectoires de formation, ce qui permet de l'utiliser dans un environnement urbain avec uniquement des données environnementales, que la méthode GHMM actuellement en application.Les contributions de cette thèse sont validées par des résultats expérimentaux sur des données réelles capturées à partir d'une caméra aérienne surplombant une rue urbaine très fréquentée, représentant un environnement bâti structuré et du point de vue de la voiture dans un parking, représentant un environnement semi-structuré et testé sur des trajectoires typiques et atypiques dans chaque cas

    Machine Learning-based Methods for Driver Identification and Behavior Assessment: Applications for CAN and Floating Car Data

    Get PDF
    The exponential growth of car generated data, the increased connectivity, and the advances in artificial intelligence (AI), enable novel mobility applications. This dissertation focuses on two use-cases of driving data, namely distraction detection and driver identification (ID). Low and medium-income countries account for 93% of traffic deaths; moreover, a major contributing factor to road crashes is distracted driving. Motivated by this, the first part of this thesis explores the possibility of an easy-to-deploy solution to distracted driving detection. Most of the related work uses sophisticated sensors or cameras, which raises privacy concerns and increases the cost. Therefore a machine learning (ML) approach is proposed that only uses signals from the CAN-bus and the inertial measurement unit (IMU). It is then evaluated against a hand-annotated dataset of 13 drivers and delivers reasonable accuracy. This approach is limited in detecting short-term distractions but demonstrates that a viable solution is possible. In the second part, the focus is on the effective identification of drivers using their driving behavior. The aim is to address the shortcomings of the state-of-the-art methods. First, a driver ID mechanism based on discriminative classifiers is used to find a set of suitable signals and features. It uses five signals from the CAN-bus, with hand-engineered features, which is an improvement from current state-of-the-art that mainly focused on external sensors. The second approach is based on Gaussian mixture models (GMMs), although it uses two signals and fewer features, it shows improved accuracy. In this system, the enrollment of a new driver does not require retraining of the models, which was a limitation in the previous approach. In order to reduce the amount of training data a Triplet network is used to train a deep neural network (DNN) that learns to discriminate drivers. The training of the DNN does not require any driving data from the target set of drivers. The DNN encodes pieces of driving data to an embedding space so that in this space examples of the same driver will appear closer to each other and far from examples of other drivers. This technique reduces the amount of data needed for accurate prediction to under a minute of driving data. These three solutions are validated against a real-world dataset of 57 drivers. Lastly, the possibility of a driver ID system is explored that only uses floating car data (FCD), in particular, GPS data from smartphones. A DNN architecture is then designed that encodes the routes, origin, and destination coordinates as well as various other features computed based on contextual information. The proposed model is then evaluated against a dataset of 678 drivers and shows high accuracy. In a nutshell, this work demonstrates that proper driver ID is achievable. The constraints imposed by the use-case and data availability negatively affect the performance; in such cases, the efficient use of the available data is crucial
    • …
    corecore