933 research outputs found

    Forecasting People Trajectories and Head Poses by Jointly Reasoning on Tracklets and Vislets

    Full text link
    In this work, we explore the correlation between people trajectories and their head orientations. We argue that people trajectory and head pose forecasting can be modelled as a joint problem. Recent approaches on trajectory forecasting leverage short-term trajectories (aka tracklets) of pedestrians to predict their future paths. In addition, sociological cues, such as expected destination or pedestrian interaction, are often combined with tracklets. In this paper, we propose MiXing-LSTM (MX-LSTM) to capture the interplay between positions and head orientations (vislets) thanks to a joint unconstrained optimization of full covariance matrices during the LSTM backpropagation. We additionally exploit the head orientations as a proxy for the visual attention, when modeling social interactions. MX-LSTM predicts future pedestrians location and head pose, increasing the standard capabilities of the current approaches on long-term trajectory forecasting. Compared to the state-of-the-art, our approach shows better performances on an extensive set of public benchmarks. MX-LSTM is particularly effective when people move slowly, i.e. the most challenging scenario for all other models. The proposed approach also allows for accurate predictions on a longer time horizon.Comment: Accepted at IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2019. arXiv admin note: text overlap with arXiv:1805.0065

    COMIC: Towards A Compact Image Captioning Model with Attention

    Full text link
    Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be deployed on embedded system with limited hardware resources. This is because the size of word and output embedding matrices grow proportionally with the size of vocabulary, adversely affecting the compactness of these networks. To address this limitation, this paper introduces a brand new idea in the domain of image captioning. That is, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed that, our proposed model, named COMIC for COMpact Image Captioning, achieves comparable results in five common evaluation metrics with state-of-the-art approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an embedding vocabulary size that is 39x - 99x smaller. The source code and models are available at: https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-AttentionComment: Added source code link and new results in Table

    Logsig-RNN: a novel network for robust and efficient skeleton-based action recognition

    Get PDF
    This paper contributes to the challenge of skeleton-based human action recognition in videos. The key step is to develop a generic network architecture to extract discriminative features for the spatio-temporal skeleton data. In this paper, we propose a novel module, namely Logsig-RNN, which is the combination of the log-signature layer and recurrent type neural networks (RNNs). The former one comes from the mathematically principled technology of signatures and log-signatures as representations for streamed data, which can manage high sample rate streams, non-uniform sampling and time series of variable length. It serves as an enhancement of the recurrent layer, which can be conveniently plugged into neural networks. Besides we propose two path transformation layers to significantly reduce path dimension while retaining the essential information fed into the Logsig-RNN module. (The network architecture is illustrated in Figure 1 (Right).) Finally, numerical results demonstrate that replacing the RNN module by the LogsigRNN module in SOTA networks consistently improves the performance on both Chalearn gesture data and NTU RGB+D 120 action data in terms of accuracy and robustness. In particular, we achieve the state-of-the-art accuracy on Chalearn2013 gesture data by combining simple path transformation layers with the Logsig-RNN

    Response Characterization for Auditing Cell Dynamics in Long Short-term Memory Networks

    Full text link
    In this paper, we introduce a novel method to interpret recurrent neural networks (RNNs), particularly long short-term memory networks (LSTMs) at the cellular level. We propose a systematic pipeline for interpreting individual hidden state dynamics within the network using response characterization methods. The ranked contribution of individual cells to the network's output is computed by analyzing a set of interpretable metrics of their decoupled step and sinusoidal responses. As a result, our method is able to uniquely identify neurons with insightful dynamics, quantify relationships between dynamical properties and test accuracy through ablation analysis, and interpret the impact of network capacity on a network's dynamical distribution. Finally, we demonstrate generalizability and scalability of our method by evaluating a series of different benchmark sequential datasets

    Learning stochastic differential equations using RNN with log signature features

    Get PDF
    This paper contributes to the challenge of learning a function on streamed multimodal data through evaluation. The core of the result of our paper is the combination of two quite different approaches to this problem. One comes from the mathematically principled technology of signatures and log-signatures as representations for streamed data, while the other draws on the techniques of recurrent neural networks (RNN). The ability of the former to manage high sample rate streams and the latter to manage large scale nonlinear interactions allows hybrid algorithms that are easy to code, quicker to train, and of lower complexity for a given accuracy. We illustrate the approach by approximating the unknown functional as a controlled differential equation. Linear functionals on solutions of controlled differential equations are the natural universal class of functions on data streams. Following this approach, we propose a hybrid Logsig-RNN algorithm that learns functionals on streamed data. By testing on various datasets, i.e. synthetic data, NTU RGB+D 120 skeletal action data, and Chalearn2013 gesture data, our algorithm achieves the outstanding accuracy with superior efficiency and robustness

    Deep Learning Approaches to Goal Recognition

    Get PDF
    Riconoscere il goal di un agente utilizzando una traccia di osservazioni è un compito importante con diverse applicazioni. In letteratura, molti approcci di goal recognition (GR) si basano sull'applicazione di tecniche di pianificazione automatica che richiedono un modello delle azioni del dominio e dello stato iniziale del dominio (scritto, ad esempio, in PDDL). In questa tesi studiamo tre approcci alternativi (GRNet, Fast and Slow Goal Recognition e un approccio basato su BERT) in cui il goal recognition è formulato come un compito di classificazione affrontato utilizzando il machine learning. Tutti questi approcci mirano principalmente a risolvere istanze di GR in un dato dominio, specificato da un insieme di proposizioni e da un insieme di nomi di azioni. In GRNet, le istanze di classificazione del dominio sono risolte da una rete LSTM. L'unica informazione richiesta come input della rete addestrata è una traccia di nomi di azioni, ognuno dei quali indica solo il nome di un'azione osservata. Un'esecuzione della LSTM elabora una traccia di azioni osservate per calcolare la probabilità che ogni proposizione del dominio faccia parte del goal dell'agente. Fast and Slow Goal Recognition, ispirato al framework ``Thinking Fast and Slow'', è un modello a doppio processo che integra l'uso delle sopra-citate reti LSTM con le tecniche di pianificazione automatica. Questa architettura può sfruttare sia il riconoscimento veloce dei goal, basato sull'esperienza, fornito dalla rete, sia l'analisi lenta e deliberata fornita dalle tecniche di pianificazione. Infine, studiamo come un modello BERT addestrato sui piani sia in grado di comprendere il funzionamento di un dominio, le sue azioni e le loro relazioni reciproche. Questo modello viene poi sottoposto a fine-tuning per classificare le istanze di goal recognition. Le analisi sperimentali confermano che le architetture presentate raggiungono buone prestazioni sia in termini di accuratezza della classificazione dei goal che di tempo di esecuzione, ottenendo spesso risultati migliori rispetto a un sistema di goal recognition allo stato dell'arte sui benchmark considerati.Recognising the goal of an agent from a trace of observations is an important task with many applications. In the literature, many approaches to goal recognition (GR) rely on the application of automated planning techniques which requires a model of the domain actions and of the initial domain state (written, e.g., in PDDL). We study three alternative approaches (GRNet, Fast and Slow Goal Recognition and a BERT-based approach) where Goal Recognition is formulated as a classification task addressed by machine learning. All these approaches are primarily aimed at solving GR instances in a given domain, which is specified by a set of propositions and a set of action names. In GRNet, the goal classification instances in the domain are solved by an LSTM network. The only information required as input of the trained network is a trace of action names, each one indicating just the name of an observed action. A run of the LSTM processes a trace of observed actions to compute how likely it is that each domain proposition is part of the agent's goal. Fast and Slow Goal Recognition, inspired by the ``Thinking Fast and Slow'' framework, is a dual-process model which integrates the use of the aforementioned LSTM with the automated planning techniques. This architecture can exploit both the fast, experience-based goal recognition provided by the network, and slow, deliberate analysis provided by the planning techniques. Finally, we study how a BERT model trained on plans is able to understand how a domain works, its actions and how they are related to each other. This model is then fine-tuned in order to classify goal recognition instances. Experimental analyses confirms that the presented architectures achieve good performance in terms of both goal classification accuracy and runtime, often obtaining better results w.r.t. a state-of-the-art GR system over the considered benchmarks
    corecore