226 research outputs found

    An Architectural Approach to Ensuring Consistency in Hierarchical Execution

    Full text link
    Hierarchical task decomposition is a method used in many agent systems to organize agent knowledge. This work shows how the combination of a hierarchy and persistent assertions of knowledge can lead to difficulty in maintaining logical consistency in asserted knowledge. We explore the problematic consequences of persistent assumptions in the reasoning process and introduce novel potential solutions. Having implemented one of the possible solutions, Dynamic Hierarchical Justification, its effectiveness is demonstrated with an empirical analysis

    Deep neural networks for video classification in ecology

    Get PDF
    Analyzing large volumes of video data is a challenging and time-consuming task. Automating this process would very valuable, especially in ecological research where massive amounts of video can be used to unlock new avenues of ecological research into the behaviour of animals in their environments. Deep Neural Networks, particularly Deep Convolutional Neural Networks, are a powerful class of models for computer vision. When combined with Recurrent Neural Networks, Deep Convolutional models can be applied to video for frame level video classification. This research studies two datasets: penguins and seals. The purpose of the research is to compare the performance of image-only CNNs, which treat each frame of a video independently, against a combined CNN-RNN approach; and to assess whether incorporating the motion information in the temporal aspect of video improves the accuracy of classifications in these two datasets. Video and image-only models offer similar out-of-sample performance on the simpler seals dataset but the video model led to moderate performance improvements on the more complex penguin action recognition dataset

    Review of Deep Learning Algorithms and Architectures

    Get PDF
    Deep learning (DL) is playing an increasingly important role in our lives. It has already made a huge impact in areas, such as cancer diagnosis, precision medicine, self-driving cars, predictive forecasting, and speech recognition. The painstakingly handcrafted feature extractors used in traditional learning, classification, and pattern recognition systems are not scalable for large-sized data sets. In many cases, depending on the problem complexity, DL can also overcome the limitations of earlier shallow networks that prevented efficient training and abstractions of hierarchical representations of multi-dimensional training data. Deep neural network (DNN) uses multiple (deep) layers of units with highly optimized algorithms and architectures. This paper reviews several optimization methods to improve the accuracy of the training and to reduce training time. We delve into the math behind training algorithms used in recent deep networks. We describe current shortcomings, enhancements, and implementations. The review also covers different types of deep architectures, such as deep convolution networks, deep residual networks, recurrent neural networks, reinforcement learning, variational autoencoders, and others.https://doi.org/10.1109/ACCESS.2019.291220

    Towards Video Transformers for Automatic Human Analysis

    Full text link
    [eng] With the aim of creating artificial systems capable of mirroring the nuanced understanding and interpretative powers inherent to human cognition, this thesis embarks on an exploration of the intersection between human analysis and Video Transformers. The objective is to harness the potential of Transformers, a promising architectural paradigm, to comprehend the intricacies of human interaction, thus paving the way for the development of empathetic and context-aware intelligent systems. In order to do so, we explore the whole Computer Vision pipeline, from data gathering, to deeply analyzing recent developments, through model design and experimentation. Central to this study is the creation of UDIVA, an expansive multi-modal, multi-view dataset capturing dyadic face-to-face human interactions. Comprising 147 participants across 188 sessions, UDIVA integrates audio-visual recordings, heart-rate measurements, personality assessments, socio- demographic metadata, and conversational transcripts, establishing itself as the largest dataset for dyadic human interaction analysis up to this date. This dataset provides a rich context for probing the capabilities of Transformers within complex environments. In order to validate its utility, as well as to elucidate Transformers' ability to assimilate diverse contextual cues, we focus on addressing the challenge of personality regression within interaction scenarios. We first adapt an existing Video Transformer to handle multiple contextual sources and conduct rigorous experimentation. We empirically observe a progressive enhancement in model performance as more context is added, reinforcing the potential of Transformers to decode intricate human dynamics. Building upon these findings, the Dyadformer emerges as a novel architecture, adept at long-range modeling of dyadic interactions. By jointly modeling both participants in the interaction, as well as embedding multi- modal integration into the model itself, the Dyadformer surpasses the baseline and other concurrent approaches, underscoring Transformers' aptitude in deciphering multifaceted, noisy, and challenging tasks such as the analysis of human personality in interaction. Nonetheless, these experiments unveil the ubiquitous challenges when training Transformers, particularly in managing overfitting due to their demand for extensive datasets. Consequently, we conclude this thesis with a comprehensive investigation into Video Transformers, analyzing topics ranging from architectural designs and training strategies, to input embedding and tokenization, traversing through multi-modality and specific applications. Across these, we highlight trends which optimally harness spatio-temporal representations that handle video redundancy and high dimensionality. A culminating performance comparison is conducted in the realm of video action classification, spotlighting strategies that exhibit superior efficacy, even compared to traditional CNN-based methods.[cat] Aquesta tesi busca crear sistemes artificials que reflecteixin les habilitats de comprensió i interpretació humanes a través de l'ús de Transformers per a vídeo. L'objectiu és utilitzar aquestes arquitectures per comprendre millor la interacció humana i desenvolupar sistemes intel·ligents i conscients de l'entorn. Això implica explorar àmplies àrees de la Visió per Computador, des de la recopilació de dades fins a l'anàlisi de l'estat de l'art i la prova experimental d'aquests models. Una part essencial d'aquest estudi és la creació d'UDIVA, un ampli conjunt de dades multimodal i multivista que enregistra interaccions humanes cara a cara. Amb 147 participants i 188 sessions, UDIVA inclou contingut audiovisual, freqüència cardíaca, perfils de personalitat, dades sociodemogràfiques i transcripcions de les converses. És el conjunt de dades més gran conegut per a l'anàlisi de la interacció humana diàdica i proporciona un context ric per a l'estudi de les capacitats dels Transformers en entorns complexos. Per tal de validar la seva utilitat i les habilitats dels Transformers, ens centrem en la regressió de la personalitat. Inicialment, adaptem un Transformer de vídeo per integrar diverses fonts de context. Mitjançant experiments exhaustius, observem millores progressives en els resultats amb la inclusió de més context, confirmant la capacitat dels Transformers. Motivats per aquests resultats, desenvolupem el Dyadformer, una arquitectura per interaccions diàdiques de llarga duració. Aquesta nova arquitectura considera simultàniament els dos participants en la interacció i incorpora la multimodalitat en un sol model. El Dyadformer supera la nostra proposta inicial i altres treballs similars, destacant la capacitat dels Transformers per abordar tasques complexes. No obstant això, aquestos experiments revelen reptes d'entrenament dels Transformers, com el sobreajustament, per la seva necessitat de grans conjunts de dades. La tesi conclou amb una anàlisi profunda dels Transformers per a vídeo, incloent dissenys arquitectònics, estratègies d'entrenament, preprocessament de vídeos, tokenització i multimodalitat. S'identifiquen tendències per gestionar la redundància i alta dimensionalitat de vídeos i es realitza una comparació de rendiment en la classificació d'accions a vídeo, destacant estratègies d'eficàcia superior als mètodes tradicionals basats en convolucions

    Classification of Broadcast News Audio Data Employing Binary Decision Architecture

    Get PDF
    A novel binary decision architecture (BDA) for broadcast news audio classification task is presented in this paper. The idea of developing such architecture came from the fact that the appropriate combination of multiple binary classifiers for two-class discrimination problem can reduce a miss-classification error without rapid increase in computational complexity. The core element of classification architecture is represented by a binary decision (BD) algorithm that performs discrimination between each pair of acoustic classes, utilizing two types of decision functions. The first one is represented by a simple rule-based approach in which the final decision is made according to the value of selected discrimination parameter. The main advantage of this solution is relatively low processing time needed for classification of all acoustic classes. The cost for that is low classification accuracy. The second one employs support vector machine (SVM) classifier. In this case, the overall classification accuracy is conditioned by finding the optimal parameters for decision function resulting in higher computational complexity and better classification performance. The final form of proposed BDA is created by combining four BD discriminators supplemented by decision table. The effectiveness of proposed BDA, utilizing rule-based approach and the SVM classifier, is compared with two most popular strategies for multiclass classification, namely the binary decision trees (BDT) and the One-Against-One SVM (OAOSVM). Experimental results show that the proposed classification architecture can decrease the overall classification error in comparison with the BDT architecture. On the contrary, an optimization technique for selecting the optimal set of training data is needed in order to overcome the OAOSVM

    Mixing Deep Networks and Entangled Forests for the Semantic Segmentation of 3D Indoor Scenes

    Get PDF
    This work focuses on semantic segmentation over indoor 3D data, that is, to assign labels to every point in the point clouds representing working spaces: after researching the current state of the art, traditional approaches like random forests and deep neural networks based on PointNet are evaluated. The Superpoint Graph architecture and the 3D Entangled Forests algorithm are selected for mixing their features to try to enhance their performance
    corecore