15,905 research outputs found

    Automatic Interaction and Activity Recognition from Videos of Human Manual Demonstrations with Application to Anomaly Detection

    Full text link
    This paper presents a new method to describe spatio-temporal relations between objects and hands, to recognize both interactions and activities within video demonstrations of manual tasks. The approach exploits Scene Graphs to extract key interaction features from image sequences, encoding at the same time motion patterns and context. Additionally, the method introduces an event-based automatic video segmentation and clustering, which allows to group similar events, detecting also on the fly if a monitored activity is executed correctly. The effectiveness of the approach was demonstrated in two multi-subject experiments, showing the ability to recognize and cluster hand-object and object-object interactions without prior knowledge of the activity, as well as matching the same activity performed by different subjects.Comment: 8 pages, 8 figures, submitted to IEEE RAS International Symposium on Robot and Human Interactive Communication (RO-MAN), for associated video see https://youtu.be/Ftu_EHAtH4

    CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition

    Full text link
    Vision-Language models like CLIP have been widely adopted for various tasks due to their impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric features as it was trained on only images and text by natural language supervision. We work on addressing this limitation and propose a new framework termed CG3D (CLIP Goes 3D) where a 3D encoder is learned to exhibit zero-shot capabilities. CG3D is trained using triplets of pointclouds, corresponding rendered 2D images, and texts using natural language supervision. To align the features in a multimodal embedding space, we utilize contrastive loss on 3D features obtained from the 3D encoder, as well as visual and text features extracted from CLIP. We note that the natural images used to train CLIP and the rendered 2D images in CG3D have a distribution shift. Attempting to train the visual and text encoder to account for this shift results in catastrophic forgetting and a notable decrease in performance. To solve this, we employ prompt tuning and introduce trainable parameters in the input space to shift CLIP towards the 3D pre-training dataset utilized in CG3D. We extensively test our pre-trained CG3D framework and demonstrate its impressive capabilities in zero-shot, open scene understanding, and retrieval tasks. Further, it also serves as strong starting weights for fine-tuning in downstream 3D recognition tasks.Comment: Website: https://jeya-maria-jose.github.io/cg3d-web

    Graph-based Algorithm Unfolding for Energy-aware Power Allocation in Wireless Networks

    Full text link
    We develop a novel graph-based trainable framework to maximize the weighted sum energy efficiency (WSEE) for power allocation in wireless communication networks. To address the non-convex nature of the problem, the proposed method consists of modular structures inspired by a classical iterative suboptimal approach and enhanced with learnable components. More precisely, we propose a deep unfolding of the successive concave approximation (SCA) method. In our unfolded SCA (USCA) framework, the originally preset parameters are now learnable via graph convolutional neural networks (GCNs) that directly exploit multi-user channel state information as the underlying graph adjacency matrix. We show the permutation equivariance of the proposed architecture, which is a desirable property for models applied to wireless network data. The USCA framework is trained through a stochastic gradient descent approach using a progressive training strategy. The unsupervised loss is carefully devised to feature the monotonic property of the objective under maximum power constraints. Comprehensive numerical results demonstrate its generalizability across different network topologies of varying size, density, and channel distribution. Thorough comparisons illustrate the improved performance and robustness of USCA over state-of-the-art benchmarks.Comment: Published in IEEE Transactions on Wireless Communication

    NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior

    Full text link
    Training a Neural Radiance Field (NeRF) without pre-computed camera poses is challenging. Recent advances in this direction demonstrate the possibility of jointly optimising a NeRF and camera poses in forward-facing scenes. However, these methods still face difficulties during dramatic camera movement. We tackle this challenging problem by incorporating undistorted monocular depth priors. These priors are generated by correcting scale and shift parameters during training, with which we are then able to constrain the relative poses between consecutive frames. This constraint is achieved using our proposed novel loss functions. Experiments on real-world indoor and outdoor scenes show that our method can handle challenging camera trajectories and outperforms existing methods in terms of novel view rendering quality and pose estimation accuracy. Our project page is https://nope-nerf.active.vision

    Quantifying the Benefit of Artificial Intelligence for Scientific Research

    Full text link
    The ongoing artificial intelligence (AI) revolution has the potential to change almost every line of work. As AI capabilities continue to improve in accuracy, robustness, and reach, AI may outperform and even replace human experts across many valuable tasks. Despite enormous efforts devoted to understanding AI's impact on labor and the economy and its recent success in accelerating scientific discovery and progress, we lack a systematic understanding of how advances in AI may benefit scientific research across disciplines and fields. Here we develop a measurement framework to estimate both the direct use of AI and the potential benefit of AI in scientific research by applying natural language processing techniques to 87.6 million publications and 7.1 million patents. We find that the use of AI in research appears widespread throughout the sciences, growing especially rapidly since 2015, and papers that use AI exhibit an impact premium, more likely to be highly cited both within and outside their disciplines. While almost every discipline contains some subfields that benefit substantially from AI, analyzing 4.6 million course syllabi across various educational disciplines, we find a systematic misalignment between the education of AI and its impact on research, suggesting the supply of AI talents in scientific disciplines is not commensurate with AI research demands. Lastly, examining who benefits from AI within the scientific workforce, we find that disciplines with a higher proportion of women or black scientists tend to be associated with less benefit, suggesting that AI's growing impact on research may further exacerbate existing inequalities in science. As the connection between AI and scientific research deepens, our findings may have an increasing value, with important implications for the equity and sustainability of the research enterprise.Comment: 23 pages, 4 figure

    MENLI: Robust Evaluation Metrics from Natural Language Inference

    Full text link
    Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).Comment: TACL 2023 Camera-ready; github link fixed+Fig.3 legend fixe

    DASS Good: Explainable Data Mining of Spatial Cohort Data

    Full text link
    Developing applicable clinical machine learning models is a difficult task when the data includes spatial information, for example, radiation dose distributions across adjacent organs at risk. We describe the co-design of a modeling system, DASS, to support the hybrid human-machine development and validation of predictive models for estimating long-term toxicities related to radiotherapy doses in head and neck cancer patients. Developed in collaboration with domain experts in oncology and data mining, DASS incorporates human-in-the-loop visual steering, spatial data, and explainable AI to augment domain knowledge with automatic data mining. We demonstrate DASS with the development of two practical clinical stratification models and report feedback from domain experts. Finally, we describe the design lessons learned from this collaborative experience.Comment: 10 pages, 9 figure

    Colour technologies for content production and distribution of broadcast content

    Get PDF
    The requirement of colour reproduction has long been a priority driving the development of new colour imaging systems that maximise human perceptual plausibility. This thesis explores machine learning algorithms for colour processing to assist both content production and distribution. First, this research studies colourisation technologies with practical use cases in restoration and processing of archived content. The research targets practical deployable solutions, developing a cost-effective pipeline which integrates the activity of the producer into the processing workflow. In particular, a fully automatic image colourisation paradigm using Conditional GANs is proposed to improve content generalisation and colourfulness of existing baselines. Moreover, a more conservative solution is considered by providing references to guide the system towards more accurate colour predictions. A fast-end-to-end architecture is proposed to improve existing exemplar-based image colourisation methods while decreasing the complexity and runtime. Finally, the proposed image-based methods are integrated into a video colourisation pipeline. A general framework is proposed to reduce the generation of temporal flickering or propagation of errors when such methods are applied frame-to-frame. The proposed model is jointly trained to stabilise the input video and to cluster their frames with the aim of learning scene-specific modes. Second, this research explored colour processing technologies for content distribution with the aim to effectively deliver the processed content to the broad audience. In particular, video compression is tackled by introducing a novel methodology for chroma intra prediction based on attention models. Although the proposed architecture helped to gain control over the reference samples and better understand the prediction process, the complexity of the underlying neural network significantly increased the encoding and decoding time. Therefore, aiming at efficient deployment within the latest video coding standards, this work also focused on the simplification of the proposed architecture to obtain a more compact and explainable model

    Optimizations of Autoencoders for Analysis and Classification of Microscopic In Situ Hybridization Images

    Full text link
    Currently, analysis of microscopic In Situ Hybridization images is done manually by experts. Precise evaluation and classification of such microscopic images can ease experts' work and reveal further insights about the data. In this work, we propose a deep-learning framework to detect and classify areas of microscopic images with similar levels of gene expression. The data we analyze requires an unsupervised learning model for which we employ a type of Artificial Neural Network - Deep Learning Autoencoders. The model's performance is optimized by balancing the latent layers' length and complexity and fine-tuning hyperparameters. The results are validated by adapting the mean-squared error (MSE) metric, and comparison to expert's evaluation.Comment: 9 pages; 9 figure

    Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification

    Full text link
    Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision, which plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. However, previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which easily leads to poor generalization capability when adapted to the new domain. In this paper, we propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning on visual, textual and visual-textual tasks respectively. To further enhance the robust feature learning in the context of transformer, a dynamic masking mechanism called Masked Multimodal Modeling strategy (MMM) is introduced to mask both the image patches and the text tokens, which can jointly works on multimodal or unimodal data and significantly boost the performance of generalizable person Re-ID. Extensive experiments on benchmark datasets demonstrate the competitive performance of our method over previous approaches. We hope this method could advance the research towards visual-semantic representation learning. Our source code is also publicly available at https://github.com/JeremyXSC/MMET
    corecore