15,905 research outputs found
Automatic Interaction and Activity Recognition from Videos of Human Manual Demonstrations with Application to Anomaly Detection
This paper presents a new method to describe spatio-temporal relations
between objects and hands, to recognize both interactions and activities within
video demonstrations of manual tasks. The approach exploits Scene Graphs to
extract key interaction features from image sequences, encoding at the same
time motion patterns and context. Additionally, the method introduces an
event-based automatic video segmentation and clustering, which allows to group
similar events, detecting also on the fly if a monitored activity is executed
correctly. The effectiveness of the approach was demonstrated in two
multi-subject experiments, showing the ability to recognize and cluster
hand-object and object-object interactions without prior knowledge of the
activity, as well as matching the same activity performed by different
subjects.Comment: 8 pages, 8 figures, submitted to IEEE RAS International Symposium on
Robot and Human Interactive Communication (RO-MAN), for associated video see
https://youtu.be/Ftu_EHAtH4
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition
Vision-Language models like CLIP have been widely adopted for various tasks
due to their impressive zero-shot capabilities. However, CLIP is not suitable
for extracting 3D geometric features as it was trained on only images and text
by natural language supervision. We work on addressing this limitation and
propose a new framework termed CG3D (CLIP Goes 3D) where a 3D encoder is
learned to exhibit zero-shot capabilities. CG3D is trained using triplets of
pointclouds, corresponding rendered 2D images, and texts using natural language
supervision. To align the features in a multimodal embedding space, we utilize
contrastive loss on 3D features obtained from the 3D encoder, as well as visual
and text features extracted from CLIP. We note that the natural images used to
train CLIP and the rendered 2D images in CG3D have a distribution shift.
Attempting to train the visual and text encoder to account for this shift
results in catastrophic forgetting and a notable decrease in performance. To
solve this, we employ prompt tuning and introduce trainable parameters in the
input space to shift CLIP towards the 3D pre-training dataset utilized in CG3D.
We extensively test our pre-trained CG3D framework and demonstrate its
impressive capabilities in zero-shot, open scene understanding, and retrieval
tasks. Further, it also serves as strong starting weights for fine-tuning in
downstream 3D recognition tasks.Comment: Website: https://jeya-maria-jose.github.io/cg3d-web
Graph-based Algorithm Unfolding for Energy-aware Power Allocation in Wireless Networks
We develop a novel graph-based trainable framework to maximize the weighted
sum energy efficiency (WSEE) for power allocation in wireless communication
networks. To address the non-convex nature of the problem, the proposed method
consists of modular structures inspired by a classical iterative suboptimal
approach and enhanced with learnable components. More precisely, we propose a
deep unfolding of the successive concave approximation (SCA) method. In our
unfolded SCA (USCA) framework, the originally preset parameters are now
learnable via graph convolutional neural networks (GCNs) that directly exploit
multi-user channel state information as the underlying graph adjacency matrix.
We show the permutation equivariance of the proposed architecture, which is a
desirable property for models applied to wireless network data. The USCA
framework is trained through a stochastic gradient descent approach using a
progressive training strategy. The unsupervised loss is carefully devised to
feature the monotonic property of the objective under maximum power
constraints. Comprehensive numerical results demonstrate its generalizability
across different network topologies of varying size, density, and channel
distribution. Thorough comparisons illustrate the improved performance and
robustness of USCA over state-of-the-art benchmarks.Comment: Published in IEEE Transactions on Wireless Communication
NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior
Training a Neural Radiance Field (NeRF) without pre-computed camera poses is
challenging. Recent advances in this direction demonstrate the possibility of
jointly optimising a NeRF and camera poses in forward-facing scenes. However,
these methods still face difficulties during dramatic camera movement. We
tackle this challenging problem by incorporating undistorted monocular depth
priors. These priors are generated by correcting scale and shift parameters
during training, with which we are then able to constrain the relative poses
between consecutive frames. This constraint is achieved using our proposed
novel loss functions. Experiments on real-world indoor and outdoor scenes show
that our method can handle challenging camera trajectories and outperforms
existing methods in terms of novel view rendering quality and pose estimation
accuracy. Our project page is https://nope-nerf.active.vision
Quantifying the Benefit of Artificial Intelligence for Scientific Research
The ongoing artificial intelligence (AI) revolution has the potential to
change almost every line of work. As AI capabilities continue to improve in
accuracy, robustness, and reach, AI may outperform and even replace human
experts across many valuable tasks. Despite enormous efforts devoted to
understanding AI's impact on labor and the economy and its recent success in
accelerating scientific discovery and progress, we lack a systematic
understanding of how advances in AI may benefit scientific research across
disciplines and fields. Here we develop a measurement framework to estimate
both the direct use of AI and the potential benefit of AI in scientific
research by applying natural language processing techniques to 87.6 million
publications and 7.1 million patents. We find that the use of AI in research
appears widespread throughout the sciences, growing especially rapidly since
2015, and papers that use AI exhibit an impact premium, more likely to be
highly cited both within and outside their disciplines. While almost every
discipline contains some subfields that benefit substantially from AI,
analyzing 4.6 million course syllabi across various educational disciplines, we
find a systematic misalignment between the education of AI and its impact on
research, suggesting the supply of AI talents in scientific disciplines is not
commensurate with AI research demands. Lastly, examining who benefits from AI
within the scientific workforce, we find that disciplines with a higher
proportion of women or black scientists tend to be associated with less
benefit, suggesting that AI's growing impact on research may further exacerbate
existing inequalities in science. As the connection between AI and scientific
research deepens, our findings may have an increasing value, with important
implications for the equity and sustainability of the research enterprise.Comment: 23 pages, 4 figure
MENLI: Robust Evaluation Metrics from Natural Language Inference
Recently proposed BERT-based evaluation metrics for text generation perform
well on standard benchmarks but are vulnerable to adversarial attacks, e.g.,
relating to information correctness. We argue that this stems (in part) from
the fact that they are models of semantic similarity. In contrast, we develop
evaluation metrics based on Natural Language Inference (NLI), which we deem a
more appropriate modeling. We design a preference-based adversarial attack
framework and show that our NLI based metrics are much more robust to the
attacks than the recent BERT-based metrics. On standard benchmarks, our NLI
based metrics outperform existing summarization metrics, but perform below SOTA
MT metrics. However, when combining existing metrics with our NLI metrics, we
obtain both higher adversarial robustness (15%-30%) and higher quality metrics
as measured on standard benchmarks (+5% to 30%).Comment: TACL 2023 Camera-ready; github link fixed+Fig.3 legend fixe
DASS Good: Explainable Data Mining of Spatial Cohort Data
Developing applicable clinical machine learning models is a difficult task
when the data includes spatial information, for example, radiation dose
distributions across adjacent organs at risk. We describe the co-design of a
modeling system, DASS, to support the hybrid human-machine development and
validation of predictive models for estimating long-term toxicities related to
radiotherapy doses in head and neck cancer patients. Developed in collaboration
with domain experts in oncology and data mining, DASS incorporates
human-in-the-loop visual steering, spatial data, and explainable AI to augment
domain knowledge with automatic data mining. We demonstrate DASS with the
development of two practical clinical stratification models and report feedback
from domain experts. Finally, we describe the design lessons learned from this
collaborative experience.Comment: 10 pages, 9 figure
Colour technologies for content production and distribution of broadcast content
The requirement of colour reproduction has long been a priority driving the development of new colour imaging systems that maximise human perceptual plausibility. This thesis explores machine learning algorithms for colour processing to assist both content production and distribution. First, this research studies colourisation technologies with practical use cases in restoration and processing of archived content. The research targets practical deployable solutions, developing a cost-effective pipeline which integrates the activity of the producer into the processing workflow. In particular, a fully automatic image colourisation paradigm using Conditional GANs is proposed to improve content generalisation and colourfulness of existing baselines. Moreover, a more conservative solution is considered by providing references to guide the system towards more accurate colour predictions. A fast-end-to-end architecture is proposed to improve existing exemplar-based image colourisation methods while decreasing the complexity and runtime. Finally, the proposed image-based methods are integrated into a video colourisation pipeline. A general framework is proposed to reduce the generation of temporal flickering or propagation of errors when such methods are applied frame-to-frame. The proposed model is jointly trained to stabilise the input video and to cluster their frames with the aim of learning scene-specific modes. Second, this research explored colour processing technologies for content distribution with the aim to effectively deliver the processed content to the broad audience. In particular, video compression is tackled by introducing a novel methodology for chroma intra prediction based on attention models. Although the proposed architecture helped to gain control over the reference samples and better understand the prediction process, the complexity of the underlying neural network significantly increased the encoding and decoding time. Therefore, aiming at efficient deployment within the latest video coding standards, this work also focused on the simplification of the proposed architecture to obtain a more compact and explainable model
Optimizations of Autoencoders for Analysis and Classification of Microscopic In Situ Hybridization Images
Currently, analysis of microscopic In Situ Hybridization images is done
manually by experts. Precise evaluation and classification of such microscopic
images can ease experts' work and reveal further insights about the data. In
this work, we propose a deep-learning framework to detect and classify areas of
microscopic images with similar levels of gene expression. The data we analyze
requires an unsupervised learning model for which we employ a type of
Artificial Neural Network - Deep Learning Autoencoders. The model's performance
is optimized by balancing the latent layers' length and complexity and
fine-tuning hyperparameters. The results are validated by adapting the
mean-squared error (MSE) metric, and comparison to expert's evaluation.Comment: 9 pages; 9 figure
Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification
Generalizable person re-identification (Re-ID) is a very hot research topic
in machine learning and computer vision, which plays a significant role in
realistic scenarios due to its various applications in public security and
video surveillance. However, previous methods mainly focus on the visual
representation learning, while neglect to explore the potential of semantic
features during training, which easily leads to poor generalization capability
when adapted to the new domain. In this paper, we propose a Multi-Modal
Equivalent Transformer called MMET for more robust visual-semantic embedding
learning on visual, textual and visual-textual tasks respectively. To further
enhance the robust feature learning in the context of transformer, a dynamic
masking mechanism called Masked Multimodal Modeling strategy (MMM) is
introduced to mask both the image patches and the text tokens, which can
jointly works on multimodal or unimodal data and significantly boost the
performance of generalizable person Re-ID. Extensive experiments on benchmark
datasets demonstrate the competitive performance of our method over previous
approaches. We hope this method could advance the research towards
visual-semantic representation learning. Our source code is also publicly
available at https://github.com/JeremyXSC/MMET
- …