433 research outputs found
A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
Traditional architectures for solving computer vision problems and the degree
of success they enjoyed have been heavily reliant on hand-crafted features.
However, of late, deep learning techniques have offered a compelling
alternative -- that of automatically learning problem-specific features. With
this new paradigm, every problem in computer vision is now being re-examined
from a deep learning perspective. Therefore, it has become important to
understand what kind of deep networks are suitable for a given problem.
Although general surveys of this fast-moving paradigm (i.e. deep-networks)
exist, a survey specific to computer vision is missing. We specifically
consider one form of deep networks widely used in computer vision -
convolutional neural networks (CNNs). We start with "AlexNet" as our base CNN
and then examine the broad variations proposed over time to suit different
applications. We hope that our recipe-style survey will serve as a guide,
particularly for novice practitioners intending to use deep-learning techniques
for computer vision.Comment: Published in Frontiers in Robotics and AI (http://goo.gl/6691Bm
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
Graphic layout designs play an essential role in visual communication. Yet
handcrafting layout designs is skill-demanding, time-consuming, and
non-scalable to batch production. Generative models emerge to make design
automation scalable but it remains non-trivial to produce designs that comply
with designers' multimodal desires, i.e., constrained by background images and
driven by foreground content. We propose LayoutDETR that inherits the high
quality and realism from generative modeling, while reformulating content-aware
requirements as a detection problem: we learn to detect in a background image
the reasonable locations, scales, and spatial relations for multimodal
foreground elements in a layout. Our solution sets a new state-of-the-art
performance for layout generation on public benchmarks and on our newly-curated
ad banner dataset. We integrate our solution into a graphical system that
facilitates user studies, and show that users prefer our designs over baselines
by significant margins. Our code, models, dataset, graphical system, and demos
are available at https://github.com/salesforce/LayoutDETR
Discriminative feature learning for multimodal classification
The purpose of this thesis is to tackle two related topics: multimodal classification and objective functions to improve the discriminative power of features.
First, I worked on image and text classification tasks and performed many experiments to show the effectiveness of different approaches available in literature.
Then, I introduced a novel methodology which can classify multimodal documents using singlemodal classifiers merging textual and visual information into images and a novel loss function to improve separability between samples of a dataset.
Results show that exploiting multimodal data increases performances on classification tasks rather than using traditional single-modality methods.
Moreover the introduced GIT loss function is able to enhance the discriminative power of features, lowering intra-class distance and raising inter-class distance between samples of a multiclass dataset
Discriminative feature learning for multimodal classification
The purpose of this thesis is to tackle two related topics: multimodal classification and objective functions to improve the discriminative power of features.
First, I worked on image and text classification tasks and performed many experiments to show the effectiveness of different approaches available in literature.
Then, I introduced a novel methodology which can classify multimodal documents using singlemodal classifiers merging textual and visual information into images and a novel loss function to improve separability between samples of a dataset.
Results show that exploiting multimodal data increases performances on classification tasks rather than using traditional single-modality methods.
Moreover the introduced GIT loss function is able to enhance the discriminative power of features, lowering intra-class distance and raising inter-class distance between samples of a multiclass dataset
End-to-end Lip-reading: A Preliminary Study
Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading
Force-Aware Interface via Electromyography for Natural VR/AR Interaction
While tremendous advances in visual and auditory realism have been made for
virtual and augmented reality (VR/AR), introducing a plausible sense of
physicality into the virtual world remains challenging. Closing the gap between
real-world physicality and immersive virtual experience requires a closed
interaction loop: applying user-exerted physical forces to the virtual
environment and generating haptic sensations back to the users. However,
existing VR/AR solutions either completely ignore the force inputs from the
users or rely on obtrusive sensing devices that compromise user experience.
By identifying users' muscle activation patterns while engaging in VR/AR, we
design a learning-based neural interface for natural and intuitive force
inputs. Specifically, we show that lightweight electromyography sensors,
resting non-invasively on users' forearm skin, inform and establish a robust
understanding of their complex hand activities. Fuelled by a
neural-network-based model, our interface can decode finger-wise forces in
real-time with 3.3% mean error, and generalize to new users with little
calibration. Through an interactive psychophysical study, we show that human
perception of virtual objects' physical properties, such as stiffness, can be
significantly enhanced by our interface. We further demonstrate that our
interface enables ubiquitous control via finger tapping. Ultimately, we
envision our findings to push forward research towards more realistic
physicality in future VR/AR.Comment: ACM Transactions on Graphics (SIGGRAPH Asia 2022
Pattern Recognition
A wealth of advanced pattern recognition algorithms are emerging from the interdiscipline between technologies of effective visual features and the human-brain cognition process. Effective visual features are made possible through the rapid developments in appropriate sensor equipments, novel filter designs, and viable information processing architectures. While the understanding of human-brain cognition process broadens the way in which the computer can perform pattern recognition tasks. The present book is intended to collect representative researches around the globe focusing on low-level vision, filter design, features and image descriptors, data mining and analysis, and biologically inspired algorithms. The 27 chapters coved in this book disclose recent advances and new ideas in promoting the techniques, technology and applications of pattern recognition
Classification of Frequency and Phase Encoded Steady State Visual Evoked Potentials for Brain Computer Interface Speller Applications using Convolutional Neural Networks
Over the past decade there have been substantial improvements in vision based Brain-Computer Interface (BCI) spellers for quadriplegic patient populations. This thesis contains a review of the numerous bio-signals available to BCI researchers, as well as a brief chronology of foremost decoding methodologies used to date. Recent advances in classification accuracy and information transfer rate can be primarily attributed to time consuming patient specific parameter optimization procedures. The aim of the current study was to develop analysis software with potential ‘plug-in-and-play’ functionality. To this end, convolutional neural networks, presently established as state of the art analytical techniques for image processing, were utilized. The thesis herein defines deep convolutional neural network architecture for the offline classification of phase and frequency encoded SSVEP bio-signals. Networks were trained using an extensive 35 participant open source Electroencephalographic (EEG) benchmark dataset (Department of Bio-medical Engineering, Tsinghua University, Beijing). Average classification accuracies of 82.24% and information transfer rates of 22.22 bpm were achieved on a BCI naïve participant dataset for a 40 target alphanumeric display, in absence of any patient specific parameter optimization
Assessing Facial Symmetry and Attractiveness using Augmented Reality
Facial symmetry is a key component in quantifying the perception of beauty. In this paper, we propose a set of facial features computed from facial landmarks which can be extracted at a low computational cost. We quantitatively evaluated our proposed features for predicting perceived attractiveness from human portraits on four benchmark datasets (SCUT-FBP, SCUT-FBP5500, FACES and Chicago Face Database). Experimental results showed that the performance of our features is comparable to those extracted from a set with much denser facial landmarks. The computation of facial features was also implemented as an Augmented Reality (AR) app developed on Android OS. The app overlays four types of measurements and guide lines over a live video stream, while the facial measurements are computed from the tracked facial landmarks at run-time. The developed app can be used to assist plastic surgeons in assessing facial symmetry when planning reconstructive facial surgeries
Turning a CLIP Model into a Scene Text Spotter
We exploit the potential of the large-scale Contrastive Language-Image
Pretraining (CLIP) model to enhance scene text detection and spotting tasks,
transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes
visual prompt learning and cross-attention in CLIP to extract image and
text-based prior knowledge. Using predefined and learnable prompts,
FastTCM-CR50 introduces an instance-language matching process to enhance the
synergy between image and text embeddings, thereby refining text regions. Our
Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt
generation, enabling offline computations and improving performance.
FastTCM-CR50 offers several advantages: 1) It can enhance existing text
detectors and spotters, improving performance by an average of 1.7% and 1.5%,
respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an
average improvement of 0.2% and 0.56% in text detection and spotting tasks,
along with a 48.5% increase in inference speed. 3) It showcases robust few-shot
training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50
improves performance by an average of 26.5% and 5.5% for text detection and
spotting tasks, respectively. 4) It consistently enhances performance on
out-of-distribution text detection and spotting datasets, particularly the
NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented
object detection. The code is available at https://github.com/wenwenyu/TCM.Comment: arXiv admin note: text overlap with arXiv:2302.1433
- …