268 research outputs found
Sparsity-inducing dictionaries for effective action classification
Action recognition in unconstrained videos is one of the most important challenges in computer vision. In this paper, we propose sparsity-inducing dictionaries as an effective representation for action classification in videos. We demonstrate that features obtained from sparsity based representation provide discriminative information useful for classification of action videos into various action classes. We show that the constructed dictionaries are distinct for a large number of action classes resulting in a significant improvement in classification accuracy on the HMDB51 dataset. We further demonstrate the efficacy of dictionaries and sparsity based classification on other large action video datasets like UCF50
REPRESENTATION LEARNING FOR ACTION RECOGNITION
The objective of this research work is to develop discriminative representations for human
actions. The motivation stems from the fact that there are many issues encountered while
capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration),
inter-action similarity, background motion, and occlusion of actors. Hence, obtaining
a representation which can address all the variations in the same action while maintaining
discrimination with other actions is a challenging task. In literature, actions have been represented
either using either low-level or high-level features. Low-level features describe
the motion and appearance in small spatio-temporal volumes extracted from a video. Due
to the limited space-time volume used for extracting low-level features, they are not able
to account for viewpoint and actor variations or variable length actions. On the other hand,
high-level features handle variations in actors, viewpoints, and duration but the resulting
representation is often high-dimensional which introduces the curse of dimensionality. In
this thesis, we propose new representations for describing actions by combining the advantages
of both low-level and high-level features. Specifically, we investigate various linear
and non-linear decomposition techniques to extract meaningful attributes in both high-level
and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build
action-specific dictionaries. Each dictionary retains only the discriminative information
for a particular action and hence reduces inter-action similarity. Then, a sparsity-based
classification method is proposed to classify the low-rank representation of clips obtained
using these dictionaries. We show that this representation based on dictionary learning improves
the classification performance across actions. Also, a few of the actions consist of
rapid body deformations that hinder the extraction of local features from body movements.
Hence, we propose to use a dictionary which is trained on convolutional neural network
(CNN) features of the human body in various poses to reliably identify actors from the
background. Particularly, we demonstrate the efficacy of sparse representation in the identification
of the human body under rapid and substantial deformation.
In the first two approaches, sparsity-based representation is developed to improve discriminability
using class-specific dictionaries that utilize action labels. However, developing
an unsupervised representation of actions is more beneficial as it can be used to both
recognize similar actions and localize actions. We propose to exploit inter-action similarity
to train a universal attribute model (UAM) in order to learn action attributes (common and
distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation,
a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains
redundant attributes of all other actions, we use factor analysis to extract a novel lowvi
dimensional action-vector representation for each clip. Action-vectors are shown to suppress
background motion and highlight actions of interest in both trimmed and untrimmed
clips that contributes to action recognition without the help of any classifiers.
It is observed during our experiments that action-vector cannot effectively discriminate
between actions which are visually similar to each other. Hence, we subject action-vectors
to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic
LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary
information across action-vectors using different local features followed by discriminative
embedding provides the best classification performance. Further, we explore
non-linear embedding of action-vectors using Siamese networks especially for fine-grained
action recognition. A visualization of the hidden layer output in Siamese networks shows
its ability to effectively separate visually similar actions. This leads to better classification
performance than linear embedding on fine-grained action recognition.
All of the above approaches are presented on large unconstrained datasets with hundreds
of examples per action. However, actions in surveillance videos like snatch thefts are
difficult to model because of the diverse variety of scenarios in which they occur and very
few labeled examples. Hence, we propose to utilize the universal attribute model (UAM)
trained on large action datasets to represent such actions. Specifically, we show that there
are similarities between certain actions in the large datasets with snatch thefts which help
in extracting a representation for snatch thefts using the attributes from the UAM. This
representation is shown to be effective in distinguishing snatch thefts from regular actions
with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing
actions which provide better discrimination than existing representations. The
first approach presents a dictionary learning based sparse representation for effective discrimination
of actions. Also, we propose a sparse representation for the human body based
on dictionaries in order to recognize actions with rapid body deformations. In the next
approach, a low-dimensional representation called action-vector for unsupervised action
recognition is presented. Further, linear and non-linear embedding of action-vectors is
proposed for addressing inter-action similarity and fine-grained action recognition, respectively.
Finally, we propose a representation for locating snatch thefts among thousands of
regular interactions in surveillance videos
Going Deep in Medical Image Analysis: Concepts, Methods, Challenges and Future Directions
Medical Image Analysis is currently experiencing a paradigm shift due to Deep
Learning. This technology has recently attracted so much interest of the
Medical Imaging community that it led to a specialized conference in `Medical
Imaging with Deep Learning' in the year 2018. This article surveys the recent
developments in this direction, and provides a critical review of the related
major aspects. We organize the reviewed literature according to the underlying
Pattern Recognition tasks, and further sub-categorize it following a taxonomy
based on human anatomy. This article does not assume prior knowledge of Deep
Learning and makes a significant contribution in explaining the core Deep
Learning concepts to the non-experts in the Medical community. Unique to this
study is the Computer Vision/Machine Learning perspective taken on the advances
of Deep Learning in Medical Imaging. This enables us to single out `lack of
appropriately annotated large-scale datasets' as the core challenge (among
other challenges) in this research direction. We draw on the insights from the
sister research fields of Computer Vision, Pattern Recognition and Machine
Learning etc.; where the techniques of dealing with such challenges have
already matured, to provide promising directions for the Medical Imaging
community to fully harness Deep Learning in the future
Peripheral mechanisms of cancer-induced bone pain
With the rise in cancer rates and better treatments which prolong life expectancy for patients, the prevalence of cancer associated pain is also increasing. Many cancer patients suffer from painful metastases to the bone, which have a significant impact on quality of life and constitute an economic burden on society. Opioids, the most widely used treatment for metastatic cancer pain, especially in advanced stages of the disease, are associated with severe side effects. While preclinical work suggests anti-NGF therapy may be a useful strategy for malignant bone pain relief, it’s efficacy in the clinic has been disappointing. Like for other forms of chronic pain, the establishment and maintenance of metastatic bone pain depend on peripheral inputs. Elucidating novel peripheral mechanisms and targets for therapy remains a key challenge in the field. This thesis investigates peripheral mechanisms involved in cancer-induced bone pain, by using molecular, transgenic and in vivo imaging approaches, providing new evidence for peripheral changes in cancer-induced bone pain and potential targets for further therapeutic investigation. Firstly, we explored the role of the voltage gated sodium channel Nav1.9, which together with Nav1.7 and Nav1.8 is preferentially expressed in the peripheral nervous system, in metastatic cancer pain. Nav1.9 does not play a role in bone cancer pain, as identified in global knockout mice. Through microarray analysis we identified novel genes and pathways which are dysregulated in the peripheral nervous system in cancer-induced bone pain. A large proportion of differentially expressed genes were microRNAs, suggesting large changes at the posttranscriptional level. Five identified protein coding genes had been previously associated with pain (Adamts5, Adcyap1, Calca, Gal, Nts), but only neurotensin and galanin have been described in the context of cancer-induced bone pain. The three other genes may constitute novel targets for analgesia. Secondly, we investigated the molecular profile of mouse bone marrow afferent neurons by using transgenic mouse reporter lines. Contrary to previous reports in the literature based on immunohistochemistry, we found more than three quarters of bone marrow afferents express the nociceptive neuron’s marker Nav1.8. Additionally, bone afferents never expressed the marker parvalbumin, indicating they are not involved in proprioception. Thirdly, using in vivo calcium imaging we found no increase in the excitability of bone marrow afferents in cancer-bearing animals. However, a larger proportion of cutaneous afferents responded to mechanical stimulation in animals with metastatic bone pain, reflecting behavioural mechanical hypersensitivity. Cutaneous afferents showed increased calcium transients to both thermal and mechanical stimuli in animals with cancer-induced bone pain, suggesting hyperexcitability in the peripheral nervous system contributes to secondary hyperalgesia. Fourthly, based on clinical and pre-clinical evidence of pain relief by osteoclast targeting agents, we wondered if increased activity of these cells alone is sufficient to induce bone pain. While a localized increase in osteoclast activity could not be achieved, a widespread model of osteoclast activation through multiple nuclear factor κB ligand (RANKL) injections, resulted in decreased bone mineral density, without producing any symptoms of pain. Pain may be induced only after reaching a certain threshold of osteoclast activity, or additional changes in the bone microenvironment are needed for a phenotypic switch from physiologic to inflammatory osteoclast
VIDEO FOREGROUND LOCALIZATION FROM TRADITIONAL METHODS TO DEEP LEARNING
These days, detection of Visual Attention Regions (VAR), such as moving objects has become an integral part of many Computer Vision applications, viz. pattern recognition, object detection and classification, video surveillance, autonomous driving, human-machine interaction (HMI), and so forth. The moving object identification using bounding boxes has matured to the level of localizing the objects along their rigid borders and the process is called foreground localization (FGL). Over the decades, many image segmentation methodologies have been well studied, devised, and extended to suit the video FGL. Despite that, still, the problem of video foreground (FG) segmentation remains an intriguing task yet appealing due to its ill-posed nature and myriad of applications. Maintaining spatial and temporal coherence, particularly at object boundaries, persists challenging, and computationally burdensome. It even gets harder when the background possesses dynamic nature, like swaying tree branches or shimmering water body, and illumination variations, shadows cast by the moving objects, or when the video sequences have jittery frames caused by vibrating or unstable camera mounts on a surveillance post or moving robot. At the same time, in the analysis of traffic flow or human activity, the performance of an intelligent system substantially depends on its robustness of localizing the VAR, i.e., the FG. To this end, the natural question arises as what is the best way to deal with these challenges? Thus, the goal of this thesis is to investigate plausible real-time performant implementations from traditional approaches to modern-day deep learning (DL) models for FGL that can be applicable to many video content-aware applications (VCAA). It focuses mainly on improving existing methodologies through harnessing multimodal spatial and temporal cues for a delineated FGL. The first part of the dissertation is dedicated for enhancing conventional sample-based and Gaussian mixture model (GMM)-based video FGL using probability mass function (PMF), temporal median filtering, and fusing CIEDE2000 color similarity, color distortion, and illumination measures, and picking an appropriate adaptive threshold to extract the FG pixels. The subjective and objective evaluations are done to show the improvements over a number of similar conventional methods. The second part of the thesis focuses on exploiting and improving deep convolutional neural networks (DCNN) for the problem as mentioned earlier. Consequently, three models akin to encoder-decoder (EnDec) network are implemented with various innovative strategies to improve the quality of the FG segmentation. The strategies are not limited to double encoding - slow decoding feature learning, multi-view receptive field feature fusion, and incorporating spatiotemporal cues through long-shortterm memory (LSTM) units both in the subsampling and upsampling subnetworks. Experimental studies are carried out thoroughly on all conditions from baselines to challenging video sequences to prove the effectiveness of the proposed DCNNs. The analysis demonstrates that the architectural efficiency over other methods while quantitative and qualitative experiments show the competitive performance of the proposed models compared to the state-of-the-art
- …