384 research outputs found
ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain
Transformer design is the de facto standard for natural language processing
tasks. The success of the transformer design in natural language processing has
lately piqued the interest of researchers in the domain of computer vision.
When compared to Convolutional Neural Networks (CNNs), Vision Transformers
(ViTs) are becoming more popular and dominant solutions for many vision
problems. Transformer-based models outperform other types of networks, such as
convolutional and recurrent neural networks, in a range of visual benchmarks.
We evaluate various vision transformer models in this work by dividing them
into distinct jobs and examining their benefits and drawbacks. ViTs can
overcome several possible difficulties with convolutional neural networks
(CNNs). The goal of this survey is to show the first use of ViTs in CV. In the
first phase, we categorize various CV applications where ViTs are appropriate.
Image classification, object identification, image segmentation, video
transformer, image denoising, and NAS are all CV applications. Our next step
will be to analyze the state-of-the-art in each area and identify the models
that are currently available. In addition, we outline numerous open research
difficulties as well as prospective research possibilities.Comment: ICCD-2023. arXiv admin note: substantial text overlap with
arXiv:2208.04309 by other author
On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator
Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise
A Closer Look into Recent Video-based Learning Research: A Comprehensive Review of Video Characteristics, Tools, Technologies, and Learning Effectiveness
People increasingly use videos on the Web as a source for learning. To
support this way of learning, researchers and developers are continuously
developing tools, proposing guidelines, analyzing data, and conducting
experiments. However, it is still not clear what characteristics a video should
have to be an effective learning medium. In this paper, we present a
comprehensive review of 257 articles on video-based learning for the period
from 2016 to 2021. One of the aims of the review is to identify the video
characteristics that have been explored by previous work. Based on our
analysis, we suggest a taxonomy which organizes the video characteristics and
contextual aspects into eight categories: (1) audio features, (2) visual
features, (3) textual features, (4) instructor behavior, (5) learners
activities, (6) interactive features (quizzes, etc.), (7) production style, and
(8) instructional design. Also, we identify four representative research
directions: (1) proposals of tools to support video-based learning, (2) studies
with controlled experiments, (3) data analysis studies, and (4) proposals of
design guidelines for learning videos. We find that the most explored
characteristics are textual features followed by visual features, learner
activities, and interactive features. Text of transcripts, video frames, and
images (figures and illustrations) are most frequently used by tools that
support learning through videos. The learner activity is heavily explored
through log files in data analysis studies, and interactive features have been
frequently scrutinized in controlled experiments. We complement our review by
contrasting research findings that investigate the impact of video
characteristics on the learning effectiveness, report on tasks and technologies
used to develop tools that support learning, and summarize trends of design
guidelines to produce learning video
Deep Learning in Medical Image Analysis
The accelerating power of deep learning in diagnosing diseases will empower physicians and speed up decision making in clinical environments. Applications of modern medical instruments and digitalization of medical care have generated enormous amounts of medical images in recent years. In this big data arena, new deep learning methods and computational models for efficient data processing, analysis, and modeling of the generated data are crucially important for clinical applications and understanding the underlying biological process. This book presents and highlights novel algorithms, architectures, techniques, and applications of deep learning for medical image analysis
Superpixel labeling for medical image segmentation
openNowadays, most methods for image segmentation consider images in a pixel-
wise manner, which is a huge job and also time-consuming. On the other hand,
superpixel labeling can make the segmentation task easier in some aspects. First,
superpixels carry more information than pixels because they usually follow the
edges present in the image. Furthermore, superpixels have perceptual meaning,
and finally, they can be very useful in computationally demanding problems,
since by mapping pixels to superpixels we are reducing the complexity of the
problem. In this thesis, we propose to do superpixel-wise labeling on two med-
ical image datasets including ISIC Lesion Skin and Chest X-ray, then we feed
them to the U-Net Convolutional Neural Network (CNN) DoubleU-Net and
Dual-Aggregation Transformer (DuAT) network to segment our images in term
of superpixels. Three different methods of labeling are used in this thesis: Su-
perpixel labeling, Extended Superpixel Labeling (Distance-base Labeling), and
Random Walk Superpixel labeling. The Superpixel labeled ground truths are
used just for training. For the evaluation, we consider the original image and
also the original binary ground truth. We considered four different superpixel
algorithms, namely Simple Linear Iterative Clustering (SLIC), Felsenszwalb Hut-
tenlocher (FH), QuickShift (QS) , and Superpixels Extracted via Energy-Driven
Sampling (SEEDS). We evaluate the segmentation result with metrics such as
Dice Coefficient, Precision, Intersection Over Union (IOU), and Sensitivity. Our
results show the accuracy of 0.89 and 0.95 percent in dice coefficient for skin
lesion and chest X-ray datasets respectively.
Key Words: Superpixels, Medical Images, U-Net, DoubleU-Net, Image seg-
mentation, CNN, DuAT, SEEDS.Nowadays, most methods for image segmentation consider images in a pixel-
wise manner, which is a huge job and also time-consuming. On the other hand,
superpixel labeling can make the segmentation task easier in some aspects. First,
superpixels carry more information than pixels because they usually follow the
edges present in the image. Furthermore, superpixels have perceptual meaning,
and finally, they can be very useful in computationally demanding problems,
since by mapping pixels to superpixels we are reducing the complexity of the
problem. In this thesis, we propose to do superpixel-wise labeling on two med-
ical image datasets including ISIC Lesion Skin and Chest X-ray, then we feed
them to the U-Net Convolutional Neural Network (CNN) DoubleU-Net and
Dual-Aggregation Transformer (DuAT) network to segment our images in term
of superpixels. Three different methods of labeling are used in this thesis: Su-
perpixel labeling, Extended Superpixel Labeling (Distance-base Labeling), and
Random Walk Superpixel labeling. The Superpixel labeled ground truths are
used just for training. For the evaluation, we consider the original image and
also the original binary ground truth. We considered four different superpixel
algorithms, namely Simple Linear Iterative Clustering (SLIC), Felsenszwalb Hut-
tenlocher (FH), QuickShift (QS) , and Superpixels Extracted via Energy-Driven
Sampling (SEEDS). We evaluate the segmentation result with metrics such as
Dice Coefficient, Precision, Intersection Over Union (IOU), and Sensitivity. Our
results show the accuracy of 0.89 and 0.95 percent in dice coefficient for skin
lesion and chest X-ray datasets respectively.
Key Words: Superpixels, Medical Images, U-Net, DoubleU-Net, Image seg-
mentation, CNN, DuAT, SEEDS
Improving and Scaling Mobile Learning via Emotion and Cognitive-state Aware Interfaces
Massive Open Online Courses (MOOCs) provide high-quality learning materials at low cost to millions of learners. Current MOOC designs, however, have minimal learner-instructor communication channels. This limitation restricts MOOCs from addressing major challenges: low retention rates, frequent distractions, and little personalization in instruction. Previous work enriched learner-instructor communication with physiological signals but was not scalable because of the additional hardware requirement. Large MOOC providers, such as Coursera, have released mobile apps providing more flexibility with “on-the-go” learning environments. This thesis reports an iterative process for the design of mobile intelligent interfaces that can run on unmodified smartphones, implicitly sense multiple modalities from learners, infer learner emotions and cognitive states, and intervene to provide gains in learning.
The first part of this research explores the usage of photoplethysmogram (PPG) signals collected implicitly on the back-camera of unmodified smartphones. I explore different deep neural networks, DeepHeart, to improve the accuracy (+2.2%) and robustness of heart rate sensing from noisy PPG signals. The second project, AttentiveLearner, infers mind-wandering events via the collected PPG signals at a performance comparable to systems relying on dedicated physiological sensors (Kappa = 0.22). By leveraging the fine-grained cognitive states, the third project, AttentiveReview, achieves significant (+17.4%) learning gains by providing personalized interventions based on learners’ perceived difficulty.
The latter part of this research adds real-time facial analysis from the front camera in addition to the PPG sensing from the back camera. AttentiveLearner2 achieves more robust emotion inference (average accuracy = 84.4%) in mobile MOOC learning. According to a longitudinal study with 28 subjects for three weeks, AttentiveReview2, with the multimodal sensing component, improves learning gain by 28.0% with high usability ratings (average System Usability Scale = 80.5).
Finally, I show that technologies in this dissertation not only benefit MOOC learning, but also other emerging areas such as computational advertising and behavior targeting. AttentiveVideo, building on top of the sensing architecture in AttentiveLearner2, quantifies emotional responses to mobile video advertisements. In a 24-participant study, AttentiveVideo achieved good accuracy on a wide range of emotional measures (best accuracy = 82.6% across 9 measures)
Model-driven and Data-driven Methods for Recognizing Compositional Interactions from Videos
The ability to accurately understand how humans interact with their surroundings is critical for many vision based intelligent systems. Compared to simple atomic actions (eg. raise hand), many interactions found in our daily lives are defined as a composition of an atomic action with a variety of arguments (eg. pick up a pen). Despite recent progress in the literature, there still remains fundamental challenges unique to recognizing interactions from videos. First, most of the action recognition literature assumes a problem setting where a pre-defined set of action labels is supported by a large and relatively balanced set of training examples for those labels. There are many realistic cases where this data assumption breaks down, either because the application demands fine-grained classification of a potentially combinatorial number of activities, and/or because the problem at hand is an “open-set” problem where new labels may be defined at test time. Second, many deep video models often simply represent video as a three-dimensional tensor and ignore the differences in spatial and temporal dimensions during the representation learning stage. As a result, data-driven bottom-up action models frequently over-fit to the static content of the video and fail to accurately capture the dynamic changes in relations among actors in the video.
In this dissertation, we address the aforementioned challenges of recognizing fine-grained interactions from videos by developing solutions that explicitly represent interactions as compositions of simpler static and dynamic elements. By exploiting the power of composition, our ``detection by description'' framework expresses a very rich space of interactions using only a small set of static visual attributes and a few dynamic patterns. A definition of an interaction is constructed on the fly from first-principles state machines which leverage bottom-up deep-learned components such as object detectors. Compared to existing model-driven methods for video understanding, we introduce the notion of dynamic action signatures which allows a practitioner to express the expected temporal behavior of various elements of an interaction. We show that our model-driven approach using dynamic action signatures outperforms other zero-shot methods on multiple public action classification benchmarks and even some fully supervised baselines under realistic problem settings.
Next, we extend our approach to a setting where the static and dynamic action signatures are not given by the user but rather learned from data. We do so by borrowing ideas from data-driven, two-stream action recognition and model-driven, structured human-object interaction detection. The key idea behind our approach is that we can learn the static and dynamic decomposition of an interaction using a dual-pathway network by leveraging object detections. To do so, we introduce the Motion Guided Attention Fusion mechanism which transfers the motion-centric features learned using object detections to the representation learned from the RGB-based motion pathway.
Finally, we conclude with a comprehensive case study on vision based activity detection applied to video surveillance. Using the methods presented in this dissertation, we step towards an intelligent vision system that can detect a particular interaction instance only given a description from a user and depart from requiring massive dataset of labeled training videos. Moreover, as our framework naturally defines a decompositional structure of activities into detectable static/visual attributes, we show that we can simulate necessary training data to acquire attribute detectors when the desired detector is otherwise unavailable. Our approach achieves competitive or superior performance over existing approaches for recognizing fine-grained interactions from realistic videos
AI and IoT Meet Mobile Machines: Towards a Smart Working Site
Infrastructure construction is society's cornerstone and economics' catalyst. Therefore, improving mobile machinery's efficiency and reducing their cost of use have enormous economic benefits in the vast and growing construction market. In this thesis, I envision a novel concept smart working site to increase productivity through fleet management from multiple aspects and with Artificial Intelligence (AI) and Internet of Things (IoT)
- …