384 research outputs found

    ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain

    Full text link
    Transformer design is the de facto standard for natural language processing tasks. The success of the transformer design in natural language processing has lately piqued the interest of researchers in the domain of computer vision. When compared to Convolutional Neural Networks (CNNs), Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems. Transformer-based models outperform other types of networks, such as convolutional and recurrent neural networks, in a range of visual benchmarks. We evaluate various vision transformer models in this work by dividing them into distinct jobs and examining their benefits and drawbacks. ViTs can overcome several possible difficulties with convolutional neural networks (CNNs). The goal of this survey is to show the first use of ViTs in CV. In the first phase, we categorize various CV applications where ViTs are appropriate. Image classification, object identification, image segmentation, video transformer, image denoising, and NAS are all CV applications. Our next step will be to analyze the state-of-the-art in each area and identify the models that are currently available. In addition, we outline numerous open research difficulties as well as prospective research possibilities.Comment: ICCD-2023. arXiv admin note: substantial text overlap with arXiv:2208.04309 by other author

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    A Closer Look into Recent Video-based Learning Research: A Comprehensive Review of Video Characteristics, Tools, Technologies, and Learning Effectiveness

    Full text link
    People increasingly use videos on the Web as a source for learning. To support this way of learning, researchers and developers are continuously developing tools, proposing guidelines, analyzing data, and conducting experiments. However, it is still not clear what characteristics a video should have to be an effective learning medium. In this paper, we present a comprehensive review of 257 articles on video-based learning for the period from 2016 to 2021. One of the aims of the review is to identify the video characteristics that have been explored by previous work. Based on our analysis, we suggest a taxonomy which organizes the video characteristics and contextual aspects into eight categories: (1) audio features, (2) visual features, (3) textual features, (4) instructor behavior, (5) learners activities, (6) interactive features (quizzes, etc.), (7) production style, and (8) instructional design. Also, we identify four representative research directions: (1) proposals of tools to support video-based learning, (2) studies with controlled experiments, (3) data analysis studies, and (4) proposals of design guidelines for learning videos. We find that the most explored characteristics are textual features followed by visual features, learner activities, and interactive features. Text of transcripts, video frames, and images (figures and illustrations) are most frequently used by tools that support learning through videos. The learner activity is heavily explored through log files in data analysis studies, and interactive features have been frequently scrutinized in controlled experiments. We complement our review by contrasting research findings that investigate the impact of video characteristics on the learning effectiveness, report on tasks and technologies used to develop tools that support learning, and summarize trends of design guidelines to produce learning video

    Deep Learning in Medical Image Analysis

    Get PDF
    The accelerating power of deep learning in diagnosing diseases will empower physicians and speed up decision making in clinical environments. Applications of modern medical instruments and digitalization of medical care have generated enormous amounts of medical images in recent years. In this big data arena, new deep learning methods and computational models for efficient data processing, analysis, and modeling of the generated data are crucially important for clinical applications and understanding the underlying biological process. This book presents and highlights novel algorithms, architectures, techniques, and applications of deep learning for medical image analysis

    Superpixel labeling for medical image segmentation

    Get PDF
    openNowadays, most methods for image segmentation consider images in a pixel- wise manner, which is a huge job and also time-consuming. On the other hand, superpixel labeling can make the segmentation task easier in some aspects. First, superpixels carry more information than pixels because they usually follow the edges present in the image. Furthermore, superpixels have perceptual meaning, and finally, they can be very useful in computationally demanding problems, since by mapping pixels to superpixels we are reducing the complexity of the problem. In this thesis, we propose to do superpixel-wise labeling on two med- ical image datasets including ISIC Lesion Skin and Chest X-ray, then we feed them to the U-Net Convolutional Neural Network (CNN) DoubleU-Net and Dual-Aggregation Transformer (DuAT) network to segment our images in term of superpixels. Three different methods of labeling are used in this thesis: Su- perpixel labeling, Extended Superpixel Labeling (Distance-base Labeling), and Random Walk Superpixel labeling. The Superpixel labeled ground truths are used just for training. For the evaluation, we consider the original image and also the original binary ground truth. We considered four different superpixel algorithms, namely Simple Linear Iterative Clustering (SLIC), Felsenszwalb Hut- tenlocher (FH), QuickShift (QS) , and Superpixels Extracted via Energy-Driven Sampling (SEEDS). We evaluate the segmentation result with metrics such as Dice Coefficient, Precision, Intersection Over Union (IOU), and Sensitivity. Our results show the accuracy of 0.89 and 0.95 percent in dice coefficient for skin lesion and chest X-ray datasets respectively. Key Words: Superpixels, Medical Images, U-Net, DoubleU-Net, Image seg- mentation, CNN, DuAT, SEEDS.Nowadays, most methods for image segmentation consider images in a pixel- wise manner, which is a huge job and also time-consuming. On the other hand, superpixel labeling can make the segmentation task easier in some aspects. First, superpixels carry more information than pixels because they usually follow the edges present in the image. Furthermore, superpixels have perceptual meaning, and finally, they can be very useful in computationally demanding problems, since by mapping pixels to superpixels we are reducing the complexity of the problem. In this thesis, we propose to do superpixel-wise labeling on two med- ical image datasets including ISIC Lesion Skin and Chest X-ray, then we feed them to the U-Net Convolutional Neural Network (CNN) DoubleU-Net and Dual-Aggregation Transformer (DuAT) network to segment our images in term of superpixels. Three different methods of labeling are used in this thesis: Su- perpixel labeling, Extended Superpixel Labeling (Distance-base Labeling), and Random Walk Superpixel labeling. The Superpixel labeled ground truths are used just for training. For the evaluation, we consider the original image and also the original binary ground truth. We considered four different superpixel algorithms, namely Simple Linear Iterative Clustering (SLIC), Felsenszwalb Hut- tenlocher (FH), QuickShift (QS) , and Superpixels Extracted via Energy-Driven Sampling (SEEDS). We evaluate the segmentation result with metrics such as Dice Coefficient, Precision, Intersection Over Union (IOU), and Sensitivity. Our results show the accuracy of 0.89 and 0.95 percent in dice coefficient for skin lesion and chest X-ray datasets respectively. Key Words: Superpixels, Medical Images, U-Net, DoubleU-Net, Image seg- mentation, CNN, DuAT, SEEDS

    Improving and Scaling Mobile Learning via Emotion and Cognitive-state Aware Interfaces

    Get PDF
    Massive Open Online Courses (MOOCs) provide high-quality learning materials at low cost to millions of learners. Current MOOC designs, however, have minimal learner-instructor communication channels. This limitation restricts MOOCs from addressing major challenges: low retention rates, frequent distractions, and little personalization in instruction. Previous work enriched learner-instructor communication with physiological signals but was not scalable because of the additional hardware requirement. Large MOOC providers, such as Coursera, have released mobile apps providing more flexibility with “on-the-go” learning environments. This thesis reports an iterative process for the design of mobile intelligent interfaces that can run on unmodified smartphones, implicitly sense multiple modalities from learners, infer learner emotions and cognitive states, and intervene to provide gains in learning. The first part of this research explores the usage of photoplethysmogram (PPG) signals collected implicitly on the back-camera of unmodified smartphones. I explore different deep neural networks, DeepHeart, to improve the accuracy (+2.2%) and robustness of heart rate sensing from noisy PPG signals. The second project, AttentiveLearner, infers mind-wandering events via the collected PPG signals at a performance comparable to systems relying on dedicated physiological sensors (Kappa = 0.22). By leveraging the fine-grained cognitive states, the third project, AttentiveReview, achieves significant (+17.4%) learning gains by providing personalized interventions based on learners’ perceived difficulty. The latter part of this research adds real-time facial analysis from the front camera in addition to the PPG sensing from the back camera. AttentiveLearner2 achieves more robust emotion inference (average accuracy = 84.4%) in mobile MOOC learning. According to a longitudinal study with 28 subjects for three weeks, AttentiveReview2, with the multimodal sensing component, improves learning gain by 28.0% with high usability ratings (average System Usability Scale = 80.5). Finally, I show that technologies in this dissertation not only benefit MOOC learning, but also other emerging areas such as computational advertising and behavior targeting. AttentiveVideo, building on top of the sensing architecture in AttentiveLearner2, quantifies emotional responses to mobile video advertisements. In a 24-participant study, AttentiveVideo achieved good accuracy on a wide range of emotional measures (best accuracy = 82.6% across 9 measures)

    Model-driven and Data-driven Methods for Recognizing Compositional Interactions from Videos

    Get PDF
    The ability to accurately understand how humans interact with their surroundings is critical for many vision based intelligent systems. Compared to simple atomic actions (eg. raise hand), many interactions found in our daily lives are defined as a composition of an atomic action with a variety of arguments (eg. pick up a pen). Despite recent progress in the literature, there still remains fundamental challenges unique to recognizing interactions from videos. First, most of the action recognition literature assumes a problem setting where a pre-defined set of action labels is supported by a large and relatively balanced set of training examples for those labels. There are many realistic cases where this data assumption breaks down, either because the application demands fine-grained classification of a potentially combinatorial number of activities, and/or because the problem at hand is an “open-set” problem where new labels may be defined at test time. Second, many deep video models often simply represent video as a three-dimensional tensor and ignore the differences in spatial and temporal dimensions during the representation learning stage. As a result, data-driven bottom-up action models frequently over-fit to the static content of the video and fail to accurately capture the dynamic changes in relations among actors in the video. In this dissertation, we address the aforementioned challenges of recognizing fine-grained interactions from videos by developing solutions that explicitly represent interactions as compositions of simpler static and dynamic elements. By exploiting the power of composition, our ``detection by description'' framework expresses a very rich space of interactions using only a small set of static visual attributes and a few dynamic patterns. A definition of an interaction is constructed on the fly from first-principles state machines which leverage bottom-up deep-learned components such as object detectors. Compared to existing model-driven methods for video understanding, we introduce the notion of dynamic action signatures which allows a practitioner to express the expected temporal behavior of various elements of an interaction. We show that our model-driven approach using dynamic action signatures outperforms other zero-shot methods on multiple public action classification benchmarks and even some fully supervised baselines under realistic problem settings. Next, we extend our approach to a setting where the static and dynamic action signatures are not given by the user but rather learned from data. We do so by borrowing ideas from data-driven, two-stream action recognition and model-driven, structured human-object interaction detection. The key idea behind our approach is that we can learn the static and dynamic decomposition of an interaction using a dual-pathway network by leveraging object detections. To do so, we introduce the Motion Guided Attention Fusion mechanism which transfers the motion-centric features learned using object detections to the representation learned from the RGB-based motion pathway. Finally, we conclude with a comprehensive case study on vision based activity detection applied to video surveillance. Using the methods presented in this dissertation, we step towards an intelligent vision system that can detect a particular interaction instance only given a description from a user and depart from requiring massive dataset of labeled training videos. Moreover, as our framework naturally defines a decompositional structure of activities into detectable static/visual attributes, we show that we can simulate necessary training data to acquire attribute detectors when the desired detector is otherwise unavailable. Our approach achieves competitive or superior performance over existing approaches for recognizing fine-grained interactions from realistic videos

    AI and IoT Meet Mobile Machines: Towards a Smart Working Site

    Get PDF
    Infrastructure construction is society's cornerstone and economics' catalyst. Therefore, improving mobile machinery's efficiency and reducing their cost of use have enormous economic benefits in the vast and growing construction market. In this thesis, I envision a novel concept smart working site to increase productivity through fleet management from multiple aspects and with Artificial Intelligence (AI) and Internet of Things (IoT)
    • …
    corecore