81 research outputs found

    Semantic Image Segmentation Using Region Bank

    Get PDF
    International audienceSemantic image segmentation assigns a predefined class label to each pixel. This paper proposes a unified framework by using region bank to solve this task. Images are hierarchically segmented leading to region banks. Local features and high-level descriptors are extracted on each region of the banks. Discriminative classifiers are learned based the histograms of features descriptors computed from training region bank (TRB). Optimally merging predicted regions of query region bank (QRB) results in semantic labeling. This paper details each algorithmic module used in our system, however, any algorithm fits corresponding modules can be plugged into the proposed framework. Experiments on the challenging Microsoft Research Cambridge (MSRC 21) dataset show that the proposed approach achieves the state-of-the-art performance

    Multi-modal Video Content Understanding

    Get PDF
    Video is an important format of information. Humans use videos for a variety of purposes such as entertainment, education, communication, information sharing, and capturing memories. To this date, humankind accumulated a colossal amount of video material online which is freely available. Manual processing at this scale is simply impossible. To this end, many research efforts have been dedicated to the automatic processing of video content. At the same time, human perception of the world is multi-modal. A human uses multiple senses to understand the environment and objects, and their interactions. When watching a video, we perceive the content via both audio and visual modalities, and removing one of these modalities results in less immersive experience. Similarly, if information in both modalities does not correspond, it may create a sense of dissonance. Therefore, joint modelling of multiple modalities (such as audio, visual, and text) within one model is an active research area. In the last decade, the fields of automatic video understanding and multi-modal modelling have seen exceptional progress due to the ubiquitous success of deep learning models and, more recently, transformer-based architectures in particular. Our work draws on these advances and pushes the state-of-the-art of multi-modal video understanding forward. Applications of automatic multi-modal video processing are broad and exciting! For instance, the content-based textual description of a video (video captioning) may allow a visually- or auditory-impaired person to understand the content and, thus, engage in brighter social interactions. However, prior work in video content description relies on the visual input alone, missing vital information only available in the audio stream. To this end, we proposed two novel multi-modal transformer models that encode audio and visual interactions simultaneously. More specifically, first, we introduced a late-fusion multi-modal transformer that is highly modular and allows the processing of an arbitrary set of modalities. Second, an efficient bi-modal transformer was presented to encode audio-visual cues starting from the lower network layers allowing more rich audio-visual features and stronger performance as a result. Another application is the automatic visually-guided sound generation that might help professional sound (foley) designers who spend hours searching a database for relevant audio for a movie scene. Previous approaches for automatic conditional audio generation support only one class (e. g. “dog barking”), while real-life applications may require generation for hundreds of data classes and one would need to train one model for every data class which can be infeasible. To bridge this gap, we introduced a novel two-stage model that, first, efficiently encodes audio as a set of codebook vectors (i. e. trains to make “building blocks”) and, then, learns to sample these audio vectors given visual inputs to make a relevant audio track for this visual input. Moreover, we studied the automatic evaluation of the conditional audio generation model and proposed metrics that measure both quality and relevance of the generated samples. Finally, as video editing is becoming more common among non-professionals due to the increased popularity of such services as YouTube, automatic assistance during video editing grows in demand, e. g. off-sync detection between audio and visual tracks. Prior work in audio-visual synchronization was devoted to solving the task on lip-syncing datasets with “dense” signals, such as interviews and presentations. In such videos, synchronization cues occur “densely” across time, and it is enough to process just a few tens of a second to synchronize the tracks. In contrast, opendomain videos mostly have only “sparse” cues that occur just once in a seconds-long video clip (e. g. “chopping wood”). To address this, we: a) proposed a novel dataset with “sparse” sounds; b) designed a model which can efficiently encode seconds-long audio-visual tracks in a small set of “learnable selectors” that is, then, used for synchronization. In addition, we explored the temporal artefacts that common audio and video compression algorithms leave in data streams. To prevent a model from learning to rely on these artefacts, we introduced a list of recommendations on how to mitigate them. This thesis provides the details of the proposed methodologies as well as a comprehensive overview of advances in relevant fields of multi-modal video understanding. In addition, we provide a discussion of potential research directions that can bring significant contributions to the field

    Multifeature analysis and semantic context learning for image classification

    Get PDF
    This article introduces an image classification approach in which the semantic context of images and multiple low-level visual features are jointly exploited. The context consists of a set of semantic terms defining the classes to be associated to unclassified images. Initially, a multiobjective optimization technique is used to define a multifeature fusion model for each semantic class. Then, a Bayesian learning procedure is applied to derive a context model representing relationships among semantic classes. Finally, this ..

    Computer-Aided Cancer Diagnosis and Grading via Sparse Directional Image Representations

    Get PDF
    Prostate cancer and breast cancer are the second cause of death among cancers in males and females, respectively. If not diagnosed, prostate and breast cancers can spread and metastasize to other organs and bones and make it impossible for treatment. Hence, early diagnosis of cancer is vital for patient survival. Histopathological evaluation of the tissue is used for cancer diagnosis. The tissue is taken during biopsies and stained using hematoxylin and eosin (H&E) stain. Then a pathologist looks for abnormal changes in the tissue to diagnose and grade the cancer. This process can be time-consuming and subjective. A reliable and repetitive automatic cancer diagnosis method can greatly reduce the time while producing more reliable results. The scope of this dissertation is developing computer vision and machine learning algorithms for automatic cancer diagnosis and grading methods with accuracy acceptable by the expert pathologists. Automatic image classification relies on feature representation methods. In this dissertation we developed methods utilizing sparse directional multiscale transforms - specifically shearlet transform - for medical image analysis. We particularly designed theses computer visions-based algorithms and methods to work with H&E images and MRI images. Traditional signal processing methods (e.g. Fourier transform, wavelet transform, etc.) are not suitable for detecting carcinoma cells due to their lack of directional sensitivity. However, shearlet transform has inherent directional sensitivity and multiscale framework that enables it to detect different edges in the tissue images. We developed techniques for extracting holistic and local texture features from the histological and MRI images using histogram and co-occurrence of shearlet coefficients, respectively. Then we combined these features with the color and morphological features using multiple kernel learning (MKL) algorithm and employed support vector machines (SVM) with MKL to classify the medical images. We further investigated the impact of deep neural networks in representing the medical images for cancer detection. The aforementioned engineered features have a few limitations. They lack generalizability due to being tailored to the specific texture and structure of the tissues. They are time-consuming and expensive and need prepossessing and sometimes it is difficult to extract discriminative features from the images. On the other hand, feature learning techniques use multiple processing layers and learn feature representations directly from the data. To address these issues, we have developed a deep neural network containing multiple layers of convolution, max-pooling, and fully connected layers, trained on the Red, Green, and Blue (RGB) images along with the magnitude and phase of shearlet coefficients. Then we developed a weighted decision fusion deep neural network that assigns weights on the output probabilities and update those weights via backpropagation. The final decision was a weighted sum of the decisions from the RGB, and the magnitude and the phase of shearlet networks. We used the trained networks for classification of benign and malignant H&E images and Gleason grading. Our experimental results show that our proposed methods based on feature engineering and feature learning outperform the state-of-the-art and are even near perfect (100%) for some databases in terms of classification accuracy, sensitivity, specificity, F1 score, and area under the curve (AUC) and hence are promising computer-based methods for cancer diagnosis and grading using images

    Tree-structured CRF Models for Interactive Image Labeling

    Get PDF
    International audienceWe propose structured prediction models for image labeling that explicitly take into account dependencies among image labels. In our tree structured models, image labels are nodes, and edges encode dependency relations. To allow for more complex dependencies, we combine labels in a single node, and use mixtures of trees. Our models are more expressive than independent predictors, and lead to more accurate label predictions. The gain becomes more significant in an interactive scenario where a user provides the value of some of the image labels at test time. Such an interactive scenario offers an interesting trade-off between label accuracy and manual labeling effort. The structured models are used to decide which labels should be set by the user, and transfer the user input to more accurate predictions on other image labels. We also apply our models to attribute-based image classification, where attribute predictions of a test image are mapped to class probabilities by means of a given attribute-class mapping. Experimental results on three publicly available benchmark data sets show that in all scenarios our structured models lead to more accurate predictions, and leverage user input much more effectively than state-of-the-art independent models

    Per-exemplar analysis with MFoM fusion learning for multimedia retrieval and recounting

    Get PDF
    As a large volume of digital video data becomes available, along with revolutionary advances in multimedia technologies, demand related to efficiently retrieving and recounting multimedia data has grown. However, the inherent complexity in representing and recognizing multimedia data, especially for large-scale and unconstrained consumer videos, poses significant challenges. In particular, the following challenges are major concerns in the proposed research. One challenge is that consumer-video data (e.g., videos on YouTube) are mostly unstructured; therefore, evidence for a targeted semantic category is often sparsely located across time. To address the issue, a segmental multi-way local feature pooling method by using scene concept analysis is proposed. In particular, the proposed method utilizes scene concepts that are pre-constructed by clustering video segments into categories in an unsupervised manner. Then, a video is represented with multiple feature descriptors with respect to scene concepts. Finally, multiple kernels are constructed from the feature descriptors, and then, are combined into a final kernel that improves the discriminative power for multimedia event detection. Another challenge is that most semantic categories used for multimedia retrieval have inherent within-class diversity that can be dramatic and can raise the question as to whether conventional approaches are still successful and scalable. To consider such huge variability and further improve recounting capabilities, a per-exemplar learning scheme is proposed with a focus on fusing multiple types of heterogeneous features for video retrieval. While the conventional approach for multimedia retrieval involves learning a single classifier per category, the proposed scheme learns multiple detection models, one for each training exemplar. In particular, a local distance function is defined as a linear combination of element distance measured by each features. Then, a weight vector of the local distance function is learned in a discriminative learning method by taking only neighboring samples around an exemplar as training samples. In this way, a retrieval problem is redefined as an association problem, i.e., test samples are retrieved by association-based rules. In addition, the quality of a multimedia-retrieval system is often evaluated by domain-specific performance metrics that serve sophisticated user needs. To address such criteria for evaluating a multimedia-retrieval system, in MFoM learning, novel algorithms were proposed to explicitly optimize two challenging metrics, AP and a weighted sum of the probabilities of false alarms and missed detections at a target error ratio. Most conventional learning schemes attempt to optimize their own learning criteria, as opposed to domain-specific performance measures. By addressing this discrepancy, the proposed learning scheme approximates the given performance measure, which is discrete and makes it difficult to apply conventional optimization schemes, with a continuous and differentiable loss function which can be directly optimized. Then, a GPD algorithm is applied to optimizing this loss function.Ph.D

    Tree-Structured CRF Models for Interactive Image Labeling

    Full text link
    • …
    corecore