4,792 research outputs found
Redundant Multi-Modal Integration of Machine Vision and Programmable Mechanical Manipulation for Scene Segmentation
The main idea in this paper is that one cannot discern the part-whole relationship of three-dimensional objects in a passive mode without a great deal of a priori information. Perceptual activity is exploratory, probing and searching. Physical scene segmentation is the first step in active perception. The task of perception is greatly simplified if one has to deal with only one object at a time. This work adapts the non-deterministic Turing machine model and develops strategies to control the interaction between sensors and actions for physical segmentation. Scene segmentation is formulated in graph theoretic terms as a graph generation/decomposition problem. The isomorphism between manipulation actions and graph decomposition operations is defined. The non-contact sensors generate the directed graphs representing the spatial relations among surface regions. The manipulator decomposes these graphs under contact sensor supervision. Assuming a finite number of sensors and actions and a goal state, that is reachable and measurable with the available sensors, the control strategies converge. This was experimentally verified in a real, noisy, and dynamic environment
Mix and Localize: Localizing Sound Sources in Mixtures
We present a method for simultaneously localizing multiple sound sources
within a visual scene. This task requires a model to both group a sound mixture
into individual sources, and to associate them with a visual signal. Our method
jointly solves both tasks at once, using a formulation inspired by the
contrastive random walk of Jabri et al. We create a graph in which images and
separated sounds correspond to nodes, and train a random walker to transition
between nodes from different modalities with high return probability. The
transition probabilities for this walk are determined by an audio-visual
similarity metric that is learned by our model. We show through experiments
with musical instruments and human speech that our model can successfully
localize multiple sounds, outperforming other self-supervised methods. Project
site: https://hxixixh.github.io/mix-and-localizeComment: CVPR 202
Semantically Informed Multiview Surface Refinement
We present a method to jointly refine the geometry and semantic segmentation
of 3D surface meshes. Our method alternates between updating the shape and the
semantic labels. In the geometry refinement step, the mesh is deformed with
variational energy minimization, such that it simultaneously maximizes
photo-consistency and the compatibility of the semantic segmentations across a
set of calibrated images. Label-specific shape priors account for interactions
between the geometry and the semantic labels in 3D. In the semantic
segmentation step, the labels on the mesh are updated with MRF inference, such
that they are compatible with the semantic segmentations in the input images.
Also, this step includes prior assumptions about the surface shape of different
semantic classes. The priors induce a tight coupling, where semantic
information influences the shape update and vice versa. Specifically, we
introduce priors that favor (i) adaptive smoothing, depending on the class
label; (ii) straightness of class boundaries; and (iii) semantic labels that
are consistent with the surface orientation. The novel mesh-based
reconstruction is evaluated in a series of experiments with real and synthetic
data. We compare both to state-of-the-art, voxel-based semantic 3D
reconstruction, and to purely geometric mesh refinement, and demonstrate that
the proposed scheme yields improved 3D geometry as well as an improved semantic
segmentation
Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation
Video topic segmentation unveils the coarse-grained semantic structure
underlying videos and is essential for other video understanding tasks. Given
the recent surge in multi-modal, relying solely on a single modality is
arguably insufficient. On the other hand, prior solutions for similar tasks
like video scene/shot segmentation cater to short videos with clear visual
shifts but falter for long videos with subtle changes, such as livestreams. In
this paper, we introduce a multi-modal video topic segmenter that utilizes both
video transcripts and frames, bolstered by a cross-modal attention mechanism.
Furthermore, we propose a dual-contrastive learning framework adhering to the
unsupervised domain adaptation paradigm, enhancing our model's adaptability to
longer, more semantically complex videos. Experiments on short and long video
corpora demonstrate that our proposed solution, significantly surpasses
baseline methods in terms of both accuracy and transferability, in both intra-
and cross-domain settings.Comment: Accepted at the 30th International Conference on Multimedia Modeling
(MMM 2024
DRINet for medical image segmentation
Convolutional neural networks (CNNs) have revolutionized medical image analysis over the past few years. The UNet architecture is one of the most well-known CNN architectures for semantic segmentation and has achieved remarkable successes in many different medical image segmentation applications. The U-Net architecture consists of standard convolution layers, pooling layers, and upsampling layers. These convolution layers learn representative features of input images and construct segmentations based on the features. However, the features learned by standard convolution layers are not distinctive when the differences among different categories are subtle in terms of intensity, location, shape, and size. In this paper, we propose a novel CNN architecture, called Dense-Res-Inception Net (DRINet), which addresses this challenging problem. The proposed DRINet consists of three blocks, namely a convolutional block with dense connections, a deconvolutional block with residual Inception modules, and an unpooling block. Our proposed architecture outperforms the U-Net in three different challenging applications, namely multi-class segmentation of cerebrospinal fluid (CSF) on brain CT images, multi-organ segmentation on abdominal CT images, multi-class brain tumour segmentation on MR images
Histopathological image analysis : a review
Over the past decade, dramatic increases in computational power and improvement in image analysis algorithms have allowed the development of powerful computer-assisted analytical approaches to radiological data. With the recent advent of whole slide digital scanners, tissue histopathology slides can now be digitized and stored in digital image form. Consequently, digitized tissue histopathology has now become amenable to the application of computerized image analysis and machine learning techniques. Analogous to the role of computer-assisted diagnosis (CAD) algorithms in medical imaging to complement the opinion of a radiologist, CAD algorithms have begun to be developed for disease detection, diagnosis, and prognosis prediction to complement the opinion of the pathologist. In this paper, we review the recent state of the art CAD technology for digitized histopathology. This paper also briefly describes the development and application of novel image analysis technology for a few specific histopathology related problems being pursued in the United States and Europe
- …