25,414 research outputs found
Project RISE: Recognizing Industrial Smoke Emissions
Industrial smoke emissions pose a significant concern to human health. Prior
works have shown that using Computer Vision (CV) techniques to identify smoke
as visual evidence can influence the attitude of regulators and empower
citizens to pursue environmental justice. However, existing datasets are not of
sufficient quality nor quantity to train the robust CV models needed to support
air quality advocacy. We introduce RISE, the first large-scale video dataset
for Recognizing Industrial Smoke Emissions. We adopted a citizen science
approach to collaborate with local community members to annotate whether a
video clip has smoke emissions. Our dataset contains 12,567 clips from 19
distinct views from cameras that monitored three industrial facilities. These
daytime clips span 30 days over two years, including all four seasons. We ran
experiments using deep neural networks to establish a strong performance
baseline and reveal smoke recognition challenges. Our survey study discussed
community feedback, and our data analysis displayed opportunities for
integrating citizen scientists and crowd workers into the application of
Artificial Intelligence for social good.Comment: Technical repor
Surgical Phase Recognition of Short Video Shots Based on Temporal Modeling of Deep Features
Recognizing the phases of a laparoscopic surgery (LS) operation form its
video constitutes a fundamental step for efficient content representation,
indexing and retrieval in surgical video databases. In the literature, most
techniques focus on phase segmentation of the entire LS video using
hand-crafted visual features, instrument usage signals, and recently
convolutional neural networks (CNNs). In this paper we address the problem of
phase recognition of short video shots (10s) of the operation, without
utilizing information about the preceding/forthcoming video frames, their phase
labels or the instruments used. We investigate four state-of-the-art CNN
architectures (Alexnet, VGG19, GoogleNet, and ResNet101), for feature
extraction via transfer learning. Visual saliency was employed for selecting
the most informative region of the image as input to the CNN. Video shot
representation was based on two temporal pooling mechanisms. Most importantly,
we investigate the role of 'elapsed time' (from the beginning of the
operation), and we show that inclusion of this feature can increase performance
dramatically (69% vs. 75% mean accuracy). Finally, a long short-term memory
(LSTM) network was trained for video shot classification based on the fusion of
CNN features with 'elapsed time', increasing the accuracy to 86%. Our results
highlight the prominent role of visual saliency, long-range temporal recursion
and 'elapsed time' (a feature so far ignored), for surgical phase recognition.Comment: 6 pages, 4 figures, 6 table
Recurrent 3D Pose Sequence Machines
3D human articulated pose recovery from monocular image sequences is very
challenging due to the diverse appearances, viewpoints, occlusions, and also
the human 3D pose is inherently ambiguous from the monocular imagery. It is
thus critical to exploit rich spatial and temporal long-range dependencies
among body joints for accurate 3D pose sequence prediction. Existing approaches
usually manually design some elaborate prior terms and human body kinematic
constraints for capturing structures, which are often insufficient to exploit
all intrinsic structures and not scalable for all scenarios. In contrast, this
paper presents a Recurrent 3D Pose Sequence Machine(RPSM) to automatically
learn the image-dependent structural constraint and sequence-dependent temporal
context by using a multi-stage sequential refinement. At each stage, our RPSM
is composed of three modules to predict the 3D pose sequences based on the
previously learned 2D pose representations and 3D poses: (i) a 2D pose module
extracting the image-dependent pose representations, (ii) a 3D pose recurrent
module regressing 3D poses and (iii) a feature adaption module serving as a
bridge between module (i) and (ii) to enable the representation transformation
from 2D to 3D domain. These three modules are then assembled into a sequential
prediction framework to refine the predicted poses with multiple recurrent
stages. Extensive evaluations on the Human3.6M dataset and HumanEva-I dataset
show that our RPSM outperforms all state-of-the-art approaches for 3D pose
estimation.Comment: Published in CVPR 201
UbiEar: Bringing location-independent sound awareness to the hard-of-hearing people with smartphones
Non-speech sound-awareness is important to improve the quality of life for the deaf and hard-of-hearing (DHH) people. DHH people, especially the young, are not always satisfied with their hearing aids. According to the interviews with 60 young hard-of-hearing students, a ubiquitous sound-awareness tool for emergency and social events that works in diverse environments is desired. In this paper, we design UbiEar, a smartphone-based acoustic event sensing and notification system. Core techniques in UbiEar are a light-weight deep convolution neural network to enable location-independent acoustic event recognition on commodity smartphons, and a set of mechanisms for prompt and energy-efficient acoustic sensing. We conducted both controlled experiments and user studies with 86 DHH students and showed that UbiEar can assist the young DHH students in awareness of important acoustic events in their daily life.</jats:p
Deep Convolution and Correlated Manifold Embedded Distribution Alignment for Forest Fire Smoke Prediction
This paper proposes the deep convolution and correlated manifold embedded distribution alignment (DC-CMEDA) model, which is able to realize the transfer learning classification between and among various small datasets, and greatly shorten the training time. First, pre-trained Resnet50 network is used for feature transfer to extract smoke features because of the difficulty in training small dataset of forest fire smoke; second, a correlated manifold embedded distribution alignment (CMEDA) is proposed to register the smoke features in order to align the input feature distributions of the source and target domains; and finally, a trainable network model is constructed. This model is evaluated in the paper based on satellite remote sensing image and video image datasets. Compared with the deep convolutional integrated long short-term memory (DC-ILSTM) network, DC-CMEDA has increased the accuracy of video images by 1.50 %, and the accuracy of satellite remote sensing images by 4.00 %. Compared the CMEDA algorithm with the ILSTM algorithm, the number of iterations of the former has decreased to 10 times or less, and the algorithm complexity of CMEDA is lower than that of ILSTM. DC-CMEDA has a great advantage in terms of convergence speed. The experimental results show that DC-CMEDA can solve the problem of small sample smoke dataset detection and recognition
Gas Detection and Identification Using Multimodal Artificial Intelligence Based Sensor Fusion
With the rapid industrialization and technological advancements, innovative
engineering technologies which are cost effective, faster and easier to
implement are essential. One such area of concern is the rising number of
accidents happening due to gas leaks at coal mines, chemical industries, home
appliances etc. In this paper we propose a novel approach to detect and
identify the gaseous emissions using the multimodal AI fusion techniques. Most
of the gases and their fumes are colorless, odorless, and tasteless, thereby
challenging our normal human senses. Sensing based on a single sensor may not
be accurate, and sensor fusion is essential for robust and reliable detection
in several real-world applications. We manually collected 6400 gas samples
(1600 samples per class for four classes) using two specific sensors: the
7-semiconductor gas sensors array, and a thermal camera. The early fusion
method of multimodal AI, is applied The network architecture consists of a
feature extraction module for individual modality, which is then fused using a
merged layer followed by a dense layer, which provides a single output for
identifying the gas. We obtained the testing accuracy of 96% (for fused model)
as opposed to individual model accuracies of 82% (based on Gas Sensor data
using LSTM) and 93% (based on thermal images data using CNN model). Results
demonstrate that the fusion of multiple sensors and modalities outperforms the
outcome of a single sensor.Comment: 14 Pages, 9 Figure
- …