1,361 research outputs found
Cascaded Boundary Regression for Temporal Action Detection
Temporal action detection in long videos is an important problem.
State-of-the-art methods address this problem by applying action classifiers on
sliding windows. Although sliding windows may contain an identifiable portion
of the actions, they may not necessarily cover the entire action instance,
which would lead to inferior performance. We adapt a two-stage temporal action
detection pipeline with Cascaded Boundary Regression (CBR) model.
Class-agnostic proposals and specific actions are detected respectively in the
first and the second stage. CBR uses temporal coordinate regression to refine
the temporal boundaries of the sliding windows. The salient aspect of the
refinement process is that, inside each stage, the temporal boundaries are
adjusted in a cascaded way by feeding the refined windows back to the system
for further boundary refinement. We test CBR on THUMOS-14 and TVSeries, and
achieve state-of-the-art performance on both datasets. The performance gain is
especially remarkable under high IoU thresholds, e.g. map@tIoU=0.5 on THUMOS-14
is improved from 19.0% to 31.0%
Training-Free Instance Segmentation from Semantic Image Segmentation Masks
In recent years, the development of instance segmentation has garnered
significant attention in a wide range of applications. However, the training of
a fully-supervised instance segmentation model requires costly both
instance-level and pixel-level annotations. In contrast, weakly-supervised
instance segmentation methods (i.e., with image-level class labels or point
labels) struggle to satisfy the accuracy and recall requirements of practical
scenarios. In this paper, we propose a novel paradigm for instance segmentation
called training-free instance segmentation (TFISeg), which achieves instance
segmentation results from image masks predicted using off-the-shelf semantic
segmentation models. TFISeg does not require training a semantic or/and
instance segmentation model and avoids the need for instance-level image
annotations. Therefore, it is highly efficient. Specifically, we first obtain a
semantic segmentation mask of the input image via a trained semantic
segmentation model. Then, we calculate a displacement field vector for each
pixel based on the segmentation mask, which can indicate representations
belonging to the same class but different instances, i.e., obtaining the
instance-level object information. Finally, instance segmentation results are
obtained after being refined by a learnable category-agnostic object boundary
branch. Extensive experimental results on two challenging datasets and
representative semantic segmentation baselines (including CNNs and
Transformers) demonstrate that TFISeg can achieve competitive results compared
to the state-of-the-art fully-supervised instance segmentation methods without
the need for additional human resources or increased computational costs. The
code is available at: TFISegComment: 14 pages,5 figure
Wireless Deep Video Semantic Transmission
In this paper, we design a new class of high-efficiency deep joint
source-channel coding methods to achieve end-to-end video transmission over
wireless channels. The proposed methods exploit nonlinear transform and
conditional coding architecture to adaptively extract semantic features across
video frames, and transmit semantic feature domain representations over
wireless channels via deep joint source-channel coding. Our framework is
collected under the name deep video semantic transmission (DVST). In
particular, benefiting from the strong temporal prior provided by the feature
domain context, the learned nonlinear transform function becomes temporally
adaptive, resulting in a richer and more accurate entropy model guiding the
transmission of current frame. Accordingly, a novel rate adaptive transmission
mechanism is developed to customize deep joint source-channel coding for video
sources. It learns to allocate the limited channel bandwidth within and among
video frames to maximize the overall transmission performance. The whole DVST
design is formulated as an optimization problem whose goal is to minimize the
end-to-end transmission rate-distortion performance under perceptual quality
metrics or machine vision task performance metrics. Across standard video
source test sequences and various communication scenarios, experiments show
that our DVST can generally surpass traditional wireless video coded
transmission schemes. The proposed DVST framework can well support future
semantic communications due to its video content-aware and machine vision task
integration abilities.Comment: published in IEEE JSA
The TREC-2002 video track report
TREC-2002 saw the second running of the Video Track, the goal of which was to promote progress in content-based retrieval from digital video via open, metrics-based evaluation. The track used 73.3 hours of publicly available digital video (in MPEG-1/VCD format) downloaded by the participants directly from the Internet Archive (Prelinger Archives) (internetarchive, 2002) and some from the Open
Video Project (Marchionini, 2001). The material comprised advertising, educational, industrial, and amateur films produced between the 1930's and the 1970's by corporations, nonprofit organizations, trade associations, community and interest groups, educational institutions, and individuals. 17 teams representing 5 companies and 12 universities - 4 from Asia, 9 from Europe, and 4 from the US - participated in one or more of three tasks in the 2001 video track: shot boundary determination, feature extraction, and search (manual or interactive). Results were scored by NIST using manually created truth data for shot boundary determination and manual assessment of feature extraction and search results. This paper is an introduction to, and an overview
of, the track framework - the tasks, data, and measures - the approaches taken by the participating groups, the results, and issues regrading the evaluation. For detailed information about the approaches and results, the reader should see the various site reports in the final workshop proceedings
A Case-Based Reasoning Model Powered by Deep Learning for Radiology Report Recommendation
Case-Based Reasoning models are one of the most used reasoning paradigms in expert-knowledge-driven areas. One of the most prominent fields of use of these systems is the medical sector, where explainable models are required. However, these models are considerably reliant on user input and the introduction of relevant curated data. Deep learning approaches offer an analogous solution, where user input is not required. This paper proposes a hybrid Case-Based Reasoning, Deep Learning framework for medical-related applications, focusing on the generation of medical reports. The proposal combines the explainability and user-focused approach of case-based reasoning models with the deep learning techniques performance. Moreover, the framework is fully modular to fit a wide variety of tasks and data, such as real-time sensor captured data, images, or text, to name a few. An implementation of the proposed framework focusing on radiology report generation assistance is provided. This implementation is used to evaluate the proposal, showing that it can provide meaningful and accurate corrections, even when the amount of information available is minimal. Additional tests on the optimization degree of the case base are also performed, evidencing how the proposed framework can optimize this base to achieve optimal performance
- …