36 research outputs found
Facial Video-based Remote Physiological Measurement via Self-supervised Learning
Facial video-based remote physiological measurement aims to estimate remote
photoplethysmography (rPPG) signals from human face videos and then measure
multiple vital signs (e.g. heart rate, respiration frequency) from rPPG
signals. Recent approaches achieve it by training deep neural networks, which
normally require abundant facial videos and synchronously recorded
photoplethysmography (PPG) signals for supervision. However, the collection of
these annotated corpora is not easy in practice. In this paper, we introduce a
novel frequency-inspired self-supervised framework that learns to estimate rPPG
signals from facial videos without the need of ground truth PPG signals. Given
a video sample, we first augment it into multiple positive/negative samples
which contain similar/dissimilar signal frequencies to the original one.
Specifically, positive samples are generated using spatial augmentation.
Negative samples are generated via a learnable frequency augmentation module,
which performs non-linear signal frequency transformation on the input without
excessively changing its visual appearance. Next, we introduce a local rPPG
expert aggregation module to estimate rPPG signals from augmented samples. It
encodes complementary pulsation information from different face regions and
aggregate them into one rPPG prediction. Finally, we propose a series of
frequency-inspired losses, i.e. frequency contrastive loss, frequency ratio
consistency loss, and cross-video frequency agreement loss, for the
optimization of estimated rPPG signals from multiple augmented video samples
and across temporally neighboring video samples. We conduct rPPG-based heart
rate, heart rate variability and respiration frequency estimation on four
standard benchmarks. The experimental results demonstrate that our method
improves the state of the art by a large margin.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligenc
Enhancing Space-time Video Super-resolution via Spatial-temporal Feature Interaction
The target of space-time video super-resolution (STVSR) is to increase both
the frame rate (also referred to as the temporal resolution) and the spatial
resolution of a given video. Recent approaches solve STVSR with end-to-end deep
neural networks. A popular solution is to first increase the frame rate of the
video; then perform feature refinement among different frame features; and last
increase the spatial resolutions of these features. The temporal correlation
among features of different frames is carefully exploited in this process. The
spatial correlation among features of different (spatial) resolutions, despite
being also very important, is however not emphasized. In this paper, we propose
a spatial-temporal feature interaction network to enhance STVSR by exploiting
both spatial and temporal correlations among features of different frames and
spatial resolutions. Specifically, the spatial-temporal frame interpolation
module is introduced to interpolate low- and high-resolution intermediate frame
features simultaneously and interactively. The spatial-temporal local and
global refinement modules are respectively deployed afterwards to exploit the
spatial-temporal correlation among different features for their refinement.
Finally, a novel motion consistency loss is employed to enhance the motion
continuity among reconstructed frames. We conduct experiments on three standard
benchmarks, Vid4, Vimeo-90K and Adobe240, and the results demonstrate that our
method improves the state of the art methods by a considerable margin. Our
codes will be available at
https://github.com/yuezijie/STINet-Space-time-Video-Super-resolution
LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims to segment the target
instance referred by a given text expression in a video clip. The text
expression normally contains sophisticated description of the instance's
appearance, action, and relation with others. It is therefore rather difficult
for a RVOS model to capture all these attributes correspondingly in the video;
in fact, the model often favours more on the action- and relation-related
visual attributes of the instance. This can end up with partial or even
incorrect mask prediction of the target instance. We tackle this problem by
taking a subject-centric short text expression from the original long text
expression. The short one retains only the appearance-related information of
the target instance so that we can use it to focus the model's attention on the
instance's appearance. We let the model make joint predictions using both long
and short text expressions; and insert a long-short cross-attention module to
interact the joint features and a long-short predictions intersection loss to
regulate the joint predictions. Besides the improvement on the linguistic part,
we also introduce a forward-backward visual consistency loss, which utilizes
optical flows to warp visual features between the annotated frames and their
temporal neighbors for consistency. We build our method on top of two state of
the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS,
JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our
method.Code is available at https://github.com/LinfengYuan1997/Losh.Comment: CVPR202
Large Model driven Radiology Report Generation with Clinical Quality Reinforcement Learning
Radiology report generation (RRG) has attracted significant attention due to
its potential to reduce the workload of radiologists. Current RRG approaches
are still unsatisfactory against clinical standards. This paper introduces a
novel RRG method, \textbf{LM-RRG}, that integrates large models (LMs) with
clinical quality reinforcement learning to generate accurate and comprehensive
chest X-ray radiology reports. Our method first designs a large language model
driven feature extractor to analyze and interpret different regions of the
chest X-ray image, emphasizing specific regions with medical significance.
Next, based on the large model's decoder, we develop a multimodal report
generator that leverages multimodal prompts from visual features and textual
instruction to produce the radiology report in an auto-regressive way. Finally,
to better reflect the clinical significant and insignificant errors that
radiologists would normally assign in the report, we introduce a novel clinical
quality reinforcement learning strategy. It utilizes the radiology report
clinical quality (RadCliQ) metric as a reward function in the learning process.
Extensive experiments on the MIMIC-CXR and IU-Xray datasets demonstrate the
superiority of our method over the state of the art
Explicit Interaction for Fusion-Based Place Recognition
Fusion-based place recognition is an emerging technique jointly utilizing
multi-modal perception data, to recognize previously visited places in
GPS-denied scenarios for robots and autonomous vehicles. Recent fusion-based
place recognition methods combine multi-modal features in implicit manners.
While achieving remarkable results, they do not explicitly consider what the
individual modality affords in the fusion system. Therefore, the benefit of
multi-modal feature fusion may not be fully explored. In this paper, we propose
a novel fusion-based network, dubbed EINet, to achieve explicit interaction of
the two modalities. EINet uses LiDAR ranges to supervise more robust vision
features for long time spans, and simultaneously uses camera RGB data to
improve the discrimination of LiDAR point clouds. In addition, we develop a new
benchmark for the place recognition task based on the nuScenes dataset. To
establish this benchmark for future research with comprehensive comparisons, we
introduce both supervised and self-supervised training schemes alongside
evaluation protocols. We conduct extensive experiments on the proposed
benchmark, and the experimental results show that our EINet exhibits better
recognition performance as well as solid generalization ability compared to the
state-of-the-art fusion-based place recognition approaches. Our open-source
code and benchmark are released at: https://github.com/BIT-XJY/EINet
A Novel Non-Volatile Inverter-based CiM: Continuous Sign Weight Transition and Low Power on-Chip Training
In this work, we report a novel design, one-transistor-one-inverter (1T1I),
to satisfy high speed and low power on-chip training requirements. By
leveraging doped HfO2 with ferroelectricity, a non-volatile inverter is
successfully demonstrated, enabling desired continuous weight transition
between negative and positive via the programmable threshold voltage (VTH) of
ferroelectric field-effect transistors (FeFETs). Compared with commonly used
designs with the similar function, 1T1I uniquely achieves pure on-chip-based
weight transition at an optimized working current without relying on assistance
from off-chip calculation units for signed-weight comparison, facilitating
high-speed training at low power consumption. Further improvements in linearity
and training speed can be obtained via a two-transistor-one-inverter (2T1I)
design. Overall, focusing on energy and time efficiencies, this work provides a
valuable design strategy for future FeFET-based computing-in-memory (CiM)
A promoting role of androgen receptor in androgen-sensitive and -insensitive prostate cancer cells
Although the vital role of the androgen receptor (AR) has been well demonstrated in primary prostate cancers, its role in the androgen-insensitive prostate cancers still remains unclear. Here, we used a small hairpin RNA approach to directly assess AR activity in prostate cancer cells. Reduction of AR expression in the two androgen-sensitive prostate cancer cell lines, LNCaP and LAPC4, significantly decreased AR-mediated transcription and cell growth. Intriguingly, in two androgen-insensitive prostate cell lines, LNCaP-C42B4 and CWR22Rv1, knockdown of AR expression showed a more pronounced effect on AR-induced transcription and cell growth than androgen depletion. Using cDNA microarrays, we also compared the transcriptional profiles induced by either androgen depletion or AR knockdown. Although a significant number of transcripts appear to be regulated by both androgen depletion and AR knockdown, we observed a subset of transcripts affected only by androgen depletion but not by AR knockdown, and vice versa. Finally, we demonstrated a direct role for AR in promoting tumor formation and growth in a xenograft model. Taken together, our results elucidate an important role for the AR in androgen-insensitive prostate cancer cells, and suggest that AR can be used as a therapeutic target for androgen-insensitive prostate cancers