5,590 research outputs found
Towards a Scalable Hardware/Software Co-Design Platform for Real-time Pedestrian Tracking Based on a ZYNQ-7000 Device
Currently, most designers face a daunting task to
research different design flows and learn the intricacies of
specific software from various manufacturers in
hardware/software co-design. An urgent need of creating a
scalable hardware/software co-design platform has become a key
strategic element for developing hardware/software integrated
systems. In this paper, we propose a new design flow for building
a scalable co-design platform on FPGA-based system-on-chip.
We employ an integrated approach to implement a histogram
oriented gradients (HOG) and a support vector machine (SVM)
classification on a programmable device for pedestrian tracking.
Not only was hardware resource analysis reported, but the
precision and success rates of pedestrian tracking on nine open
access image data sets are also analysed. Finally, our proposed
design flow can be used for any real-time image processingrelated
products on programmable ZYNQ-based embedded
systems, which benefits from a reduced design time and provide a
scalable solution for embedded image processing products
WCCNet: Wavelet-integrated CNN with Crossmodal Rearranging Fusion for Fast Multispectral Pedestrian Detection
Multispectral pedestrian detection achieves better visibility in challenging
conditions and thus has a broad application in various tasks, for which both
the accuracy and computational cost are of paramount importance. Most existing
approaches treat RGB and infrared modalities equally, typically adopting two
symmetrical CNN backbones for multimodal feature extraction, which ignores the
substantial differences between modalities and brings great difficulty for the
reduction of the computational cost as well as effective crossmodal fusion. In
this work, we propose a novel and efficient framework named WCCNet that is able
to differentially extract rich features of different spectra with lower
computational complexity and semantically rearranges these features for
effective crossmodal fusion. Specifically, the discrete wavelet transform (DWT)
allowing fast inference and training speed is embedded to construct a
dual-stream backbone for efficient feature extraction. The DWT layers of WCCNet
extract frequency components for infrared modality, while the CNN layers
extract spatial-domain features for RGB modality. This methodology not only
significantly reduces the computational complexity, but also improves the
extraction of infrared features to facilitate the subsequent crossmodal fusion.
Based on the well extracted features, we elaborately design the crossmodal
rearranging fusion module (CMRF), which can mitigate spatial misalignment and
merge semantically complementary features of spatially-related local regions to
amplify the crossmodal complementary information. We conduct comprehensive
evaluations on KAIST and FLIR benchmarks, in which WCCNet outperforms
state-of-the-art methods with considerable computational efficiency and
competitive accuracy. We also perform the ablation study and analyze thoroughly
the impact of different components on the performance of WCCNet.Comment: Submitted to TPAM
ContextVP: Fully Context-Aware Video Prediction
Video prediction models based on convolutional networks, recurrent networks,
and their combinations often result in blurry predictions. We identify an
important contributing factor for imprecise predictions that has not been
studied adequately in the literature: blind spots, i.e., lack of access to all
relevant past information for accurately predicting the future. To address this
issue, we introduce a fully context-aware architecture that captures the entire
available past context for each pixel using Parallel Multi-Dimensional LSTM
units and aggregates it using blending units. Our model outperforms a strong
baseline network of 20 recurrent convolutional layers and yields
state-of-the-art performance for next step prediction on three challenging
real-world video datasets: Human 3.6M, Caltech Pedestrian, and UCF-101.
Moreover, it does so with fewer parameters than several recently proposed
models, and does not rely on deep convolutional networks, multi-scale
architectures, separation of background and foreground modeling, motion flow
learning, or adversarial training. These results highlight that full awareness
of past context is of crucial importance for video prediction.Comment: 19 pages. ECCV 2018 oral presentation. Project webpage is at
https://wonmin-byeon.github.io/publication/2018-ecc
Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection
Video anomaly detection (VAD) is an essential yet challenge task in signal
processing. Since certain anomalies cannot be detected by analyzing temporal or
spatial information alone, the interaction between two types of information is
considered crucial for VAD. However, current dual-stream architectures either
limit interaction between the two types of information to the bottleneck of
autoencoder or incorporate background pixels irrelevant to anomalies into the
interaction. To this end, we propose a multi-scale spatial-temporal interaction
network (MSTI-Net) for VAD. First, to pay particular attention to objects and
reconcile the significant semantic differences between the two information, we
propose an attention-based spatial-temporal fusion module (ASTM) as a
substitute for the conventional direct fusion. Furthermore, we inject multi
ASTM-based connections between the appearance and motion pathways of a dual
stream network to facilitate spatial-temporal interaction at all possible
scales. Finally, the regular information learned from multiple scales is
recorded in memory to enhance the differentiation between anomalies and normal
events during the testing phase. Solid experimental results on three standard
datasets validate the effectiveness of our approach, which achieve AUCs of
96.8% for UCSD Ped2, 87.6% for CUHK Avenue, and 73.9% for the ShanghaiTech
dataset
- …