268 research outputs found
Semantic Adversarial Network with Multi-scale Pyramid Attention for Video Classification
Two-stream architecture have shown strong performance in video classification
task. The key idea is to learn spatio-temporal features by fusing convolutional
networks spatially and temporally. However, there are some problems within such
architecture. First, it relies on optical flow to model temporal information,
which are often expensive to compute and store. Second, it has limited ability
to capture details and local context information for video data. Third, it
lacks explicit semantic guidance that greatly decrease the classification
performance. In this paper, we proposed a new two-stream based deep framework
for video classification to discover spatial and temporal information only from
RGB frames, moreover, the multi-scale pyramid attention (MPA) layer and the
semantic adversarial learning (SAL) module is introduced and integrated in our
framework. The MPA enables the network capturing global and local feature to
generate a comprehensive representation for video, and the SAL can make this
representation gradually approximate to the real video semantics in an
adversarial manner. Experimental results on two public benchmarks demonstrate
our proposed methods achieves state-of-the-art results on standard video
datasets
PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer
We present PBFormer, an efficient yet powerful scene text detector that
unifies the transformer with a novel text shape representation Polynomial Band
(PB). The representation has four polynomial curves to fit a text's top,
bottom, left, and right sides, which can capture a text with a complex shape by
varying polynomial coefficients. PB has appealing features compared with
conventional representations: 1) It can model different curvatures with a fixed
number of parameters, while polygon-points-based methods need to utilize a
different number of points. 2) It can distinguish adjacent or overlapping texts
as they have apparent different curve coefficients, while segmentation-based or
points-based methods suffer from adhesive spatial positions. PBFormer combines
the PB with the transformer, which can directly generate smooth text contours
sampled from predicted curves without interpolation. A parameter-free
cross-scale pixel attention (CPA) module is employed to highlight the feature
map of a suitable scale while suppressing the other feature maps. The simple
operation can help detect small-scale texts and is compatible with the
one-stage DETR framework, where no postprocessing exists for NMS. Furthermore,
PBFormer is trained with a shape-contained loss, which not only enforces the
piecewise alignment between the ground truth and the predicted curves but also
makes curves' positions and shapes consistent with each other. Without bells
and whistles about text pre-training, our method is superior to the previous
state-of-the-art text detectors on the arbitrary-shaped text datasets.Comment: 9 pages, 8 figures, accepted by ACM MM 202
Viia-hand: a Reach-and-grasp Restoration System Integrating Voice interaction, Computer vision and Auditory feedback for Blind Amputees
Visual feedback plays a crucial role in the process of amputation patients
completing grasping in the field of prosthesis control. However, for blind and
visually impaired (BVI) amputees, the loss of both visual and grasping
abilities makes the "easy" reach-and-grasp task a feasible challenge. In this
paper, we propose a novel multi-sensory prosthesis system helping BVI amputees
with sensing, navigation and grasp operations. It combines modules of voice
interaction, environmental perception, grasp guidance, collaborative control,
and auditory/tactile feedback. In particular, the voice interaction module
receives user instructions and invokes other functional modules according to
the instructions. The environmental perception and grasp guidance module
obtains environmental information through computer vision, and feedbacks the
information to the user through auditory feedback modules (voice prompts and
spatial sound sources) and tactile feedback modules (vibration stimulation).
The prosthesis collaborative control module obtains the context information of
the grasp guidance process and completes the collaborative control of grasp
gestures and wrist angles of prosthesis in conjunction with the user's control
intention in order to achieve stable grasp of various objects. This paper
details a prototyping design (named viia-hand) and presents its preliminary
experimental verification on healthy subjects completing specific
reach-and-grasp tasks. Our results showed that, with the help of our new
design, the subjects were able to achieve a precise reach and reliable grasp of
the target objects in a relatively cluttered environment. Additionally, the
system is extremely user-friendly, as users can quickly adapt to it with
minimal training
Free-Form Composition Networks for Egocentric Action Recognition
Egocentric action recognition is gaining significant attention in the field
of human action recognition. In this paper, we address data scarcity issue in
egocentric action recognition from a compositional generalization perspective.
To tackle this problem, we propose a free-form composition network (FFCN) that
can simultaneously learn disentangled verb, preposition, and noun
representations, and then use them to compose new samples in the feature space
for rare classes of action videos. First, we use a graph to capture the
spatial-temporal relations among different hand/object instances in each action
video. We thus decompose each action into a set of verb and preposition
spatial-temporal representations using the edge features in the graph. The
temporal decomposition extracts verb and preposition representations from
different video frames, while the spatial decomposition adaptively learns verb
and preposition representations from action-related instances in each frame.
With these spatial-temporal representations of verbs and prepositions, we can
compose new samples for those rare classes in a free-form manner, which is not
restricted to a rigid form of a verb and a noun. The proposed FFCN can directly
generate new training data samples for rare classes, hence significantly
improve action recognition performance. We evaluated our method on three
popular egocentric action recognition datasets, Something-Something V2, H2O,
and EPIC-KITCHENS-100, and the experimental results demonstrate the
effectiveness of the proposed method for handling data scarcity problems,
including long-tailed and few-shot egocentric action recognition
Ultrafast Spin-To-Charge Conversion at the Surface of Topological Insulator Thin Films
Strong spin-orbit coupling, resulting in the formation of
spin-momentum-locked surface states, endows topological insulators with
superior spin-to-charge conversion characteristics, though the dynamics that
govern it have remained elusive. Here, we present an all-optical method that
enables unprecedented tracking of the ultrafast dynamics of spin-to-charge
conversion in a prototypical topological insulator BiSe/ferromagnetic
Co heterostructure, down to the sub-picosecond timescale. Compared to pure
BiSe or Co, we observe a giant terahertz emission in the
heterostructure than originates from spin-to-charge conversion, in which the
topological surface states play a crucial role. We identify a 0.12-picosecond
timescale that sets a technological speed limit of spin-to-charge conversion
processes in topological insulators. In addition, we show that the
spin-to-charge conversion efficiency is temperature independent in BiSe
as expected from the nature of the surface states, paving the way for designing
next-generation high-speed opto-spintronic devices based on topological
insulators at room temperature.Comment: 19 pages, 4 figure
Oxygen-vacancy effect on structural, magnetic, and ferroelectric properties in multiferroic YMnO3 single crystals
We have investigated the structural, magnetic, and ferroelectric properties of magnetically frustrated multiferroic YMnO3 single crystals. The ferroelectric domain structures of YMnO3 samples were studied by piezoresponse force microscopy. Instead of domain vortex structure in stoichiometric crystals, YMnO3-delta exhibits a random domain configuration with straight domain walls. In magnetic measurements, the YMnO3-delta crystal shows typical antiferromagnetic behavior with higher Neel temperature and lower magnetization compared to the stoichiometric sample. The ordered oxygen vacancies dominate multiferroicity through tailoring the domain wall structure. (C) 2012 American Institute of Physics. [doi:10.1063/1.3676000
- …