268 research outputs found

    Semantic Adversarial Network with Multi-scale Pyramid Attention for Video Classification

    Full text link
    Two-stream architecture have shown strong performance in video classification task. The key idea is to learn spatio-temporal features by fusing convolutional networks spatially and temporally. However, there are some problems within such architecture. First, it relies on optical flow to model temporal information, which are often expensive to compute and store. Second, it has limited ability to capture details and local context information for video data. Third, it lacks explicit semantic guidance that greatly decrease the classification performance. In this paper, we proposed a new two-stream based deep framework for video classification to discover spatial and temporal information only from RGB frames, moreover, the multi-scale pyramid attention (MPA) layer and the semantic adversarial learning (SAL) module is introduced and integrated in our framework. The MPA enables the network capturing global and local feature to generate a comprehensive representation for video, and the SAL can make this representation gradually approximate to the real video semantics in an adversarial manner. Experimental results on two public benchmarks demonstrate our proposed methods achieves state-of-the-art results on standard video datasets

    PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer

    Full text link
    We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based or points-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' positions and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped text datasets.Comment: 9 pages, 8 figures, accepted by ACM MM 202

    Viia-hand: a Reach-and-grasp Restoration System Integrating Voice interaction, Computer vision and Auditory feedback for Blind Amputees

    Full text link
    Visual feedback plays a crucial role in the process of amputation patients completing grasping in the field of prosthesis control. However, for blind and visually impaired (BVI) amputees, the loss of both visual and grasping abilities makes the "easy" reach-and-grasp task a feasible challenge. In this paper, we propose a novel multi-sensory prosthesis system helping BVI amputees with sensing, navigation and grasp operations. It combines modules of voice interaction, environmental perception, grasp guidance, collaborative control, and auditory/tactile feedback. In particular, the voice interaction module receives user instructions and invokes other functional modules according to the instructions. The environmental perception and grasp guidance module obtains environmental information through computer vision, and feedbacks the information to the user through auditory feedback modules (voice prompts and spatial sound sources) and tactile feedback modules (vibration stimulation). The prosthesis collaborative control module obtains the context information of the grasp guidance process and completes the collaborative control of grasp gestures and wrist angles of prosthesis in conjunction with the user's control intention in order to achieve stable grasp of various objects. This paper details a prototyping design (named viia-hand) and presents its preliminary experimental verification on healthy subjects completing specific reach-and-grasp tasks. Our results showed that, with the help of our new design, the subjects were able to achieve a precise reach and reliable grasp of the target objects in a relatively cluttered environment. Additionally, the system is extremely user-friendly, as users can quickly adapt to it with minimal training

    Free-Form Composition Networks for Egocentric Action Recognition

    Full text link
    Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and then use them to compose new samples in the feature space for rare classes of action videos. First, we use a graph to capture the spatial-temporal relations among different hand/object instances in each action video. We thus decompose each action into a set of verb and preposition spatial-temporal representations using the edge features in the graph. The temporal decomposition extracts verb and preposition representations from different video frames, while the spatial decomposition adaptively learns verb and preposition representations from action-related instances in each frame. With these spatial-temporal representations of verbs and prepositions, we can compose new samples for those rare classes in a free-form manner, which is not restricted to a rigid form of a verb and a noun. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance. We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition

    Ultrafast Spin-To-Charge Conversion at the Surface of Topological Insulator Thin Films

    Full text link
    Strong spin-orbit coupling, resulting in the formation of spin-momentum-locked surface states, endows topological insulators with superior spin-to-charge conversion characteristics, though the dynamics that govern it have remained elusive. Here, we present an all-optical method that enables unprecedented tracking of the ultrafast dynamics of spin-to-charge conversion in a prototypical topological insulator Bi2_2Se3_3/ferromagnetic Co heterostructure, down to the sub-picosecond timescale. Compared to pure Bi2_2Se3_3 or Co, we observe a giant terahertz emission in the heterostructure than originates from spin-to-charge conversion, in which the topological surface states play a crucial role. We identify a 0.12-picosecond timescale that sets a technological speed limit of spin-to-charge conversion processes in topological insulators. In addition, we show that the spin-to-charge conversion efficiency is temperature independent in Bi2_2Se3_3 as expected from the nature of the surface states, paving the way for designing next-generation high-speed opto-spintronic devices based on topological insulators at room temperature.Comment: 19 pages, 4 figure

    Oxygen-vacancy effect on structural, magnetic, and ferroelectric properties in multiferroic YMnO3 single crystals

    Get PDF
    We have investigated the structural, magnetic, and ferroelectric properties of magnetically frustrated multiferroic YMnO3 single crystals. The ferroelectric domain structures of YMnO3 samples were studied by piezoresponse force microscopy. Instead of domain vortex structure in stoichiometric crystals, YMnO3-delta exhibits a random domain configuration with straight domain walls. In magnetic measurements, the YMnO3-delta crystal shows typical antiferromagnetic behavior with higher Neel temperature and lower magnetization compared to the stoichiometric sample. The ordered oxygen vacancies dominate multiferroicity through tailoring the domain wall structure. (C) 2012 American Institute of Physics. [doi:10.1063/1.3676000
    corecore