12,210 research outputs found

    Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification

    Full text link
    Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision, which plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. However, previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which easily leads to poor generalization capability when adapted to the new domain. In this paper, we propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning on visual, textual and visual-textual tasks respectively. To further enhance the robust feature learning in the context of transformer, a dynamic masking mechanism called Masked Multimodal Modeling strategy (MMM) is introduced to mask both the image patches and the text tokens, which can jointly works on multimodal or unimodal data and significantly boost the performance of generalizable person Re-ID. Extensive experiments on benchmark datasets demonstrate the competitive performance of our method over previous approaches. We hope this method could advance the research towards visual-semantic representation learning. Our source code is also publicly available at https://github.com/JeremyXSC/MMET

    ADS_UNet: A Nested UNet for Histopathology Image Segmentation

    Full text link
    The UNet model consists of fully convolutional network (FCN) layers arranged as contracting encoder and upsampling decoder maps. Nested arrangements of these encoder and decoder maps give rise to extensions of the UNet model, such as UNete and UNet++. Other refinements include constraining the outputs of the convolutional layers to discriminate between segment labels when trained end to end, a property called deep supervision. This reduces feature diversity in these nested UNet models despite their large parameter space. Furthermore, for texture segmentation, pixel correlations at multiple scales contribute to the classification task; hence, explicit deep supervision of shallower layers is likely to enhance performance. In this paper, we propose ADS UNet, a stage-wise additive training algorithm that incorporates resource-efficient deep supervision in shallower layers and takes performance-weighted combinations of the sub-UNets to create the segmentation model. We provide empirical evidence on three histopathology datasets to support the claim that the proposed ADS UNet reduces correlations between constituent features and improves performance while being more resource efficient. We demonstrate that ADS_UNet outperforms state-of-the-art Transformer-based models by 1.08 and 0.6 points on CRAG and BCSS datasets, and yet requires only 37% of GPU consumption and 34% of training time as that required by Transformers.Comment: To be published in Expert Systems With Application

    ShakingBot: Dynamic Manipulation for Bagging

    Full text link
    Bag manipulation through robots is complex and challenging due to the deformability of the bag. Based on dynamic manipulation strategy, we propose a new framework, ShakingBot, for the bagging tasks. ShakingBot utilizes a perception module to identify the key region of the plastic bag from arbitrary initial configurations. According to the segmentation, ShakingBot iteratively executes a novel set of actions, including Bag Adjustment, Dual-arm Shaking, and One-arm Holding, to open the bag. The dynamic action, Dual-arm Shaking, can effectively open the bag without the need to account for the crumpled configuration.Then, we insert the items and lift the bag for transport. We perform our method on a dual-arm robot and achieve a success rate of 21/33 for inserting at least one item across various initial bag configurations. In this work, we demonstrate the performance of dynamic shaking actions compared to the quasi-static manipulation in the bagging task. We also show that our method generalizes to variations despite the bag's size, pattern, and color.Comment: Manipulating bag through robots to baggin

    Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention

    Full text link
    Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers' activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.Comment: 9 pages (1 for reference); accepted by the 6th Multimodal Learning and Applications Workshop (MULA) at CVPR 202

    BotMoE: Twitter Bot Detection with Community-Aware Mixtures of Modal-Specific Experts

    Full text link
    Twitter bot detection has become a crucial task in efforts to combat online misinformation, mitigate election interference, and curb malicious propaganda. However, advanced Twitter bots often attempt to mimic the characteristics of genuine users through feature manipulation and disguise themselves to fit in diverse user communities, posing challenges for existing Twitter bot detection models. To this end, we propose BotMoE, a Twitter bot detection framework that jointly utilizes multiple user information modalities (metadata, textual content, network structure) to improve the detection of deceptive bots. Furthermore, BotMoE incorporates a community-aware Mixture-of-Experts (MoE) layer to improve domain generalization and adapt to different Twitter communities. Specifically, BotMoE constructs modal-specific encoders for metadata features, textual content, and graphical structure, which jointly model Twitter users from three modal-specific perspectives. We then employ a community-aware MoE layer to automatically assign users to different communities and leverage the corresponding expert networks. Finally, user representations from metadata, text, and graph perspectives are fused with an expert fusion layer, combining all three modalities while measuring the consistency of user information. Extensive experiments demonstrate that BotMoE significantly advances the state-of-the-art on three Twitter bot detection benchmarks. Studies also confirm that BotMoE captures advanced and evasive bots, alleviates the reliance on training data, and better generalizes to new and previously unseen user communities.Comment: Accepted at SIGIR 202

    Self-Supervised Learning to Prove Equivalence Between Straight-Line Programs via Rewrite Rules

    Full text link
    We target the problem of automatically synthesizing proofs of semantic equivalence between two programs made of sequences of statements. We represent programs using abstract syntax trees (AST), where a given set of semantics-preserving rewrite rules can be applied on a specific AST pattern to generate a transformed and semantically equivalent program. In our system, two programs are equivalent if there exists a sequence of application of these rewrite rules that leads to rewriting one program into the other. We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs. The system outputs a sequence of rewrites, and the validity of the sequence is simply checked by verifying it can be applied. If no valid sequence is produced by the neural network, the system reports the programs as non-equivalent, ensuring by design no programs may be incorrectly reported as equivalent. Our system is fully implemented for a given grammar which can represent straight-line programs with function calls and multiple types. To efficiently train the system to generate such sequences, we develop an original incremental training technique, named self-supervised sample selection. We extensively study the effectiveness of this novel training approach on proofs of increasing complexity and length. Our system, S4Eq, achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent programsComment: 30 pages including appendi

    ARA-net: an attention-aware retinal atrophy segmentation network coping with fundus images

    Get PDF
    BackgroundAccurately detecting and segmenting areas of retinal atrophy are paramount for early medical intervention in pathological myopia (PM). However, segmenting retinal atrophic areas based on a two-dimensional (2D) fundus image poses several challenges, such as blurred boundaries, irregular shapes, and size variation. To overcome these challenges, we have proposed an attention-aware retinal atrophy segmentation network (ARA-Net) to segment retinal atrophy areas from the 2D fundus image.MethodsIn particular, the ARA-Net adopts a similar strategy as UNet to perform the area segmentation. Skip self-attention connection (SSA) block, comprising a shortcut and a parallel polarized self-attention (PPSA) block, has been proposed to deal with the challenges of blurred boundaries and irregular shapes of the retinal atrophic region. Further, we have proposed a multi-scale feature flow (MSFF) to challenge the size variation. We have added the flow between the SSA connection blocks, allowing for capturing considerable semantic information to detect retinal atrophy in various area sizes.ResultsThe proposed method has been validated on the Pathological Myopia (PALM) dataset. Experimental results demonstrate that our method yields a high dice coefficient (DICE) of 84.26%, Jaccard index (JAC) of 72.80%, and F1-score of 84.57%, which outperforms other methods significantly.ConclusionOur results have demonstrated that ARA-Net is an effective and efficient approach for retinal atrophic area segmentation in PM

    Copy-paste data augmentation for domain transfer on traffic signs

    Get PDF
    City streets carry a lot of information that can be exploited to improve the quality of the services the citizens receive. For example, autonomous vehicles need to act accordingly to all the element that are nearby the vehicle itself, like pedestrians, traffic signs and other vehicles. It is also possible to use such information for smart city applications, for example to predict and analyze the traffic or pedestrian flows. Among all the objects that it is possible to find in a street, traffic signs are very important because of the information they carry. This information can in fact be exploited both for autonomous driving and for smart city applications. Deep learning and, more generally, machine learning models however need huge quantities to learn. Even though modern models are very good at gener- alizing, the more samples the model has, the better it can generalize between different samples. Creating these datasets organically, namely with real pictures, is a very tedious task because of the wide variety of signs available in the whole world and especially because of all the possible light, orientation conditions and con- ditions in general in which they can appear. In addition to that, it may not be easy to collect enough samples for all the possible traffic signs available, cause some of them may be very rare to find. Instead of collecting pictures manually, it is possible to exploit data aug- mentation techniques to create synthetic datasets containing the signs that are needed. Creating this data synthetically allows to control the distribution and the conditions of the signs in the datasets, improving the quality and quantity of training data that is going to be used. This thesis work is about using copy-paste data augmentation to create synthetic data for the traffic sign recognition task

    Identifying Student Profiles Within Online Judge Systems Using Explainable Artificial Intelligence

    Get PDF
    Online Judge (OJ) systems are typically considered within programming-related courses as they yield fast and objective assessments of the code developed by the students. Such an evaluation generally provides a single decision based on a rubric, most commonly whether the submission successfully accomplished the assignment. Nevertheless, since in an educational context such information may be deemed insufficient, it would be beneficial for both the student and the instructor to receive additional feedback about the overall development of the task. This work aims to tackle this limitation by considering the further exploitation of the information gathered by the OJ and automatically inferring feedback for both the student and the instructor. More precisely, we consider the use of learning-based schemes—particularly, Multi-Instance Learning and classical Machine Learning formulations—to model student behaviour. Besides, Explainable Artificial Intelligence is contemplated to provide human-understandable feedback. The proposal has been evaluated considering a case of study comprising 2,500 submissions from roughly 90 different students from a programming-related course in a Computer Science degree. The results obtained validate the proposal: the model is capable of significantly predicting the user outcome (either passing or failing the assignment) solely based on the behavioural pattern inferred by the submissions provided to the OJ. Moreover, the proposal is able to identify prone-to-fail student groups and profiles as well as other relevant information, which eventually serves as feedback to both the student and the instructor.This work has been partially funded by the “Programa Redes-I3CE de investigacion en docencia universitaria del Instituto de Ciencias de la Educacion (REDES-I3CE-2020-5069)” of the University of Alicante. The third author is supported by grant APOSTD/2020/256 from “Programa I+D+I de la Generalitat Valenciana”

    Neural Architecture Search: Insights from 1000 Papers

    Full text link
    In the past decade, advances in deep learning have resulted in breakthroughs in a variety of areas, including computer vision, natural language understanding, speech recognition, and reinforcement learning. Specialized, high-performing neural architectures are crucial to the success of deep learning in these areas. Neural architecture search (NAS), the process of automating the design of neural architectures for a given task, is an inevitable next step in automating machine learning and has already outpaced the best human-designed architectures on many tasks. In the past few years, research in NAS has been progressing rapidly, with over 1000 papers released since 2020 (Deng and Lindauer, 2021). In this survey, we provide an organized and comprehensive guide to neural architecture search. We give a taxonomy of search spaces, algorithms, and speedup techniques, and we discuss resources such as benchmarks, best practices, other surveys, and open-source libraries
    • …
    corecore