107 research outputs found

    MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

    Full text link
    Document dewarping from a distorted camera-captured image is of great value for OCR and document understanding. The document boundary plays an important role which is more evident than the inner region in document dewarping. Current learning-based methods mainly focus on complete boundary cases, leading to poor document correction performance of documents with incomplete boundaries. In contrast to these methods, this paper proposes MataDoc, the first method focusing on arbitrary boundary document dewarping with margin and text aware regularizations. Specifically, we design the margin regularization by explicitly considering background consistency to enhance boundary perception. Moreover, we introduce word position consistency to keep text lines straight in rectified document images. To produce a comprehensive evaluation of MataDoc, we propose a novel benchmark ArbDoc, mainly consisting of document images with arbitrary boundaries in four typical scenarios. Extensive experiments confirm the superiority of MataDoc with consideration for the incomplete boundary on ArbDoc and also demonstrate the effectiveness of the proposed method on DocUNet, DIR300, and WarpDoc datasets.Comment: 12 page

    Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

    Full text link
    In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard KK-keypoint pose estimation as predicting a set of NĂ—KN\times K keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring NN pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the NĂ—(K+1)N\times(K+1) queries with two subsequent group self-attentions: (i) NN within-instance self-attention, with each over KK keypoint queries and one instance query, and (ii) (K+1)(K+1) same-type across-instance self-attention, each over NN queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. \href\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle} and \href\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch} code are available.Comment: Accepted by ICCV 202

    Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation

    Full text link
    One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network. Existing methods typically improve the quality of heatmaps by customized architectures, such as high-resolution representation and vision Transformers. In this paper, we propose \textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints heatmaps generation problem from noised heatmaps. During training, the keypoints are diffused to random distribution by adding noises and the diffusion model learns to recover ground-truth heatmaps from noised heatmaps with respect to conditions constructed by image feature. During inference, the diffusion model generates heatmaps from initialized heatmaps in a progressive denoising way. Moreover, we further explore improving the performance of DiffusionPose with conditions from human structural information. Extensive experiments show the prowess of our DiffusionPose, with improvements of 1.6, 1.2, and 1.2 mAP on widely-used COCO, CrowdPose, and AI Challenge datasets, respectively

    Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

    Full text link
    Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many assignment, assigning one ground-truth object to multiple predictions, succeeds in detection methods such as Faster R-CNN and FCOS. While the naive one-to-many assignment does not work for DETR, and it remains challenging to apply one-to-many assignment for DETR training. In this paper, we introduce Group DETR, a simple yet efficient DETR training approach that introduces a group-wise way for one-to-many assignment. This approach involves using multiple groups of object queries, conducting one-to-one assignment within each group, and performing decoder self-attention separately. It resembles data augmentation with automatically-learned object query augmentation. It is also equivalent to simultaneously training parameter-sharing networks of the same architecture, introducing more supervision and thus improving DETR training. The inference process is the same as DETR trained normally and only needs one group of queries without any architecture modification. Group DETR is versatile and is applicable to various DETR variants. The experiments show that Group DETR significantly speeds up the training convergence and improves the performance of various DETR-based models. Code will be available at \url{https://github.com/Atten4Vis/GroupDETR}.Comment: ICCV23 camera ready versio

    MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

    Full text link
    Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme. Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images

    Statins and Thyroid Carcinoma: a Meta-Analysis

    Get PDF
    Background/Aims: Experimental studies have reported the antineoplastic effects of statins in thyroid carcinoma; however, observational studies suggested that statins might increase the risk of thyroid carcinoma. Therefore, this study evaluated the antineoplastic effects of statins in both in vitro studies and animal models, as well as the epidemiological evidence. Methods: Databases—PubMed, Cochrane Library, SinoMed, CNKI, Wanfang, and clinical trial registries— were searched. A meta-analysis was performed with sufficiently homogeneous studies. Eighteen articles were involved. Results: In in vitro studies, statins showed a concentration-dependent inhibition of cell line growth (weighted mean difference –34.68, 95% confidence interval –36.53 to –32.83). A significant efficacy of statin-induced apoptosis was observed (weighted mean difference [95% confidence interval]: 24 h, 57.50 [55.98–59.03]; 48 h, 23.43 [22.19–24.66]; 72 h, 51.29 [47.52–55.07]). Early apoptosis was increased in a dose- and time-dependent manner. In in vivo antitumor studies, lovastatin inhibited tumor growth, as shown by a reduction in tumor volume. However, two clinical studies showed discordant results from the experimental studies. Conclusion: Experimental studies revealed the antineoplastic efficacy of statins but statins were associated with thyroid carcinoma in clinical studies. This discrepancy may be due to the different concentrations of statins used and the effects of hyperlipidemia interventions, and thus further study is required

    Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

    Full text link
    Transformer-based Large Language Models (LLMs) have been applied in diverse areas such as knowledge bases, human interfaces, and dynamic agents, and marking a stride towards achieving Artificial General Intelligence (AGI). However, current LLMs are predominantly pretrained on short text snippets, which compromises their effectiveness in processing the long-context prompts that are frequently encountered in practical scenarios. This article offers a comprehensive survey of the recent advancement in Transformer-based LLM architectures aimed at enhancing the long-context capabilities of LLMs throughout the entire model lifecycle, from pre-training through to inference. We first delineate and analyze the problems of handling long-context input and output with the current Transformer-based models. We then provide a taxonomy and the landscape of upgrades on Transformer architecture to solve these problems. Afterwards, we provide an investigation on wildly used evaluation necessities tailored for long-context LLMs, including datasets, metrics, and baseline models, as well as optimization toolkits such as libraries, frameworks, and compilers to boost the efficacy of LLMs across different stages in runtime. Finally, we discuss the challenges and potential avenues for future research. A curated repository of relevant literature, continuously updated, is available at https://github.com/Strivin0311/long-llms-learning.Comment: 40 pages, 3 figures, 4 table

    Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

    Full text link
    We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon a vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant DINO~\cite{zhang2022dino}, and an efficient DETR training method Group DETR~\cite{chen2022group}. The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO. Group DETR v2 achieves 64.5\textbf{64.5} mAP on COCO test-dev, and establishes a new SoTA on the COCO leaderboard https://paperswithcode.com/sota/object-detection-on-cocoComment: Tech report, 3 pages. We establishes a new SoTA (64.5 mAP) on the COCO test-de

    HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

    Full text link
    Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Specifically, we employ this prior to guide the mask sampling process. Image patches, corresponding to human part regions, have high priority to be masked out. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks. To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image. We term the entire method as HAP. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78.1% mAP on MSMT17 for person re-identification, 86.54% mA on PA-100K for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation.Comment: Accepted by NeurIPS 202
    • …
    corecore