107 research outputs found
MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary
Document dewarping from a distorted camera-captured image is of great value
for OCR and document understanding. The document boundary plays an important
role which is more evident than the inner region in document dewarping. Current
learning-based methods mainly focus on complete boundary cases, leading to poor
document correction performance of documents with incomplete boundaries. In
contrast to these methods, this paper proposes MataDoc, the first method
focusing on arbitrary boundary document dewarping with margin and text aware
regularizations. Specifically, we design the margin regularization by
explicitly considering background consistency to enhance boundary perception.
Moreover, we introduce word position consistency to keep text lines straight in
rectified document images. To produce a comprehensive evaluation of MataDoc, we
propose a novel benchmark ArbDoc, mainly consisting of document images with
arbitrary boundaries in four typical scenarios. Extensive experiments confirm
the superiority of MataDoc with consideration for the incomplete boundary on
ArbDoc and also demonstrate the effectiveness of the proposed method on
DocUNet, DIR300, and WarpDoc datasets.Comment: 12 page
Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation
In this paper, we study the problem of end-to-end multi-person pose
estimation. State-of-the-art solutions adopt the DETR-like framework, and
mainly develop the complex decoder, e.g., regarding pose estimation as keypoint
box detection and combining with human detection in ED-Pose, hierarchically
predicting with pose decoder and joint (keypoint) decoder in PETR. We present a
simple yet effective transformer approach, named Group Pose. We simply regard
-keypoint pose estimation as predicting a set of keypoint
positions, each from a keypoint query, as well as representing each pose with
an instance query for scoring pose predictions. Motivated by the intuition
that the interaction, among across-instance queries of different types, is not
directly helpful, we make a simple modification to decoder self-attention. We
replace single self-attention over all the queries with two
subsequent group self-attentions: (i) within-instance self-attention, with
each over keypoint queries and one instance query, and (ii)
same-type across-instance self-attention, each over queries of the same
type. The resulting decoder removes the interaction among across-instance
type-different queries, easing the optimization and thus improving the
performance. Experimental results on MS COCO and CrowdPose show that our
approach without human box supervision is superior to previous methods with
complex decoders, and even is slightly better than ED-Pose that uses human box
supervision. and code
are available.Comment: Accepted by ICCV 202
Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation
One of the mainstream schemes for 2D human pose estimation (HPE) is learning
keypoints heatmaps by a neural network. Existing methods typically improve the
quality of heatmaps by customized architectures, such as high-resolution
representation and vision Transformers. In this paper, we propose
\textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints
heatmaps generation problem from noised heatmaps. During training, the
keypoints are diffused to random distribution by adding noises and the
diffusion model learns to recover ground-truth heatmaps from noised heatmaps
with respect to conditions constructed by image feature. During inference, the
diffusion model generates heatmaps from initialized heatmaps in a progressive
denoising way. Moreover, we further explore improving the performance of
DiffusionPose with conditions from human structural information. Extensive
experiments show the prowess of our DiffusionPose, with improvements of 1.6,
1.2, and 1.2 mAP on widely-used COCO, CrowdPose, and AI Challenge datasets,
respectively
Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment
Detection transformer (DETR) relies on one-to-one assignment, assigning one
ground-truth object to one prediction, for end-to-end detection without NMS
post-processing. It is known that one-to-many assignment, assigning one
ground-truth object to multiple predictions, succeeds in detection methods such
as Faster R-CNN and FCOS. While the naive one-to-many assignment does not work
for DETR, and it remains challenging to apply one-to-many assignment for DETR
training. In this paper, we introduce Group DETR, a simple yet efficient DETR
training approach that introduces a group-wise way for one-to-many assignment.
This approach involves using multiple groups of object queries, conducting
one-to-one assignment within each group, and performing decoder self-attention
separately. It resembles data augmentation with automatically-learned object
query augmentation. It is also equivalent to simultaneously training
parameter-sharing networks of the same architecture, introducing more
supervision and thus improving DETR training. The inference process is the same
as DETR trained normally and only needs one group of queries without any
architecture modification. Group DETR is versatile and is applicable to various
DETR variants. The experiments show that Group DETR significantly speeds up the
training convergence and improves the performance of various DETR-based models.
Code will be available at \url{https://github.com/Atten4Vis/GroupDETR}.Comment: ICCV23 camera ready versio
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Text images contain both visual and linguistic information. However, existing
pre-training techniques for text recognition mainly focus on either visual
representation learning or linguistic knowledge learning. In this paper, we
propose a novel approach MaskOCR to unify vision and language pre-training in
the classical encoder-decoder recognition framework. We adopt the masked image
modeling approach to pre-train the feature encoder using a large set of
unlabeled real text images, which allows us to learn strong visual
representations. In contrast to introducing linguistic knowledge with an
additional language model, we directly pre-train the sequence decoder.
Specifically, we transform text data into synthesized text images to unify the
data modalities of vision and language, and enhance the language modeling
capability of the sequence decoder using a proposed masked image-language
modeling scheme. Significantly, the encoder is frozen during the pre-training
phase of the sequence decoder. Experimental results demonstrate that our
proposed method achieves superior performance on benchmark datasets, including
Chinese and English text images
Statins and Thyroid Carcinoma: a Meta-Analysis
Background/Aims: Experimental studies have reported the antineoplastic effects of statins in thyroid carcinoma; however, observational studies suggested that statins might increase the risk of thyroid carcinoma. Therefore, this study evaluated the antineoplastic effects of statins in both in vitro studies and animal models, as well as the epidemiological evidence. Methods: Databases—PubMed, Cochrane Library, SinoMed, CNKI, Wanfang, and clinical trial registries— were searched. A meta-analysis was performed with sufficiently homogeneous studies. Eighteen articles were involved. Results: In in vitro studies, statins showed a concentration-dependent inhibition of cell line growth (weighted mean difference –34.68, 95% confidence interval –36.53 to –32.83). A significant efficacy of statin-induced apoptosis was observed (weighted mean difference [95% confidence interval]: 24 h, 57.50 [55.98–59.03]; 48 h, 23.43 [22.19–24.66]; 72 h, 51.29 [47.52–55.07]). Early apoptosis was increased in a dose- and time-dependent manner. In in vivo antitumor studies, lovastatin inhibited tumor growth, as shown by a reduction in tumor volume. However, two clinical studies showed discordant results from the experimental studies. Conclusion: Experimental studies revealed the antineoplastic efficacy of statins but statins were associated with thyroid carcinoma in clinical studies. This discrepancy may be due to the different concentrations of statins used and the effects of hyperlipidemia interventions, and thus further study is required
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
Transformer-based Large Language Models (LLMs) have been applied in diverse
areas such as knowledge bases, human interfaces, and dynamic agents, and
marking a stride towards achieving Artificial General Intelligence (AGI).
However, current LLMs are predominantly pretrained on short text snippets,
which compromises their effectiveness in processing the long-context prompts
that are frequently encountered in practical scenarios. This article offers a
comprehensive survey of the recent advancement in Transformer-based LLM
architectures aimed at enhancing the long-context capabilities of LLMs
throughout the entire model lifecycle, from pre-training through to inference.
We first delineate and analyze the problems of handling long-context input and
output with the current Transformer-based models. We then provide a taxonomy
and the landscape of upgrades on Transformer architecture to solve these
problems. Afterwards, we provide an investigation on wildly used evaluation
necessities tailored for long-context LLMs, including datasets, metrics, and
baseline models, as well as optimization toolkits such as libraries,
frameworks, and compilers to boost the efficacy of LLMs across different stages
in runtime. Finally, we discuss the challenges and potential avenues for future
research. A curated repository of relevant literature, continuously updated, is
available at https://github.com/Strivin0311/long-llms-learning.Comment: 40 pages, 3 figures, 4 table
Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining
We present a strong object detector with encoder-decoder pretraining and
finetuning. Our method, called Group DETR v2, is built upon a vision
transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant
DINO~\cite{zhang2022dino}, and an efficient DETR training method Group
DETR~\cite{chen2022group}. The training process consists of self-supervised
pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the
detector on Object365, and finally finetuning it on COCO. Group DETR v2
achieves mAP on COCO test-dev, and establishes a new SoTA on
the COCO leaderboard https://paperswithcode.com/sota/object-detection-on-cocoComment: Tech report, 3 pages. We establishes a new SoTA (64.5 mAP) on the
COCO test-de
HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception
Model pre-training is essential in human-centric perception. In this paper,
we first introduce masked image modeling (MIM) as a pre-training approach for
this task. Upon revisiting the MIM training strategy, we reveal that human
structure priors offer significant potential. Motivated by this insight, we
further incorporate an intuitive human structure prior - human parts - into
pre-training. Specifically, we employ this prior to guide the mask sampling
process. Image patches, corresponding to human part regions, have high priority
to be masked out. This encourages the model to concentrate more on body
structure information during pre-training, yielding substantial benefits across
a range of human-centric perception tasks. To further capture human
characteristics, we propose a structure-invariant alignment loss that enforces
different masked views, guided by the human part prior, to be closely aligned
for the same image. We term the entire method as HAP. HAP simply uses a plain
ViT as the encoder yet establishes new state-of-the-art performance on 11
human-centric benchmarks, and on-par result on one dataset. For example, HAP
achieves 78.1% mAP on MSMT17 for person re-identification, 86.54% mA on PA-100K
for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose
estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation.Comment: Accepted by NeurIPS 202
- …