13 research outputs found
Progressive Scene Text Erasing with Self-Supervision
Scene text erasing seeks to erase text contents from scene images and current
state-of-the-art text erasing models are trained on large-scale synthetic data.
Although data synthetic engines can provide vast amounts of annotated training
samples, there are differences between synthetic and real-world data. In this
paper, we employ self-supervision for feature representation on unlabeled
real-world scene text images. A novel pretext task is designed to keep
consistent among text stroke masks of image variants. We design the Progressive
Erasing Network in order to remove residual texts. The scene text is erased
progressively by leveraging the intermediate generated results which provide
the foundation for subsequent higher quality results. Experiments show that our
method significantly improves the generalization of the text erasing task and
achieves state-of-the-art performance on public benchmarks
FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models
Detecting stereotypes and biases in Large Language Models (LLMs) can enhance
fairness and reduce adverse impacts on individuals or groups when these LLMs
are applied. However, the majority of existing methods focus on measuring the
model's preference towards sentences containing biases and stereotypes within
datasets, which lacks interpretability and cannot detect implicit biases and
stereotypes in the real world. To address this gap, this paper introduces a
four-stage framework to directly evaluate stereotypes and biases in the
generated content of LLMs, including direct inquiry testing, serial or adapted
story testing, implicit association testing, and unknown situation testing.
Additionally, the paper proposes multi-dimensional evaluation metrics and
explainable zero-shot prompts for automated evaluation. Using the education
sector as a case study, we constructed the Edu-FairBench based on the
four-stage framework, which encompasses 12,632 open-ended questions covering
nine sensitive factors and 26 educational scenarios. Experimental results
reveal varying degrees of stereotypes and biases in five LLMs evaluated on
Edu-FairBench. Moreover, the results of our proposed automated evaluation
method have shown a high correlation with human annotations
DDT: Dual-branch Deformable Transformer for Image Denoising
Transformer is beneficial for image denoising tasks since it can model
long-range dependencies to overcome the limitations presented by inductive
convolutional biases. However, directly applying the transformer structure to
remove noise is challenging because its complexity grows quadratically with the
spatial resolution. In this paper, we propose an efficient Dual-branch
Deformable Transformer (DDT) denoising network which captures both local and
global interactions in parallel. We divide features with a fixed patch size and
a fixed number of patches in local and global branches, respectively. In
addition, we apply deformable attention operation in both branches, which helps
the network focus on more important regions and further reduces computational
complexity. We conduct extensive experiments on real-world and synthetic
denoising tasks, and the proposed DDT achieves state-of-the-art performance
with significantly fewer computational costs.Comment: The code is avaliable at: https://github.com/Merenguelkl/DD
DCQA: Document-Level Chart Question Answering towards Complex Reasoning and Common-Sense Understanding
Visually-situated languages such as charts and plots are omnipresent in
real-world documents. These graphical depictions are human-readable and are
often analyzed in visually-rich documents to address a variety of questions
that necessitate complex reasoning and common-sense responses. Despite the
growing number of datasets that aim to answer questions over charts, most only
address this task in isolation, without considering the broader context of
document-level question answering. Moreover, such datasets lack adequate
common-sense reasoning information in their questions. In this work, we
introduce a novel task named document-level chart question answering (DCQA).
The goal of this task is to conduct document-level question answering,
extracting charts or plots in the document via document layout analysis (DLA)
first and subsequently performing chart question answering (CQA). The newly
developed benchmark dataset comprises 50,010 synthetic documents integrating
charts in a wide range of styles (6 styles in contrast to 3 for PlotQA and
ChartQA) and includes 699,051 questions that demand a high degree of reasoning
ability and common-sense understanding. Besides, we present the development of
a potent question-answer generation engine that employs table data, a rich
color set, and basic question templates to produce a vast array of reasoning
question-answer pairs automatically. Based on DCQA, we devise an OCR-free
transformer for document-level chart-oriented understanding, capable of DLA and
answering complex reasoning and common-sense questions over charts in an
OCR-free manner. Our DCQA dataset is expected to foster research on
understanding visualizations in documents, especially for scenarios that
require complex reasoning for charts in the visually-rich document. We
implement and evaluate a set of baselines, and our proposed method achieves
comparable results
LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion
LiDAR-camera fusion methods have shown impressive performance in 3D object
detection. Recent advanced multi-modal methods mainly perform global fusion,
where image features and point cloud features are fused across the whole scene.
Such practice lacks fine-grained region-level information, yielding suboptimal
fusion performance. In this paper, we present the novel Local-to-Global fusion
network (LoGoNet), which performs LiDAR-camera fusion at both local and global
levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous
literature, while we exclusively use point centroids to more precisely
represent the position of voxel features, thus achieving better cross-modal
alignment. As to the Local Fusion (LoF), we first divide each proposal into
uniform grids and then project these grid centers to the images. The image
features around the projected grid points are sampled to be fused with
position-decorated point cloud features, maximally utilizing the rich
contextual information around the proposals. The Feature Dynamic Aggregation
(FDA) module is further proposed to achieve information interaction between
these locally and globally fused features, thus producing more informative
multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD)
and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D
detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection
leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy
that, for the first time, the detection performance on three classes surpasses
80 APH (L2) simultaneously. Code will be available at
\url{https://github.com/sankin97/LoGoNet}.Comment: Accepted by CVPR202
EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance
With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage the multi-modality model ImageBind for initial cross-modal retrieval between images and music, and select the music-related image for further manipulation. Simultaneously, by extracting sentiment polarity from the lyrics of the audio, we generate an emotionally rich auxiliary music branch to accentuate the affective information. We then leverage pre-trained encoders to encode audio and the audio-related image into the same embedding space. With the aligned embeddings, we manipulate the image via a direct latent optimization method. We conduct objective and subjective evaluations on the generated images, and our results show that our framework is capable of generating images with specified human emotions conveyed in the audio
High- and Low-Temperature Properties of Layered Silicate-Modified Bitumens: View from the Nature of Pristine Layered Silicate
Layered silicates, as bitumen modifiers, have received increasing attention. The main objective of this study was to evaluate the influence of layered silicates on bitumen properties. For this study, montmorillonite (MMT), rectorite (REC), organic montmorillonite (OMMT), and organic rectorite (OREC) were selected. The layered structure type of layered silicates was characterized by SEM (scanning electron microscope) and XRD (X-ray diffraction diffractometer). Tests for determining high-temperature properties included viscosity, DSR (dynamic shear rheometer), and TG (thermogravimetry) tests, and studies for determining the low-temperature properties were conducted by BBR (bending beam rheometer) and DSC (differential scanning calorimetry) tests. Our results show that MMT, REC, OMMT, and OREC were all intercalated structures. OREC had the largest d001 interlayer space, followed by REC, OMMT, and MMT. OREC improved the high-temperature property of virgin bitumen more effectively than OMMT. Meanwhile, REC-modified bitumen exhibited a high-temperature property similar to OMMT-modified bitumen. When compared with REC and OREC, MMT and OMMT were less efficient in reducing the low-temperature properties of virgin bitumen, and OMMT was the least efficient. Therefore, it can be concluded that the nature of pristine layered silicates has a great impact on the high- and low-temperature properties of bitumen. Moreover, organic treatment can simultaneously improve the high- and low-temperature properties of layered silicate-modified bitumens