1,179 research outputs found
Optimal subsampling for large scale Elastic-net regression
Datasets with sheer volume have been generated from fields including computer
vision, medical imageology, and astronomy whose large-scale and
high-dimensional properties hamper the implementation of classical statistical
models. To tackle the computational challenges, one of the efficient approaches
is subsampling which draws subsamples from the original large datasets
according to a carefully-design task-specific probability distribution to form
an informative sketch. The computation cost is reduced by applying the original
algorithm to the substantially smaller sketch. Previous studies associated with
subsampling focused on non-regularized regression from the computational
efficiency and theoretical guarantee perspectives, such as ordinary least
square regression and logistic regression. In this article, we introduce a
randomized algorithm under the subsampling scheme for the Elastic-net
regression which gives novel insights into L1-norm regularized regression
problem. To effectively conduct consistency analysis, a smooth approximation
technique based on alpha absolute function is firstly employed and
theoretically verified. The concentration bounds and asymptotic normality for
the proposed randomized algorithm are then established under mild conditions.
Moreover, an optimal subsampling probability is constructed according to
A-optimality. The effectiveness of the proposed algorithm is demonstrated upon
synthetic and real data datasets.Comment: 28 pages, 7 figure
RADAR: Robust AI-Text Detection via Adversarial Learning
Recent advances in large language models (LLMs) and the intensifying
popularity of ChatGPT-like applications have blurred the boundary of
high-quality text generation between humans and machines. However, in addition
to the anticipated revolutionary changes to our technology and society, the
difficulty of distinguishing LLM-generated texts (AI-text) from human-generated
texts poses new challenges of misuse and fairness, such as fake content
generation, plagiarism, and false accusation of innocent writers. While
existing works show that current AI-text detectors are not robust to LLM-based
paraphrasing, this paper aims to bridge this gap by proposing a new framework
called RADAR, which jointly trains a Robust AI-text Detector via Adversarial
leaRning. RADAR is based on adversarial training of a paraphraser and a
detector. The paraphraser's goal is to generate realistic contents to evade
AI-text detection. RADAR uses the feedback from the detector to update the
paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly
2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets,
experimental results show that RADAR significantly outperforms existing AI-text
detection methods, especially when paraphrasing is in place. We also identify
the strong transferability of RADAR from instruction-tuned LLMs to other LLMs,
and evaluate the improved capability of RADAR via GPT-3.5.Comment: Preprint. Project page and demos: https://radar.vizhub.a
IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition
Nowadays, scene text recognition has attracted more and more attention due to
its diverse applications. Most state-of-the-art methods adopt an
encoder-decoder framework with the attention mechanism, autoregressively
generating text from left to right. Despite the convincing performance, this
sequential decoding strategy constrains inference speed. Conversely,
non-autoregressive models provide faster, simultaneous predictions but often
sacrifice accuracy. Although utilizing an explicit language model can improve
performance, it burdens the computational load. Besides, separating linguistic
knowledge from vision information may harm the final prediction. In this paper,
we propose an alternative solution, using a parallel and iterative decoder that
adopts an easy-first decoding strategy. Furthermore, we regard text recognition
as an image-based conditional text generation task and utilize the discrete
diffusion strategy, ensuring exhaustive exploration of bidirectional contextual
information. Extensive experiments demonstrate that the proposed approach
achieves superior results on the benchmark datasets, including both Chinese and
English text images
Fish-T1K (Transcriptomes of 1,000 Fishes) Project: Large-Scale Transcriptome Data for Fish Evolution Studies
Ray-finned fishes (Actinopterygii) represent more than 50 % of extant vertebrates and are of great evolutionary, ecologic and economic significance, but they are relatively underrepresented in ‘omics studies. Increased availability of transcriptome data for these species will allow researchers to better understand changes in gene expression, and to carry out functional analyses. An international project known as the “Transcriptomes of 1,000 Fishes” (Fish-T1K) project has been established to generate RNA-seq transcriptome sequences for 1,000 diverse species of ray-finned fishes. The first phase of this project has produced transcriptomes from more than 180 ray-finned fishes, representing 142 species and covering 51 orders and 109 families. Here we provide an overview of the goals of this project and the work done so far
Masked and Permuted Implicit Context Learning for Scene Text Recognition
Scene Text Recognition (STR) is difficult because of the variations in text
styles, shapes, and backgrounds. Though the integration of linguistic
information enhances models' performance, existing methods based on either
permuted language modeling (PLM) or masked language modeling (MLM) have their
pitfalls. PLM's autoregressive decoding lacks foresight into subsequent
characters, while MLM overlooks inter-character dependencies. Addressing these
problems, we propose a masked and permuted implicit context learning network
for STR, which unifies PLM and MLM within a single decoder, inheriting the
advantages of both approaches. We utilize the training procedure of PLM, and to
integrate MLM, we incorporate word length information into the decoding process
and replace the undetermined characters with mask tokens. Besides, perturbation
training is employed to train a more robust model against potential length
prediction errors. Our empirical evaluations demonstrate the performance of our
model. It not only achieves superior performance on the common benchmarks but
also achieves a substantial improvement of on the more challenging
Union14M-Benchmark
- …
