399 research outputs found

    Optimal subsampling for large scale Elastic-net regression

    Full text link
    Datasets with sheer volume have been generated from fields including computer vision, medical imageology, and astronomy whose large-scale and high-dimensional properties hamper the implementation of classical statistical models. To tackle the computational challenges, one of the efficient approaches is subsampling which draws subsamples from the original large datasets according to a carefully-design task-specific probability distribution to form an informative sketch. The computation cost is reduced by applying the original algorithm to the substantially smaller sketch. Previous studies associated with subsampling focused on non-regularized regression from the computational efficiency and theoretical guarantee perspectives, such as ordinary least square regression and logistic regression. In this article, we introduce a randomized algorithm under the subsampling scheme for the Elastic-net regression which gives novel insights into L1-norm regularized regression problem. To effectively conduct consistency analysis, a smooth approximation technique based on alpha absolute function is firstly employed and theoretically verified. The concentration bounds and asymptotic normality for the proposed randomized algorithm are then established under mild conditions. Moreover, an optimal subsampling probability is constructed according to A-optimality. The effectiveness of the proposed algorithm is demonstrated upon synthetic and real data datasets.Comment: 28 pages, 7 figure

    RADAR: Robust AI-Text Detection via Adversarial Learning

    Full text link
    Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusation of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a Robust AI-text Detector via Adversarial leaRning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic contents to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5.Comment: Preprint. Project page and demos: https://radar.vizhub.a

    Automating Intersection Marking Data Collection and Condition Assessment at Scale With An Artificial Intelligence-Powered System

    Get PDF
    Intersection markings play a vital role in providing road users with guidance and information. The conditions of intersection markings will be gradually degrading due to vehicular traffic, rain, and/or snowplowing. Degraded markings can confuse drivers, leading to increased risk of traffic crashes. Timely obtaining high-quality information of intersection markings lays a foundation for making informed decisions in safety management and maintenance prioritization. However, current labor-intensive and high-cost data collection practices make it very challenging to gather intersection data on a large scale. This paper develops an automated system to intelligently detect intersection markings and to assess their degradation conditions with existing roadway Geographic information systems (GIS) data and aerial images. The system harnesses emerging artificial intelligence (AI) techniques such as deep learning and multi-task learning to enhance its robustness, accuracy, and computational efficiency. AI models were developed to detect lane-use arrows (85% mean average precision) and crosswalks (89% mean average precision) and to assess the degradation conditions of markings (91% overall accuracy for lane-use arrows and 83% for crosswalks). Data acquisition and computer vision modules developed were integrated and a graphical user interface (GUI) was built for the system. The proposed system can fully automate the processes of marking data collection and condition assessment on a large scale with almost zero cost and short processing time. The developed system has great potential to propel urban science forward by providing fundamental urban infrastructure data for analysis and decision-making across various critical areas such as data-driven safety management and prioritization of infrastructure maintenance

    A gene catalogue for post-diapause development of an anhydrobiotic arthropod Artemia franciscana

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Diapause is a reversible state of developmental suspension and found among diverse taxa, from plants to animals, including marsupials and some other mammals. Although previous work has accumulated ample data, the molecular mechanism underlying diapause and reactivation from it remain elusive.</p> <p>Results</p> <p>Using <it>Artemia franciscana</it>, a model organism to study the development of post-diapause embryos in Arthropod, we sequenced random clones up to a total of 28,039 ESTs from four cDNA libraries made from dehydrated cysts and three time points after rehydration/reactivation, which were assembled into 8,018 unigene clusters. We identified 324 differentially-expressed genes (DEGs, <it>P </it>< 0.05) based on pairwise comparisons of the four cDNA libraries. We identified a group of genes that are involved in an anti-water-deficit system, including proteases, protease inhibitors, heat shock proteins, and several novel members of the late embryogenesis abundant (LEA) protein family. In addition, we classified most of the up-regulated genes after cyst reactivation into metabolism, biosynthesis, transcription, and translation, and this result is consistent with the rapid development of the embryo. Some of the specific expressions of DEGs were confirmed experimentally based on quantitative real-time PCR.</p> <p>Conclusion</p> <p>We found that the first 5-hour period after rehydration is most important for embryonic reactivation of <it>Artemia</it>. As the total number of expressed genes increases significantly, the majority of DEGs were also identified in this period, including a group of water-deficient-induced genes. A group of genes with similar functions have been described in plant seeds; for instance, one of the novel LEA members shares ~70% amino-acid identity with an <it>Arabidopsis </it>EM (embryonic abundant) protein, the closest animal relative to plant LEA families identified thus far. Our findings also suggested that not only nutrition, but also mRNAs are produced and stored during cyst formation to support rapid development after reactivation.</p
    • …
    corecore