93 research outputs found

    Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

    Full text link
    Programmatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently. The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources abstracted as labeling functions (LFs). Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features when modeling the underlying generative process. In this paper, we attempt to incorporate the instance features into a statistical label model via the proposed FABLE. In particular, it is built on a mixture of Bayesian label models, each corresponding to a global pattern of correlation, and the coefficients of the mixture components are predicted by a Gaussian Process classifier based on instance features. We adopt an auxiliary variable-based variational inference algorithm to tackle the non-conjugate issue between the Gaussian Process and Bayesian label models. Extensive empirical comparison on eleven benchmark datasets sees FABLE achieving the highest averaged performance across nine baselines.Comment: 16 page

    Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification

    Full text link
    To obtain a large amount of training labels inexpensively, researchers have recently adopted the weak supervision (WS) paradigm, which leverages labeling rules to synthesize training labels rather than using individual annotations to achieve competitive results for natural language processing (NLP) tasks. However, data imbalance is often overlooked in applying the WS paradigm, despite being a common issue in a variety of NLP tasks. To address this challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a model-agnostic framework to alleviate the data imbalance issue in the WS paradigm. Specifically, it calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point. Then, the ranked data are sampled based on both class-wise and rule-aware ranking. In particular, the two sample strategies corresponds to our motivations: (1) to train the model with balanced data batches to reduce the data imbalance issue and (2) to exploit the expertise of each labeling rule for collecting clean samples. Experiments on four text classification datasets with four different imbalance ratios show that ARS2 outperformed the state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8% improvement on their F1-score

    Optimized Vectorization Implementation of CRYSTALS-Dilithium

    Full text link
    CRYSTALS-Dilithium is a lattice-based signature scheme to be standardized by NIST as the primary post-quantum signature algorithm. In this work, we make a thorough study of optimizing the implementations of Dilithium by utilizing the Advanced Vector Extension (AVX) instructions, specifically AVX2 and the latest AVX512. We first present an improved parallel small polynomial multiplication with tailored early evaluation (PSPM-TEE) to further speed up the signing procedure, which results in a speedup of 5\%-6\% compared with the original PSPM Dilithium implementation. We then present a tailored reduction method that is simpler and faster than Montgomery reduction. Our optimized AVX2 implementation exhibits a speedup of 3\%-8\% compared with the state-of-the-art of Dilithium AVX2 software. Finally, for the first time, we propose a fully and highly vectorized implementation of Dilithium using AVX-512. This is achieved by carefully vectorizing most of Dilithium functions with the AVX512 instructions in order to improve efficiency both for time and for space simultaneously. With all the optimization efforts, our AVX-512 implementation improves the performance by 37.3\%/50.7\%/39.7\% in key generation, 34.1\%/37.1\%/42.7\% in signing, and 38.1\%/38.7\%/40.7\% in verification for the parameter sets of Dilithium2/3/5 respectively. To the best of our knowledge, our AVX512 implementation has the best performance for Dilithium on the Intel x64 CPU platform to date.Comment: 13 pages, 5 figure

    Different responses in leaf-level physiology to competition and facilitation under different soil types and N fertilization

    Get PDF
    Knowledge of how competition and facilitation affect photosynthetic traits and nitrogen metabolism contributes to understanding of plant-plant interaction mechanisms. We transplanted two larch species, Larix kaempferi and L. olgensis, to establish intra- and interspecific interaction experiments under different types of soil. Experiment 1: Two different soil types were selected, one from a c. twenty years old L. kaempferi plantation (named larch soil) and another from a secondary natural forest (named mixed forest soil). The experiment included three types of plant interactions (L kaempferi + L. kaempferi, L. olgensis + L. olgensis, and L. kaempferi + L. olgensis) and two soil types. Experiment 2: N fertilization was applied to larch soil. The experiment included the same three types of plant interactions as in Experiment 1 and two N treatments. The growth of L kaempferi was negatively affected by larch soil and accelerated by N fertilization, particularly under interspecific interaction. The effects of soil type combined with plant-plant interactions or N fertilization influenced the chlorophyll pigment content, net photosynthetic rate (Pn), photosynthetic N use efficiency (PNUE) and total non-structural carbohydrates of leaves (TNC). CM a/Chl b (ratio of chlorophyll a to chlorophyll b) was higher when the growth of L. kaempferi was facilitated by the presence of L olgensis in mixed forest soil. However, the ratio significantly declined when L. kaempferi confronted strong competition from L. olgensis in larch soil without N fertilization. Under N fertilization in larch soil, Chl a/Chl b of L. olgensis significantly increased by the presence of L. kaempferi. Plant-plant interactions and soil types affected the number of chloroplasts, especially in L. kaempferi, which had a greater number of chloroplasts under interspecific interactions than in monoculture when growing in mixed forest soil. L. olgensis enhanced its ability to absorb N-NO3- under interspecific interactions in larch N- soil, while L. kaempferi enhanced its ability to absorb N-NH4+ under interspecific competition in mixed forest soil. Competition or facilitation modified the photosynthetic traits and nitrogen metabolism depending on the type of soil. Differences in these physiological processes contribute to divergent performance among individuals growing under interspecific or intraspecific competition, or in isolation.Peer reviewe

    NLPBench: Evaluating Large Language Models on Solving NLP Problems

    Full text link
    Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies can be inconsistent, occasionally damaging LLM performance, especially in smaller models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated specific shortcomings in LLMs' scientific problem-solving skills, with weaknesses in logical decomposition and reasoning notably affecting results

    Metagenomic analysis reveals gut plasmids as diagnosis markers for colorectal cancer

    Get PDF
    BackgroundColorectal cancer (CRC) is linked to distinct gut microbiome patterns. The efficacy of gut bacteria as diagnostic biomarkers for CRC has been confirmed. Despite the potential to influence microbiome physiology and evolution, the set of plasmids in the gut microbiome remains understudied.MethodsWe investigated the essential features of gut plasmid using metagenomic data of 1,242 samples from eight distinct geographic cohorts. We identified 198 plasmid-related sequences that differed in abundance between CRC patients and controls and screened 21 markers for the CRC diagnosis model. We utilize these plasmid markers combined with bacteria to construct a random forest classifier model to diagnose CRC.ResultsThe plasmid markers were able to distinguish between the CRC patients and controls [mean area under the receiver operating characteristic curve (AUC = 0.70)] and maintained accuracy in two independent cohorts. In comparison to the bacteria-only model, the performance of the composite panel created by combining plasmid and bacteria features was significantly improved in all training cohorts (mean AUCcomposite = 0.804 and mean AUCbacteria = 0.787) and maintained high accuracy in all independent cohorts (mean AUCcomposite = 0.839 and mean AUCbacteria = 0.821). In comparison to controls, we found that the bacteria-plasmid correlation strength was weaker in CRC patients. Additionally, the KEGG orthology (KO) genes in plasmids that are independent of bacteria or plasmids significantly correlated with CRC.ConclusionWe identified plasmid features associated with CRC and showed how plasmid and bacterial markers could be combined to further enhance CRC diagnosis accuracy

    DataComp: In search of the next generation of multimodal datasets

    Full text link
    Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai

    Utility of clinical metagenomics in diagnosing malignancies in a cohort of patients with Epstein-Barr virus positivity

    Get PDF
    BackgroundsDifferentiation between benign and malignant diseases in EBV-positive patients poses a significant challenge due to the lack of efficient diagnostic tools. Metagenomic Next-Generation Sequencing (mNGS) is commonly used to identify pathogens of patients with fevers of unknown-origin (FUO). Recent studies have extended the application of Next-Generation Sequencing (NGS) in identifying tumors in body fluids and cerebrospinal fluids. In light of these, we conducted this study to develop and apply metagenomic methods to validate their role in identifying EBV-associated malignant disease.MethodsWe enrolled 29 patients with positive EBV results in the cohort of FUO in the Department of Infectious Diseases of Huashan Hospital affiliated with Fudan University from 2018 to 2019. Upon enrollment, these patients were grouped for benign diseases, CAEBV, and malignant diseases according to their final diagnosis, and CNV analysis was retrospectively performed in 2022 using samples from 2018 to 2019.ResultsAmong the 29 patients. 16 of them were diagnosed with benign diseases, 3 patients were diagnosed with CAEBV and 10 patients were with malignant diseases. 29 blood samples from 29 patients were tested for mNGS. Among all 10 patients with malignant diagnosis, CNV analysis suggested neoplasms in 9 patients. Of all 19 patients with benign or CAEBV diagnosis, 2 patients showed abnormal CNV results. The sensitivity and specificity of CNV analysis for the identification for tumors were 90% and 89.5%, separately.ConclusionsThe application of mNGS could assist in the identification of microbial infection and malignancies in EBV-related diseases. Our results demonstrate that CNV detection through mNGS is faster compared to conventional oncology tests. Moreover, the convenient collection of peripheral blood samples adds to the advantages of this approach
    corecore