159 research outputs found

    Evolving Knowledge Distillation with Large Language Models and Active Learning

    Full text link
    Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks. However, their computational costs are prohibitively high. To address this issue, previous research has attempted to distill the knowledge of LLMs into smaller models by generating annotated data. Nonetheless, these works have mainly focused on the direct use of LLMs for text generation and labeling, without fully exploring their potential to comprehend the target task and acquire valuable knowledge. In this paper, we propose EvoKD: Evolving Knowledge Distillation, which leverages the concept of active learning to interactively enhance the process of data generation using large language models, simultaneously improving the task capabilities of small domain model (student model). Different from previous work, we actively analyze the student model's weaknesses, and then synthesize labeled samples based on the analysis. In addition, we provide iterative feedback to the LLMs regarding the student model's performance to continuously construct diversified and challenging samples. Experiments and analysis on different NLP tasks, namely, text classification and named entity recognition show the effectiveness of EvoKD.Comment: Accepted by COLING 202

    Comprehensive Information Integration Modeling Framework for Video Titling

    Full text link
    In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume. To recommend these videos to potential consumers more effectively, diverse and catchy video titles are critical. However, consumer-generated videos seldom accompany appropriate titles. To bridge this gap, we integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework. Although automatic video titling is very useful and demanding, it is much less addressed than video captioning. The latter focuses on generating sentences that describe videos as a whole while our task requires the product-aware multi-grained video analysis. To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization. Specifically, the granular-level interaction modeling first utilizes temporal-spatial landmark cues, descriptive words, and abstractive attributes to builds three individual graphs and recognizes the intra-actions in each graph through Graph Neural Networks (GNN). Then the global-local aggregation module is proposed to model inter-actions across graphs and aggregate heterogeneous graphs into a holistic graph representation. The abstraction-level story-line summarization further considers both frame-level video features and the holistic graph to utilize the interactions between products and backgrounds, and generate the story-line topic of the video. We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform, and will make the desensitized version publicly available to nourish further development of the research community...Comment: 11 pages, 6 figures, to appear in KDD 2020 proceeding

    Single-cell immune profiling reveals immune responses in oral lichen planus

    Get PDF
    IntroductionOral lichen planus (OLP) is a common chronic inflammatory disorder of the oral mucosa with an unclear etiology. Several types of immune cells are involved in the pathogenesis of OLP.MethodsWe used single-cell RNA sequencing and immune repertoire sequencing to characterize the mucosal immune microenvironment of OLP. The presence of tissue-resident memory CD8+ T cells are validated by multiplex immunofluorescence.ResultsWe generated a transcriptome atlas from four OLP biopsy samples and their paired peripheral blood mononuclear cells (PBMCs), and compared them with two healthy tissues and three healthy PBMCs samples. Our analysis revealed activated tissue-resident memory CD8+ T cells in OLP tissues. T cell receptor repertoires displayed apperant clonal expansion and preferrential gene pairing in OLP patients. Additionally, obvious BCR clonal expansion was observed in OLP lesions. Plasmacytoid dendritic cells, a subtype that can promote dendritic cell maturation and enhance lymphocyte cytotoxicity, were identified in OLP. Conventional dendritic cells and macrophages are also found to exhibit pro-inflammatory activity in OLP. Cell-cell communication analysis reveals that fibroblasts might promote the recruitment and extravasation of immune cells into connective tissue.DiscussionOur study provides insights into the immune ecosystem of OLP, serving as a valuable resource for precision diagnosis and therapy of OLP

    Lithium, an anti-psychotic drug, greatly enhances the generation of induced pluripotent stem cells

    Get PDF
    Somatic cells can be reprogrammed into induced pluripotent stem cells (iPSCs) by defined factors. The low efficiency of reprogramming and genomic integration of oncogenes and viral vectors limited the potential application of iPSCs. Here we report that Lithium (Li), a drug used to treat mood disorders, greatly enhances iPSC generation from both mouse embryonic fibroblast and human umbilical vein endothelial cells. Li facilitates iPSC generation with one (Oct4) or two factors (OS or OK). The effect of Li on promoting reprogramming only partially depends on its major target GSK3β. Unlike other GSK3β inhibitors, Li not only increases the expression of Nanog, but also enhances the transcriptional activity of Nanog. We also found that Li exerts its effect by promoting epigenetic modifications via downregulation of LSD1, a H3K4-specific histone demethylase. Knocking down LSD1 partially mimics Li's effect in enhancing reprogramming. Our results not only provide a straightforward method to improve the iPSC generation efficiency, but also identified a histone demethylase as a critical modulator for somatic cell reprogramming

    Histone deacetylase HD2 interacts with ERF1 and is involved in longan fruit senescence

    Get PDF
    Histone deacetylation plays an important role in epigenetic control of gene expression. HD2 is a plant-specific histone deacetylase that is able to mediate transcriptional repression in many biological processes. To investigate the epigenetic and transcriptional mechanisms of longan fruit senescence, one histone deacetylase 2-like gene, DlHD2, and two ethylene-responsive factor-like genes, DlERF1 and DlERF2, were cloned and characterized from longan fruit. Expression of these genes was examined during fruit senescence under different storage conditions. The accumulation of DlHD2 reached a peak at 2 d and 30 d in the fruit stored at 25 °C (room temperature) and 4 °C (low temperature), respectively, or 6 h after the fruit was transferred from 4 °C to 25 °C, when fruit senescence was initiated. However, the DlERF1 transcript accumulated mostly at the later stage of fruit senescence, reaching a peak at 5 d and 35 d in the fruit stored at 25 °C and 4 °C, respectively, or 36 h after the fruit was transferred from low temperature to room temperature. Moreover, application of nitric oxide (NO) delayed fruit senescence, enhanced the expression of DlHD2, but suppressed the expression of DlERF1 and DlERF2. These results indicated a possible interaction between DlHD2 and DlERFs in regulating longan fruit senescence, and the direct interaction between DlHD2 and DlERF1 was confirmed by yeast two-hybrid and bimolecular fluorescence complementation (BiFC) assays. Taken together, the results suggested that DlHD2 may act with DlERF1 to regulate gene expression involved in longan fruit senescence

    Gene ontology based transfer learning for protein subcellular localization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology.</p> <p>Results</p> <p>In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively.</p> <p>Conclusions</p> <p>Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

    Molecular Characterization of a Strawberry FaASR Gene in Relation to Fruit Ripening

    Get PDF
    BACKGROUND: ABA-, stress- and ripening-induced (ASR) proteins have been reported to act as a downstream component involved in ABA signal transduction. Although much attention has been paid to the roles of ASR in plant development and stress responses, the mechanisms by which ABA regulate fruit ripening at the molecular level are not fully understood. In the present work, a strawberry ASR gene was isolated and characterized (FaASR), and a polyclonal antibody against FaASR protein was prepared. Furthermore, the effects of ABA, applied to two different developmental stages of strawberry, on fruit ripening and the expression of FaASR at transcriptional and translational levels were investigated. METHODOLOGY/PRINCIPAL FINDINGS: FaASR, localized in the cytoplasm and nucleus, contained 193 amino acids and shared common features with other plant ASRs. It also functioned as a transcriptional activator in yeast with trans-activation activity in the N-terminus. During strawberry fruit development, endogenous ABA content, levels of FaASR mRNA and protein increased significantly at the initiation of ripening at a white (W) fruit developmental stage. More importantly, application of exogenous ABA to large green (LG) fruit and W fruit markedly increased endogenous ABA content, accelerated fruit ripening, and greatly enhanced the expression of FaASR transcripts and the accumulation of FaASR protein simultaneously. CONCLUSIONS: These results indicate that FaASR may be involved in strawberry fruit ripening. The observed increase in endogenous ABA content, and enhanced FaASR expression at transcriptional and translational levels in response to ABA treatment might partially contribute to the acceleration of strawberry fruit ripening

    Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays.

    Get PDF
    Spatially resolved transcriptomic technologies are promising tools to study complex biological processes such as mammalian embryogenesis. However, the imbalance between resolution, gene capture, and field of view of current methodologies precludes their systematic application to analyze relatively large and three-dimensional mid- and late-gestation embryos. Here, we combined DNA nanoball (DNB)-patterned arrays and in situ RNA capture to create spatial enhanced resolution omics-sequencing (Stereo-seq). We applied Stereo-seq to generate the mouse organogenesis spatiotemporal transcriptomic atlas (MOSTA), which maps with single-cell resolution and high sensitivity the kinetics and directionality of transcriptional variation during mouse organogenesis. We used this information to gain insight into the molecular basis of spatial cell heterogeneity and cell fate specification in developing tissues such as the dorsal midbrain. Our panoramic atlas will facilitate in-depth investigation of longstanding questions concerning normal and abnormal mammalian development.This work is part of the ‘‘SpatioTemporal Omics Consortium’’ (STOC) paper package. A list of STOC members is available at: http://sto-consortium.org. We would like to thank the MOTIC China Group, Rongqin Ke (Huaqiao University, Xiamen, China), Jiazuan Ni (Shenzhen University, Shenzhen, China), Wei Huang (Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China), and Jonathan S. Weissman (Whitehead Institute, Boston, USA) for their help. This work was supported by the grant of Top Ten Foundamental Research Institutes of Shenzhen, the Shenzhen Key Laboratory of Single-Cell Omics (ZDSYS20190902093613831), and the Guangdong Provincial Key Laboratory of Genome Read and Write (2017B030301011); Longqi Liu was supported by the National Natural Science Foundation of China (31900466) and Miguel A. Esteban’s laboratory at the Guangzhou Institutes of Biomedicine and Health by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA16030502), National Natural Science Foundation of China (92068106), and the Guangdong Basic and Applied Basic Research Foundation (2021B1515120075).S

    The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions

    Get PDF
    Watermelon, Citrullus lanatus, is an important cucurbit crop grown throughout the world. Here we report a high-quality draft genome sequence of the east Asia watermelon cultivar 97103 (2n = 2 7 = 22) containing 23,440 predicted protein-coding genes. Comparative genomics analysis provided an evolutionary scenario for the origin of the 11 watermelon chromosomes derived from a 7-chromosome paleohexaploid eudicot ancestor. Resequencing of 20 watermelon accessions representing three different C. lanatus subspecies produced numerous haplotypes and identified the extent of genetic diversity and population structure of watermelon germplasm. Genomic regions that were preferentially selected during domestication were identified. Many disease-resistance genes were also found to be lost during domestication. In addition, integrative genomic and transcriptomic analyses yielded important insights into aspects of phloem-based vascular signaling in common between watermelon and cucumber and identified genes crucial to valuable fruit-quality traits, including sugar accumulation and citrulline metabolism
    corecore