89 research outputs found

    Sen2Pro: A Probabilistic Perspective to Sentence Embedding from Pre-trained Language Model

    Full text link
    Sentence embedding is one of the most fundamental tasks in Natural Language Processing and plays an important role in various tasks. The recent breakthrough in sentence embedding is achieved by pre-trained language models (PLMs). Despite its success, an embedded vector (Sen2Vec) representing a point estimate does not naturally express uncertainty in a taskagnostic way. This paper thereby proposes an efficient framework on probabilistic sentence embedding (Sen2Pro) from PLMs, and it represents a sentence as a probability density distribution in an embedding space to reflect both model uncertainty and data uncertainty (i.e., many-to-one nature) in the sentence representation. The proposed framework performs in a plug-and-play way without retraining PLMs anymore, and it is easy to implement and generally applied on top of any PLM. The superiority of Sen2Pro over Sen2Vec has been theoretically verified and practically illustrated on different NLP tasks.Comment: Accepted to ACL2023 workshop Rep4NL

    Directed evolution of an orthogonal nucleoside analog kinase via fluorescence-activated cell sorting

    Get PDF
    Nucleoside analogs (NAs) represent an important category of prodrugs for the treatment of viral infections and cancer, yet the biological potency of many analogs is compromised by their inefficient activation through cellular 2′-deoxyribonucleoside kinases (dNKs). We herein report the directed evolution and characterization of an orthogonal NA kinase for 3′-deoxythymidine (ddT), using a new FACS-based screening protocol in combination with a fluorescent analog of ddT. Four rounds of random mutagenesis and DNA shuffling of Drosophila melanogaster 2′-deoxynucleoside kinase, followed by FACS analysis, yielded an orthogonal ddT kinase with a 6-fold higher activity for the NA and a 20-fold kcat/KM preference for ddT over thymidine, an overall 10 000-fold change in substrate specificity. The contributions of individual amino acid substitutions in the ddT kinase were evaluated by reverse engineering, enabling a detailed structure–function analysis to rationalize the observed changes in performance. Based on our results, kinase engineering with fluorescent NAs and FACS should prove a highly versatile method for evolving selective kinase:NA pairs and for studying fundamental aspects of the structure–function relationship in dNKs

    Genome-wide selection footprints and deleterious variations in young Asian allotetraploid rapeseed

    Get PDF
    Brassica napus (AACC, 2n=38) is an important oilseed crop grown worldwide. However, little is known about the population evolution of this species, the genomic difference between its major genetic groups, such as European and Asian rapeseed, and the impacts of historical large-scale introgression events on this young tetraploid. In this study, we reported the de novo assembly of the genome sequences of an Asian rapeseed (B. napus), Ningyou 7 and its four progenitors and compared these genomes with other available genomic data from diverse European and Asian cultivars. Our results showed that Asian rapeseed originally derived from European rapeseed but subsequently significantly diverged, with rapid genome differentiation after hybridization and intensive local selective breeding. The first historical introgression of B. rapa dramatically broadened the allelic pool but decreased the deleterious variations of Asian rapeseed. The second historical introgression of the double-low traits of European rapeseed (canola) has reshaped Asian rapeseed into two groups (double-low and double-high), accompanied by an increase in genetic load in the double-low group. This study demonstrates distinctive genomic footprints and deleterious SNP (Single Nucleotide Polymorphism) variants for local adaptation by recent intra- and interspecies introgression events and provides novel insights for understanding the rapid genome evolution of a young allopolyploid cro

    Finishing the euchromatic sequence of the human genome

    Get PDF
    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead

    Trajectory Modeling by Distributed Gaussian Processes in Multiagent Systems

    No full text
    This paper considers trajectory a modeling problem for a multi-agent system by using the Gaussian processes. The Gaussian process, as the typical data-driven method, is well suited to characterize the model uncertainties and perturbations in a complex environment. To address model uncertainties and noises disturbances, a distributed Gaussian process is proposed to characterize the system model by using local information exchange among neighboring agents, in which a number of agents cooperate without central coordination to estimate a common Gaussian process function based on local measurements and datum received from neighbors. In addition, both the continuous-time system model and the discrete-time system model are considered, in which we design a control Lyapunov function to learn the continuous-time model, and a distributed model predictive control-based approach is used to learn the discrete-time model. Furthermore, we apply a Kullback–Leibler average consensus fusion algorithm to fuse the local prediction results (mean and variance) of the desired Gaussian process. The performance of the proposed distributed Gaussian process is analyzed and is verified by two trajectory tracking examples

    A Simple Regularized Multiple Criteria Linear Programs for Binary Classification

    Get PDF
    AbstractOptimization is an important tool in computational finance and business intelligence. Multiple criteria mathematical pro- gram(MCMP), which is concerned with mathematical optimization problems involving more than one objective function to be optimized simultaneously, is one of the ways of utilizing optimization techniques. Due to the existence of multiple objec- tives, MCMPs are usually difficult to be optimized. In fact, for a nontrivial MCMP, there does not exist a single solution that optimizes all the objectives at the same time. In practice, many methods convert the original MCMP into a single-objective program and solve the obtained scalarized optimization problem. If the values of scalarization parameters, which measure the trade-offs between the conflicting objectives, are not chosen carefully, the converted single-objective optimization problem may be not solvable. Therefore, to make sure MCMP always can be solved successfully, heuristic search and expert knowledge for deciding the value of scalarization parameters are always necessary, which is not an easy task and limits the applications of MCMP to some extend. In this paper, we take the multiple criteria linear program(MCLP) for binary classification as the example and discuss how to modified the formulation of MCLP directly to guarantee the solvability. In details, we propose adding a quadratic regularization term into the converted single-objective linear program. The new regularized formulation does not only overcomes some defects of the original scalarized problem in modeling, it also can be shown in theory that the finite optimal solutions always exist. To test the performance of the proposed method, we compare our algorithm with sever- al state-of-the-art algorithms for binary classification on several different kinds of datasets. Preliminary experimental results demonstrate the effectiveness of our regularization method
    corecore