7 research outputs found

    Winnie09/GPTCelltype: v1.0.1

    No full text
    <p>Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We demonstrated that GPT-4, a highly potent large language model, can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines in <a href="https://www.biorxiv.org/content/10.1101/2023.04.16.537094v1">this manuscript</a>. We developed this software, <strong>GPTCelltype</strong>, to provide reference-free, cost-effective automated cell type annotation using GPT-4 for single-cell RNA-seq analysis.</p&gt

    Mathematical modelling and optimization in biological networks and data

    No full text
    Bioinformatics and network biology provide exciting and challenging research and application areas for applied mathematics and computational science. Bioinformatics is the science of mining, managing and interpreting information from biological structures and sequences, while network biology focuses on analyzing the interactions among components in biological systems. Besides, machine learning and data mining have been developing in strides, with advanced and high-impact applications benefiting science. Although researchers have made efforts to model and analyze biological networks and data, the two areas have largely been developing separately. The theme of this thesis is to derive, analyze and optimize mathematical and numerical models suggested by biological networks as well as establishing practical algorithms for representing and solving problems in bioinformatics. In gene level, Boolean networks (BNs) are studied. A Boolean network (BN) is a sequential dynamical system composing of a large number of highly interconnected processing nodes. It is very efficient in modeling genetic regulation, neural networks, cancer networks, quorum sensing circuits, and cellular signaling pathways. To control a BN is to manipulate the values of a subset of the nodes or conduct external signals in the networks so as to drive it to a desired state. For example, one may need to conduct therapeutic intervention which drives the cell state of a patient to a benign state. It is shown that to find a minimum set of control nodes is NP-hard. An integer linear programming-based method is then proposed to solve the problem exactly with boundaries analysis. However, previous results imply that O(N)O(N) drivers nodes are still required if an arbitrary state is specified as the target state, where NN is the number of nodes. Considering the complexity, it is proved only O(log2M+log2N)O(\log_2M+\log_2N) driver nodes are required for controlling BNs if the targets are restricted to attractors, where MM is the number of attractors. Since it is expected that MM is not very large in many practical networks, this is a significant improvement. This result is based on discovery of novel relationships between control problems on BNs and the coupon collector's problem, a well-known concept in combinatorics. We also provide boundaries analysis. Simulation results using artificial and realistic network data support our theoretical findings. Besides, the problem of observability of attractors in BNs has been formulated on finding the minimum set of consecutive nodes determining the attractor cycle. In molecular level, a framework K2014 is developed to automatically construct NN-glycosylation networks in MATLAB with the involvement of the 27 most updated enzyme reaction rules of 22 enzymes. Our network shows a strong ability to predict a wider range of glycan produced by the enzymes encountered in the Golgi Apparatus in human cell expression systems. Furthermore, an orthogonal feature extraction model and a regularized regression method are proposed for biological data analysis. Simulations validate their contribution to the improvement of cancer prognosis and drug side-effects prediction.published_or_final_versionMathematicsDoctoralDoctor of Philosoph

    ON CLASSIFICATION OF BIOLOGICAL DATA USING OUTLIER DETECTION

    No full text
    Abstract With the rapid development of information technology, the number of datasets, as well as their complexity and dimension, have been growing dramatically. This dramatic growth of biology data and non-biological commercial databases becomes a challenging issue in data mining. Classification technique is one of the major tools in the captured research area. However, the performance of classification may be degraded when there exists noise in the captured databases. Therefore, outlier detection becomes an urgent need and the issue of how to integrate outlier detection method and classification techniques is an important and challenging issue. In this paper, we proposed a novel and effective approach based on k-means clustering to identify outliers in the databases. In particular, we employed one of famous classification techniques, Support Vector Machine (SVM), owing to its ability to handle highdimensional data set. We also compare the classification results with the multivariate outlier detection method. Numerical results on two different data sets indicate that the classification results after removing the outliers by our proposed method are much better than the multivariate outlier detection method

    GeneSegNet: a deep learning framework for cell segmentation by integrating gene expression and imaging

    No full text
    When analyzing data from in situ RNA detection technologies, cell segmentation is an essential step in identifying cell boundaries, assigning RNA reads to cells, and studying the gene expression and morphological features of cells. We developed a deep-learning-based method, GeneSegNet, that integrates both gene expression and imaging information to perform cell segmentation. GeneSegNet also employs a recursive training strategy to deal with noisy training labels. We show that GeneSegNet significantly improves cell segmentation performances over existing methods that either ignore gene expression information or underutilize imaging information.ISSN:1474-760

    Additional file 2 of Hadamard Kernel SVM with applications for breast cancer outcome predictions

    No full text
    Results on RNAseq data. Additional file 2 contains results on RNAseq data for breast cancer outcome predictions. (DOCX 157 kb
    corecore