Search CORE

867 research outputs found

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data

Author: Habtom W. Ressom
Jean-Pierre A. Kocher
Yuji Zhang
Publication venue: 'IntechOpen'
Publication date: 07/03/2012
Field of study

IntechOpen

Construction of Gene Regulatory Networks Using Recurrent Neural Networks and Swarm Intelligence

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

Crossref

생물학적 서열 데이터에 대한 표현 학습

Author: 민선우
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2021.8. 윤성로.As we are living in the era of big data, the biomedical domain is not an exception. With the advent of technologies such as next-generation sequencing, developing methods to capitalize on the explosion of biomedical data is one of the most major challenges in bioinformatics. Representation learning, in particular deep learning, has made significant advancements in diverse fields where the artificial intelligence community has struggled for many years. However, although representation learning has also shown great promises in bioinformatics, it is not a silver bullet. Off-the-shelf applications of representation learning cannot always provide successful results for biological sequence data. There remain full of challenges and opportunities to be explored. This dissertation presents a set of representation learning methods to address three issues in biological sequence data analysis. First, we propose a two-stage training strategy to address throughput and information trade-offs within wet-lab CRISPR-Cpf1 activity experiments. Second, we propose an encoding scheme to model interaction between two sequences for functional microRNA target prediction. Third, we propose a self-supervised pre-training method to bridge the exponentially growing gap between the numbers of unlabeled and labeled protein sequences. In summary, this dissertation proposes a set of representation learning methods that can derive invaluable information from the biological sequence data.우리는 빅데이터의 시대를 맞이하고 있으며, 의생명 분야 또한 예외가 아니다. 차세대 염기서열 분석과 같은 기술들이 도래함에 따라, 폭발적인 의생명 데이터의 증가를 활용하기 위한 방법론의 개발은 생물정보학 분야의 주요 과제 중의 하나이다. 심층 학습을 포함한 표현 학습 기법들은 인공지능 학계가 오랫동안 어려움을 겪어온 다양한 분야에서 상당한 발전을 이루었다. 표현 학습은 생물정보학 분야에서도 많은 가능성을 보여주었다. 하지만 단순한 적용으로는 생물학적 서열 데이터 분석의 성공적인 결과를 항상 얻을 수는 않으며, 여전히 연구가 필요한 많은 문제들이 남아있다. 본 학위논문은 생물학적 서열 데이터 분석과 관련된 세 가지 사안을 해결하기 위해, 표현 학습에 기반한 일련의 방법론들을 제안한다. 첫 번째로, 유전자가위 실험 데이터에 내재된 정보와 수율의 균형에 대처할 수 있는 2단계 학습 기법을 제안한다. 두 번째로, 두 염기 서열 간의 상호 작용을 학습하기 위한 부호화 방식을 제안한다. 세 번째로, 기하급수적으로 증가하는 특징되지 않은 단백질 서열을 활용하기 위한 자기 지도 사전 학습 기법을 제안한다. 요약하자면, 본 학위논문은 생물학적 서열 데이터를 분석하여 중요한 정보를 도출할 수 있는 표현 학습에 기반한 일련의 방법론들을 제안한다.1 Introduction 1 1.1 Motivation 1 1.2 Contents of Dissertation 4 2 Background 8 2.1 Representation Learning 8 2.2 Deep Neural Networks 12 2.2.1 Multi-layer Perceptrons 12 2.2.2 Convolutional Neural Networks 14 2.2.3 Recurrent Neural Networks 16 2.2.4 Transformers 19 2.3 Training of Deep Neural Networks 23 2.4 Representation Learning in Bioinformatics 26 2.5 Biological Sequence Data Analyses 29 2.6 Evaluation Metrics 32 3 CRISPR-Cpf1 Activity Prediction 36 3.1 Methods 39 3.1.1 Model Architecture 39 3.1.2 Training of Seq-deepCpf1 and DeepCpf1 41 3.2 Experiment Results 44 3.2.1 Datasets 44 3.2.2 Baselines 47 3.2.3 Evaluation of Seq-deepCpf1 49 3.2.4 Evaluation of DeepCpf1 51 3.3 Summary 55 4 Functional microRNA Target Prediction 56 4.1 Methods 62 4.1.1 Candidate Target Site Selection 63 4.1.2 Input Encoding 64 4.1.3 Residual Network 67 4.1.4 Post-processing 68 4.2 Experiment Results 70 4.2.1 Datasets 70 4.2.2 Classification of Functional and Non-functional Targets 71 4.2.3 Distinguishing High-functional Targets 73 4.2.4 Ablation Studies 76 4.3 Summary 77 5 Self-supervised Learning of Protein Representations 78 5.1 Methods 83 5.1.1 Pre-training Procedure 83 5.1.2 Fine-tuning Procedure 86 5.1.3 Model Architecturen 87 5.2 Experiment Results 90 5.2.1 Experiment Setup 90 5.2.2 Pre-training Results 92 5.2.3 Fine-tuning Results 93 5.2.4 Comparison with Larger Protein Language Models 97 5.2.5 Ablation Studies 100 5.2.6 Qualitative Interpreatation Analyses 103 5.3 Summary 106 6 Discussion 107 6.1 Challenges and Opportunities 107 7 Conclusion 111 Bibliography 113 Abstract in Korean 130박

SNU Open Repository and Archive

Deep diversification of an AAV capsid protein by machine learning.

Author: AM Davis
CE Dunbar
CL Araya
D Grimm
DM Weinreich
EC Alley
ED Kelsic
F Pereira
F Sievers
GL Butterfield
J Zhang
JR Mendell
K Adachi
KK Yang
KS Sarkisyan
L Ferretti
LV Tse
N Halabi
PA Romero
PJ Ogden
PS Huang
R Calcedo
RA Langan
RJ Fox
S Zolotukhin
TJ Wheeler
WP Stemmer
YS Tseng
Z Wu
Publication venue: Nat Biotechnol
Publication date: 01/06/2021
Field of study

Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12-29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics

Crossref

Apollo (Cambridge)

Cell Type Classification Via Deep Learning On Single-Cell Gene Expression Data

Author: Chowdhury Shanta
Publication venue: Digital Commons @PVAMU
Publication date: 01/05/2023
Field of study

Single-cell sequencing is a recently advanced revolutionary technology which enables researchers to obtain genomic, transcriptomic, or multi-omics information through gene expression analysis. It gives the advantage of analyzing highly heterogenous cell type information compared to traditional sequencing methods, which is gaining popularity in the biomedical area. Moreover, this analysis can help for early diagnosis and drug development of tumor cells, and cancer cell types. In the workflow of gene expression data profiling, identification of the cell types is an important task, but it faces many challenges like the curse of dimensionality, sparsity, batch effect, and overfitting. However, these challenges can be overcome by performing a feature selection technique which selects more relevant features by reducing feature dimensions. In this research work, recurrent neural network-based feature selection model is proposed to extract relevant features from high dimensional, and low sample size data. Moreover, a deep learning-based gene embedding model is also proposed to reduce data sparsity of single-cell data for cell type identification. The proposed frameworks have been implemented with different architectures of recurrent neural networks, and demonstrated via real-world micro-array datasets and single-cell RNA-seq data and observed that the proposed models perform better than other feature selection models. A semi-supervised model is also implemented using the same workflow of gene embedding concept since labeling data is very cumbersome, time consuming, and requires manual effort and expertise in the field. Therefore, different ratios of labeled data are used in the experiment to validate the concept. Experimental results show that the proposed semi-supervised approach represents very encouraging performance even though a limited number of labeled data is used via the gene embedding concept. In addition, graph attention based autoencoder model has also been studied to learn the latent features by incorporating prior knowledge with gene expression data for cell type classification. Index Terms — Single-Cell Gene Expression Data, Gene Embedding, Semi-Supervised model, Incorporate Prior Knowledge, Gene-gene Interaction Network, Deep Learning, Graph Auto Encode

Digital Commons @ PVAMU (Prairie View A&M Univ)

Evaluation of machine learning approaches for prediction of protein coding genes in prokaryotic DNA sequences

Author: Sandvik Yva Jacob
Publication venue: Norwegian University of Life Sciences, Ås
Publication date: 01/01/2022
Field of study

According to the National Human Genome Research Institute the amount of genomic data generated on a yearly basis is constantly increasing. This rapid growth in genomic data has led to a subsequent surge in the demand for efficient analysis and handling of said data. Gene prediction involves identifying the areas of a DNA sequence that code for proteins, also called protein coding genes. This task falls within the scope of bioinformatics, and there has been surprisingly little development in this field of study, over the past years. Despite there being sufficient state-of-the-art gene prediction tools, there is still room for improvement in terms of efficiency and accuracy. Advances made within the field of gene prediction can, among other things, aid the medical and pharmaceutical industry, as well as impact environmental and anthropological research. Machine learning techniques such as the Random Forest classifiers and Artificial Neural Networks (ANN) have proved successful at the task of gene prediction. In this thesis one deep learning model and two other machine learning models were tested. The first model implemented was the established Random Forest classifier. When it comes to the use of ensemble methods, such as the Random Forest classifier, feature engineering is critical for the success of such models. The exploration of different feature selection and extraction techniques underpinned its relevance. It also showed that feature importance varies greatly among genomes, and revealed possibilities that can be further explored in future work. The second model tested was the ensemble method Extreme Gradient Boosting (XGBoost), which served as a good competitor to the Random Forest classifier. Finally, a Recurrent Neural Network (RNN) was implemented. RNNs are known to be good with handling sequential data, therefore it seemed like a good candidate for gene prediction. The 15 prokaryotic genomes used to train the models were extracted from the NCBI genome database. Each model was tasked with classifying sub-sequences of the genomes, called open reading frames (ORFs), as either protein coding ORFs, or non-coding ORFs. One challenge when preparing these datasets was that the number of protein coding ORFs was very small compared to the number of non-coding ORFs. Another problem encountered in the dataset was that protein coding ORFs in general are longer than non-coding ORFs, which can bias the models to simply classify long ORFs as protein coding, and short ORFs as non-coding. For these reasons, two datasets for each genome were created, taking each imbalance into account. The models were trained, tuned and tested on both datasets for all genomes, and a combination of genomes. The models were evaluated with regard to accuracy, precision and recall. The results show that all three methods have potential and attained somewhat similar performance scores. Despite the fact that both time and data were limited during model development, they still yielded promising results. Considering there are several parameters that have not yet been tuned in all models, many possibilities for further research remain. The fact that a relatively simple RNN architecture performed so well, and has no requirement for feature engineering, shows great promise for further applications in gene prediction, and possibly other fields in bioinformatics.M-D

Brage NMBU

Stages of Gene Regulatory Network Inference: the Evolutionary Algorithm Role

Author: Alina Sîrbu
Heather J. Ruskin
Martin Crane
Publication venue: 'IntechOpen'
Publication date: 01/01/2011
Field of study

IntechOpen

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna