Search CORE

713 research outputs found

Designing single guide RNAs for CRISPR/Cas9

Author: Bhagwat Neha Atul
Publication venue: SJSU ScholarWorks
Publication date: 24/05/2019
Field of study

Researchers have been working towards development of tools to facilitate regular use genome engineering techniques. In recent years, the focus of these efforts has been the Clustered Regularly Interspaced Short Palindromic Repeats(CRISPR)/CRISPR associated(Cas) systems. These systems, while found naturally in bacteria and archaea as an immunity mechanism, can be used for genome engineering in eukaryotes. There are three major computational challenges associated with the use of CRISPR/Cas9 in genome engineering for mammals - identification of CRISPR arrays, single guide RNA design and minimizing off-target effects. This project attempts to solve the problem of single guide RNA design using a novel approach. Researchers have been trying to solve the problem by using different machine learning classification algorithms. The algorithms have been trained to use the sequential and structural properties of single guide RNAs (sgRNAs). This project explores the use of a neural network based approach to solve the sgRNA design problem. A form of the Recurrent Neural Network (RNN) called the Long Short Term Memory (LSTM) model can be used as feature-less classification model to differentiate between functional and non-functional single guide RNAs. The project covers different experiments conducted using Support Vector Machine and Random Forest classifiers using sequential and structural features to identify the most potent sgRNAs in a given set of input sgRNAs. It also summarizes the implementation of the LSTM model and its results, along with the cross-validation results for each of these models. Through these results, it has been observed that LSTMs perform better than existing models such as Random Forest Classifiers and Support Vector Machines and give results comparable to existing tools

SJSU ScholarWorks

Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

Author: Zhong Wei
Publication venue: ScholarWorks @ Georgia State University
Publication date: 01/01/2006
Field of study

Protein tertiary structure plays a very important role in determining its possible functional sites and chemical interactions with other related proteins. Experimental methods to determine protein structure are time consuming and expensive. As a result, the gap between protein sequence and its structure has widened substantially due to the high throughput sequencing techniques. Problems of experimental methods motivate us to develop the computational algorithms for protein structure prediction. In this work, the clustering system is used to predict local protein structure. At first, recurring sequence clusters are explored with an improved K-means clustering algorithm. Carefully constructed sequence clusters are used to predict local protein structure. After obtaining the sequence clusters and motifs, we study how sequence variation for sequence clusters may influence its structural similarity. Analysis of the relationship between sequence variation and structural similarity for sequence clusters shows that sequence clusters with tight sequence variation have high structural similarity and sequence clusters with wide sequence variation have poor structural similarity. Based on above knowledge, the established clustering system is used to predict the tertiary structure for local sequence segments. Test results indicate that highest quality clusters can give highly reliable prediction results and high quality clusters can give reliable prediction results. In order to improve the performance of the clustering system for local protein structure prediction, a novel computational model called Clustering Support Vector Machines (CSVMs) is proposed. In our previous work, the sequence-to-structure relationship with the K-means algorithm has been explored by the conventional K-means algorithm. The K-means clustering algorithm may not capture nonlinear sequence-to-structure relationship effectively. As a result, we consider using Support Vector Machine (SVM) to capture the nonlinear sequence-to-structure relationship. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called CSVMs. Taking advantage of both the theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. Compared with the clustering system introduced previously, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied

CiteSeerX

ScholarWorks @ Georgia State University

A Balanced Secondary Structure Predictor

Author: Islam Md Nasrul
Publication venue: ScholarWorks@UNO
Publication date: 15/05/2015
Field of study

Secondary structure (SS) refers to the local spatial organization of the polypeptide backbone atoms of a protein. Accurate prediction of SS is a vital clue to resolve the 3D structure of protein. SS has three different components- helix (H), beta (E) and coil (C). Most SS predictors are imbalanced as their accuracy in predicting helix and coil are high, however significantly low in the beta. The objective of this thesis is to develop a balanced SS predictor which achieves good accuracies in all three SS components. We proposed a novel approach to solve this problem by combining a genetic algorithm (GA) with a support vector machine. We prepared two test datasets (CB471 and N295) to compare the performance of our predictors with SPINE X. Overall accuracy of our predictor was 76.4% and 77.2% respectively on CB471 and N295 datasets, while SPINE X gave 76.5% overall accuracy on both test datasets

University of New Orleans

Gene expression profiling of papillary thyroid carcinomas in Korean patients by oligonucleotide microarrays

Author: Aldred
Bashyam
Benjamini
Benjamini
Bergstrom
Carpi
Carvallo
Conacci-Sorrell
Davies
Giordano
Gorres
Griffith
Huang
Ki-Wook Chung
Kim
Krause
Mineo
Ministry of Health & Welfare
Nikolova
Nishiyama
Paik
Reiner
Rhodes
Rodrigues
Seok Won Kim
Sun Wook Kim
Teodoro
Thomassen
Turashvili
Wasenius
Yu
Yukinawa
Zhang
Publication venue: The Korean Surgical Society
Publication date: 01/01/2012
Field of study

Crossref

PubMed Central

A Balanced Secondary Structure Predictor

Author: Islam Md Nasrul
Publication venue: ScholarWorks@UNO
Publication date: 15/05/2015
Field of study

Integrated mining of feature spaces for bioinformatics domain discovery

Author: Chowriappa Pradeep
Publication venue: Louisiana Tech Digital Commons
Publication date: 01/10/2008
Field of study

One of the major challenges in the field of bioinformatics is the elucidation of protein folding for the functional annotation of proteins. The factors that govern protein folding include the chemical, physical, and environmental conditions of the protein\u27s surroundings, which can be measured and exploited for computational discovery purposes. These conditions enable the protein to transform from a sequence of amino acids to a globular three-dimensional structure. Information concerning the folded state of a protein has significant potential to explain biochemical pathways and their involvement in disorders and diseases. This information impacts the ways in which genetic diseases are characterized and cured and in which designer drugs are created. With the exponential growth of protein databases and the limitations of experimental protein structure determination, sophisticated computational methods have been developed and applied to search for, detect, and compare protein homology. Most computational tools developed for protein structure prediction are primarily based on sequence similarity searches. These approaches have improved the prediction accuracy of high sequence similarity proteins but have failed to perform well with proteins of low sequence similarity. Data mining offers unique algorithmic computational approaches that have been used widely in the development of automatic protein structure classification and prediction. In this dissertation, we present a novel approach for the integration of physico-chemical properties and effective feature extraction techniques for the classification of proteins. Our approaches overcome one of the major obstacles of data mining in protein databases, the encapsulation of different hydrophobicity residue properties into a much reduced feature space that possess high degrees of specificity and sensitivity in protein structure classification. We have developed three unique computational algorithms for coherent feature extraction on selected scale properties of the protein sequence. When plagued by the problem of the unequal cardinality of proteins, our proposed integration scheme effectively handles the varied sizes of proteins and scales well with increasing dimensionality of these sequences. We also detail a two-fold methodology for protein functional annotation. First, we exhibit our success in creating an algorithm that provides a means to integrate multiple physico-chemical properties in the form of a multi-layered abstract feature space, with each layer corresponding to a physico-chemical property. Second, we discuss a wavelet-based segmentation approach that efficiently detects regions of property conservation across all layers of the created feature space. Finally, we present a unique graph-theory based algorithmic framework for the identification of conserved hydrophobic residue interaction patterns using identified scales of hydrophobicity. We report that these discriminatory features are specific to a family of proteins, which consist of conserved hydrophobic residues that are then used for structural classification. We also present our rigorously tested validation schemes, which report significant degrees of accuracy to show that homologous proteins exhibit the conservation of physico-chemical properties along the protein backbone. We conclude our discussion by summarizing our results and contributions and by listing our goals for future research

Louisiana Tech Digital Commons

Characterising and Predicting Amyloid Mutations in Proteins

Author: Gardner Allison
Publication venue
Publication date: 31/12/2016
Field of study

The University of Manchester - Institutional Repository

USE OF IMAGE PROCESSING TECHNIQUES AND MACHINE LEARNING FOR BETTER UNDERSTANDING OF T GONDII BIOLOGY

Author: Asiri Amer
Publication venue: UKnowledge
Publication date: 01/01/2022
Field of study

Almost one in every three people worldwide is infected with Toxoplasma gondii (T. gondii). The biology and growth of the parasite’s bradyzoite form in host tissue cysts are not well understood. T. gondii’s metabolic state influences the morphology of its single mitochondrion, which can be visualized using fluorescence microscopy with specific dyes. Hence, fluorescence microscopy images of cysts purified from infected mouse brains carry biological information about bradyzoites, the poorly understood form of the parasite within them. With the help of fluorescence microscopy techniques, previous studies extracted images of the mitochondrion, nucleus, and the inner membrane complex (IMC) providing information on T. gondii’s cysts paving the way for image processing techniques and machine learning to analyze the bradyzoite form of the parasite. Previously, multivariate logistic regression (MLG) was used to classify shapes of mitochondrion. In the present study, in addition to the previously used MLG model, two other machine learning models, Support Vector Machine (SVM) and K Nearest Neighbors (KNN), were used to explore the possibility of better model selection for mitochondrial classification. A minimal model error was used to optimize the classification model performance. Error in any machine learning model is driven by bias, variance, and noise. Through trial and error, the optimal hyperparameters for each model were selected to minimize error. The dataset used consisted of 1940 labeled mitochondrial objects with 22 features, and consisted of five classes Blob, Tadpole, Donut, Arc, and Other. 50% of the dataset was used for training, and the other 50% was used for testing. The overall models’ accuracy of MLG, SVM, and KNN were 79.1%, 78.9%, and 80.3% respectively. Overall classification performance did not vary, but the F score for some classes like Tadpole and Donut showed improvement when using the two newer models. One of the 22 features used was an application of the Histogram of Oriented Gradients (HOG). The HOG feature was replaced with a novel feature that used linear regression of object boundary segments to extract the HOG for only the object’s boundary. The model that used Boundary HOG showed some improvement over the HOG feature. Finally, a new module including a graphical user interface was developed to process and extract shape and intensity information from TgIMC3 images which facilitate further investigations of the parasite biology

University of Kentucky

SLIMSVM : a simple implementation of support vector machine for analysis of microarray data

Author: Karmaker Avik
Publication venue: Digital Commons @ NJIT
Publication date: 31/08/2004
Field of study

Support Vector Machine (SVM) is a supervised machine learning technique being widely used in multiple areas of biological analysis including microarray data analysis. SlimSVM has been developed with the intention of replacing OSU SVM as the classification component of GenoIterSVM in order to make it independent of other SVM packages. GenolterSVM, developed by Dr. Marc Ma, is a SVM implementation with an iterative refinement algorithm for improved accuracy of classification of genotype microarray data. SlimSVM is an object-oriented, modular, and easy-to-use implementation written in C++. It supports dot (linear) and polynomial (non-linear) kernels. The program has been tested with artificial non-biological and microarray data. Testing with microarray data was performed to observe how SlimSVM handles medium-sized data files (containing thousands of data points) since it would ultimately be used to analyze them. The results were compared to those of LIBSVM, a leading SVM software, and the comparison demonstrates that implementation of SlimS VM was carried out accurately

Digital Commons @ New Jersey Institute of Technology (NJIT)

Recommended from our members

Scoring functions for protein docking and drug design

Author: Viswanath Shruthi
Publication venue
Publication date: 26/06/2014
Field of study

textPredicting the structure of complexes formed by two interacting proteins is an important problem in computation structural biology. Proteins perform many of their functions by binding to other proteins. The structure of protein-protein complexes provides atomic details about protein function and biochemical pathways, and can help in designing drugs that inhibit binding. Docking computationally models the structure of protein-protein complexes, given three-dimensional structures of the individual chains. Protein docking methods have two phases. In the first phase, a comprehensive, coarse search is performed for optimally docked models. In the second refinement and reranking phase, the models from the first phase are refined and reranked, with the expectation of extracting a small set of accurate models from the pool of thousands of models obtained from the first phase. In this thesis, new algorithms are developed for the refinement and reranking phase of docking. New scoring functions, or potentials, that rank models are developed. These potentials are learnt using large-scale machine learning methods based on mathematical programming. The procedure for learning these potentials involves examining hundreds of thousands of correct and incorrect models. In this thesis, hierarchical constraints were introduced into the learning algorithm. First, an atomic potential was developed using this learning procedure. A refinement procedure involving side-chain remodeling and conjugate gradient-based minimization was introduced. The refinement procedure combined with the atomic potential was shown to improve docking accuracy significantly. Second, a hydrogen bond potential, was developed. Molecular dynamics-based sampling combined with the hydrogen bond potential improved docking predictions. Third, mathematical programming compared favorably to SVMs and neural networks in terms of accuracy, training and test time for the task of designing potentials to rank docking models. The methods described in this thesis are implemented in the docking package DOCK/PIERR. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer's disease.R. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer’s disease.Computer Science

Texas ScholarWorks