713 research outputs found

    Designing single guide RNAs for CRISPR/Cas9

    Get PDF
    Researchers have been working towards development of tools to facilitate regular use genome engineering techniques. In recent years, the focus of these efforts has been the Clustered Regularly Interspaced Short Palindromic Repeats(CRISPR)/CRISPR associated(Cas) systems. These systems, while found naturally in bacteria and archaea as an immunity mechanism, can be used for genome engineering in eukaryotes. There are three major computational challenges associated with the use of CRISPR/Cas9 in genome engineering for mammals - identification of CRISPR arrays, single guide RNA design and minimizing off-target effects. This project attempts to solve the problem of single guide RNA design using a novel approach. Researchers have been trying to solve the problem by using different machine learning classification algorithms. The algorithms have been trained to use the sequential and structural properties of single guide RNAs (sgRNAs). This project explores the use of a neural network based approach to solve the sgRNA design problem. A form of the Recurrent Neural Network (RNN) called the Long Short Term Memory (LSTM) model can be used as feature-less classification model to differentiate between functional and non-functional single guide RNAs. The project covers different experiments conducted using Support Vector Machine and Random Forest classifiers using sequential and structural features to identify the most potent sgRNAs in a given set of input sgRNAs. It also summarizes the implementation of the LSTM model and its results, along with the cross-validation results for each of these models. Through these results, it has been observed that LSTMs perform better than existing models such as Random Forest Classifiers and Support Vector Machines and give results comparable to existing tools

    Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction

    Get PDF
    Protein tertiary structure plays a very important role in determining its possible functional sites and chemical interactions with other related proteins. Experimental methods to determine protein structure are time consuming and expensive. As a result, the gap between protein sequence and its structure has widened substantially due to the high throughput sequencing techniques. Problems of experimental methods motivate us to develop the computational algorithms for protein structure prediction. In this work, the clustering system is used to predict local protein structure. At first, recurring sequence clusters are explored with an improved K-means clustering algorithm. Carefully constructed sequence clusters are used to predict local protein structure. After obtaining the sequence clusters and motifs, we study how sequence variation for sequence clusters may influence its structural similarity. Analysis of the relationship between sequence variation and structural similarity for sequence clusters shows that sequence clusters with tight sequence variation have high structural similarity and sequence clusters with wide sequence variation have poor structural similarity. Based on above knowledge, the established clustering system is used to predict the tertiary structure for local sequence segments. Test results indicate that highest quality clusters can give highly reliable prediction results and high quality clusters can give reliable prediction results. In order to improve the performance of the clustering system for local protein structure prediction, a novel computational model called Clustering Support Vector Machines (CSVMs) is proposed. In our previous work, the sequence-to-structure relationship with the K-means algorithm has been explored by the conventional K-means algorithm. The K-means clustering algorithm may not capture nonlinear sequence-to-structure relationship effectively. As a result, we consider using Support Vector Machine (SVM) to capture the nonlinear sequence-to-structure relationship. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called CSVMs. Taking advantage of both the theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. Compared with the clustering system introduced previously, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied

    A Balanced Secondary Structure Predictor

    Get PDF
    Secondary structure (SS) refers to the local spatial organization of the polypeptide backbone atoms of a protein. Accurate prediction of SS is a vital clue to resolve the 3D structure of protein. SS has three different components- helix (H), beta (E) and coil (C). Most SS predictors are imbalanced as their accuracy in predicting helix and coil are high, however significantly low in the beta. The objective of this thesis is to develop a balanced SS predictor which achieves good accuracies in all three SS components. We proposed a novel approach to solve this problem by combining a genetic algorithm (GA) with a support vector machine. We prepared two test datasets (CB471 and N295) to compare the performance of our predictors with SPINE X. Overall accuracy of our predictor was 76.4% and 77.2% respectively on CB471 and N295 datasets, while SPINE X gave 76.5% overall accuracy on both test datasets

    A Balanced Secondary Structure Predictor

    Get PDF
    Secondary structure (SS) refers to the local spatial organization of the polypeptide backbone atoms of a protein. Accurate prediction of SS is a vital clue to resolve the 3D structure of protein. SS has three different components- helix (H), beta (E) and coil (C). Most SS predictors are imbalanced as their accuracy in predicting helix and coil are high, however significantly low in the beta. The objective of this thesis is to develop a balanced SS predictor which achieves good accuracies in all three SS components. We proposed a novel approach to solve this problem by combining a genetic algorithm (GA) with a support vector machine. We prepared two test datasets (CB471 and N295) to compare the performance of our predictors with SPINE X. Overall accuracy of our predictor was 76.4% and 77.2% respectively on CB471 and N295 datasets, while SPINE X gave 76.5% overall accuracy on both test datasets

    Integrated mining of feature spaces for bioinformatics domain discovery

    Get PDF
    One of the major challenges in the field of bioinformatics is the elucidation of protein folding for the functional annotation of proteins. The factors that govern protein folding include the chemical, physical, and environmental conditions of the protein\u27s surroundings, which can be measured and exploited for computational discovery purposes. These conditions enable the protein to transform from a sequence of amino acids to a globular three-dimensional structure. Information concerning the folded state of a protein has significant potential to explain biochemical pathways and their involvement in disorders and diseases. This information impacts the ways in which genetic diseases are characterized and cured and in which designer drugs are created. With the exponential growth of protein databases and the limitations of experimental protein structure determination, sophisticated computational methods have been developed and applied to search for, detect, and compare protein homology. Most computational tools developed for protein structure prediction are primarily based on sequence similarity searches. These approaches have improved the prediction accuracy of high sequence similarity proteins but have failed to perform well with proteins of low sequence similarity. Data mining offers unique algorithmic computational approaches that have been used widely in the development of automatic protein structure classification and prediction. In this dissertation, we present a novel approach for the integration of physico-chemical properties and effective feature extraction techniques for the classification of proteins. Our approaches overcome one of the major obstacles of data mining in protein databases, the encapsulation of different hydrophobicity residue properties into a much reduced feature space that possess high degrees of specificity and sensitivity in protein structure classification. We have developed three unique computational algorithms for coherent feature extraction on selected scale properties of the protein sequence. When plagued by the problem of the unequal cardinality of proteins, our proposed integration scheme effectively handles the varied sizes of proteins and scales well with increasing dimensionality of these sequences. We also detail a two-fold methodology for protein functional annotation. First, we exhibit our success in creating an algorithm that provides a means to integrate multiple physico-chemical properties in the form of a multi-layered abstract feature space, with each layer corresponding to a physico-chemical property. Second, we discuss a wavelet-based segmentation approach that efficiently detects regions of property conservation across all layers of the created feature space. Finally, we present a unique graph-theory based algorithmic framework for the identification of conserved hydrophobic residue interaction patterns using identified scales of hydrophobicity. We report that these discriminatory features are specific to a family of proteins, which consist of conserved hydrophobic residues that are then used for structural classification. We also present our rigorously tested validation schemes, which report significant degrees of accuracy to show that homologous proteins exhibit the conservation of physico-chemical properties along the protein backbone. We conclude our discussion by summarizing our results and contributions and by listing our goals for future research

    USE OF IMAGE PROCESSING TECHNIQUES AND MACHINE LEARNING FOR BETTER UNDERSTANDING OF T GONDII BIOLOGY

    Get PDF
    Almost one in every three people worldwide is infected with Toxoplasma gondii (T. gondii). The biology and growth of the parasite’s bradyzoite form in host tissue cysts are not well understood. T. gondii’s metabolic state influences the morphology of its single mitochondrion, which can be visualized using fluorescence microscopy with specific dyes. Hence, fluorescence microscopy images of cysts purified from infected mouse brains carry biological information about bradyzoites, the poorly understood form of the parasite within them. With the help of fluorescence microscopy techniques, previous studies extracted images of the mitochondrion, nucleus, and the inner membrane complex (IMC) providing information on T. gondii’s cysts paving the way for image processing techniques and machine learning to analyze the bradyzoite form of the parasite. Previously, multivariate logistic regression (MLG) was used to classify shapes of mitochondrion. In the present study, in addition to the previously used MLG model, two other machine learning models, Support Vector Machine (SVM) and K Nearest Neighbors (KNN), were used to explore the possibility of better model selection for mitochondrial classification. A minimal model error was used to optimize the classification model performance. Error in any machine learning model is driven by bias, variance, and noise. Through trial and error, the optimal hyperparameters for each model were selected to minimize error. The dataset used consisted of 1940 labeled mitochondrial objects with 22 features, and consisted of five classes Blob, Tadpole, Donut, Arc, and Other. 50% of the dataset was used for training, and the other 50% was used for testing. The overall models’ accuracy of MLG, SVM, and KNN were 79.1%, 78.9%, and 80.3% respectively. Overall classification performance did not vary, but the F score for some classes like Tadpole and Donut showed improvement when using the two newer models. One of the 22 features used was an application of the Histogram of Oriented Gradients (HOG). The HOG feature was replaced with a novel feature that used linear regression of object boundary segments to extract the HOG for only the object’s boundary. The model that used Boundary HOG showed some improvement over the HOG feature. Finally, a new module including a graphical user interface was developed to process and extract shape and intensity information from TgIMC3 images which facilitate further investigations of the parasite biology

    SLIMSVM : a simple implementation of support vector machine for analysis of microarray data

    Get PDF
    Support Vector Machine (SVM) is a supervised machine learning technique being widely used in multiple areas of biological analysis including microarray data analysis. SlimSVM has been developed with the intention of replacing OSU SVM as the classification component of GenoIterSVM in order to make it independent of other SVM packages. GenolterSVM, developed by Dr. Marc Ma, is a SVM implementation with an iterative refinement algorithm for improved accuracy of classification of genotype microarray data. SlimSVM is an object-oriented, modular, and easy-to-use implementation written in C++. It supports dot (linear) and polynomial (non-linear) kernels. The program has been tested with artificial non-biological and microarray data. Testing with microarray data was performed to observe how SlimSVM handles medium-sized data files (containing thousands of data points) since it would ultimately be used to analyze them. The results were compared to those of LIBSVM, a leading SVM software, and the comparison demonstrates that implementation of SlimS VM was carried out accurately
    • …
    corecore