33 research outputs found

    Development of New Bioinformatic Approaches for Human Genetic Studies

    Get PDF
    The development of bioinformatics methods for human genetic studies utilizes the vast amount of data to generate new valuable information. Machine learning and statistical coupling analysis can be used in the study of human diseases. These diseases include intellectual disabilities (ID), prevalent in 1-3% of the population and caused primarily by genetics. Although many cases of ID are caused by mutations in protein-coding genes, the possible involvement of long non-coding RNAs (lncRNAs) in ID due to their role in gene expression regulation, has been explored. In this study, we used machine learning to develop a new expression-based model trained using ID genes encoded with the developing brain transcriptome. The model was fine-tuned using the class-balancing approach of synthetic over-sampling of the minority class, resulting in improved performance. We used the model to predict candidate ID-associated lncRNAs. Our model identified several candidates that overlapped with previously reported ID-associated lncRNAs, enriched with neurodevelopmental functions, and highly expressed in brain tissues. Machine learning was also used to predict protein stability changes caused by missense mutations, which can lead to disease conditions including ID. We tested Random Forests, Support Vector Machines (SVM) and Naïve Bayes to find the best-performing algorithm to develop a multi-class classifier. We developed an SVM model using relevant physico-chemical features after feature selection. Our work identified new features for predicting the effect of amino acid substitutions on protein stability and a well-performing multi-class classifier solely based on sequence information. Statistical approaches were used to analyze the association between mutations and phenotypes. In this study, we used statistical coupling analysis (SCA) to cluster disease-causing mutations and ID phenotypes. Using SCA we identified groups of co-evolving residues, known as protein sectors, in ID protein families. Within each distinct sector, mutations associated with different phenotypic manifestations associated with a syndromic ID were identified. Our results suggest that protein sector analysis can be used to associate mutations with phenotypic manifestations in human diseases. The bioinformatic methods developed in this dissertation can be used in human genetic research to understand the role of new genes and proteins in human disease

    Granular Support Vector Machines Based on Granular Computing, Soft Computing and Statistical Learning

    Get PDF
    With emergence of biomedical informatics, Web intelligence, and E-business, new challenges are coming for knowledge discovery and data mining modeling problems. In this dissertation work, a framework named Granular Support Vector Machines (GSVM) is proposed to systematically and formally combine statistical learning theory, granular computing theory and soft computing theory to address challenging predictive data modeling problems effectively and/or efficiently, with specific focus on binary classification problems. In general, GSVM works in 3 steps. Step 1 is granulation to build a sequence of information granules from the original dataset or from the original feature space. Step 2 is modeling Support Vector Machines (SVM) in some of these information granules when necessary. Finally, step 3 is aggregation to consolidate information in these granules at suitable abstract level. A good granulation method to find suitable granules is crucial for modeling a good GSVM. Under this framework, many different granulation algorithms including the GSVM-CMW (cumulative margin width) algorithm, the GSVM-AR (association rule mining) algorithm, a family of GSVM-RFE (recursive feature elimination) algorithms, the GSVM-DC (data cleaning) algorithm and the GSVM-RU (repetitive undersampling) algorithm are designed for binary classification problems with different characteristics. The empirical studies in biomedical domain and many other application domains demonstrate that the framework is promising. As a preliminary step, this dissertation work will be extended in the future to build a Granular Computing based Predictive Data Modeling framework (GrC-PDM) with which we can create hybrid adaptive intelligent data mining systems for high quality prediction

    MACHINE LEARNING AND BIOINFORMATIC INSIGHTS INTO KEY ENZYMES FOR A BIO-BASED CIRCULAR ECONOMY

    Get PDF
    The world is presently faced with a sustainability crisis; it is becoming increasingly difficult to meet the energy and material needs of a growing global population without depleting and polluting our planet. Greenhouse gases released from the continuous combustion of fossil fuels engender accelerated climate change, and plastic waste accumulates in the environment. There is need for a circular economy, where energy and materials are renewably derived from waste items, rather than by consuming limited resources. Deconstruction of the recalcitrant linkages in natural and synthetic polymers is crucial for a circular economy, as deconstructed monomers can be used to manufacture new products. In Nature, organisms utilize enzymes for the efficient depolymerization and conversion of macromolecules. Consequently, by employing enzymes industrially, biotechnology holds great promise for energy- and cost-efficient conversion of materials for a circular economy. However, there is need for enhanced molecular-level understanding of enzymes to enable economically viable technologies that can be applied on a global scale. This work is a computational study of key enzymes that catalyze important reactions that can be utilized for a bio-based circular economy. Specifically, bioinformatics and data- mining approaches were employed to study family 7 glycoside hydrolases (GH7s), which are the principal enzymes in Nature for deconstructing cellulose to simple sugars; a cytochrome P450 enzyme (GcoA) that catalyzes the demethylation of lignin subunits; and MHETase, a tannase-family enzyme utilized by the bacterium, Ideonella sakaiensis, in the degradation and assimilation of polyethylene terephthalate (PET). Since enzyme function is fundamentally dependent on the primary amino-acid sequence, we hypothesize that machine-learning algorithms can be trained on an ensemble of functionally related enzymes to reveal functional patterns in the enzyme family, and to map the primary sequence to enzyme function such that functional properties can be predicted for a new enzyme sequence with significant accuracy. We find that supervised machine learning identifies important residues for processivity and accurately predicts functional subtypes and domain architectures in GH7s. Bioinformatic analyses revealed conserved active-site residues in GcoA and informed protein engineering that enabled expanded enzyme specificity and improved activity. Similarly, bioinformatic studies and phylogenetic analysis provided evolutionary context and identified crucial residues for MHET-hydrolase activity in a tannase-family enzyme (MHETase). Lastly, we developed machine-learning models to predict enzyme thermostability, allowing for high-throughput screening of enzymes that can catalyze reactions at elevated temperatures. Altogether, this work provides a solid basis for a computational data-driven approach to understanding, identifying, and engineering enzymes for biotechnological applications towards a more sustainable world

    ASPIRER: A new computational approach for identifying non-classical secreted proteins based on deep learning

    Get PDF
    Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.Xiaoyu Wang, Fuyi Li, Jing Xu, Jia Rong, Geoffrey I. Webb, Zongyuan Ge, Jian Li and Jiangning Son

    Implementation of machine learning for the evaluation of mastitis and antimicrobial resistance in dairy cows

    Get PDF
    Bovine mastitis is one of the biggest concerns in the dairy industry, where it affects sustainable milk production, farm economy and animal health. Most of the mastitis pathogens are bacterial in origin and accurate diagnosis of them enables understanding the epidemiology, outbreak prevention and rapid cure of the disease. This thesis aimed to provide a diagnostic solution that couples Matrix-Assisted Laser Desorption/Ionization-Time of Flight (MALDI-TOF) mass spectroscopy coupled with machine learning (ML), for detecting bovine mastitis pathogens at the subspecies level based on their phenotypic characters. In Chapter 3, MALDI-TOF coupled with ML was performed to discriminate bovine mastitis-causing Streptococcus uberis based on transmission routes; contagious and environmental. S. uberis isolates collected from dairy farms across England and Wales were compared within and between farms. The findings of this chapter suggested that the proposed methodology has the potential of successful classification at the farm level. In Chapter 4, MALDI-TOF coupled with ML was performed to show proteomic differences between bovine mastitis-causing Escherichia coli isolates with different clinical outcomes (clinical and subclinical) and disease phenotype (persistent and non-persistent). The findings of this chapter showed that phenotypic differences can be detected by the proposed methodology even for genotypically identical isolates. In Chapter 5, MALDI-TOF coupled with ML was performed to differentiate benzylpenicillin signatures of bovine mastitis-causing Staphylococcus aureus isolates. The findings of this chapter presented that the proposed methodology enables fast, affordable and effective diag-nostic solution for targeting resistant bacteria in dairy cows. Having shown this methodology successfully worked for differentiating benzylpenicillin resistant and susceptible S. aureus isolates in Chapter 5, the same technique was applied to other mastitis agents Enterococcus faecalis and Enterococcus faecium and for profiling other antimicrobials besides benzylpenicillin in Chapter 6. The findings of this chapter demonstrated that MALDI-TOF coupled with ML allows monitoring the disease epidemiology and provides suggestions for adjusting farm management strategies. Taken together, this thesis highlights that MALDI-TOF coupled with ML is capable of dis-criminating bovine mastitis pathogens at subspecies level based on transmission route, clinical outcome and antimicrobial resistance profile, which could be used as a diagnostic tool for bo-vine mastitis at dairy farms

    Early Stopping of a Neural Network via the Receiver Operating Curve.

    Get PDF
    This thesis presents the area under the ROC (Receiver Operating Characteristics) curve, or abbreviated AUC, as an alternate measure for evaluating the predictive performance of ANNs (Artificial Neural Networks) classifiers. Conventionally, neural networks are trained to have total error converge to zero which may give rise to over-fitting problems. To ensure that they do not over fit the training data and then fail to generalize well in new data, it appears effective to stop training as early as possible once getting AUC sufficiently large via integrating ROC/AUC analysis into the training process. In order to reduce learning costs involving the imbalanced data set of the uneven class distribution, random sampling and k-means clustering are implemented to draw a smaller subset of representatives from the original training data set. Finally, the confidence interval for the AUC is estimated in a non-parametric approach

    Clasificación binaria en problemas desequilibrados mediante equivalencia del cociente de verosimilitudes

    Get PDF
    Los Problemas Singulares son aquellos cuyas caracter´ısticas pueden comprometer el correcto funcionamiento de máquinas discriminativas convencionales, obteniendo resultados poco satisfactorios. Entre ellos destacan los problemas de clasificación desequilibrada, aquellos en los que existen grandes diferencias en las poblaciones de las clases o/y la política de costes penaliza en mayor medida la elección de determinadas clases, sesgando la salida de la máquina en favor de las clases predominantes. Por ello, se precisa la aplicación de métodos específicos que compensen el desequilibrio existente, permitiendo la detección de las clases minoritarias. Particularizando para el caso binario, se lleva a cabo un estudio del estado del arte de los métodos de re-equilibrado existentes. La mayoría de las técnicas propuestas son puramente empíricas, sin un análisis completo de las implicaciones estadísticas que tiene su aplicación. A pesar de que su uso puede ofrecer buenos resultados bajo determinadas condiciones, cualquier cambio en dichas condiciones puede producir una degradación en las prestaciones. Por ello, se presenta una metodología fundamentada en la teoría estadística bayesiana con el objetivo de construir soluciones robustas. Esta metodología se basa en el principio de invarianza del cociente de verosimilitudes, estableciendo dos condiciones suficientes y necesarias: el uso de divergencias de Bregman como coste subrogado y métodos de re-equilibrado estadísticamente neutrales. Además, se proponen procedimientos fundamentados de clasificación en dos pasos y se describe detalladamente un proceso de diseño re-equilibrado basado en la combinación de métodos. Diversos experimentos avalan la metodología, estudiando sus efectos y limitaciones en problemas reales bajo distintas circunstancias: mayor o menor número de muestras disponibles y presencia de ruido. Por último, se estudia en mayor profundidad el algoritmo SMOTE, uno de los métodos de re-equilibrado más comunes. Debido a la generación −por medio de los vecinos más próximos− filiforme de muestras, SMOTE presenta dificultades ante problemas de alta dimensionalidad. Por ello, se propone una alternativa, VoluSMOTE, para corregir o atenuar tales efectos por medio de una generación volumétrica.Singular Problems are those whose characteristics compromise the correct operation of conventional discriminative machines, obtaining unsatisfactory results. Among them, imbalanced classification problems stand out, those in which there are large differences in the class populations or/and the cost policy penalizes to a greater extent the choice of certain classes, biasing the machine output in favor of the predominant classes. Therefore, the application of specific methods that compensate the imbalance is required, allowing the detection of the minority classes. Particularly for the binary case, a state-of-the-art survey of the existing rebalancing methods is carried out. Most of the proposed techniques are purely empirical, without a complete analysis of the statistical implications of their application. Although their use may provide good results under certain conditions, any change in these conditions may lead to a degradation of their performance. Therefore, a principled methodology based on Bayesian statistical theory is presented with the aim of constructing robust solutions. This methodology is based on the likelihood ratio invariance principle, for which two sufficient and necessary conditions are established: the use of Bregman divergences as a surrogate cost and statistically neutral rebalancing methods. In addition, principled two-step classification procedures are proposed and a rebalanced design process based on the combination of methods is described in detail. Several experiments support the methodology, studying its effects and limitations in real problems under different circumstances: larger or smaller number of available samples and presence of noise. Finally, the SMOTE algorithm, one of the most common rebalancing methods, is studied in more depth. Due to the filiform generation of samples −by means of the nearest neighbors−, SMOTE presents difficulties with high dimensionality problems. Therefore, an alternative, VoluSMOTE, is proposed to correct or mitigate such effects by volumetric generation.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Javier Martínez Moguerza.- Secretario: Marcelino Lázaro Teja.- Vocal: José Luis Sancho Góme

    Implementation of machine learning for the evaluation of mastitis and antimicrobial resistance in dairy cows

    Get PDF
    Bovine mastitis is one of the biggest concerns in the dairy industry, where it affects sustainable milk production, farm economy and animal health. Most of the mastitis pathogens are bacterial in origin and accurate diagnosis of them enables understanding the epidemiology, outbreak prevention and rapid cure of the disease. This thesis aimed to provide a diagnostic solution that couples Matrix-Assisted Laser Desorption/Ionization-Time of Flight (MALDI-TOF) mass spectroscopy coupled with machine learning (ML), for detecting bovine mastitis pathogens at the subspecies level based on their phenotypic characters. In Chapter 3, MALDI-TOF coupled with ML was performed to discriminate bovine mastitis-causing Streptococcus uberis based on transmission routes; contagious and environmental. S. uberis isolates collected from dairy farms across England and Wales were compared within and between farms. The findings of this chapter suggested that the proposed methodology has the potential of successful classification at the farm level. In Chapter 4, MALDI-TOF coupled with ML was performed to show proteomic differences between bovine mastitis-causing Escherichia coli isolates with different clinical outcomes (clinical and subclinical) and disease phenotype (persistent and non-persistent). The findings of this chapter showed that phenotypic differences can be detected by the proposed methodology even for genotypically identical isolates. In Chapter 5, MALDI-TOF coupled with ML was performed to differentiate benzylpenicillin signatures of bovine mastitis-causing Staphylococcus aureus isolates. The findings of this chapter presented that the proposed methodology enables fast, affordable and effective diag-nostic solution for targeting resistant bacteria in dairy cows. Having shown this methodology successfully worked for differentiating benzylpenicillin resistant and susceptible S. aureus isolates in Chapter 5, the same technique was applied to other mastitis agents Enterococcus faecalis and Enterococcus faecium and for profiling other antimicrobials besides benzylpenicillin in Chapter 6. The findings of this chapter demonstrated that MALDI-TOF coupled with ML allows monitoring the disease epidemiology and provides suggestions for adjusting farm management strategies. Taken together, this thesis highlights that MALDI-TOF coupled with ML is capable of dis-criminating bovine mastitis pathogens at subspecies level based on transmission route, clinical outcome and antimicrobial resistance profile, which could be used as a diagnostic tool for bo-vine mastitis at dairy farms

    Prediction of G Protein-Coupled Receptors with SVM-Prot Features and Random Forest

    Get PDF

    Early Detection of Neurodegenerative Diseases from Bio-Signals: A Machine Learning Approach

    Get PDF
    Given the fact that people, especially in advanced countries, are living longer due to the advancements in medical sciences which resulted in the prevalence of age-related diseases like Alzheimer’s and dementia. The occurrence of such diseases continues to increase and ultimately the cost of caring for these groups will become unsustainable. Addressing this issue has reached a critical point and failing to provide a strategic way forward will negatively affect patients, national health services and society as a whole.Three distinctive development stages of neurodegenerative diseases (Retrogenesis, Cognitive Impairment and Gait Impairment) motivated me to divide this research work into two main parts. To fully achieve the purpose of early detection/diagnosis, I aimed at analysing the gait signals as well as EEG signals, separately, as both of these signals severely get affected by any neurological disease.The first part of this research work focuses on the discrimination analysis of gait signals of different neurodegenerative diseases (Parkinson’s, Huntington, and Amyotrophic Lateral Sclerosis) and also of control subjects. This involves relevant feature extraction, solving the issues of imbalanced datasets and missing entries and lastly classification of multiclass datasets. For the classification and discrimination of gait signals, eleven (11) classifiers are selected representing linear, non-linear and Bayes normal classification techniques. Results revealed that three classifiers have provided us with higher accuracy rate which are UDC, LDC and PARZEN with 65%, 62.5% and 60% accuracy, respectively. Further, I proposed and developed a new classifier fusion strategy that combined classification algorithms with combining rules (voting, product, mean, median, maximum and minimum). It generates better results and classifies subjects more accurately than base-level classifiers.The last part of this research work is based on the rectification and computation of EEG signals of mild Alzheimer’s disease patients and control subjects. To detect the perturbation in EEG signals of Alzheimer’s patients, three neural synchrony measurement techniques; phase synchrony, magnitude squared coherence and cross correlation are applied on three different databases of mild Alzheimer’s disease (MiAD) patients and healthy subjects. I have compared right and left temporal parts of brain with rest of the brain area (frontal, central and occipital), as temporal regions are relatively the first ones to be affected by Alzheimer’s. Two novel methods are proposed to compute the neural synchronization of the brain; Average synchrony measure and PCA based synchrony measure. These techniques are evaluated for three different datasets of MiAD patients and control subjects using the Wilcoxon ranksum test (Mann-Whitney U test). Results demonstrated that PCA based method helped us to find more significant features that can be used as biomarkers for the early diagnosis of Alzheimer’s
    corecore