122 research outputs found

    Nuclear export signals (NESs) in Arabidopsis thaliana : development and experimental validation of a prediction tool

    Get PDF
    Rubiano Castellanos CC. Nuclear export signals (NESs) in Arabidopsis thaliana : development and experimental validation of a prediction tool. Bielefeld (Germany): Bielefeld University; 2010.It is well established that nucleo-cytoplasmic shuttling regulates not only the localization but also the activity of many proteins like transcription factors, cell cycle regulators and tumor suppressor proteins just to mention some. Also in plants the nucleo-cytoplasmic partitioning of proteins emerges as an important regulation mechanism for many plant-specific processes. One requirement for a protein to shuttle between nucleus and cytoplasm lies in its nuclear export activity. The widely used mechanism for export of proteins from the nucleus involves the receptor Exportin 1 and the presence of a nuclear export signal (NES) in the cargo protein. Given the big amount of sequence data available nowadays the possibility to use a computational tool to predict the proteins potentially containing an NES would help to facilitate the screening and experimental characterization of NES-containing proteins. However, the computational prediction of NESs is a challenging task. Currently there is only one NES prediction tool and that is unfortunately not accurate for predicting these signals in proteins of plants. In that direction, this study aimed mainly at developing a prediction method for identifying NESs in proteins from Arabidopsis and to validate its usefulness experimentally. It included also the definition of the influence of the NES protein context in the nuclear export activity of specific proteins of Arabidopsis. Three machine-learning algorithms (i.e. k-NN, SVM and Random Forests) were trained with experimentally validated NES sequences from proteins of Arabidopsis and other organisms. Two kinds of features were included, the sequence of the NESs expressed as the score obtained from an HMM profile constructed with the NES sequences of proteins from Arabidopsis, and physicochemical properties of the amino acid residues expressed as amino acid index values. The Random Forest classifier was selected among the three classifiers after evaluation of the performance by different methods. It showed to be highly accurate (accuracy values over 85 percent, classification error around 10 percent, MCC around 0.7 and area under the ROC curve around 0.90) and performed better than the other two trained classifiers. Using the Random Forest classifier around 5000 proteins from the total of protein sequences from Arabidopsis were predicted as containing NESs. A group of these proteins was selected by using Gene Ontologies (GO) and from this last group, 13 proteins were experimentally tested for nuclear export activity. 11 out of those 13 proteins showed positive interaction with the receptor Exportin 1 (XPO1a) from Arabidopsis in yeast two-hybrid assays. The proteins showing nuclear export activity include 9 transcription factors and 2 DNA metabolism-related proteins. Furthermore, it was established that the amino acid residues located between the hydrophobic residues in the NES as well as the protein structure of the regions around the NES could modify the nuclear export activity of some proteins. In conclusion, this work presents a new prediction tool for NESs in proteins of Arabidopsis based on a Random Forest classifier. The experimental validation of the nuclear export activity in a selected group of proteins is an indicative of the usefulness of the tool. From the biological point of view, the nuclear export activity observed in those proteins strongly suggest that nucleo-cytoplasmic partitioning could be involved in the regulation of their functions. For the follow up research the further characterization of the proteins showing positive nuclear export activity as well as the validation of additional predicted NES-containing proteins is envisioned. In the near future, the developed tool is going to be available as a web application to facilitate and promote its further usage

    Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics.

    Get PDF
    Sub-cellular localisation of proteins is an essential post-translational regulatory mechanism that can be assayed using high-throughput mass spectrometry (MS). These MS-based spatial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousands of proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics data for the cell biology community. Yet, there are many third-party data sources, such as immunofluorescence microscopy or protein annotations and sequences, which represent a rich and vast source of complementary information. We present a unique transfer learning classification framework that utilises a nearest-neighbour or support vector machine system, to integrate heterogeneous data sources to considerably improve on the quantity and quality of sub-cellular protein assignment. We demonstrate the utility of our algorithms through evaluation of five experimental datasets, from four different species in conjunction with four different auxiliary data sources to classify proteins to tens of sub-cellular compartments with high generalisation accuracy. We further apply the method to an experiment on pluripotent mouse embryonic stem cells to classify a set of previously unknown proteins, and validate our findings against a recent high resolution map of the mouse stem cell proteome. The methodology is distributed as part of the open-source Bioconductor pRoloc suite for spatial proteomics data analysis.LMB was supported by a BBSRC Tools and Resources Development Fund (Award BB/K00137X/1) and a Wellcome Trust Technology Development Grant (108441/Z/15/Z). LG was supported by the European Union 7th Framework Program (PRIME-XS project, grant agreement number 262067) and a BBSRC Strategic Longer and Larger Award (Award BB/L002817/1). DW and OK acknowledge funding from the European Union (PRIME-XS, GA 262067) and Deutsche Forschungsgemeinschaft (KO-2313/6-1).This is the final version of the article. It first appeared from PLOS via https://doi.org/10.1371/journal.pcbi.100492

    A new computational framework for the classification and function prediction of long non-coding RNAs

    Get PDF
    Long non-coding RNAs (lncRNAs) are known to play a significant role in several biological processes. These RNAs possess sequence length greater than 200 base pairs (bp), and so are often misclassified as protein-coding genes. Most Coding Potential Computation (CPC) tools fail to accurately identify, classify and predict the biological functions of lncRNAs in plant genomes, due to previous research being limited to mammalian genomes. In this thesis, an investigation and extraction of various sequence and codon-bias features for identification of lncRNA sequences has been carried out, to develop a new CPC Framework. For identification of essential features, the framework implements regularisation-based selection. A novel classification algorithm is implemented, which removes the dependency on experimental datasets and provides a coordinate-based solution for sub-classification of lncRNAs. For imputing the lncRNA functions, lncRNA-protein interactions have been first determined through co-expression of genes which were re-analysed by a sequence similaritybased approach for identification of novel interactions and prediction of lncRNA functions in the genome. This integrates a D3-based application for visualisation of lncRNA sequences and their associated functions in the genome. Standard evaluation metrics such as accuracy, sensitivity, and specificity have been used for benchmarking the performance of the framework against leading CPC tools. Case study analyses were conducted with plant RNA-seq datasets for evaluating the effectiveness of the framework using a cross-validation approach. The tests show the framework can provide significant improvements on existing CPC models for plant genomes: 20-40% greater accuracy. Function prediction analysis demonstrates results are consistent with the experimentally-published findings

    Bioinformatics methods for metabolomics based biomarker detection in functional genomics studies

    Get PDF
    The biochemical and physiological functions of a large proportion of the approximately 27,000 protein-encoding genes in the Arabidopsis genome is experimentally undetermined using sequence homology techniques alone. This thesis presents a set of bioinformatics resources including a software platform for data visualization and data analysis that address the key issues in incorporating the metabolomics data for functional genomics studies. Multiple mass spectrometry based metabolomics platforms are combined together to get better coverage of the metabolome. Different strategies for integrating the metabolomics abundance data from multiple platforms are compared to find the ideal method for biomarker discovery. A new method of putatively identifying unknown metabolites by first order partial correlation networks is proposed that uses the existing data to incorporate structurally unknown metabolites. A comprehensive study of 70 single gene knock mutants vs. wild type samples is performed using Random Forest machine learning algorithm and a biomarker database for each of the 70 mutations is built with the key metabolites including the putative identifications of unknown metabolites. A proof-of-concept analysis on the oxoprolinase (oxp1) and gamma-glutamyl transpeptidase (ggt1 and ggt2) single gene knock-out mutants in the glutathione degradation (GSH) pathway of the Arabidopsis confirms the known biology that OXP1 is responsible for conversion of 5-oxoproline (5-OP) to glutamic acid. In addition, ggt1/ggt2 analysis supports the hypothesis that the GGT genes may not be major contributors for the 5-OP production. Also, the lack of biochemical changes in ggt2 mutation supports the previous studies of its low level expression in leaf tissues. The metabolomics database, the biomarker database and the data mining tools are implemented in a web based software suite at www.plantmetabolomics.org

    Machine Learning in clinical biology and medicine: from prediction of multidrug resistant infections in humans to pre-mRNA splicing control in Ciliates

    Get PDF
    Machine Learning methods have broadly begun to infiltrate the clinical literature in such a way that the correct use of algorithms and tools can facilitate both diagnosis and therapies. The availability of large quantities of high-quality data could lead to an improved understanding of risk factors in community and healthcare-acquired infections. In the first part of my PhD program, I refined my skills in Machine Learning by developing and evaluate with a real antibiotic stewardship dataset, a model useful to predict multi-drugs resistant urinary tract infections after patient hospitalization9 . For this purpose, I created an online platform called DSaaS specifically designed for healthcare operators to train ML models (supervised learning algorithms). These results are reported in Chapter 2. In the second part of the PhD thesis (Chapter 3) I used my new skills to study the genomic variants, in particular the phenomenon of intron splicing. One of the important modes of pre-mRNA post-transcriptional modification is alternative intron splicing, that includes intron retention (unsplicing), allowing the creation of many distinct mature mRNA transcripts from a single gene. An accurate interpretation of genomic variants is the backbone of genomic medicine. Determining for example the causative variant in patients with Mendelian disorders facilitates both management and potential downstream treatment of the patient’s condition, as well as providing peace of mind and allowing more effective counselling for the wider family. Recent years have seen a surge in bioinformatics tools designed to predict variant impact on splicing, and these offer an opportunity to circumvent many limitations of RNA-seq based approaches. An increasing number of these tools rely on machine learning computational approaches that can identify patterns in data and use this knowledge to speculate on new data. I optimized a pipeline to extract and classify introns from genomes and transcriptomes and I classified them into retained (Ris) and constitutively spliced (CSIs) introns. I used data from ciliates for the peculiar organization of their genomes (enriched of coding sequences) and because they are unicellular organisms without cells differentiated into tissues. That made easier the identification and the manipulation of introns. In collaboration with the PhD colleague dr. Leonardo Vito, I analyzed these intronic sequences in order to identify “features” to predict and to classify them by Machine Learning algorithms. We also developed a platform useful to manipulate FASTA, gtf, BED, etc. files produced by the pipeline tools. I named the platform: Biounicam (intron extraction tools) available at http://46.23.201.244:1880/ui. The major objective of this study was to develop an accurate machine-learning model that can predict whether an intron will be retained or not, to understand the key-features involved in the intron retention mechanism, and provide insight on the factors that drive IR. Once the model has been developed, the final step of my PhD work will be to expand the platform with different machine learning algorithms to better predict the retention and to test new features that drive this phenomenon. These features hopefully will contribute to find new mechanisms that controls intron splicing. The other additional papers and patents I published during my PhD program are in Appendix B and C. These works have enriched me with many useful techniques for future works and ranged from microbiology to classical statistics

    G-Quadruplexes and Gene Expression in Arabidopsis thaliana

    Get PDF

    Regularisoitu riippuvuuksien mallintaminen geeniekpressio- ja metabolomiikkadatan välillä metabolian säätelyn tutkimuksessa

    Get PDF
    Fusing different high-throughput data sources is an effective way to reveal functions of unknown genes, as well as regulatory relationships between biological components such as genes and metabolites. Dependencies between biological components functioning in the different layers of biological regulation can be investigated using canonical correlation analysis (CCA). However, the properties of the high-throughput bioinformatics data induce many challenges to data analysis: the sample size is often insufficient compared to the dimensionality of the data, and the data pose multi-collinearity due to, for example, co-expressed and co-regulated genes. Therefore, a regularized version of classical CCA has been adopted. An alternative way of introducing regularization to statistical models is to perform Bayesian data analysis with suitable priors. In this thesis, the performance of a new variant of Bayesian CCA called gsCCA is compared to a classical ridge regression regularized CCA (rrCCA) in revealing relevant information shared between two high-throughput data sets. The gsCCA produces a partly similar regulatory effect as the classical CCA but, in addition, the gsCCA introduces a new type of regularization to the data covariance matrices. Both CCA methods are applied to gene expression and metabolic concentration measurements obtained from an oxidative-stress tolerant Arabidopsis thaliana ecotype Col-0, and an oxidative stress sensitive mutant rcd1 as time series under ozone exposure and in a control condition. The aim of this work is to reveal new regulatory mechanisms in the oxidative stress signalling in plants. For the both methods, rrCCA and gsCCA, the thesis illustrates their potential to reveal both already known and new regulatory mechanisms in Arabidopsis thaliana oxidative stress signalling.Bioinformatiikassa erityyppisten mittausaineistojen yhdistäminen on tehokas tapa selvittää tuntemattomien geenien toiminnallisuutta sekä säätelyvuorovaikutuksia eri biologisten komponenttien, kuten geenien ja metaboliittien, välillä. Riippuvuuksia eri biologisilla säätelytasoilla toimivien komponenttien välillä voidaan tutkia kanonisella korrelaatioanalyysilla (canonical correlation analysis, CCA). Bioinformatiikan tietoaineistot aiheuttavat kuitenkin monia haasteita data-analyysille: näytteiden määrä on usein riittämätön verrattuna aineiston piirteiden määrään, ja aineisto on multikollineaarista johtuen esim. yhdessä säädellyistä ja ilmentyvistä geeneistä. Tästä syystä usein käytetään regularisoitua versiota kanonisesta korrelaatioanalyysistä aineiston tilastolliseen analysointiin. Vaihtoehto regularisoidulle analyysille on bayesilainen lähestymistapa yhdessä sopivien priorioletuksien kanssa. Tässä diplomityössä tutkitaan ja vertaillaan uuden bayesilaisen CCA:n sekä klassisen harjanneregressio-regularisoidun CCA:n kykyä löytää oleellinen jaettu informaatio kahden bioinformatiikka-tietoaineiston välillä. Uuden bayesilaisen menetelmän nimi on ryhmittäin harva kanoninen korrelaatioanalyysi. Ryhmittäin harva CCA tuottaa samanlaisen regularisointivaikutuksen kuin harjanneregressio-CCA, mutta lisäksi uusi menetelmä regularisoi tietoaineistojen kovarianssimatriiseja uudella tavalla. Molempia CCA-menetelmiä sovelletaan geenien ilmentymisaineistoon ja metaboliittien konsentraatioaineistoon, jotka on mitattu Arabidopsis thaliana:n hapetus-stressiä sietävästä ekotyypistä Col-0 ja hapetus-stressille herkästä rcd1 mutantista aika-sarjana, sekä otsoni-altistuksessa että kontrolliolosuhteissa. Diplomityö havainnollistaa harjanneregressio-CCA:n ja ryhmittäin harvan CCA:n kykyä paljastaa jo tunnettuja ja mahdollisesti uusia säätelymekanismeja geenien ja metabolittien välillä kasvisolujen viestinnässä hapettavan stressin aikana

    Statistical Population Genomics

    Get PDF
    This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions
    corecore