10 research outputs found

    Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers

    Get PDF
    Abstract Background: The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog classification approach implemented in a big data platform that considered several pairwise protein features and the low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International, 2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach; they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes. Results: The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built with only alignment-based similarity measures or combined with several alignment-free pairwise protein features showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such supervised approaches outperformed traditional methods, there were no significant differences between the exclusive use of alignment-based similarity measures and their combination with alignment-free features, even within the twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed that alignment-based features were top-ranked for the best classifiers while the runners-up were alignment-free features related to amino acid composition. Conclusions: The incorporation of alignment-free features in supervised big data models did not significantly improve ortholog detection in yeast proteomes regarding the classification qualities achieved with just alignment-based similarity measures. However, the similarity of their classification performance to that of traditional ortholog detection methods encourages the evaluation of other alignment-free protein pair descriptors in future research.This work was supported by the following financial sources: Postdoc fellowship (SFRH/BPD/92978/2013) granted to GACh by the Portuguese Fundação para a Ciência e a Tecnologia (FCT). AA was supported by the MarInfo – Integrated Platform for Marine Data Acquisition and Analysis (reference NORTE-01-0145-FEDER-000031), a project supported by the North Portugal Regional Operational Program (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF)

    Bioinformatics Analyses of Alternative Splicing: Predition of alternative splicing events in animals and plants using Machine Learning and analysis of the extent and conservation of subtle alternative splicing

    Get PDF
    Alternatives Spleißen (AS) ist ein Mechanismus, durch den ein Multi-Exon-Gen verschiedene Transkripte und damit verschiedene Proteine exprimieren kann. AS trägt wesentlich zur Komplexität und Vielfalt eukaryotischer Transkriptome und Proteome bei. Die Bioinformatik hat in den vergangenen zehn Jahren entscheidenden Beiträge zu unserem Verständnis des AS in Bezug auf Verbreitung, Umfang und Konservierung der verschiedenen Klassen, Evolution, Regulierung und biologische Funktion geliefert. Zum Nachweis des AS im großen Maßstab wurden meist Verfahren zur Genom- und Transkriptom-weiten Alignierung von EST- und mRNA-Daten sowie Microarray-Analysen eingesetzt, die weitestgehend auf bioinformatischen Methoden basieren. Diese wurden durch rechnergestützte Verfahren zur Charakterisierung und Vorhersage von AS ergänzt, die zeigen, wie sich konstitutive und alternative Spleißorte sowie Exons unterscheiden. Die vorliegende Dissertationsschrift beschäftigt sich mit bioinformatischen Analysen ausgewählter Aspekte des AS. Im ersten Teil habe ich Verfahren zur Vorhersage des AS entwickelt, ohne dabei auf Datensätze exprimierter Sequenzen zurückzugreifen. Insbesondere habe ich Ansätze zur Vorhersage von Kassetten-Exons mittels Bayessches Netze (BN) weiterentwickelt und neue diskriminierende Merkmale etabliert. Diese verbesserten deutlich die Richtig-Positiv-Rate von publizierten 50% auf 61%, bei einer stringenten Falsch-Positiv-Rate von nur 0,5%. Ich konnte zeigen, dass Exons, die als konstitutiv gekennzeichnet waren, denen aber durch das BN eine hohe Wahrscheinlichkeit zugeweisen wurde, alternativ zu sein, in der Tat durch neueste Expressionsdaten als alternativ bestätigt wurden. Bei gleichen Datensätzen und Merkmalen entspricht die Leistungsfähigkeit eines BN der einer publizierten Support-Vektor-Maschine (SVM), was darauf hinweist, dass verlässliche Ergebnisse bei der Klassifikation mehr von den Merkmalen als von der Wahl des Klassifikators abhängen. Im zweiten Teil habe ich den BN-Ansatz auf eine umfangreiche und evolutionär weit verbreitete Klasse von AS-Ereignissen ausgeweitet, die als NAGNAG-Tandem-Spleißstellen bezeichnet werden und bei denen die alternativen Spleißorte nur 3 Nukleotide (nt) voneinander getrennt sind. Die sorgfältige Zusammenstellung der Trainings- und Test-Datensätze bei der Vorhersage des NAGNAG-AS trug zu einer ausgewogenen Sensitivität und Spezifität von 92% bei. Vorhersagen eines auf dem vereinigten Datensatz trainierten BN konnten in 81% (38/47) der Fälle experimentell bestätigt werden. Im Rahmen dieser Studie wurde damit einer der gegenwärtig umfangreichsten Datensätze zur experimentellen Verifizierung von Vorhersagen des AS generiert. Ein BN, trainiert anhand menschlicher Daten, erzielt ähnliche gute Ergebnisse bei vier anderen Wirbeltier-Genomen. Nur leichte Einbußen bei Vorhersagen für Drosophila melanogaster und Caenorhabditis elegans weisen darauf hin, dass der zugrunde liegende Spleißmechanismus über weite evolutionäre Distanzen konserviert zu seien scheint. Schließlich verwendete ich die Vorhersagegenauigkeit der experimentellen Validierung, um die Zahl der noch unentdeckten alternativen NAGNAGs abzuschätzen. Die Ergebnisse deuten darauf hin, dass der Mechanismus des NAGNAG-AS einfach, stochastisch und konserviert ist - unter Wirbeltieren und darüber hinaus. Des weiteren habe ich den BN-Ansatz zur Charakterisierung und Vorhersage von NAGNAG-AS in Physcomitrella patens, einem Moos, eingesetzt. Dies ist eine der ersten Studien zur Vorhersage von AS in Pflanzen, ohne dabei auf Datensätze von exprimierten Sequenzen zurückzugreifen. Wir erreichten ähnliche Ergebnisse, wie in unseren anderen Arbeiten zur Vorhersage NAGNAG-AS. Eine unabhängige Validierung mittels 454-NextGen-Sequenzdaten zeigte Richtig-Positiv-Raten von 64%-79% für gut unterstützt Fälle von NAGNAG-AS. Damit scheint der Mechanismus des NAGNAG-AS bei Pflanzen dem der Tiere zu ähneln

    Towards the understanding of transcriptional and translational regulatory complexity

    Get PDF
    Considering the same genome within every cell, the observed phenotypic diversity can only arise from highly regulated mechanisms beyond the encoded DNA sequence. We investigated several mechanisms of protein biosynthesis and analyzed DNA methylation patterns, alternative translation sites, and genomic mutations. As chromatin states are determined by epigenetic modifications and nucleosome occupancy,we conducted a structural superimposition approach between DNA methyltransferase 1 (DNMT1) and the nucleosome, which suggests that DNA methylation is dependent on accessibility of DNMT1 to nucleosome–bound DNA. Considering translation, alternative non–AUG translation initiation was observed. We developed reliable prediction models to detect these alternative start sites in a given mRNA sequence. Our tool PreTIS provides initiation confidences for all frame–independent non–cognate and AUG starts. Despite these innate factors, specific sequence variations can additionally affect a phenotype. We conduced a genome–wide analysis with millions of mutations and found an accumulation of SNPs next to transcription starts that could relate to a gene–specific regulatory signal. We also report similar conservation of canonical and alternative translation sites, highlighting the relevance of alternative mechanisms. Finally, our tool MutaNET automates variation analysis by scoring the impact of individual mutations on cell function while also integrating a gene regulatory network.Da sich in jeder Zelle die gleiche genomische Information befindet, kann die vorliegende phänotypische Vielfalt nur durch hochregulierte Mechanismen jenseits der kodierten DNA– Sequenz erklärt werden. Wir untersuchten Mechanismen der Proteinbiosynthese und analysierten DNA–Methylierungsmuster, alternative Translation und genomische Mutationen. Da die Chromatinorganisation von epigenetischen Modifikationen und Nukleosompositionen bestimmt wird, führten wir ein strukturelles Alignment zwischen DNA–Methyltransferase 1 (DNMT1) und Nukleosom durch. Dieses lässt vermuten, dass DNA–Methylierung von einer Zugänglichkeit der DNMT1 zur nukleosomalen DNA abhängt. Hinsichtlich der Translation haben wir verlässliche Vorhersagemodelle entwickelt, um alternative Starts zu identifizieren. Anhand einer mRNA–Sequenz bestimmt unser Tool PreTIS die Initiationskonfidenzen aller alternativen nicht–AUG und AUG Starts. Auch können sich Sequenzvarianten auf den Phänotyp auswirken. In einer genomweiten Untersuchung von mehreren Millionen Mutationen fanden wir eine Anreicherung von SNPs nahe des Transkriptionsstarts,welche auf ein genspezifisches regulatorisches Signal hindeuten könnte. Außerdem beobachteten wir eine ähnliche Konservierung von kanonischen und alternativen Translationsstarts, was die Relevanz alternativer Mechanismen belegt. Auch bewertet unser Tool MutaNET mit Hilfe von Scores und eines Genregulationsnetzwerkes automatisch den Einfluss einzelner Mutationen auf die Zellfunktion

    Análisis y diseño de técnicas de preprocesamiento de instancias escalables para problemas no balanceados en Big Data : Aplicaciones en situaciones de emergencias humanitarias

    Get PDF
    En la actual era de la información, el análisis asociado al escenario de Big Data permite la extracción de conocimiento de una vasta fuente de información. Una de las cuestiones de interés para extraer y explotar el valor de los datos, es adaptar y simplificar los datos en crudo que son entrada para el algoritmo de aprendizaje, lo que se conoce como "Smart Data". A pesar de la importancia de lo anterior, y su aplicación en problemas estándar, el análisis de la calidad de los datos de los conjuntos Big Data es casi un territorio inexplorado. En este sentido, un estudio exhaustivo de las características de los datos, junto con la aplicación de las técnicas de preprocesamiento adecuadas, se ha convertido en un paso obligatorio para todos los proyectos de Ciencia de Datos, tanto en la industria como en el mundo académico, y en especial aquellos asociados con análisis en Big Data. En consecuencia, el eje principal de investigación de la presente tesis abordó el preprocesamiento distribuido y escalable de conjuntos Big Data de clasificación binaria, con el fin de obtener el ya citado Smart Data. Teniendo en cuenta el impacto que tienen las características intrínsecas de los datos en el rendimiento de los modelos de aprendizaje, así como la escasa cantidad de soluciones existentes para escenarios Big Data, en esta memoria de tesis se presentaron tres propuestas para la identificación y/o el tratamiento de las siguientes características: (a) datos no balanceados; (b) redundancia; (c) alta dimensionalidad; y (d) solapamiento. Respecto a los datos no balanceados, se presentó SMOTE-BD, un SMOTE para Big Data basado en un estudio sobre las particularidades necesarias para que su diseño sea totalmente escalable, y que además su comportamiento se ajuste lo más fielmente posible a la técnica secuencial del estado del arte (tan popular en escenarios Small Data). Asimismo, se introdujo una variante de SMOTE-BD, denominada SMOTE-MR, que sigue un diseño tal que procesa los datos localmente en cada nodo. Dado que no existe una única técnica que siempre sea la que genere los mejores resultados, cuando se tiene que equilibrar las clases de un problema, se suelen aplicar una serie de ellas. Es por esto que nuestro aporte toma mayor relevancia puesto que, hasta el momento de su desarrollo, sólo estaban disponibles las soluciones triviales basadas en muestreo aleatorio. En relación a la redundancia y a la alta dimensionalidad de los datos, se presentó FDR2-BD, una metodología escalable para reducir (o condensar) un conjunto Big Data de manera dual vertical y horizontal, es decir, reducción de atributos y de instancias, con la premisa de mantener la calidad predictiva respecto de los datos originales. La propuesta se basa en un esquema de validación cruzada donde se realiza un proceso de hiperparametrización que, además, soporta el manejo de conjuntos de datos no balanceados. FDR2-BD permite conocer si un conjunto de datos dado es reducible manteniendo el poder predictivo de los datos originales dentro de un umbral que puede ser establecido por la persona experta en el dominio del problema. Por consiguiente, nuestra propuesta informa cuáles son los atributos de los datos de mayor importancia y cuál es el porcentaje de reducción uniforme de instancias que se puede llevar a cabo. Los resultados mostraron la fortaleza de FDR2-BD obteniendo valores de reducción muy elevados para la mayoría de los conjuntos de datos estudiados, tanto en lo que respecta a la dimensionalidad como a los porcentajes de reducción de instancias propuestos. En términos concretos, se alcanzó alrededor del 70 % de reducción de las características y 98 % de reducción de las instancias, para un umbral de pérdida predictiva máxima aceptada del 1 % del cual, en algunos casos, la calidad predictiva se mantuvo igual a la del conjunto original. Esta información condensada provee la ventaja de poder ser usada en infraestructuras más sencillas que las dedicadas para el procesamiento de Big Data, además de habilitar su uso con técnicas de explicabilidad/interpretabilidad como LIME o SHAP, cuya complejidad computacional es al menos O(n2 x d), con n y d número de instancias y variables respectivamente. En cuanto al solapamiento, se presentó GridOverlap-BD, una metodología para la caracterización escalable de problemas Big Data de clasificación. La propuesta se apoya en el particionamiento del espacio de características basado en rejilla. GridOverlap-BD permite identificar o caracterizar las áreas del problema en dos tipologías: zonas puras y solapadas. Además, se introdujo una métrica de complejidad derivada de aplicar GridOverlap-BD, con foco en cuantificar el solapamiento presente en los datos. De la experimentación realizada, se observó que tanto la caracterización de las zonas de un problema como la cuantificación del grado de solapamiento se llevaron a cabo de manera efectiva para los conjuntos de datos del entorno experimental. Ello implica una aproximación pionera escalable y totalmente agnóstica (independiente del modelo) para la caracterización de las instancias de un problema Big Data, y la estimación de su complejidad de cara al análisis de los resultados posteriores del modelado. Todas las propuestas fueron desarrolladas utilizando el framework Apache Spark, dado que se ha convertido en un estándar "de facto" para el procesamiento de Big Data. Además, las implementaciones se encuentran disponibles en repositorios de público acceso, en aras de facilitar la reproducibilidad de los resultados, así como la posible extensión de las aproximaciones diseñadas en la presente tesis doctoral para cualquier investigador interesado.Tesis en cotutela con la Universidad de Granada (España).Facultad de Informátic

    Understanding Neuromuscular Health and Disease: Advances in Genetics, Omics, and Molecular Function

    Get PDF
    This compilation focuses on recent advances in the molecular and cellular understandingof neuromuscular biology, and the treatment of neuromuscular disease.These advances are at the forefront of modern molecular methodologies, oftenintegrating across wet-lab cell and tissue models, dry-lab computational approaches,and clinical studies. The continuing development and application ofmultiomics methods offer particular challenges and opportunities in the field,not least in the potential for personalized medicine

    A systems biology approach to musculoskeletal tissue engineering: transcriptomic and proteomic analysis of cartilage and tendon cells

    Get PDF
    Disorders of cartilage and tendon account for a high incidence of disability and are highly prevalent co-morbidities within the ageing population; therefore, musculoskeletal disorders represent a major public health policy issue. Despite considerable efforts to characterise biochemical and biomechanical cues that promote a stable differentiated cartilage or tendon phenotype in vitro the benchmarks by which progress is measured are limited. Common regenerative interventions, such as autologous cartilage implantation, have a required period of monolayer expansion that induces a loss of the functional phenotype, termed dedifferentiation. Dedifferentiation has no definitive mechanism yet is widely described in both regenerative and degenerative contexts; in addition to stem cell transplantation and cell-seeding in three-dimensional scaffolds, dedifferentiation represents the third approach to the development of regenerative mechanisms for mammalian tissue repair. Cartilage and tendon show a number of common features in structure, develop, disease, and repair. The extracellular matrix is a dynamic and complex structure that confers the functional mechanical properties of cartilage and tendon. Dysregulation of production and degradation are critical to the pathophysiology of musculoskeletal disorders, therefore, reparative interventions require a stable, functional phenotype from the outset. Cartilage and tendon demonstrate a commonality in terms of function defining structure both being sparsely cellular with a preponderance of collagenous matrix. Parity of functionality with the pre- injury state after healing is rarely achieved for cartilage and tendon. Cartilage and  tendon also share common embryological origins. Common mesenchymal progenitor cells differentiate into many musculoskeletal tissues with diverse functions. Specialist sub-populations of tendon and cartilage progenitors enable formation of transitional zones between these developing tissues. The development of musculoskeletal structures does not occur in isolation, however, cartilage and tendon have not previously been considered together in a systems context. An integrated understanding of the differentiation of these tissues should inform regenerative therapies and tissue engineering strategies. Systems biology is paradigm shift in scientific thinking where traditional reductionist strategies to complex biological problems have been superseded by a holistic philosophy seeking to understand the emergent behavior of a system by the integrative and predictive modeling of all elements of that system. Whole transcriptome and proteome profiling studies are used to collect quantitative data about a system, which may then be exploited by systems biology methodologies including the analysis of gene and protein networks. Gene-gene co-expression relationships, which are core regulatory mechanisms in biology, are often not part of a comprehensive gene expression analysis. Many biological networks are sparse and have a scale-free topology, which generally indicates that the majority of genes have very few connections, whilst certain key regulators, or ‘hubs’, are highly interconnected. Co-expression networks may be used to define regulatory sub- networks and ‘hubs’ that have phenotypic associations. This approach allows all quantitative data to be used and makes no a priori assumptions about relationships in the system and, therefore, can facilitate the exploration of emergent behavior in the system and the generation of novel hypotheses. The ultimate goal of tissue engineering is the replacement of lost or damaged cells, and in vitro, to develop biomimetic (organotypic) structures to serve as experimental models. Tissues, and the strategies to functionally replicate them ex vivo, are complex and require an integrated, multi-disciplinary approach. Systems biology approaches, using data arising from multiple-levels of the biological hierarchy, can facilitate the development of predictive models for bioengineered tissue. The iterative refinement, quantification, and perturbation of these models may expedite the translation of well-validated organotypic systems, through legal regulatory frameworks, into regenerative strategies for musculoskeletal disorders in humans. In this thesis the systems under consideration are the major cell populations of cartilage and tendon (chondrocytes and tenocytes, respectively). They are described in three environmental conditions: native tissue, monolayer (two- dimensional), or three-dimensional models. There has been no systematic investigate of the global gene and protein profiles of cartilage and tendon in their native state relative to monolayer or three-dimensional cultures. There is no clear mechanistic description of the impact of in vitro environmental perturbations on the system or indeed the adequacy of these models as proxies for cartilage and tendon. A discovery approach using transcriptomic and proteomic profiling is undertaken to define a robust and consistent gene and protein profile for each condition. Differentially expressed elements are functionally annotated and pathway topology approaches employed to predict major signalling pathways associated with the observed phenotype. This study defines dedifferentiated chondrocytes and tenocytes in monolayer culture as expressing markers of musculoskeletal development, including scleraxis (Scx) and Mohawk (Mkx). Furthermore, there is reproducible synthetic profile convergence in monolayer culture between cartilage and tendon cells. Standard three-dimensional culture systems for chondrocyte and tenocytes fail to replicate the gene expression profile of cartilage and tendon. The PI-3K/Akt signaling pathway is predicted to be the predominant canonical pathway associated with de- and re-differentiation in vitro. Using novel, and publically available, transcriptomic data sets a meta-analysis of microarray gene expression profiles is performed using weighted gene co- expression network analysis. This is employed for transcriptome network decomposition to isolate highly correlated and interconnected gene-sets (modules) from gene expression profiles of cartilage and tendon cells in different environmental conditions. Sub-networks strongly associated with de- and re- differentiation phenotypes are defined. Comparison of global transcriptome network architecture was performed to define the conservation of network modules between a model species (rat) and human data. In addition to the annotation of an osteoarthritis-associated module in the rat a class-prediction analysis defined a minimal gene signature for the prediction of three-dimensional cultures from standard monolayer culture. Finally, proteomic and transcriptomic data sets are integrated by defining common upstream regulators (TGFB and PDGF BB) and unified mechanistic networks are generated for de- and re- differentiation. The studies collected in this thesis contribute to a wider understanding of cartilage and tendon tissue engineering and organotypic culture development. A clear mechanistic understanding of the regulatory networks controlling differentiation of cartilage and tendon progenitor cells is required in order to develop improved in vitro models and bio-engineered tissue that are physiologically relevant. The findings presented here provide practical outputs and testable hypotheses to drive future evidence-based research in organotypic culture development for musculoskeletal tissues

    Ultrasensitive detection of toxocara canis excretory-secretory antigens by a nanobody electrochemical magnetosensor assay.

    Full text link
    peer reviewedHuman Toxocariasis (HT) is a zoonotic disease caused by the migration of the larval stage of the roundworm Toxocara canis in the human host. Despite of being the most cosmopolitan helminthiasis worldwide, its diagnosis is elusive. Currently, the detection of specific immunoglobulins IgG against the Toxocara Excretory-Secretory Antigens (TES), combined with clinical and epidemiological criteria is the only strategy to diagnose HT. Cross-reactivity with other parasites and the inability to distinguish between past and active infections are the main limitations of this approach. Here, we present a sensitive and specific novel strategy to detect and quantify TES, aiming to identify active cases of HT. High specificity is achieved by making use of nanobodies (Nbs), recombinant single variable domain antibodies obtained from camelids, that due to their small molecular size (15kDa) can recognize hidden epitopes not accessible to conventional antibodies. High sensitivity is attained by the design of an electrochemical magnetosensor with an amperometric readout with all components of the assay mixed in one single step. Through this strategy, 10-fold higher sensitivity than a conventional sandwich ELISA was achieved. The assay reached a limit of detection of 2 and15 pg/ml in PBST20 0.05% or serum, spiked with TES, respectively. These limits of detection are sufficient to detect clinically relevant toxocaral infections. Furthermore, our nanobodies showed no cross-reactivity with antigens from Ascaris lumbricoides or Ascaris suum. This is to our knowledge, the most sensitive method to detect and quantify TES so far, and has great potential to significantly improve diagnosis of HT. Moreover, the characteristics of our electrochemical assay are promising for the development of point of care diagnostic systems using nanobodies as a versatile and innovative alternative to antibodies. The next step will be the validation of the assay in clinical and epidemiological contexts
    corecore