14 research outputs found

    Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers

    Get PDF
    Abstract Background: The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog classification approach implemented in a big data platform that considered several pairwise protein features and the low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International, 2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach; they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes. Results: The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built with only alignment-based similarity measures or combined with several alignment-free pairwise protein features showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such supervised approaches outperformed traditional methods, there were no significant differences between the exclusive use of alignment-based similarity measures and their combination with alignment-free features, even within the twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed that alignment-based features were top-ranked for the best classifiers while the runners-up were alignment-free features related to amino acid composition. Conclusions: The incorporation of alignment-free features in supervised big data models did not significantly improve ortholog detection in yeast proteomes regarding the classification qualities achieved with just alignment-based similarity measures. However, the similarity of their classification performance to that of traditional ortholog detection methods encourages the evaluation of other alignment-free protein pair descriptors in future research.This work was supported by the following financial sources: Postdoc fellowship (SFRH/BPD/92978/2013) granted to GACh by the Portuguese Fundação para a Ciência e a Tecnologia (FCT). AA was supported by the MarInfo – Integrated Platform for Marine Data Acquisition and Analysis (reference NORTE-01-0145-FEDER-000031), a project supported by the North Portugal Regional Operational Program (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF)

    Fragmentación primaria en la combustión en lecho fluidizado de pellets de serrín

    Get PDF
    Parece ser que La combustión en lecho fluidizado tiene buenas perspectivas dentro de las opciones tecnológicas para la generación de energía a partir de un combustible, dada su flexibilidad respecto a los combustibles a emplear como sus posibilidades de operación limpia y eficiente, junto a la posibilidad de cambio de escala. En este artículo se exponen los resultados alcanzados a escala de planta piloto en el funcionamiento de un reactor de lecho fluidizado en la combustión de pellets de serrín con vistas a su aplicación en el aprovechamiento de los residuos sólidos en el proceso de producción de bioetanol de lignocelulósicos

    Big Data Supervised Pairwise Ortholog Detection in Yeasts

    Get PDF
    Ortholog are genes in different species, evolving from a common ancestor. Ortholog detection is essential to study phylogenies and to predict the function of unknown genes. The scalability of gene (or protein) pairwise comparisons and that of the classification process constitutes a challenge due to the ever-increasing amount of sequenced genomes. Ortholog detection algorithms, just based on sequence similarity, tend to fail in classification, specifically, in Saccharomycete yeasts with rampant paralogies and gene losses. In this book chapter, a new classification approach has been proposed based on the combination of pairwise similarity measures in a decision system that consider the extreme imbalance between ortholog and non-ortholog pairs. Some new gene pair similarity measures are defined based on protein physicochemical profiles, gene pair membership to conserved regions in related genomes, and protein lengths. The efficiency and scalability of the calculation of these measures are analyzed to propose its implementation for big data. In conclusion, evaluated supervised algorithms that manage big and imbalanced data showed high effectiveness in Saccharomycete yeast genomes

    An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species

    Get PDF
    Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiaeSchizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification

    Inclusión en Weka de filtros basados en conjuntos aproximados para bases desbalanceadas

    No full text
    El problema de desbalance en la clasificación se presenta en conjuntos de datos que tienen una cantidad grande de datos de cierto tipo (clase mayoritaria), mientras que el número de datos del tipo contrario es considerablemente menor (clase minoritaria). En este artículo se hace un breve resumen de la teoría de conjuntos aproximados basados en relaciones de similitud para su utilización en la implementación en Weka de tres filtros para tratar el problema de desbalance de clases. Luego se realiza un análisis de los resultados en dos conjuntos de datos para probar su validación, obteniéndose resultados satisfactorios

    Inclusión en Weka de filtros basados en conjuntos aproximados para bases desbalanceadas (Inclusion of filters in Weka based in rough sets for imbalanced bases)

    No full text
    Spanish abstract. El problema de desbalance en la clasificación se presenta en conjuntos de datos que tienen una cantidad grande de datos de cierto tipo (clase mayoritaria), mientras que el número de datos del tipo contrario es considerablemente menor (clase minoritaria). En este artículo se hace un breve resumen de la teoría de conjuntos aproximados basados en relaciones de similitud para su utilización en la implementación en Weka de tres filtros para tratar el problema de desbalance de clases. Luego se realiza un análisis de los resultados en dos conjuntos de datos para probar su validación, obteniéndose resultados satisfactorios. English abstract The class imbalance problem is shown in datasets which have a great amount of data of a certain type (majority class), whilst in the case of the contrary data type it is considerably less (minority class). In this paper, a brief summary of the rough set theory is made based in similarity relations for its use on three filters Weka for class imbalance management. Finally, an analysis of the results in both sets of data is made in order to prove its validation, obtaining satisfying results

    Inclusión en Weka de filtros basados en conjuntos aproximados para bases desbalanceadas (Inclusion of filters in Weka based in rough sets for imbalanced bases)

    No full text
    El problema de desbalance en la clasificación se presenta en conjuntos de datos que tienen una cantidad grande de datos de cierto tipo (clase mayoritaria), mientras que el número de datos del tipo contrario es considerablemente menor (clase minoritaria). En este artículo se hace un breve resumen de la teoría de conjuntos aproximados basados en relaciones de similitud para su utilización en la implementación en Weka de tres filtros para tratar el problema de desbalance de clases. Luego se realiza un análisis de los resultados en dos conjuntos de datos para probar su validación, obteniéndose resultados satisfactorios.English abstractThe class imbalance problem is shown in datasets which have a great amount of data of a certain type (majority class), whilst in the case of the contrary data type it is considerably less (minority class). In this paper, a brief summary of the rough set theory is made based in similarity relations for its use on three filters Weka for class imbalance management. Finally, an analysis of the results in both sets of data is made in order to prove its validation, obtaining satisfying results

    An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species

    Get PDF
    Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiaeSchizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.Portuguese Foundation for Science and Technology SFRH/BPD/92978/2013European Union (EU)national funds through FCT PEst-C/MAR/LA0015/2013 PTDC/AAC-AMB/121301/2010 FCOMP-01-0124-FEDER-019490Spanish Government TIN2014-57251-PRegional Andalusian Research P11-TIC-7765 P10-TIC-685

    Algunas aplicaciones de la estructura booleana del código genético

    No full text
    Las estructuras boolenas del código genético constituyen modelos matemáticos minimales y muy simplificados que nos ayudan a comprender mejor la lógica subyacente del código genético. Más específicamente, estas estructuras reflejan una fuerte conexión entre los órdenes del código genético y las propiedades físico-químicas de los aminoácidos. En este artículo presentamos dos aplicaciones de esta estructura algebraica en problemas típicos de Bioinformática. El primer es el de la clasificación de las mutaciones de una proteína dada. El siguiente es un caso particular del problema de predicción de estructura secundaria. Usamos además técnicas estadísticas y de inteligencia artificial en la solución de ellos

    Emerging Computational Approaches for Antimicrobial Peptide Discovery

    No full text
    In the last two decades many reports have addressed the application of artificial intelligence (AI) in the search and design of antimicrobial peptides (AMPs). AI has been represented by machine learning (ML) algorithms that use sequence-based features for the discovery of new peptidic scaffolds with promising biological activity. From AI perspective, evolutionary algorithms have been also applied to the rational generation of peptide libraries aimed at the optimization/design of AMPs. However, the literature has scarcely dedicated to other emerging non-conventional in silico approaches for the search/design of such bioactive peptides. Thus, the first motivation here is to bring up some non-standard peptide features that have been used to build classical ML predictive models. Secondly, it is valuable to highlight emerging ML algorithms and alternative computational tools to predict/design AMPs as well as to explore their chemical space. Another point worthy of mention is the recent application of evolutionary algorithms that actually simulate sequence evolution to both the generation of diversity-oriented peptide libraries and the optimization of hit peptides. Last but not least, included here some new considerations in proteogenomic analyses currently incorporated into the computational workflow for unravelling AMPs in natural sources
    corecore