14 research outputs found
Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers
Abstract
Background: The development of new ortholog detection algorithms and the improvement of existing ones are of
major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog
classification approach implemented in a big data platform that considered several pairwise protein features and the
low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International,
2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by
Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach;
they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test
set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models
implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes.
Results: The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built
with only alignment-based similarity measures or combined with several alignment-free pairwise protein features
showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such
supervised approaches outperformed traditional methods, there were no significant differences between the exclusive
use of alignment-based similarity measures and their combination with alignment-free features, even within the
twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in
Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be
achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed
that alignment-based features were top-ranked for the best classifiers while the runners-up were alignment-free
features related to amino acid composition.
Conclusions: The incorporation of alignment-free features in supervised big data models did not significantly improve
ortholog detection in yeast proteomes regarding the classification qualities achieved with just alignment-based
similarity measures. However, the similarity of their classification performance to that of traditional ortholog detection
methods encourages the evaluation of other alignment-free protein pair descriptors in future research.This work was supported by the following financial sources: Postdoc
fellowship (SFRH/BPD/92978/2013) granted to GACh by the Portuguese
Fundação para a Ciência e a Tecnologia (FCT). AA was supported by the
MarInfo – Integrated Platform for Marine Data Acquisition and Analysis
(reference NORTE-01-0145-FEDER-000031), a project supported by the
North Portugal Regional Operational Program (NORTE 2020), under the
PORTUGAL 2020 Partnership Agreement, through the European Regional
Development Fund (ERDF)
Fragmentación primaria en la combustión en lecho fluidizado de pellets de serrín
Parece ser que La combustión en lecho fluidizado tiene buenas perspectivas dentro de las opciones tecnológicas para la generación de energía a partir de un combustible, dada su flexibilidad respecto a los combustibles a emplear como sus posibilidades de operación limpia y eficiente, junto a la posibilidad de cambio de escala. En este artículo se exponen los resultados alcanzados a escala de planta piloto en el funcionamiento de un reactor de lecho fluidizado en la combustión de pellets de serrín con vistas a su aplicación en el aprovechamiento de los residuos sólidos en el proceso de producción de bioetanol de lignocelulósicos
Big Data Supervised Pairwise Ortholog Detection in Yeasts
Ortholog are genes in different species, evolving from a common ancestor. Ortholog detection is essential to study phylogenies and to predict the function of unknown genes. The scalability of gene (or protein) pairwise comparisons and that of the classification process constitutes a challenge due to the ever-increasing amount of sequenced genomes. Ortholog detection algorithms, just based on sequence similarity, tend to fail in classification, specifically, in Saccharomycete yeasts with rampant paralogies and gene losses. In this book chapter, a new classification approach has been proposed based on the combination of pairwise similarity measures in a decision system that consider the extreme imbalance between ortholog and non-ortholog pairs. Some new gene pair similarity measures are defined based on protein physicochemical profiles, gene pair membership to conserved regions in related genomes, and protein lengths. The efficiency and scalability of the calculation of these measures are analyzed to propose its implementation for big data. In conclusion, evaluated supervised algorithms that manage big and imbalanced data showed high effectiveness in Saccharomycete yeast genomes
An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiaeSchizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification
Inclusión en Weka de filtros basados en conjuntos aproximados para bases desbalanceadas
El problema de desbalance en la clasificación se presenta en conjuntos de datos que tienen una cantidad grande de datos de cierto tipo (clase mayoritaria), mientras que el número de datos del tipo contrario es considerablemente menor (clase minoritaria). En este artículo se hace un breve resumen de la teoría de conjuntos aproximados basados en relaciones de similitud para su utilización en la implementación en Weka de tres filtros para tratar el problema de desbalance de clases. Luego se realiza un análisis de los resultados en dos conjuntos de datos para probar su validación, obteniéndose resultados satisfactorios
Inclusión en Weka de filtros basados en conjuntos aproximados para bases desbalanceadas (Inclusion of filters in Weka based in rough sets for imbalanced bases)
Spanish abstract. El problema de desbalance en la clasificación se presenta en conjuntos de datos que tienen una cantidad grande de datos de cierto tipo (clase mayoritaria), mientras que el número de datos del tipo contrario es considerablemente menor (clase minoritaria). En este artículo se hace un breve resumen de la teoría de conjuntos aproximados basados en relaciones de similitud para su utilización en la implementación en Weka de tres filtros para tratar el problema de desbalance de clases. Luego se realiza un análisis de los resultados en dos conjuntos de datos para probar su validación, obteniéndose resultados satisfactorios.
English abstract
The class imbalance problem is shown in datasets which have a great amount of data of a certain type (majority class), whilst in the case of the contrary data type it is considerably less (minority class). In this paper, a brief summary of the rough set theory is made based in similarity relations for its use on three filters Weka for class imbalance management. Finally, an analysis of the results in both sets of data is made in order to prove its validation, obtaining satisfying results
Inclusión en Weka de filtros basados en conjuntos aproximados para bases desbalanceadas (Inclusion of filters in Weka based in rough sets for imbalanced bases)
El problema de desbalance en la clasificación se presenta en conjuntos de datos que tienen una cantidad grande de datos de cierto tipo (clase mayoritaria), mientras que el número de datos del tipo contrario es considerablemente menor (clase minoritaria). En este artículo se hace un breve resumen de la teoría de conjuntos aproximados basados en relaciones de similitud para su utilización en la implementación en Weka de tres filtros para tratar el problema de desbalance de clases. Luego se realiza un análisis de los resultados en dos conjuntos de datos para probar su validación, obteniéndose resultados satisfactorios.English abstractThe class imbalance problem is shown in datasets which have a great amount of data of a certain type (majority class), whilst in the case of the contrary data type it is considerably less (minority class). In this paper, a brief summary of the rough set theory is made based in similarity relations for its use on three filters Weka for class imbalance management. Finally, an analysis of the results in both sets of data is made in order to prove its validation, obtaining satisfying results
An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity
measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined
in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the
possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between
ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other
genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome
pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiaeSchizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the
supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low
ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with
Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment
similarities combined with the advances in big data supervised classification.Portuguese Foundation for Science and Technology
SFRH/BPD/92978/2013European Union (EU)national funds through FCT
PEst-C/MAR/LA0015/2013
PTDC/AAC-AMB/121301/2010
FCOMP-01-0124-FEDER-019490Spanish Government
TIN2014-57251-PRegional Andalusian Research
P11-TIC-7765
P10-TIC-685
Algunas aplicaciones de la estructura booleana del código genético
Las estructuras boolenas del código genético constituyen modelos matemáticos minimales y muy simplificados que nos ayudan a comprender mejor la lógica subyacente del código genético. Más específicamente, estas estructuras reflejan una fuerte conexión entre los órdenes del código genético y las propiedades físico-químicas de los aminoácidos. En este artículo presentamos dos aplicaciones de esta estructura algebraica en problemas típicos de Bioinformática. El primer es el de la clasificación de las mutaciones de una proteína dada. El siguiente es un caso particular del problema de predicción de estructura secundaria. Usamos además técnicas estadísticas y de inteligencia artificial en la solución de ellos
Emerging Computational Approaches for Antimicrobial Peptide Discovery
In the last two decades many reports have addressed the application of artificial intelligence (AI) in the search and design of antimicrobial peptides (AMPs). AI has been represented by machine learning (ML) algorithms that use sequence-based features for the discovery of new peptidic scaffolds with promising biological activity. From AI perspective, evolutionary algorithms have been also applied to the rational generation of peptide libraries aimed at the optimization/design of AMPs. However, the literature has scarcely dedicated to other emerging non-conventional in silico approaches for the search/design of such bioactive peptides. Thus, the first motivation here is to bring up some non-standard peptide features that have been used to build classical ML predictive models. Secondly, it is valuable to highlight emerging ML algorithms and alternative computational tools to predict/design AMPs as well as to explore their chemical space. Another point worthy of mention is the recent application of evolutionary algorithms that actually simulate sequence evolution to both the generation of diversity-oriented peptide libraries and the optimization of hit peptides. Last but not least, included here some new considerations in proteogenomic analyses currently incorporated into the computational workflow for unravelling AMPs in natural sources