232 research outputs found
PhyloPat: an updated version of the phylogenetic pattern database contains gene neighborhood
Phylogenetic patterns show the presence or absence of certain genes in a set of full genomes derived from different species. They can also be used to determine sets of genes that occur only in certain evolutionary branches. Previously, we presented a database named PhyloPat which allows the complete Ensembl gene database to be queried using phylogenetic patterns. Here, we describe an updated version of PhyloPat which can be queried by an improved web server. We used a single linkage clustering algorithm to create 241 697 phylogenetic lineages, using all the orthologies provided by Ensembl v49. PhyloPat offers the possibility of querying with binary phylogenetic patterns or regular expressions, or through a phylogenetic tree of the 39 included species. Users can also input a list of Ensembl, EMBL, EntrezGene or HGNC IDs to check which phylogenetic lineage any gene belongs to. A link to the FatiGO web interface has been incorporated in the HTML output. For each gene, the surrounding genes on the chromosome, color coded according to their phylogenetic lineage can be viewed, as well as FASTA files of the peptide sequences of each lineage. Furthermore, lists of omnipresent, polypresent, oligopresent and anticorrelating genes have been included. PhyloPat is freely available at http://www.cmbi.ru.nl/phylopat
Development and validation of a novel stemness-related prognostic model for neuroblastoma using integrated machine learning and bioinformatics analyses
\ua9 2024 AME Publishing Company. All rights reserved.Background: Neuroblastoma (NB) is a common solid tumor in children, with a dismal prognosis in high-risk cases. Despite advancements in NB treatment, the clinical need for precise prognostic models remains critical, particularly to address the heterogeneity of cancer stemness which plays a pivotal role in tumor aggressiveness and patient outcomes. By utilizing machine learning (ML) techniques, we aimed to explore the cancer stemness features in NB and identify stemness-related hub genes for future investigation and potential targeted therapy. Methods: The public dataset GSE49710 was employed as the training set for acquire gene expression data and NB sample information, including age, stage, and MYCN amplification status and survival. The messenger RNA (mRNA) expression-based stemness index (mRNAsi) was calculated and patients were grouped according to their mRNAsi value. Stemness-related hub genes were identified from the differentially expressed genes (DEGs) to construct a gene signature. This was followed by evaluating the relationship between cancer stemness and the NB immune microenvironment, and the development of a predictive nomogram. We assessed the prognostic outcomes including overall survival (OS) and event-free survival, employing machine learning methods to measure predictive accuracy through concordance indices and validation in an independent cohort E-MTAB-8248. Results: Based on mRNAsi, we categorized NB patients into two groups to explore the association between varying levels of stemness and their clinical outcomes. High mRNAsi was linked to the advanced International Neuroblastoma Staging System (INSS) stage, amplified MYCN, and elder age. High mRNAsi patients had a significantly poorer prognosis than low mRNAsi cases. According to the multivariate Cox analysis, the mRNAsi was an independent risk factor of prognosis in NB patients. After least absolute shrinkage and selection operator (LASSO) regression analysis, four key genes (ERCC6L, DUXAP10, NCAN, DIRAS3) most related to mRNAsi scores were discovered and a risk model was built. Our model demonstrated a significant prognostic capacity with hazard ratios (HR) ranging from 18.96 to 41.20, P values below 0.0001, and area under the receiver operating characteristic curve (AUC) values of 0.918 in the training set, suggesting high predictive accuracy which was further confirmed by external verification. Individuals with a low four-gene signature score had a favorable outcome and better immune responses. Finally, a nomogram for clinical practice was constructed by integrating the four-gene signature and INSS stage. Conclusions: Our findings confirm the influence of CSC features in NB prognosis. The newly developed NB stemness-related four-gene signature prognostic signature could facilitate the prognostic prediction, and the identified hub genes may serve as promising targets for individualized treatments
Automatically extracting functionally equivalent proteins from SwissProt
In summary, FOSTA provides an automated analysis of annotations in UniProtKB/Swiss-Prot to enable groups of proteins already annotated as functionally equivalent, to be extracted. Our results demonstrate that the vast majority of UniProtKB/Swiss-Prot functional annotations are of high quality, and that FOSTA can interpret annotations successfully. Where FOSTA is not successful, we are able to highlight inconsistencies in UniProtKB/Swiss-Prot annotation. Most of these would have presented equal difficulties for manual interpretation of annotations. We discuss limitations and possible future extensions to FOSTA, and recommend changes to the UniProtKB/Swiss-Prot format, which would facilitate text-mining of UniProtKB/Swiss-Prot
A mechanistic model for anaerobic phototrophs in domestic wastewater applications: photo-anaerobic model (PAnM)
Purple phototrophic bacteria (PPB) have been recently proposed as a key potential mechanism for accumulative biotechnologies for wastewater treatment with total nutrient recovery, low greenhouse gas emissions, and a neutral to positive energy balance. Purple phototrophic bacteria have a complex metabolism which can be regulated for process control and optimization. Since microbial processes governing PPB metabolism differ from traditional processes used for wastewater treatment (e.g., aerobic and anaerobic functional groups in ASM and ADM1), a model basis has to be developed to be used as a framework for further detailed modelling under specific situations. This work presents a mixed population phototrophic model for domestic wastewater treatment in anaerobic conditions. The model includes photoheterotrophy, which is divided into acetate consumption and other organics consumption, chemoheterotrophy (including simplified fermentation and anaerobic oxidation) and photoautotrophy (using hydrogen as an electron donor), as microbial processes, as well as hydrolysis and biomass decay as biochemical processes, and is single-biomass based. The main processes have been evaluated through targeted batch experiments, and the key kinetic and stoichiometric parameters have been determined. The process was assessed by analyzing a continuous reactor simulation scenario within a long-term wastewater treatment system in a photo-anaerobic membrane bioreactor
The ReIMAGINE Multimodal Warehouse: Using Artificial Intelligence for Accurate Risk Stratification of Prostate Cancer
Introduction. Prostate cancer (PCa) is the most frequent cancer diagnosis in men worldwide. Our ability to identify those men whose cancer will decrease their lifespan and/or quality of life remains poor. The ReIMAGINE Consortium has been established to improve PCa diagnosis. /
Materials and methods. MRI will likely become the future cornerstone of the risk-stratification process for men at risk of early prostate cancer. We will, for the first time, be able to combine the underlying molecular changes in PCa with the state-of-the-art imaging. ReIMAGINE Screening invites men for MRI and PSA evaluation. ReIMAGINE Risk includes men at risk of prostate cancer based on MRI, and includes biomarker testing. /
Results. Baseline clinical information, genomics, blood, urine, fresh prostate tissue samples, digital pathology and radiomics data will be analysed. Data will be de-identified, stored with correlated mpMRI disease endotypes and linked with long term follow-up outcomes in an instance of the Philips Clinical Data Lake, consisting of cloud-based software. The ReIMAGINE platform includes application programming interfaces and a user interface that allows users to browse data, select cohorts, manage users and access rights, query data, and more. Connection to analytics tools such as Python allows statistical and stratification method pipelines to run profiling regression analyses. /
Discussion. The ReIMAGINE Multimodal Warehouse comprises a unique data source for PCa research, to improve risk stratification for PCa and inform clinical practice. The de-identified dataset characterized by clinical, imaging, genomics and digital pathology PCa patient phenotypes will be a valuable resource for the scientific and medical community
PhyloPat: phylogenetic pattern analysis of eukaryotic genes
BACKGROUND: Phylogenetic patterns show the presence or absence of certain genes or proteins in a set of species. They can also be used to determine sets of genes or proteins that occur only in certain evolutionary branches. Phylogenetic patterns analysis has routinely been applied to protein databases such as COG and OrthoMCL, but not upon gene databases. Here we present a tool named PhyloPat which allows the complete Ensembl gene database to be queried using phylogenetic patterns. DESCRIPTION: PhyloPat is an easy-to-use webserver, which can be used to query the orthologies of all complete genomes within the EnsMart database using phylogenetic patterns. This enables the determination of sets of genes that occur only in certain evolutionary branches or even single species. We found in total 446,825 genes and 3,164,088 orthologous relationships within the EnsMart v40 database. We used a single linkage clustering algorithm to create 147,922 phylogenetic lineages, using every one of the orthologies provided by Ensembl. PhyloPat provides the possibility of querying with either binary phylogenetic patterns (created by checkboxes) or regular expressions. Specific branches of a phylogenetic tree of the 21 included species can be selected to create a branch-specific phylogenetic pattern. Users can also input a list of Ensembl or EMBL IDs to check which phylogenetic lineage any gene belongs to. The output can be saved in HTML, Excel or plain text format for further analysis. A link to the FatiGO web interface has been incorporated in the HTML output, creating easy access to functional information. Finally, lists of omnipresent, polypresent and oligopresent genes have been included. CONCLUSION: PhyloPat is the first tool to combine complete genome information with phylogenetic pattern querying. Since we used the orthologies generated by the accurate pipeline of Ensembl, the obtained phylogenetic lineages are reliable. The completeness and reliability of these phylogenetic lineages will further increase with the addition of newly found orthologous relationships within each new Ensembl release
InParanoid 7: new algorithms and tools for eukaryotic orthology analysis
The InParanoid project gathers proteomes of completely sequenced eukaryotic species plus Escherichia coli and calculates pairwise ortholog relationships among them. The new release 7.0 of the database has grown by an order of magnitude over the previous version and now includes 100 species and their collective 1.3 million proteins organized into 42.7 million pairwise ortholog groups. The InParanoid algorithm itself has been revised and is now both more specific and sensitive. Based on results from our recent benchmarking of low-complexity filters in homology assignment, a two-pass BLAST approach was developed that makes use of high-precision compositional score matrix adjustment, but avoids the alignment truncation that sometimes follows. We have also updated the InParanoid web site (http://InParanoid.sbc.su.se). Several features have been added, the response times have been improved and the site now sports a new, clearer look. As the number of ortholog databases has grown, it has become difficult to compare among these resources due to a lack of standardized source data and incompatible representations of ortholog relationships. To facilitate data exchange and comparisons among ortholog databases, we have developed and are making available two XML schemas: SeqXML for the input sequences and OrthoXML for the output ortholog clusters
Testing statistical significance scores of sequence comparison methods with structure similarity
BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons
- …