28 research outputs found
Merging 1D and 3D genomic information: Challenges in modelling and validation
Genome organization in eukaryotes during interphase stems from the delicate balance between non-random correlations present in the DNA polynucleotide linear sequence and the physico/chemical reactions which shape continuously the form and structure of DNA and chromatin inside the nucleus of the cell. It is now clear that these mechanisms have a key role in important processes like gene regulation, yet the detailed ways they act simultaneously and, eventually, come to influence each other even across very different length-scales remain largely unexplored. In this paper, we recapitulate some of the main results concerning gene regulatory and physical mechanisms, in relation to the information encoded in the 1D sequence and the 3D folding structure of DNA. In particular, we stress how reciprocal crossfeeding between 1D and 3D models may provide original insight into how these complex processes work and influence each other. This article is part of a Special Issue entitled: Transcriptional Profiles and Regulatory Gene Networks edited by Dr. Dr. Federico Manuel Giorgi and Dr. Shaun Mahony
Intraspecies characterization of bacteria via evolutionary modeling of protein domains
The ability to detect and characterize bacteria within a biological sample is crucial for the monitoring of infections and epidemics, as well as for the study of human health and its relationship with commensal microorganisms. To this aim, a commonly used technique is the 16S rRNA gene targeted sequencing. PCR-amplified 16S sequences derived from the sample of interest are usually clustered into the so-called Operational Taxonomic Units (OTUs) based on pairwise similarities. Then, representative OTU sequences are compared with reference (human-made) databases to derive their phylogeny and taxonomic classification. Here, we propose a new reference-free approach to define the phylogenetic distance between bacteria based on protein domains, which are the evolving units of proteins. We extract the protein domain profiles of 3368 bacterial genomes and we use an ecological approach to model their Relative Species Abundance distribution. Based on the model parameters, we then derive a new measurement of phylogenetic distance. Finally, we show that such model-based distance is capable of detecting differences between bacteria in cases in which the 16S rRNA-based method fails, providing a possibly complementary approach , which is particularly promising for the analysis of bacterial populations measured by shotgun sequencing
Automated Prediction of the Response to Neoadjuvant Chemoradiotherapy in Patients Affected by Rectal Cancer
Simple Summary Colorectal cancer is the second most malignant tumor per number of deaths after lung cancer and the third per number of new cases after breast and lung cancer. The correct and rapid identification (i.e., segmentation of the cancer regions) is a fundamental task for correct patient diagnosis. In this study, we propose a novel automated pipeline for the segmentation of MRI scans of patients with LARC in order to predict the response to nCRT using radiomic features. This study involved the retrospective analysis of T-2-weighted MRI scans of 43 patients affected by LARC. The segmentation of tumor areas was on par or better than the state-of-the-art results, but required smaller sample sizes. The analysis of radiomic features allowed us to predict the TRG score, which agreed with the state-of-the-art results. Background: Rectal cancer is a malignant neoplasm of the large intestine resulting from the uncontrolled proliferation of the rectal tract. Predicting the pathologic response of neoadjuvant chemoradiotherapy at an MRI primary staging scan in patients affected by locally advanced rectal cancer (LARC) could lead to significant improvement in the survival and quality of life of the patients. In this study, the possibility of automatizing this estimation from a primary staging MRI scan, using a fully automated artificial intelligence-based model for the segmentation and consequent characterization of the tumor areas using radiomic features was evaluated. The TRG score was used to evaluate the clinical outcome. Methods: Forty-three patients under treatment in the IRCCS Sant'Orsola-Malpighi Polyclinic were retrospectively selected for the study; a U-Net model was trained for the automated segmentation of the tumor areas; the radiomic features were collected and used to predict the tumor regression grade (TRG) score. Results: The segmentation of tumor areas outperformed the state-of-the-art results in terms of the Dice score coefficient or was comparable to them but with the advantage of considering mucinous cases. Analysis of the radiomic features extracted from the lesion areas allowed us to predict the TRG score, with the results agreeing with the state-of-the-art results. Conclusions: The results obtained regarding TRG prediction using the proposed fully automated pipeline prove its possible usage as a viable decision support system for radiologists in clinical practice
Time-series sewage metagenomics distinguishes seasonal, human-derived and environmental microbial communities potentially allowing source-attributed surveillance
Sewage metagenomics has risen to prominence in urban population surveillance of pathogens and antimicrobial resistance (AMR). Unknown species with similarity to known genomes cause database bias in reference-based metagenomics. To improve surveillance, we seek to recover sewage genomes and develop a quantification and correlation workflow for these genomes and AMR over time. We use longitudinal sewage sampling in seven treatment plants from five major European cities to explore the utility of catch-all sequencing of these population-level samples. Using metagenomic assembly methods, we recover 2332 metagenome-assembled genomes (MAGs) from prokaryotic species, 1334 of which were previously undescribed. These genomes account for ~69% of sequenced DNA and provide insight into sewage microbial dynamics. Rotterdam (Netherlands) and Copenhagen (Denmark) show strong seasonal microbial community shifts, while Bologna, Rome, (Italy) and Budapest (Hungary) have occasional blooms of Pseudomonas-dominated communities, accounting for up to ~95% of sample DNA. Seasonal shifts and blooms present challenges for effective sewage surveillance. We find that bacteria of known shared origin, like human gut microbiota, form communities, suggesting the potential for source-attributing novel species and their ARGs through network community analysis. This could significantly improve AMR tracking in urban environments.</p
Effectiveness of Radiomic ZOT Features in the Automated Discrimination of Oncocytoma from Clear Cell Renal Cancer
Background: Benign renal tumors, such as renal oncocytoma (RO), can be erroneously
diagnosed as malignant renal cell carcinomas (RCC), because of their similar imaging features.
Computer-aided systems leveraging radiomic features can be used to better discriminate benign renal
tumors from the malignant ones. The purpose of this work was to build a machine learning model to
distinguish RO from clear cell RCC (ccRCC). Method: We collected CT images of 77 patients, with
30 cases of RO (39%) and 47 cases of ccRCC (61%). Radiomic features were extracted both from the
tumor volumes identified by the clinicians and from the tumor’s zone of transition (ZOT). We used a
genetic algorithm to perform feature selection, identifying the most descriptive set of features for the
tumor classification. We built a decision tree classifier to distinguish between ROs and ccRCCs. We
proposed two versions of the pipeline: in the first one, the feature selection was performed before the
splitting of the data, while in the second one, the feature selection was performed after, i.e., on the
training data only. We evaluated the efficiency of the two pipelines in cancer classification. Results:
The ZOT features were found to be the most predictive by the genetic algorithm. The pipeline
with the feature selection performed on the whole dataset obtained an average ROC AUC score of
0.87 ± 0.09. The second pipeline, in which the feature selection was performed on the training data
only, obtained an average ROC AUC score of 0.62 ± 0.17. Conclusions: The obtained results confirm
the efficiency of ZOT radiomic features in capturing the renal tumor characteristics. We showed
that there is a significant difference in the performances of the two proposed pipelines, highlighting
how some already published radiomic analyses could be too optimistic about the real generalization
capabilities of the models
DNA sequence analysis: a statistical characterization of dinucleotides interdistances across multiple organisms
In questo lavoro abbiamo scelto un approccio basato sulle interdistanze per studiare le proprietà statistiche dei diversi dinucleotidi, in quanto riteniamo che vi sia una relazione tra la posizione che essi occupano nel genoma e la funzione biologica che essi svolgono. Sono stati perciò studiati 18 organismi modello appartenenti a diverse classi e dai risultati è emersa una netta differenza tra le distribuzioni dei CG dei mammiferi rispetto a quelle dei non CG; diversamente, nel caso dei non mammiferi la differenza è risultata essere più lieve e in alcuni casi nulla. In particolare, è emerso che le distribuzioni CG dei mammiferi risultano essere ben descritte da una distribuzione gamma, mentre nel caso dei non mammiferi, questo andamento è stato ritrovato solo in pochi casi. Si è visto inoltre che i CG dei mammiferi risultano essere in numero inferiore rispetto ai non CG, perciò è stato elaborato un modello nullo che provasse a rendere conto di questa discrepanza, imputandone la causa a mutazioni casuali di una singola base azotata. Il modello è stato applicato ai dinucleotidi di Homo sapiens e dai risultati è emerso che solo le distribuzioni AT e TA risultano simili a quella dei CG e che il processo è irreversibile. Infine si è visto che rappresentando il parametro di scala della distribuzione gamma, ricavato dal fit, in funzione del paramtero di forma, è stato possibile distringuere le diverse classi di organismi, sia per i dati di partenza che per un set più ampio; inoltre, i risultati mostrano l'esistenza di una relazione lineare tra il parametro di scala e la percentuale di CG presenti nella sequenza analizzata, se rappresentati in scala doppio logaritmica. Lo studio svolto ha dunque confermato l'esistenza di una relazione tra la posizione occupata dai CG nei genomi e la funzione biologica da essi svolta
Characterization of DNA sequence properties through network and statistical approaches
In this thesis we will see that the DNA sequence is constantly shaped by the interactions with its environment at multiple levels, showing footprints of DNA methylation, of its 3D organization and, in the case of bacteria, of the interaction with the host organisms.
In the first chapter, we will see that analyzing the distribution of distances between consecutive dinucleotides of the same type along the sequence, we can detect epigenetic and structural footprints. In particular, we will see that CG distance distribution allows to distinguish among organisms of different biological complexity, depending on how much CG sites are involved in DNA methylation. Moreover, we will see that CG and TA can be described by the same fitting function, suggesting a relationship between the two. We will also provide an interpretation of the observed trend, simulating a positioning process guided by the presence and absence of memory. In the end, we will focus on TA distance distribution, characterizing deviations from the trend predicted by the best fitting function, and identifying specific patterns that might be related to peculiar mechanical properties of the DNA and also to epigenetic and structural processes.
In the second chapter, we will see how we can map the 3D structure of the DNA onto its sequence. In particular, we devised a network-based algorithm that produces a genome assembly starting from its 3D configuration, using as inputs Hi-C contact maps. Specifically, we will see how we can identify the different chromosomes and reconstruct their sequences by exploiting the spectral properties of the Laplacian operator of a network.
In the third chapter, we will see a novel method for source clustering and source attribution, based on a network approach, that allows to identify host-bacteria interaction starting from the detection of Single-Nucleotide Polymorphisms along the sequence of bacterial genomes
Source Attribution of Human Campylobacteriosis Using Whole-Genome Sequencing Data and Network Analysis
Campylobacter spp. are a leading and increasing cause of gastrointestinal infections worldwide. Source attribution, which apportions human infection cases to different animal species and food reservoirs, has been instrumental in control- and evidence-based intervention efforts. The rapid increase in whole-genome sequencing data provides an opportunity for higher-resolution source attribution models. Important challenges, including the high dimension and complex structure of WGS data, have inspired concerted research efforts to develop new models. We propose network analysis models as an accurate, high-resolution source attribution approach for the sources of human campylobacteriosis. A weighted network analysis approach was used in this study for source attribution comparing different WGS data inputs. The compared model inputs consisted of cgMLST and wgMLST distance matrices from 717 human and 717 animal isolates from cattle, chickens, dogs, ducks, pigs and turkeys. SNP distance matrices from 720 human and 720 animal isolates were also used. The data were collected from 2015 to 2017 in Denmark, with the animal sources consisting of domestic and imports from 7 European countries. Clusters consisted of network nodes representing respective genomes and links representing distances between genomes. Based on the results, animal sources were the main driving factor for cluster formation, followed by type of species and sampling year. The coherence source clustering (CSC) values based on animal sources were [Formula: see text] , [Formula: see text] and [Formula: see text] for cgMLST, wgMLST and SNP, respectively. The CSC values based on Campylobacter species were [Formula: see text] , [Formula: see text] and [Formula: see text] for cgMLST, wgMLST and SNP, respectively. Including human isolates in the network resulted in [Formula: see text] , [Formula: see text] and [Formula: see text] of the total human isolates being clustered with the different animal sources for cgMLST, wgMLST and SNP, respectively. Between [Formula: see text] and [Formula: see text] of human isolates were not attributed to any animal source. Most of the human genomes were attributed to chickens from Denmark, with an average attribution percentage of [Formula: see text] , [Formula: see text] and [Formula: see text] for cgMLST, wgMLST and SNP distance matrices respectively, while ducks from Denmark showed the least attribution of [Formula: see text] for all three distance matrices. The best-performing model was the one using wgMLST distance matrix as input data, which had a CSC value of [Formula: see text]. Results from our study show that the weighted network-based approach for source attribution is reliable and can be used as an alternative method for source attribution considering the high performance of the model. The model is also robust across the different Campylobacter species, animal sources and WGS data types used as input
Comparison of Source Attribution Methodologies for Human Campylobacteriosis
Campylobacter spp. are the most common cause of bacterial gastrointestinal infection in humans both in Denmark and worldwide. Studies have found microbial subtyping to be a powerful tool for source attribution, but comparisons of different methodologies are limited. In this study, we compare three source attribution approaches (Machine Learning, Network Analysis, and Bayesian modeling) using three types of whole genome sequences (WGS) data inputs (cgMLST, 5-Mers and 7-Mers). We predicted and compared the sources of human campylobacteriosis cases in Denmark. Using 7mer as an input feature provided the best model performance. The network analysis algorithm had a CSC value of 78.99% and an F1-score value of 67%, while the machine-learning algorithm showed the highest accuracy (98%). The models attributed between 965 and all of the 1224 human cases to a source (network applying 5mer and machine learning applying 7mer, respectively). Chicken from Denmark was the primary source of human campylobacteriosis with an average percentage probability of attribution of 45.8% to 65.4%, representing Bayesian with 7mer and machine learning with cgMLST, respectively. Our results indicate that the different source attribution methodologies based on WGS have great potential for the surveillance and source tracking of Campylobacter. The results of such models may support decision makers to prioritize and target interventions