140 research outputs found
Computational analysis suggests that virulence of Chromobacterium violaceum might be linked to biofilm formation and poly-NAG biosynthesis
Groups of genes that produce exopolysaccharide with a N-acetyl-D-glucosamine monomer are in the genome of several pathogenic bacteria. Chromobacterium violaceum, an opportunistic pathogen, has the operon hmsHFR-CV2940, whose proteins can synthesize such polysaccharide. In this work, multiple alignments among proteins from bacteria that synthesize such polysaccharide were used to verify the existence of amino acids that might be critical for pathogen activity. Three-dimensional models were generated for spatial visualization of these amino acid residues. The analysis carried out showed that the protein HmsR preserves the amino acids D135, D228, Q264 and R267, considered critical for the formation of biofilms and, furthermore, that these amino acids are close to each other. The protein HmsF of C. violaceum preserves the residues D86, D87, H156 and W115. It was also shown that these residues are also close to each other in their spatial arrangement. For the proteins HmsH and CV2940 there is evidence of conservation of the residues R104 and W94, respectively. Conservation and favorable spatial location of those critical amino acids that constitute the proteins of the operon indicates that they preserve the same enzymatic function in biofilm synthesis. This is an indicator that the operon hmsHFR-CV2940 is a possible target in C. violaceum pathogenicity
Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement
BACKGROUND: Biological information is commonly used to cluster or classify entities of interest such as genes, conditions, species or samples. However, different sources of data can be used to classify the same set of entities and methods allowing the comparison of the performance of two data sources or the determination of how well a given classification agrees with another are frequently needed, especially in the absence of a universally accepted "gold standard" classification. RESULTS: Here, we describe a novel measure – the Ranked Adjusted Rand (RAR) index. RAR differs from existing methods by evaluating the extent of agreement between any two groupings, taking into account the intercluster distances. This characteristic is relevant to evaluate cases of pairs of entities grouped in the same cluster by one method and separated by another. The latter method may assign them to close neighbour clusters or, on the contrary, to clusters that are far apart from each other. RAR is applicable even when intercluster distance information is absent for both or one of the groupings. In the first case, RAR is equal to its predecessor, Adjusted Rand (HA) index. Artificially designed clusterings were used to demonstrate situations in which only RAR was able to detect differences in the grouping patterns. A study with larger simulated clusterings ensured that in realistic conditions, RAR is effectively integrating distance and partition information. The new method was applied to biological examples to compare 1) two microbial typing methods, 2) two gene regulatory network distances and 3) microarray gene expression data with pathway information. In the first application, one of the methods does not provide intercluster distances while the other originated a hierarchical clustering. RAR proved to be more sensitive than HA in the choice of a threshold for defining clusters in the hierarchical method that maximizes agreement between the results of both methods. CONCLUSION: RAR has its major advantage in combining cluster distance and partition information, while the previously available methods used only the latter. RAR should be used in the research problems were HA was previously used, because in the absence of inter cluster distance effects it is an equally effective measure, and in the presence of distance effects it is a more complete one
Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes
BACKGROUND: A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species. RESULTS: In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency. CONCLUSION: Functional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set
Recommended from our members
On the origin of spaces: morphometric foundations of urban form evolution
The modern discipline of urban morphology gives us a ground for the comparative analysis of cities, which increasingly includes specific quantitative elements. In this paper, we make a further step forward towards the definition of a general method for the classification of urban form. We draw from morphometrics and taxonomy in life sciences to propose such method, which we name ‘urban morphometrics’. We then test it on a unit of the urban landscape named ‘Sanctuary Area’ (SA), explored in 45 cities whose origins span four historic time periods: Historic (medieval), Industrial (19th century), New Towns (post-WWII, high-rise) and Sprawl (post-WWII, low-rise). We describe each SA through 207 physical dimensions and then use these to discover features that discriminate them among the four temporal groups. Nine dimensions emerge as sufficient to correctly classify 90% of the urban settings by their historic origins. These nine attributes largely identify an area's ‘visible identity’ as reflected by three characteristics: (1) block perimeterness, or the way buildings define the street-edge; (2) building coverage, or the way buildings cover the land and (3) regular plot coverage, or the extent to which blocks are made of plots that have main access from a street. Hierarchical cluster analysis utilising only the nine key variables nearly perfectly clusters each SA according to its historic origin; moreover, the resulting dendrogram shows, just after WWII, the first ‘bifurcation’ of urban history, with the emergence of the modern city as a new ‘species’ of urban form. With ‘urban morphometrics’ we hope to extend urban morphological research and contribute to understanding the way cities evolve
Genomic and SNP Analyses Demonstrate a Distant Separation of the Hospital and Community-Associated Clades of Enterococcus faecium
Recent studies have pointed to the existence of two subpopulations of Enterococcus faecium, one containing primarily commensal/community-associated (CA) strains and one that contains most clinical or hospital-associated (HA) strains, including those classified by multi-locus sequence typing (MLST) as belonging to the CC17 group. The HA subpopulation more frequently has IS16, pathogenicity island(s), and plasmids or genes associated with antibiotic resistance, colonization, and/or virulence. Supporting the two clades concept, we previously found a 3–10% difference between four genes from HA-clade strains vs. CA-clade strains, including 5% difference between pbp5-R of ampicillin-resistant, HA strains and pbp5-S of ampicillin-sensitive, CA strains. To further investigate the core genome of these subpopulations, we studied 100 genes from 21 E. faecium genome sequences; our analyses of concatenated sequences, SNPs, and individual genes all identified two distinct groups. With the concatenated sequence, HA-clade strains differed by 0–1% from one another while CA clade strains differed from each other by 0–1.1%, with 3.5–4.2% difference between the two clades. While many strains had a few genes that grouped in one clade with most of their genes in the other clade, one strain had 28% of its genes in the CA clade and 72% in the HA clade, consistent with the predicted role of recombination in the evolution of E. faecium. Using estimates for Escherichia coli, molecular clock calculations using sSNP analysis indicate that these two clades may have diverged ≥1 million years ago or, using the higher mutation rate for Bacillus anthracis, ∼300,000 years ago. These data confirm the existence of two clades of E. faecium and show that the differences between the HA and CA clades occur at the core genomic level and long preceded the modern antibiotic era
Complex Evolutionary History of the Aeromonas veronii Group Revealed by Host Interaction and DNA Sequence Data
Aeromonas veronii biovar sobria, Aeromonas veronii biovar veronii, and Aeromonas allosaccharophila are a closely related group of organisms, the Aeromonas veronii Group, that inhabit a wide range of host animals as a symbiont or pathogen. In this study, the ability of various strains to colonize the medicinal leech as a model for beneficial symbiosis and to kill wax worm larvae as a model for virulence was determined. Isolates cultured from the leech out-competed other strains in the leech model, while most strains were virulent in the wax worms. Three housekeeping genes, recA, dnaJ and gyrB, the gene encoding chitinase, chiA, and four loci associated with the type three secretion system, ascV, ascFG, aexT, and aexU were sequenced. The phylogenetic reconstruction failed to produce one consensus tree that was compatible with most of the individual genes. The Approximately Unbiased test and the Genetic Algorithm for Recombination Detection both provided further support for differing evolutionary histories among this group of genes. Two contrasting tests detected recombination within aexU, ascFG, ascV, dnaJ, and gyrB but not in aexT or chiA. Quartet decomposition analysis indicated a complex recent evolutionary history for these strains with a high frequency of horizontal gene transfer between several but not among all strains. In this study we demonstrate that at least for some strains, horizontal gene transfer occurs at a sufficient frequency to blur the signal from vertically inherited genes, despite strains being adapted to distinct niches. Simply increasing the number of genes included in the analysis is unlikely to overcome this challenge in organisms that occupy multiple niches and can exchange DNA between strains specialized to different niches. Instead, the detection of genes critical in the adaptation to specific niches may help to reveal the physiological specialization of these strains
Shifts in Species Composition Constrain Restoration of Overgrazed Grassland Using Nitrogen Fertilization in Inner Mongolian Steppe, China
Long-term livestock over-grazing causes nitrogen outputs to exceed inputs in Inner Mongolia, suggesting that low levels of nitrogen fertilization could help restore grasslands degraded by overgrazing. However, the effectiveness of such an approach depends on the response of production and species composition to the interactive drivers of nitrogen and water availability. We conducted a five-year experiment manipulating precipitation (NP: natural precipitation and SWP: simulated wet year precipitation) and nitrogen (0, 25 and 50 kg N ha-1 yr-1) addition in Inner Mongolia. We hypothesized that nitrogen fertilization would increase forage production when water availability was relatively high. However, the extent to which nitrogen would co-limit production under average or below average rainfall in these grasslands was unknown
- …