58,556 research outputs found
Maximal Extraction of Biological Information from Genetic Interaction Data
Targeted genetic perturbation is a powerful tool for inferring gene function in model organisms. Functional relationships between genes can be inferred by observing the effects of multiple genetic perturbations in a single strain. The study of these relationships, generally referred to as genetic interactions, is a classic technique for ordering genes in pathways, thereby revealing genetic organization and gene-to-gene information flow. Genetic interaction screens are now being carried out in high-throughput experiments involving tens or hundreds of genes. These data sets have the potential to reveal genetic organization on a large scale, and require computational techniques that best reveal this organization. In this paper, we use a complexity metric based in information theory to determine the maximally informative network given a set of genetic interaction data. We find that networks with high complexity scores yield the most biological information in terms of (i) specific associations between genes and biological functions, and (ii) mapping modules of co-functional genes. This information-based approach is an automated, unsupervised classification of the biological rules underlying observed genetic interactions. It might have particular potential in genetic studies in which interactions are complex and prior gene annotation data are sparse
Coding limits on the number of transcription factors
Transcription factor proteins bind specific DNA sequences to control the
expression of genes. They contain DNA binding domains which belong to several
super-families, each with a specific mechanism of DNA binding. The total number
of transcription factors encoded in a genome increases with the number of genes
in the genome. Here, we examined the number of transcription factors from each
super-family in diverse organisms.
We find that the number of transcription factors from most super-families
appears to be bounded. For example, the number of winged helix factors does not
generally exceed 300, even in very large genomes. The magnitude of the maximal
number of transcription factors from each super-family seems to correlate with
the number of DNA bases effectively recognized by the binding mechanism of that
super-family. Coding theory predicts that such upper bounds on the number of
transcription factors should exist, in order to minimize cross-binding errors
between transcription factors. This theory further predicts that factors with
similar binding sequences should tend to have similar biological effect, so
that errors based on mis-recognition are minimal. We present evidence that
transcription factors with similar binding sequences tend to regulate genes
with similar biological functions, supporting this prediction.
The present study suggests limits on the transcription factor repertoire of
cells, and suggests coding constraints that might apply more generally to the
mapping between binding sites and biological function.Comment: http://www.weizmann.ac.il/complex/tlusty/papers/BMCGenomics2006.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1590034/
http://www.biomedcentral.com/1471-2164/7/23
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Chi-square-based scoring function for categorization of MEDLINE citations
Objectives: Text categorization has been used in biomedical informatics for
identifying documents containing relevant topics of interest. We developed a
simple method that uses a chi-square-based scoring function to determine the
likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our
procedure requires construction of a genetic and a nongenetic domain document
corpus. We used MeSH descriptors assigned to MEDLINE citations for this
categorization task. We compared frequencies of MeSH descriptors between two
corpora applying chi-square test. A MeSH descriptor was considered to be a
positive indicator if its relative observed frequency in the genetic domain
corpus was greater than its relative observed frequency in the nongenetic
domain corpus. The output of the proposed method is a list of scores for all
the citations, with the highest score given to those citations containing MeSH
descriptors typical for the genetic domain. Results: Validation was done on a
set of 734 manually annotated MEDLINE citations. It achieved predictive
accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method
by comparing it to three machine learning algorithms (support vector machines,
decision trees, na\"ive Bayes). Although the differences were not statistically
significantly different, results showed that our chi-square scoring performs as
good as compared machine learning algorithms. Conclusions: We suggest that the
chi-square scoring is an effective solution to help categorize MEDLINE
citations. The algorithm is implemented in the BITOLA literature-based
discovery support system as a preprocessor for gene symbol disambiguation
process.Comment: 34 pages, 2 figure
Methods for protein complex prediction and their contributions towards understanding the organization, function and dynamics of complexes
Complexes of physically interacting proteins constitute fundamental
functional units responsible for driving biological processes within cells. A
faithful reconstruction of the entire set of complexes is therefore essential
to understand the functional organization of cells. In this review, we discuss
the key contributions of computational methods developed till date
(approximately between 2003 and 2015) for identifying complexes from the
network of interacting proteins (PPI network). We evaluate in depth the
performance of these methods on PPI datasets from yeast, and highlight
challenges faced by these methods, in particular detection of sparse and small
or sub- complexes and discerning of overlapping complexes. We describe methods
for integrating diverse information including expression profiles and 3D
structures of proteins with PPI networks to understand the dynamics of complex
formation, for instance, of time-based assembly of complex subunits and
formation of fuzzy complexes from intrinsically disordered proteins. Finally,
we discuss methods for identifying dysfunctional complexes in human diseases,
an application that is proving invaluable to understand disease mechanisms and
to discover novel therapeutic targets. We hope this review aptly commemorates a
decade of research on computational prediction of complexes and constitutes a
valuable reference for further advancements in this exciting area.Comment: 1 Tabl
Virus isolation studies suggest short-term variations in abundance in natural cyanophage populations of the Indian Ocean
Cyanophage abundance has been shown to fluctuate over long timescales and with depth, but little is known about how it varies over short timescales. Previous short-term studies have relied on counting total virus numbers and therefore the phages which infect cyanobacteria cannot be distinguished from the total count.
In this study, an isolation-based approach was used to determine cyanophage abundance from water samples collected over a depth profile for a 24 h period from the Indian Ocean. Samples were used to infect Synechococcus sp. WH7803 and the number of plaque forming units (pfu) at each time point and depth were counted. At 10 m phage numbers were similar for most time-points, but there was a distinct peak in abundance at 0100 hours. Phage numbers were lower at 25 m and 50 m and did not show such strong temporal variation. No phages were found below this depth. Therefore, we conclude that only the abundance of phages in surface waters showed a clear temporal pattern over a short timescale. Fifty phages from a range of depths and time points were isolated and purified. The molecular diversity of these phages was estimated using a section of the phage-encoded psbD gene and the results from a phylogenetic analysis do not suggest that phages from the deeper waters form a distinct subgroup
Combinatorial CRISPR-Cas9 screens for de novo mapping of genetic interactions.
We developed a systematic approach to map human genetic networks by combinatorial CRISPR-Cas9 perturbations coupled to robust analysis of growth kinetics. We targeted all pairs of 73 cancer genes with dual guide RNAs in three cell lines, comprising 141,912 tests of interaction. Numerous therapeutically relevant interactions were identified, and these patterns replicated with combinatorial drugs at 75% precision. From these results, we anticipate that cellular context will be critical to synthetic-lethal therapies
- …