31 research outputs found

    Semantically linking and browsing PubMed abstracts with gene ontology

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The technological advances in the past decade have lead to massive progress in the field of biotechnology. The documentation of the progress made exists in the form of research articles. The PubMed is the current most used repository for bio-literature. PubMed consists of about 17 million abstracts as of 2007 that require methods to efficiently retrieve and browse large volume of relevant information. The State-of-the-art technologies such as GOPubmed use simple keyword-based techniques for retrieving abstracts from the PubMed and linking them to the Gene Ontology (GO). This paper changes the paradigm by introducing semantics enabled technique to link the PubMed to the Gene Ontology, called, SEGOPubmed for ontology-based browsing. Latent Semantic Analysis (LSA) framework is used to semantically interface PubMed abstracts to the Gene Ontology.</p> <p>Results</p> <p>The Empirical analysis is performed to compare the performance of the SEGOPubmed with the GOPubmed. The analysis is initially performed using a few well-referenced query words. Further, statistical analysis is performed using GO curated dataset as ground truth. The analysis suggests that the SEGOPubmed performs better than the classic GOPubmed as it incorporates semantics.</p> <p>Conclusions</p> <p>The LSA technique is applied on the PubMed abstracts obtained based on the user query and the semantic similarity between the query and the abstracts. The analyses using well-referenced keywords show that the proposed semantic-sensitive technique outperformed the string comparison based techniques in associating the relevant abstracts to the GO terms. The SEGOPubmed also extracted the abstracts in which the keywords do not appear in isolation (i.e. they appear in combination with other terms) that could not be retrieved by simple term matching techniques.</p

    REDHORSE-REcombination and Double crossover detection in Haploid Organisms using next-geneRation SEquencing data

    Get PDF
    BACKGROUND: Next-generation sequencing technology provides a means to study genetic exchange at a higher resolution than was possible using earlier technologies. However, this improvement presents challenges as the alignments of next generation sequence data to a reference genome cannot be directly used as input to existing detection algorithms, which instead typically use multiple sequence alignments as input. We therefore designed a software suite called REDHORSE that uses genomic alignments, extracts genetic markers, and generates multiple sequence alignments that can be used as input to existing recombination detection algorithms. In addition, REDHORSE implements a custom recombination detection algorithm that makes use of sequence information and genomic positions to accurately detect crossovers. REDHORSE is a portable and platform independent suite that provides efficient analysis of genetic crosses based on Next-generation sequencing data. RESULTS: We demonstrated the utility of REDHORSE using simulated data and real Next-generation sequencing data. The simulated dataset mimicked recombination between two known haploid parental strains and allowed comparison of detected break points against known true break points to assess performance of recombination detection algorithms. A newly generated NGS dataset from a genetic cross of Toxoplasma gondii allowed us to demonstrate our pipeline. REDHORSE successfully extracted the relevant genetic markers and was able to transform the read alignments from NGS to the genome to generate multiple sequence alignments. Recombination detection algorithm in REDHORSE was able to detect conventional crossovers and double crossovers typically associated with gene conversions whilst filtering out artifacts that might have been introduced during sequencing or alignment. REDHORSE outperformed other commonly used recombination detection algorithms in finding conventional crossovers. In addition, REDHORSE was the only algorithm that was able to detect double crossovers. CONCLUSION: REDHORSE is an efficient analytical pipeline that serves as a bridge between genomic alignments and existing recombination detection algorithms. Moreover, REDHORSE is equipped with a recombination detection algorithm specifically designed for Next-generation sequencing data. REDHORSE is portable, platform independent Java based utility that provides efficient analysis of genetic crosses based on Next-generation sequencing data. REDHORSE is available at http://redhorse.sourceforge.net/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-1309-7) contains supplementary material, which is available to authorized users

    A unified framework for finding differentially expressed genes from microarray experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework.</p> <p>Results</p> <p>The performance of the unified framework is compared with well-known ranking algorithms such as t-statistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the two-sample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multi-sample microarray datasets.</p> <p>Conclusion</p> <p>This paper presents a unified framework for the robust selection of genes from the two-sample as well as multi-sample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to p-values, the fusion of p-values and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multi-sample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.</p

    NextGen sequencing reveals short double crossovers contribute disproportionately to genetic diversity in Toxoplasma gondii

    Get PDF
    BACKGROUND: Toxoplasma gondii is a widespread protozoan parasite of animals that causes zoonotic disease in humans. Three clonal variants predominate in North America and Europe, while South American strains are genetically diverse, and undergo more frequent recombination. All three northern clonal variants share a monomorphic version of chromosome Ia (ChrIa), which is also found in unrelated, but successful southern lineages. Although this pattern could reflect a selective advantage, it might also arise from non-Mendelian segregation during meiosis. To understand the inheritance of ChrIa, we performed a genetic cross between the northern clonal type 2 ME49 strain and a divergent southern type 10 strain called VAND, which harbors a divergent ChrIa. RESULTS: NextGen sequencing of haploid F1 progeny was used to generate a genetic map revealing a low level of conventional recombination, with an unexpectedly high frequency of short, double crossovers. Notably, both the monomorphic and divergent versions of ChrIa were isolated with equal frequency. As well, ChrIa showed no evidence of being a sex chromosome, of harboring an inversion, or distorting patterns of segregation. Although VAND was unable to self fertilize in the cat, it underwent successful out-crossing with ME49 and hybrid survival was strongly associated with inheritance of ChrIII from ME49 and ChrIb from VAND. CONCLUSIONS: Our findings suggest that the successful spread of the monomorphic ChrIa in the wild has not been driven by meiotic drive or related processes, but rather is due to a fitness advantage. As well, the high frequency of short double crossovers is expected to greatly increase genetic diversity among progeny from genetic crosses, thereby providing an unexpected and likely important source of diversity. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-1168) contains supplementary material, which is available to authorized users

    The mating competence of geographically diverse Leishmania major strains in their natural and unnatural sand fly vectors

    Get PDF
    Invertebrate stages of Leishmania are capable of genetic exchange during their extracellular growth and development in the sand fly vector. Here we explore two variables: the ability of diverse L. major strains from across its natural range to undergo mating in pairwise tests; and the timing of the appearance of hybrids and their developmental stage associations within both natural (Phlebotomus duboscqi) and unnatural (Lutzomyia longipalpis) sand fly vectors. Following co-infection of flies with parental lines bearing independent drug markers, doubly-drug resistant hybrid progeny were selected, from which 96 clonal lines were analyzed for DNA content and genotyped for parent alleles at 4-6 unlinked nuclear loci as well as the maxicircle DNA. As seen previously, the majority of hybrids showed '2n' DNA contents, but with a significant number of '3n' and one '4n' offspring. In the natural vector, 97% of the nuclear loci showed both parental alleles; however, 3% (4/150) showed only one parental allele. In the unnatural vector, the frequency of uniparental inheritance rose to 10% (27/275). We attribute this to loss of heterozygosity after mating, most likely arising from aneuploidy which is both common and temporally variable in Leishmania. As seen previously, only uniparental inheritance of maxicircle kDNA was observed. Hybrids were recovered at similar efficiencies in all pairwise crosses tested, suggesting that L. major lacks detectable 'mating types' that limit free genetic exchange. In the natural vector, comparisons of the timing of hybrid formation with the presence of developmental stages suggest nectomonads as the most likely sexually competent stage, with hybrids emerging well before the first appearance of metacyclic promastigotes. These studies provide an important perspective on the prevalence of genetic exchange in natural populations of L. major and a guide for experimental studies to understand the biology of mating

    Global selective sweep of a highly inbred genome of the cattle parasite Neospora caninum

    Get PDF
    Neospora caninum, a cyst-forming apicomplexan parasite, is a leading cause of neuromuscular diseases in dogs as well as fetal abortion in cattle worldwide. The importance of the domestic and sylvatic life cycles of Neospora, and the role of vertical transmission in the expansion and transmission of infection in cattle, is not sufficiently understood. To elucidate the population genomics of Neospora, we genotyped 50 isolates collected worldwide from a wide range of hosts using 19 linked and unlinked genetic markers. Phylogenetic analysis and genetic distance indices resolved a single genotype of N. caninum. Whole-genome sequencing of 7 isolates from 2 different continents identified high linkage disequilibrium, significant structural variation, but only limited polymorphism genome-wide, with only 5,766 biallelic single nucleotide polymorphisms (SNPs) total. Greater than half of these SNPs (∼3,000) clustered into 6 distinct haploblocks and each block possessed limited allelic diversity (with only 4 to 6 haplotypes resolved at each cluster). Importantly, the alleles at each haploblock had independently segregated across the strains sequenced, supporting a unisexual expansion model that is mosaic at 6 genomic blocks. Integrating seroprevalence data from African cattle, our data support a global selective sweep of a highly inbred livestock pathogen that originated within European dairy stock and expanded transcontinentally via unisexual mating and vertical transmission very recently, likely the result of human activities, including recurrent migration, domestication, and breed development of bovid and canid hosts within similar proximities

    A progressive framework for two-way clustering using adaptive subspace iteration for functionally classifying genes

    No full text
    This paper presents an adaptive subspace based two-way clustering of microarray data. To analyze the data at various scales a Progressive framework is introduced. The goals are to functionally classify genes and also to find differentially expressed genes in microarray expression profiles. Empirical analysis on Colon Cancer dataset shows that ASI performs favorably in grouping genes with similar functions and finding genes that may have been involved in the formation of colon cancer. It was also observed that the proposed algorithm is robust against ordering of samples and yield results consistent with ground truth information. © 2006 IEEE

    Performance evaluation of subspace-based algorithm in selecting differentially expressed genes and classification of tissue types from microarray data

    No full text
    This paper presents the implementation and evaluation of subspace-based clustering algorithm for robust selection of differentially expressed genes as well as the classification of tissue types from microarray data. The performance of the proposed algorithm is compared against other well known clustering algorithms and the quality of clusters is evaluated using a number of cluster validation indices. Empirical analyses on a number of synthetic and real microarray data sets suggest that the proposed subspace-based algorithm is robust in selecting differentially expressed genes and performs significantly better compared to popular clustering algorithms in selecting differentially expressed genes and classifying different tissue types. © 2006 IEEE
    corecore