63 research outputs found
Detecting disease-associated genotype patterns
<p>Abstract</p> <p>Background</p> <p>In addition to single-locus (main) effects of disease variants, there is a growing consensus that gene-gene and gene-environment interactions may play important roles in disease etiology. However, for the very large numbers of genetic markers currently in use, it has proven difficult to develop suitable and efficient approaches for detecting effects other than main effects due to single variants.</p> <p>Results</p> <p>We developed a method for jointly detecting disease-causing single-locus effects and gene-gene interactions. Our method is based on finding differences of genotype pattern frequencies between case and control individuals. Those single-nucleotide polymorphism markers with largest single-locus association test statistics are included in a pattern. For a logistic regression model comprising three disease variants exerting main and epistatic interaction effects, we demonstrate that our method is vastly superior to the traditional approach of looking for single-locus effects. In addition, our method is suitable for estimating the number of disease variants in a dataset. We successfully apply our approach to data on Parkinson Disease and heroin addiction.</p> <p>Conclusion</p> <p>Our approach is suitable and powerful for detecting disease susceptibility variants with potentially small main effects and strong interaction effects. It can be applied to large numbers of genetic markers.</p
Stochastic Gradient Descent in the Viewpoint of Graduated Optimization
Stochastic gradient descent (SGD) method is popular for solving non-convex
optimization problems in machine learning. This work investigates SGD from a
viewpoint of graduated optimization, which is a widely applied approach for
non-convex optimization problems. Instead of the actual optimization problem, a
series of smoothed optimization problems that can be achieved in various ways
are solved in the graduated optimization approach. In this work, a formal
formulation of the graduated optimization is provided based on the nonnegative
approximate identity, which generalizes the idea of Gaussian smoothing. Also,
an asymptotic convergence result is achieved with the techniques in variational
analysis. Then, we show that the traditional SGD method can be applied to solve
the smoothed optimization problem. The Monte Carlo integration is used to
achieve the gradient in the smoothed problem, which may be consistent with
distributed computing schemes in real-life applications. From the assumptions
on the actual optimization problem, the convergence results of SGD for the
smoothed problem can be derived straightforwardly. Numerical examples show
evidence that the graduated optimization approach may provide more accurate
training results in certain cases.Comment: 23 pages, 4 figure
Universal primers for HBV genome DNA amplification across subtypes: a case study for designing more effective viral primers
<p>Abstract</p> <p>Background</p> <p>The highly heterogenic characteristic of viruses is the major obstacle to efficient DNA amplification. Taking advantage of the large number of virus DNA sequences in public databases to select conserved sites for primer design is an optimal way to tackle the difficulties in virus genome amplification.</p> <p>Results</p> <p>Here we use hepatitis B virus as an example to introduce a simple and efficient way for virus primer design. Based on the alignment of HBV sequences in public databases and a program BxB in Perl script, our method selected several optimal sites for HBV primer design. Polymerase chain reaction showed that compared with the success rate of the most popular primers for whole genome amplification of HBV, one set of primers for full length genome amplification and four sets of walking primers showed significant improvement. These newly designed primers are suitable for most subtypes of HBV.</p> <p>Conclusion</p> <p>Researchers can extend the method described here to design universal or subtype specific primers for various types of viruses. The BxB program based on multiple sequence alignment not only can be used as a separate tool but also can be integrated in any open source primer design software to select conserved regions for primer design.</p
Stabilized COre Gene and Pathway Election Uncovers Pan-Cancer Shared Pathways and a Cancer-Specific Driver
Approaches systematically characterizing interactions via transcriptomic data usually follow two systems: (i) coexpression network analyses focusing on correlations between genes and (ii) linear regressions (usually regularized) to select multiple genes jointly. Both suffer from the problem of stability: A slight change of parameterization or dataset could lead to marked alterations of outcomes. Here, we propose Stabilized COre gene and Pathway Election (SCOPE), a tool integrating bootstrapped least absolute shrinkage and selection operator and coexpression analysis, leading to robust outcomes insensitive to variations in data. By applying SCOPE to six cancer expression datasets (BRCA, COAD, KIRC, LUAD, PRAD, and THCA) in The Cancer Genome Atlas, we identified core genes capturing interaction effects in crucial pan-cancer pathways related to genome instability and DNA damage response. Moreover, we highlighted the pivotal role of CD63 as an oncogenic driver and a potential therapeutic target in kidney cancer. SCOPE enables stabilized investigations toward complex interactions using transcriptome data
Recommended from our members
Comprehensive Analysis of CRP, CFH Y402H and Environmental Risk Factors on Risk of Neovascular Age-Related Macular Degeneration
Purpose: To examine if the gene encoding C-reactive protein (CRP), a biomarker of inflammation, confers risk for neovascular age-related macular degeneration (AMD) in the presence of other modifiers of inflammation, including body mass index (BMI), diabetes, smoking, and complement factor H (CFH) Y402 genotype. Additionally we examined the degree to which CRP common variation was in linkage disequilibrium (LD) within our cohort. Methods: We ascertained 244 individuals from 104 families where at least one member had neovascular AMD, and a sibling had normal maculae and was past the age of the index patient’s diagnosis of neovascular AMD. We employed a direct sequencing approach to analyze the 5′-promoter region as well as the entire coding region and the 3′-untranslated region of the CRP gene. CFH Y402 genotype data was available for all participants. Lifestyle and medical factors were obtained via administration of a standardized questionnaire. The family-based association test, haplotype analysis, McNemar’s test, and conditional logistic regression were used to determine significant associations and interactions. Haploview was used to calculate the degree of LD (r2) between all CRP variants identified. Results: Six single nucleotide polymorphisms (SNPs; rs3091244, rs1417938, rs1800947, rs1130864, rs1205, and rs3093068) comprised one haplotype block of which only rs1130864 and rs1417938 were in high LD (r2=0.94). SNP rs3093068 was in LD but less so with rs3093059 (r2=0.83), which is not part of the haplotype block. Six SNPs made up six different haplotypes with ≥ 5% frequency, none of which were significantly associated with AMD risk. No statistically significant association was detected between any of the nine common variants in CRP and neovascular AMD when considering disease status alone or when controlling for smoking exposure, BMI, diabetes, or CFH genotype. Significant interactions were not found between CRP genotypes and any of the risk factors studied. No novel CRP variation was identified. Conclusions: We provide evidence that if elevated serum/plasma levels of CRP are associated with neovascular AMD, it is likely not due to genetic variation within CRP, but likely due to variations in some other genetic as well as epidemiological factors
The Structural Characterization and Antigenicity of the S Protein of SARS-CoV
The corona-like spikes or peplomers on the surface of the virion under electronic microscope are the most striking features of coronaviruses. The S (spike) protein is the largest structural protein, with 1,255 amino acids, in the viral genome. Its structure can be divided into three regions: a long N-terminal region in the exterior, a characteristic transmembrane (TM) region, and a short C-terminus in the interior of a virion. We detected fifteen substitutions of nucleotides by comparisons with the seventeen published SARS-CoV genome sequences, eight (53.3%) of which are non-synonymous mutations leading to amino acid alternations with predicted physiochemical changes. The possible antigenic determinants of the S protein are predicted, and the result is confirmed by ELISA (enzyme-linked immunosorbent assay) with synthesized peptides. Another profound finding is that three disulfide bonds are defined at the C-terminus with the N-terminus of the E (envelope) protein, based on the typical sequence and positions, thus establishing the structural connection with these two important structural proteins, if confirmed. Phylogenetic analysis reveals several conserved regions that might be potent drug targets
PoolHap: Inferring Haplotype Frequencies from Pooled Samples by Next Generation Sequencing
With the advance of next-generation sequencing (NGS) technologies, increasingly ambitious applications are becoming feasible. A particularly powerful one is the sequencing of polymorphic, pooled samples. The pool can be naturally occurring, as in the case of multiple pathogen strains in a blood sample, multiple types of cells in a cancerous tissue sample, or multiple isoforms of mRNA in a cell. In these cases, it's difficult or impossible to partition the subtypes experimentally before sequencing, and those subtype frequencies must hence be inferred. In addition, investigators may occasionally want to artificially pool the sample of a large number of individuals for reasons of cost-efficiency, e. g., when carrying out genetic mapping using bulked segregant analysis. Here we describe PoolHap, a computational tool for inferring haplotype frequencies from pooled samples when haplotypes are known. The key insight into why PoolHap works is that the large number of SNPs that come with genome-wide coverage can compensate for the uneven coverage across the genome. The performance of PoolHap is illustrated and discussed using simulated and real data. We show that PoolHap is able to accurately estimate the proportions of haplotypes with less than 2% error for 34-strain mixtures with 2X total coverage Arabidopsis thaliana whole genome polymorphism data. This method should facilitate greater biological insight into heterogeneous samples that are difficult or impossible to isolate experimentally. Software and users manual are freely available at http://arabidopsis.gmi.oeaw.ac.at/quan/poolhap/
- …