57 research outputs found
Recommended from our members
Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
Background: As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. Results: Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs. Conclusions: The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary
Recommended from our members
Identifying causal rare variants of disease through family-based analysis of Genetics Analysis Workshop 17 data set
Linkage- and association-based methods have been proposed for mapping disease-causing rare variants. Based on the family information provided in the Genetic Analysis Workshop 17 data set, we formulate a two-pronged approach that combines both methods. Using the identity-by-descent information provided for eight extended pedigrees (n = 697) and the simulated quantitative trait Q1, we explore various traditional nonparametric linkage analysis methods; the best result is obtained by assuming between-family heterogeneity and applying the Haseman-Elston regression to each pedigree separately. We discover strong signals from two genes in two different families and weaker signals for a third gene from two other families. As an exploratory approach, we apply an association test based on a modified family-based association test statistic to all rare variants (frequency < 1% or < 3%) designated as causal for Q1. Family-based association tests correctly identified causal single-nucleotide polymorphisms for four genes (KDR, VEGFA, VEGFC, and FLT1). Our results suggest that both linkage and association tests with families show promise for identifying rare variants
Rare Variant Analysis for Family-Based Design
Genome-wide association studies have been able to identify disease associations with many common variants; however most of the estimated genetic contribution explained by these variants appears to be very modest. Rare variants are thought to have larger effect sizes compared to common SNPs but effects of rare variants cannot be tested in the GWAS setting. Here we propose a novel method to test for association of rare variants obtained by sequencing in family-based samples by collapsing the standard family-based association test (FBAT) statistic over a region of interest. We also propose a suitable weighting scheme so that low frequency SNPs that may be enriched in functional variants can be upweighted compared to common variants. Using simulations we show that the family-based methods perform at par with the population-based methods under no population stratification. By construction, family-based tests are completely robust to population stratification; we show that our proposed methods remain valid even when population stratification is present
A general semi-parametric approach to the analysis of genetic association studies in population-based designs
Background: For genetic association studies in designs of unrelated individuals, current statistical methodology typically models the phenotype of interest as a function of the genotype and assumes a known statistical model for the phenotype. In the analysis of complex phenotypes, especially in the presence of ascertainment conditions, the specification of such model assumptions is not straight-forward and is error-prone, potentially causing misleading results. Results: In this paper, we propose an alternative approach that treats the genotype as the random variable and conditions upon the phenotype. Thereby, the validity of the approach does not depend on the correctness of assumptions about the phenotypic model. Misspecification of the phenotypic model may lead to reduced statistical power. Theoretical derivations and simulation studies demonstrate both the validity and the advantages of the approach over existing methodology. In the COPDGene study (a GWAS for Chronic Obstructive Pulmonary Disease (COPD)), we apply the approach to a secondary, quantitative phenotype, the Fagerstrom nicotine dependence score, that is correlated with COPD affection status. The software package that implements this method is available. Conclusions: The flexibility of this approach enables the straight-forward application to quantitative phenotypes and binary traits in ascertained and unascertained samples. In addition to its robustness features, our method provides the platform for the construction of complex statistical models for longitudinal data, multivariate data, multi-marker tests, rare-variant analysis, and others
A comparative analysis of family-based and population-based association tests using whole genome sequence data
The revolution in next-generation sequencing has made obtaining both common and rare high-quality sequence variants across the entire genome feasible. Because researchers are now faced with the analytical challenges of handling a massive amount of genetic variant information from sequencing studies, numerous methods have been developed to assess the impact of both common and rare variants on disease traits. In this report, whole genome sequencing data from Genetic Analysis Workshop 18 was used to compare the power of several methods, considering both family-based and population-based designs, to detect association with variants in the MAP4 gene region and on chromosome 3 with blood pressure. To prioritize variants across the genome for testing, variants were first functionally assessed using prediction algorithms and expression quantitative trait loci (eQTLs) data. Four set-based tests in the family-based association tests (FBAT) framework--FBAT-v, FBAT-lmm, FBAT-m, and FBAT-l--were used to analyze 20 pedigrees, and 2 variance component tests, sequence kernel association test (SKAT) and genome-wide complex trait analysis (GCTA), were used with 142 unrelated individuals in the sample. Both set-based and variance-component-based tests had high power and an adequate type I error rate. Of the various FBATs, FBAT-l demonstrated superior performance, indicating the potential for it to be used in rare-variant analysis. The updated FBAT package is available at: http://www.hsph.harvard.edu/fbat/
Subpopulation Treatment Effect Pattern Plot (STEPP) methods with R and Stata
We introduce the stepp packages for R and Stata that implement the subpopulation treatment effect pattern plot (STEPP) method. STEPP is a nonparametric graphical tool aimed at examin- ing possible heterogeneous treatment effects in subpopulations defined on a continuous covariate or composite score. More pecifically, STEPP considers overlapping subpopulations defined with respect to a continuous covariate (or risk index) and it estimates a treatment effect for each subpopulation. It also produces confidence regions and tests for treatment effect heterogeneity among the subpopulations. The original method has been extended in different directions such as different survival contexts, outcome types, or more efficient procedures for identifying the overlapping subpopulations. In this paper, we also introduce a novel method to determine the number of subjects within the subpopulations by minimizing the variability of the sizes of the subpopulations generated by a specific parameter combination. We illustrate the packages using both synthetic data and publicly available data sets. The most intensive computations in R are implemented in Fortran, while the Stata version exploits the powerful Mata language
The Use of Modified Mindfulness-Based Stress Reduction and Mindfulness-Based Cognitive Therapy Program for Family Caregivers of People Living with Dementia: A Feasibility Study
Purpose
The aim of this study was to investigate the feasibility and preliminary efficacy of a modified mindfulness-based stress reduction (MBSR) program and mindfulness-based cognitive therapy (MBCT) program for reducing the stress, depressive symptoms, and subjective burden of family caregivers of people with dementia (PWD).
Methods
A prospective, parallel-group, randomized controlled trial design was adopted. Fifty-seven participants were recruited from the community and randomized into either the modified MBSR group (n = 27) or modified MBCT group (n = 26), receiving seven face-to-face intervention sessions for more than 16 weeks. Various psychological outcomes were measured at baseline (T0), immediately after intervention (T1), and at the 3-month follow-up (T2).
Results
Both interventions were found to be feasible in view of the high attendance (more than 70.0%) and low attrition (3.8%) rates. The mixed analysis of variance (ANOVA) results showed positive within-group effects on perceived stress (p = .030, Cohen's d = 0.54), depressive symptoms (p = .002, Cohen's d = 0.77), and subjective caregiver burden (p < .001, Cohen's d = 1.12) in both interventions across the time points, whereas the modified MBCT had a larger effect on stress reduction, compared with the modified MBSR (p = .019).
Conclusion
Both the modified MBSR and MBCT are acceptable to family caregivers of PWD. Their preliminary effects were improvements in stress, depressive symptoms, and subjective burden. The modified MBCT may be more suitable for caregivers of PWD than the MBSR. A future clinical trial is needed to confirm their effectiveness in improving the psychological well-being of caregivers of PWD
Hepatocellular carcinoma surveillance after HBsAg seroclearance
Hepatitis B surface antigen (HBsAg) seroclearance is considered the functional cure and the optimal treatment endpoint for chronic hepatitis B (CHB). Patients with CHB who cleared HBsAg generally have a favorable clinical course with minimal risk of developing hepatocellular carcinoma (HCC) or cirrhotic complications. Nevertheless, a minority of patients still develop HCC despite HBsAg seroclearance. While patients with liver cirrhosis are still recommended for HCC surveillance, whether other non-cirrhotic patients who achieved HBsAg seroclearance should remain on HCC surveillance remains unclear. This review provides an overview of the incidence of HBsAg seroclearance, the factors associated with the occurrence of HBsAg seroclearance, the durability of HBsAg seroclearance, the risk of developing HCC after HBsAg seroclearance, the risk factors associated with HCC development after HBsAg seroclearance, the role of HCC risk scores, and the implications on HCC surveillance. Existing HCC risk scores have a reasonably good performance in patients after HBsAg seroclearance. In the era of artificial intelligence, future HCC risk prediction models based on artificial intelligence and longitudinal clinical data may further improve the prediction accuracy to establish a foundation of a risk score-based HCC surveillance strategy. As different novel hepatitis B virus (HBV) antiviral agents aiming at HBsAg seroclearance are under active development, new knowledge is anticipated on the natural history and HCC risk prediction of patients treated with new HBV drugs
Body mass index change in gastrointestinal cancer and chronic obstructive pulmonary disease is associated with Dedicator of Cytokinesis 1
Background: There have been a number of candidate gene association studies of cancer cachexia-related traits, but no genome-wide association study (GWAS) has been published to date. Cachexia presents in patients with a number of complex traits, including both cancer and COPD. The objective of the current investigation was to search for a shared genetic aetiology for change in body mass index (ΔBMI) among cancer and COPD by using GWAS data in the Framingham Heart Study. Methods: A linear mixed effects model accounting for age, sex, and change in smoking status was used to calculate ΔBMI in participants over 40 years of age with three consecutive BMI time points (n = 4162). Four GWAS of ΔBMI using generalized estimating equations were performed among 1085 participants with a cancer diagnosis, 204 with gastrointestinal (GI) cancer, 112 with lung cancer, and 237 with COPD to test for association with 418 365 single-nucleotide polymorphisms (SNPs). Results: Two SNPs reached a level of genome-wide significance (P < 5 × 10−8) with ΔBMI: (i) rs41526344 within the CNTN4 gene, among COPD cases (β = 0.13, P = 4.3 × 10−8); and (ii) rs4751240 in the gene Dedicator of Cytokinesis 1 (DOCK1) among GI cancer cases (β = 0.10, P = 1.9 × 10−8). The DOCK1 SNP association replicated in the ΔBMI GWAS among COPD cases (βmeta-analyis = 0.10, Pmeta-analyis = 9.3 × 10−10). The DOCK1 gene codes for the dedicator of cytokinesis 1 protein, which has a role in myoblast fusion. Conclusions: In sum, one statistically significant common variant in the DOCK1 gene was associated with ΔBMI in GI cancer and COPD cases providing support for at least partially shared aetiology of ΔBMI in complex diseases
- …