30 research outputs found

    Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE.</p> <p>Results</p> <p>We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights.</p> <p>Conclusion</p> <p>SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups.</p> <p>Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.</p> <p/

    Learning from positive examples when the negative class is undetermined- microRNA gene identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species.</p> <p>Results</p> <p>Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70–80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs.</p> <p>Conclusion</p> <p>One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined.</p> <p>Availability</p> <p>The OneClassmiRNA program is available at: <abbrgrp><abbr bid="B1">1</abbr></abbrgrp></p

    Reproducible big data science: A case study in continuous FAIRness.

    Get PDF
    Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes

    Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types.

    Get PDF
    Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits

    Global age-sex-specific fertility, mortality, healthy life expectancy (HALE), and population estimates in 204 countries and territories, 1950-2019 : a comprehensive demographic analysis for the Global Burden of Disease Study 2019

    Get PDF
    Background: Accurate and up-to-date assessment of demographic metrics is crucial for understanding a wide range of social, economic, and public health issues that affect populations worldwide. The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2019 produced updated and comprehensive demographic assessments of the key indicators of fertility, mortality, migration, and population for 204 countries and territories and selected subnational locations from 1950 to 2019. Methods: 8078 country-years of vital registration and sample registration data, 938 surveys, 349 censuses, and 238 other sources were identified and used to estimate age-specific fertility. Spatiotemporal Gaussian process regression (ST-GPR) was used to generate age-specific fertility rates for 5-year age groups between ages 15 and 49 years. With extensions to age groups 10–14 and 50–54 years, the total fertility rate (TFR) was then aggregated using the estimated age-specific fertility between ages 10 and 54 years. 7417 sources were used for under-5 mortality estimation and 7355 for adult mortality. ST-GPR was used to synthesise data sources after correction for known biases. Adult mortality was measured as the probability of death between ages 15 and 60 years based on vital registration, sample registration, and sibling histories, and was also estimated using ST-GPR. HIV-free life tables were then estimated using estimates of under-5 and adult mortality rates using a relational model life table system created for GBD, which closely tracks observed age-specific mortality rates from complete vital registration when available. Independent estimates of HIV-specific mortality generated by an epidemiological analysis of HIV prevalence surveys and antenatal clinic serosurveillance and other sources were incorporated into the estimates in countries with large epidemics. Annual and single-year age estimates of net migration and population for each country and territory were generated using a Bayesian hierarchical cohort component model that analysed estimated age-specific fertility and mortality rates along with 1250 censuses and 747 population registry years. We classified location-years into seven categories on the basis of the natural rate of increase in population (calculated by subtracting the crude death rate from the crude birth rate) and the net migration rate. We computed healthy life expectancy (HALE) using years lived with disability (YLDs) per capita, life tables, and standard demographic methods. Uncertainty was propagated throughout the demographic estimation process, including fertility, mortality, and population, with 1000 draw-level estimates produced for each metric. Findings: The global TFR decreased from 2·72 (95% uncertainty interval [UI] 2·66–2·79) in 2000 to 2·31 (2·17–2·46) in 2019. Global annual livebirths increased from 134·5 million (131·5–137·8) in 2000 to a peak of 139·6 million (133·0–146·9) in 2016. Global livebirths then declined to 135·3 million (127·2–144·1) in 2019. Of the 204 countries and territories included in this study, in 2019, 102 had a TFR lower than 2·1, which is considered a good approximation of replacement-level fertility. All countries in sub-Saharan Africa had TFRs above replacement level in 2019 and accounted for 27·1% (95% UI 26·4–27·8) of global livebirths. Global life expectancy at birth increased from 67·2 years (95% UI 66·8–67·6) in 2000 to 73·5 years (72·8–74·3) in 2019. The total number of deaths increased from 50·7 million (49·5–51·9) in 2000 to 56·5 million (53·7–59·2) in 2019. Under-5 deaths declined from 9·6 million (9·1–10·3) in 2000 to 5·0 million (4·3–6·0) in 2019. Global population increased by 25·7%, from 6·2 billion (6·0–6·3) in 2000 to 7·7 billion (7·5–8·0) in 2019. In 2019, 34 countries had negative natural rates of increase; in 17 of these, the population declined because immigration was not sufficient to counteract the negative rate of decline. Globally, HALE increased from 58·6 years (56·1–60·8) in 2000 to 63·5 years (60·8–66·1) in 2019. HALE increased in 202 of 204 countries and territories between 2000 and 2019

    Spatial, temporal, and demographic patterns in prevalence of smoking tobacco use and attributable disease burden in 204 countries and territories, 1990-2019 : a systematic analysis from the Global Burden of Disease Study 2019

    Get PDF
    Background Ending the global tobacco epidemic is a defining challenge in global health. Timely and comprehensive estimates of the prevalence of smoking tobacco use and attributable disease burden are needed to guide tobacco control efforts nationally and globally. Methods We estimated the prevalence of smoking tobacco use and attributable disease burden for 204 countries and territories, by age and sex, from 1990 to 2019 as part of the Global Burden of Diseases, Injuries, and Risk Factors Study. We modelled multiple smoking-related indicators from 3625 nationally representative surveys. We completed systematic reviews and did Bayesian meta-regressions for 36 causally linked health outcomes to estimate non-linear dose-response risk curves for current and former smokers. We used a direct estimation approach to estimate attributable burden, providing more comprehensive estimates of the health effects of smoking than previously available. Findings Globally in 2019, 1.14 billion (95% uncertainty interval 1.13-1.16) individuals were current smokers, who consumed 7.41 trillion (7.11-7.74) cigarette-equivalents of tobacco in 2019. Although prevalence of smoking had decreased significantly since 1990 among both males (27.5% [26. 5-28.5] reduction) and females (37.7% [35.4-39.9] reduction) aged 15 years and older, population growth has led to a significant increase in the total number of smokers from 0.99 billion (0.98-1.00) in 1990. Globally in 2019, smoking tobacco use accounted for 7.69 million (7.16-8.20) deaths and 200 million (185-214) disability-adjusted life-years, and was the leading risk factor for death among males (20.2% [19.3-21.1] of male deaths). 6.68 million [86.9%] of 7.69 million deaths attributable to smoking tobacco use were among current smokers. Interpretation In the absence of intervention, the annual toll of 7.69 million deaths and 200 million disability-adjusted life-years attributable to smoking will increase over the coming decades. Substantial progress in reducing the prevalence of smoking tobacco use has been observed in countries from all regions and at all stages of development, but a large implementation gap remains for tobacco control. Countries have a dear and urgent opportunity to pass strong, evidence-based policies to accelerate reductions in the prevalence of smoking and reap massive health benefits for their citizens. Copyright (C) 2021 The Author(s). Published by Elsevier Ltd.Peer reviewe

    The trans-ancestral genomic architecture of glycemic traits

    Get PDF
    Glycemic traits are used to diagnose and monitor type 2 diabetes and cardiometabolic health. To date, most genetic studies of glycemic traits have focused on individuals of European ancestry. Here we aggregated genome-wide association studies comprising up to 281,416 individuals without diabetes (30% non-European ancestry) for whom fasting glucose, 2-h glucose after an oral glucose challenge, glycated hemoglobin and fasting insulin data were available. Trans-ancestry and single-ancestry meta-analyses identified 242 loci (99 novel; P < 5 x 10(-8)), 80% of which had no significant evidence of between-ancestry heterogeneity. Analyses restricted to individuals of European ancestry with equivalent sample size would have led to 24 fewer new loci. Compared with single-ancestry analyses, equivalent-sized trans-ancestry fine-mapping reduced the number of estimated variants in 99% credible sets by a median of 37.5%. Genomic-feature, gene-expression and gene-set analyses revealed distinct biological signatures for each trait, highlighting different underlying biological pathways. Our results increase our understanding of diabetes pathophysiology by using trans-ancestry studies for improved power and resolution. A trans-ancestry meta-analysis of GWAS of glycemic traits in up to 281,416 individuals identifies 99 novel loci, of which one quarter was found due to the multi-ancestry approach, which also improves fine-mapping of credible variant sets.Peer reviewe

    Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019

    Get PDF
    Background: In an era of shifting global agendas and expanded emphasis on non-communicable diseases and injuries along with communicable diseases, sound evidence on trends by cause at the national level is essential. The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) provides a systematic scientific assessment of published, publicly available, and contributed data on incidence, prevalence, and mortality for a mutually exclusive and collectively exhaustive list of diseases and injuries. Methods: GBD estimates incidence, prevalence, mortality, years of life lost (YLLs), years lived with disability (YLDs), and disability-adjusted life-years (DALYs) due to 369 diseases and injuries, for two sexes, and for 204 countries and territories. Input data were extracted from censuses, household surveys, civil registration and vital statistics, disease registries, health service use, air pollution monitors, satellite imaging, disease notifications, and other sources. Cause-specific death rates and cause fractions were calculated using the Cause of Death Ensemble model and spatiotemporal Gaussian process regression. Cause-specific deaths were adjusted to match the total all-cause deaths calculated as part of the GBD population, fertility, and mortality estimates. Deaths were multiplied by standard life expectancy at each age to calculate YLLs. A Bayesian meta-regression modelling tool, DisMod-MR 2.1, was used to ensure consistency between incidence, prevalence, remission, excess mortality, and cause-specific mortality for most causes. Prevalence estimates were multiplied by disability weights for mutually exclusive sequelae of diseases and injuries to calculate YLDs. We considered results in the context of the Socio-demographic Index (SDI), a composite indicator of income per capita, years of schooling, and fertility rate in females younger than 25 years. Uncertainty intervals (UIs) were generated for every metric using the 25th and 975th ordered 1000 draw values of the posterior distribution. Findings: Global health has steadily improved over the past 30 years as measured by age-standardised DALY rates. After taking into account population growth and ageing, the absolute number of DALYs has remained stable. Since 2010, the pace of decline in global age-standardised DALY rates has accelerated in age groups younger than 50 years compared with the 1990–2010 time period, with the greatest annualised rate of decline occurring in the 0–9-year age group. Six infectious diseases were among the top ten causes of DALYs in children younger than 10 years in 2019: lower respiratory infections (ranked second), diarrhoeal diseases (third), malaria (fifth), meningitis (sixth), whooping cough (ninth), and sexually transmitted infections (which, in this age group, is fully accounted for by congenital syphilis; ranked tenth). In adolescents aged 10–24 years, three injury causes were among the top causes of DALYs: road injuries (ranked first), self-harm (third), and interpersonal violence (fifth). Five of the causes that were in the top ten for ages 10–24 years were also in the top ten in the 25–49-year age group: road injuries (ranked first), HIV/AIDS (second), low back pain (fourth), headache disorders (fifth), and depressive disorders (sixth). In 2019, ischaemic heart disease and stroke were the top-ranked causes of DALYs in both the 50–74-year and 75-years-and-older age groups. Since 1990, there has been a marked shift towards a greater proportion of burden due to YLDs from non-communicable diseases and injuries. In 2019, there were 11 countries where non-communicable disease and injury YLDs constituted more than half of all disease burden. Decreases in age-standardised DALY rates have accelerated over the past decade in countries at the lower end of the SDI range, while improvements have started to stagnate or even reverse in countries with higher SDI. Interpretation: As disability becomes an increasingly large component of disease burden and a larger component of health expenditure, greater research and developm nt investment is needed to identify new, more effective intervention strategies. With a rapidly ageing global population, the demands on health services to deal with disabling outcomes, which increase with age, will require policy makers to anticipate these changes. The mix of universal and more geographically specific influences on health reinforces the need for regular reporting on population health in detail and by underlying cause to help decision makers to identify success stories of disease control to emulate, as well as opportunities to improve. Funding: Bill & Melinda Gates Foundation. © 2020 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 licens

    Spatial, temporal, and demographic patterns in prevalence of chewing tobacco use in 204 countries and territories, 1990-2019 : a systematic analysis from the Global Burden of Disease Study 2019

    Get PDF
    Interpretation Chewing tobacco remains a substantial public health problem in several regions of the world, and predominantly in south Asia. We found little change in the prevalence of chewing tobacco use between 1990 and 2019, and that control efforts have had much larger effects on the prevalence of smoking tobacco use than on chewing tobacco use in some countries. Mitigating the health effects of chewing tobacco requires stronger regulations and policies that specifically target use of chewing tobacco, especially in countries with high prevalence. Findings In 2019, 273 center dot 9 million (95% uncertainty interval 258 center dot 5 to 290 center dot 9) people aged 15 years and older used chewing tobacco, and the global age-standardised prevalence of chewing tobacco use was 4 center dot 72% (4 center dot 46 to 5 center dot 01). 228 center dot 2 million (213 center dot 6 to 244 center dot 7; 83 center dot 29% [82 center dot 15 to 84 center dot 42]) chewing tobacco users lived in the south Asia region. Prevalence among young people aged 15-19 years was over 10% in seven locations in 2019. Although global agestandardised prevalence of smoking tobacco use decreased significantly between 1990 and 2019 (annualised rate of change: -1 center dot 21% [-1 center dot 26 to -1 center dot 16]), similar progress was not observed for chewing tobacco (0 center dot 46% [0 center dot 13 to 0 center dot 79]). Among the 12 highest prevalence countries (Bangladesh, Bhutan, Cambodia, India, Madagascar, Marshall Islands, Myanmar, Nepal, Pakistan, Palau, Sri Lanka, and Yemen), only Yemen had a significant decrease in the prevalence of chewing tobacco use, which was among males between 1990 and 2019 (-0 center dot 94% [-1 center dot 72 to -0 center dot 14]), compared with nine of 12 countries that had significant decreases in the prevalence of smoking tobacco. Among females, none of these 12 countries had significant decreases in prevalence of chewing tobacco use, whereas seven of 12 countries had a significant decrease in the prevalence of tobacco smoking use for the period. Summary Background Chewing tobacco and other types of smokeless tobacco use have had less attention from the global health community than smoked tobacco use. However, the practice is popular in many parts of the world and has been linked to several adverse health outcomes. Understanding trends in prevalence with age, over time, and by location and sex is important for policy setting and in relation to monitoring and assessing commitment to the WHO Framework Convention on Tobacco Control. Methods We estimated prevalence of chewing tobacco use as part of the Global Burden of Diseases, Injuries, and Risk Factors Study 2019 using a modelling strategy that used information on multiple types of smokeless tobacco products. We generated a time series of prevalence of chewing tobacco use among individuals aged 15 years and older from 1990 to 2019 in 204 countries and territories, including age-sex specific estimates. We also compared these trends to those of smoked tobacco over the same time period. Findings In 2019, 273 & middot;9 million (95% uncertainty interval 258 & middot;5 to 290 & middot;9) people aged 15 years and older used chewing tobacco, and the global age-standardised prevalence of chewing tobacco use was 4 & middot;72% (4 & middot;46 to 5 & middot;01). 228 & middot;2 million (213 & middot;6 to 244 & middot;7; 83 & middot;29% [82 & middot;15 to 84 & middot;42]) chewing tobacco users lived in the south Asia region. Prevalence among young people aged 15-19 years was over 10% in seven locations in 2019. Although global age standardised prevalence of smoking tobacco use decreased significantly between 1990 and 2019 (annualised rate of change: -1 & middot;21% [-1 & middot;26 to -1 & middot;16]), similar progress was not observed for chewing tobacco (0 & middot;46% [0 & middot;13 to 0 & middot;79]). Among the 12 highest prevalence countries (Bangladesh, Bhutan, Cambodia, India, Madagascar, Marshall Islands, Myanmar, Nepal, Pakistan, Palau, Sri Lanka, and Yemen), only Yemen had a significant decrease in the prevalence of chewing tobacco use, which was among males between 1990 and 2019 (-0 & middot;94% [-1 & middot;72 to -0 & middot;14]), compared with nine of 12 countries that had significant decreases in the prevalence of smoking tobacco. Among females, none of these 12 countries had significant decreases in prevalence of chewing tobacco use, whereas seven of 12 countries had a significant decrease in the prevalence of tobacco smoking use for the period. Interpretation Chewing tobacco remains a substantial public health problem in several regions of the world, and predominantly in south Asia. We found little change in the prevalence of chewing tobacco use between 1990 and 2019, and that control efforts have had much larger effects on the prevalence of smoking tobacco use than on chewing tobacco use in some countries. Mitigating the health effects of chewing tobacco requires stronger regulations and policies that specifically target use of chewing tobacco, especially in countries with high prevalence. Copyright (c) 2021 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license.Peer reviewe
    corecore