24 research outputs found
Specimens at the Center: An Informatics Workflow and Toolkit for Specimen-level analysis of Public DNA database data
Major public DNA databases — NCBI GenBank, the DNA DataBank of Japan (DDBJ), and the European Molecular Biology
Laboratory (EMBL) — are invaluable biodiversity libraries. Systematists and other biodiversity scientists commonly mine these databases for
sequence data to use in phylogenetic studies, but such studies generally use only the taxonomic identity of the sequenced tissue, not the
specimen identity. Thus studies that use DNA supermatrices to construct phylogenetic trees with species at the tips typically do not take
advantage of the fact that for many individuals in the public DNA databases, several DNA regions have been sampled; and for many species,
two or more individuals have been sampled. Thus these studies typically do not make full use of the multigene datasets in public DNA
databases to test species coherence and select optimal sequences to represent a species. In this study, we introduce a set of tools developed
in the R programming language to construct individual-based trees from NCBI GenBank data and present a set of trees for the genus Carex
(Cyperaceae) constructed using these methods. For the more than 770 species for which we found sequence data, our approach recovered an
average of 1.85 gene regions per specimen, up to seven for some specimens, and more than 450 species represented by two or more specimens.
Depending on the subset of genes analyzed, we found up to 42% of species monophyletic. We introduce a simple tree statistic—the
Taxonomic Disparity Index (TDI)—to assist in curating specimen-level datasets and provide code for selecting maximally informative (or,
conversely, minimally misleading) sequences as species exemplars. While tailored to the Carex dataset, the approach and code presented in
this paper can readily be generalized to constructing individual-level trees from large amounts of data for any species group
Global age-sex-specific mortality, life expectancy, and population estimates in 204 countries and territories and 811 subnational locations, 1950–2021, and the impact of the COVID-19 pandemic: a comprehensive demographic analysis for the Global Burden of Disease Study 2021
Background: Estimates of demographic metrics are crucial to assess levels and trends of population health outcomes. The profound impact of the COVID-19 pandemic on populations worldwide has underscored the need for timely estimates to understand this unprecedented event within the context of long-term population health trends. The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2021 provides new demographic estimates for 204 countries and territories and 811 additional subnational locations from 1950 to 2021, with a particular emphasis on changes in mortality and life expectancy that occurred during the 2020–21 COVID-19 pandemic period. Methods: 22 223 data sources from vital registration, sample registration, surveys, censuses, and other sources were used to estimate mortality, with a subset of these sources used exclusively to estimate excess mortality due to the COVID-19 pandemic. 2026 data sources were used for population estimation. Additional sources were used to estimate migration; the effects of the HIV epidemic; and demographic discontinuities due to conflicts, famines, natural disasters, and pandemics, which are used as inputs for estimating mortality and population. Spatiotemporal Gaussian process regression (ST-GPR) was used to generate under-5 mortality rates, which synthesised 30 763 location-years of vital registration and sample registration data, 1365 surveys and censuses, and 80 other sources. ST-GPR was also used to estimate adult mortality (between ages 15 and 59 years) based on information from 31 642 location-years of vital registration and sample registration data, 355 surveys and censuses, and 24 other sources. Estimates of child and adult mortality rates were then used to generate life tables with a relational model life table system. For countries with large HIV epidemics, life tables were adjusted using independent estimates of HIV-specific mortality generated via an epidemiological analysis of HIV prevalence surveys, antenatal clinic serosurveillance, and other data sources. Excess mortality due to the COVID-19 pandemic in 2020 and 2021 was determined by subtracting observed all-cause mortality (adjusted for late registration and mortality anomalies) from the mortality expected in the absence of the pandemic. Expected mortality was calculated based on historical trends using an ensemble of models. In location-years where all-cause mortality data were unavailable, we estimated excess mortality rates using a regression model with covariates pertaining to the pandemic. Population size was computed using a Bayesian hierarchical cohort component model. Life expectancy was calculated using age-specific mortality rates and standard demographic methods. Uncertainty intervals (UIs) were calculated for every metric using the 25th and 975th ordered values from a 1000-draw posterior distribution. Findings: Global all-cause mortality followed two distinct patterns over the study period: age-standardised mortality rates declined between 1950 and 2019 (a 62·8% [95% UI 60·5–65·1] decline), and increased during the COVID-19 pandemic period (2020–21; 5·1% [0·9–9·6] increase). In contrast with the overall reverse in mortality trends during the pandemic period, child mortality continued to decline, with 4·66 million (3·98–5·50) global deaths in children younger than 5 years in 2021 compared with 5·21 million (4·50–6·01) in 2019. An estimated 131 million (126–137) people died globally from all causes in 2020 and 2021 combined, of which 15·9 million (14·7–17·2) were due to the COVID-19 pandemic (measured by excess mortality, which includes deaths directly due to SARS-CoV-2 infection and those indirectly due to other social, economic, or behavioural changes associated with the pandemic). Excess mortality rates exceeded 150 deaths per 100 000 population during at least one year of the pandemic in 80 countries and territories, whereas 20 nations had a negative excess mortality rate in 2020 or 2021, indicating that all-cause mortality in these countries was lower during the pandemic than expected based on historical trends. Between 1950 and 2021, global life expectancy at birth increased by 22·7 years (20·8–24·8), from 49·0 years (46·7–51·3) to 71·7 years (70·9–72·5). Global life expectancy at birth declined by 1·6 years (1·0–2·2) between 2019 and 2021, reversing historical trends. An increase in life expectancy was only observed in 32 (15·7%) of 204 countries and territories between 2019 and 2021. The global population reached 7·89 billion (7·67–8·13) people in 2021, by which time 56 of 204 countries and territories had peaked and subsequently populations have declined. The largest proportion of population growth between 2020 and 2021 was in sub-Saharan Africa (39·5% [28·4–52·7]) and south Asia (26·3% [9·0–44·7]). From 2000 to 2021, the ratio of the population aged 65 years and older to the population aged younger than 15 years increased in 188 (92·2%) of 204 nations. Interpretation: Global adult mortality rates markedly increased during the COVID-19 pandemic in 2020 and 2021, reversing past decreasing trends, while child mortality rates continued to decline, albeit more slowly than in earlier years. Although COVID-19 had a substantial impact on many demographic indicators during the first 2 years of the pandemic, overall global health progress over the 72 years evaluated has been profound, with considerable improvements in mortality and life expectancy. Additionally, we observed a deceleration of global population growth since 2017, despite steady or increasing growth in lower-income countries, combined with a continued global shift of population age structures towards older ages. These demographic changes will likely present future challenges to health systems, economies, and societies. The comprehensive demographic estimates reported here will enable researchers, policy makers, health practitioners, and other key stakeholders to better understand and address the profound changes that have occurred in the global health landscape following the first 2 years of the COVID-19 pandemic, and longer-term trends beyond the pandemic
Global age-sex-specific mortality, life expectancy, and population estimates in 204 countries and territories and 811 subnational locations, 1950–2021, and the impact of the COVID-19 pandemic: a comprehensive demographic analysis for the Global Burden of Disease Study 2021
BACKGROUND: Estimates of demographic metrics are crucial to assess levels and trends of population health outcomes. The profound impact of the COVID-19 pandemic on populations worldwide has underscored the need for timely estimates to understand this unprecedented event within the context of long-term population health trends. The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2021 provides new demographic estimates for 204 countries and territories and 811 additional subnational locations from 1950 to 2021, with a particular emphasis on changes in mortality and life expectancy that occurred during the 2020–21 COVID-19 pandemic period. METHODS: 22 223 data sources from vital registration, sample registration, surveys, censuses, and other sources were used to estimate mortality, with a subset of these sources used exclusively to estimate excess mortality due to the COVID-19 pandemic. 2026 data sources were used for population estimation. Additional sources were used to estimate migration; the effects of the HIV epidemic; and demographic discontinuities due to conflicts, famines, natural disasters, and pandemics, which are used as inputs for estimating mortality and population. Spatiotemporal Gaussian process regression (ST-GPR) was used to generate under-5 mortality rates, which synthesised 30 763 location-years of vital registration and sample registration data, 1365 surveys and censuses, and 80 other sources. ST-GPR was also used to estimate adult mortality (between ages 15 and 59 years) based on information from 31 642 location-years of vital registration and sample registration data, 355 surveys and censuses, and 24 other sources. Estimates of child and adult mortality rates were then used to generate life tables with a relational model life table system. For countries with large HIV epidemics, life tables were adjusted using independent estimates of HIV-specific mortality generated via an epidemiological analysis of HIV prevalence surveys, antenatal clinic serosurveillance, and other data sources. Excess mortality due to the COVID-19 pandemic in 2020 and 2021 was determined by subtracting observed all-cause mortality (adjusted for late registration and mortality anomalies) from the mortality expected in the absence of the pandemic. Expected mortality was calculated based on historical trends using an ensemble of models. In location-years where all-cause mortality data were unavailable, we estimated excess mortality rates using a regression model with covariates pertaining to the pandemic. Population size was computed using a Bayesian hierarchical cohort component model. Life expectancy was calculated using age-specific mortality rates and standard demographic methods. Uncertainty intervals (UIs) were calculated for every metric using the 25th and 975th ordered values from a 1000-draw posterior distribution. FINDINGS: Global all-cause mortality followed two distinct patterns over the study period: age-standardised mortality rates declined between 1950 and 2019 (a 62·8% [95% UI 60·5–65·1] decline), and increased during the COVID-19 pandemic period (2020–21; 5·1% [0·9–9·6] increase). In contrast with the overall reverse in mortality trends during the pandemic period, child mortality continued to decline, with 4·66 million (3·98–5·50) global deaths in children younger than 5 years in 2021 compared with 5·21 million (4·50–6·01) in 2019. An estimated 131 million (126–137) people died globally from all causes in 2020 and 2021 combined, of which 15·9 million (14·7–17·2) were due to the COVID-19 pandemic (measured by excess mortality, which includes deaths directly due to SARS-CoV-2 infection and those indirectly due to other social, economic, or behavioural changes associated with the pandemic). Excess mortality rates exceeded 150 deaths per 100 000 population during at least one year of the pandemic in 80 countries and territories, whereas 20 nations had a negative excess mortality rate in 2020 or 2021, indicating that all-cause mortality in these countries was lower during the pandemic than expected based on historical trends. Between 1950 and 2021, global life expectancy at birth increased by 22·7 years (20·8–24·8), from 49·0 years (46·7–51·3) to 71·7 years (70·9–72·5). Global life expectancy at birth declined by 1·6 years (1·0–2·2) between 2019 and 2021, reversing historical trends. An increase in life expectancy was only observed in 32 (15·7%) of 204 countries and territories between 2019 and 2021. The global population reached 7·89 billion (7·67–8·13) people in 2021, by which time 56 of 204 countries and territories had peaked and subsequently populations have declined. The largest proportion of population growth between 2020 and 2021 was in sub-Saharan Africa (39·5% [28·4–52·7]) and south Asia (26·3% [9·0–44·7]). From 2000 to 2021, the ratio of the population aged 65 years and older to the population aged younger than 15 years increased in 188 (92·2%) of 204 nations. INTERPRETATION: Global adult mortality rates markedly increased during the COVID-19 pandemic in 2020 and 2021, reversing past decreasing trends, while child mortality rates continued to decline, albeit more slowly than in earlier years. Although COVID-19 had a substantial impact on many demographic indicators during the first 2 years of the pandemic, overall global health progress over the 72 years evaluated has been profound, with considerable improvements in mortality and life expectancy. Additionally, we observed a deceleration of global population growth since 2017, despite steady or increasing growth in lower-income countries, combined with a continued global shift of population age structures towards older ages. These demographic changes will likely present future challenges to health systems, economies, and societies. The comprehensive demographic estimates reported here will enable researchers, policy makers, health practitioners, and other key stakeholders to better understand and address the profound changes that have occurred in the global health landscape following the first 2 years of the COVID-19 pandemic, and longer-term trends beyond the pandemic. FUNDING: Bill & Melinda Gates Foundation
Recommended from our members
Global burden of 288 causes of death and life expectancy decomposition in 204 countries and territories and 811 subnational locations, 1990–2021: a systematic analysis for the Global Burden of Disease Study 2021
BACKGROUND Regular, detailed reporting on population health by underlying cause of death is fundamental for public health decision making. Cause-specific estimates of mortality and the subsequent effects on life expectancy worldwide are valuable metrics to gauge progress in reducing mortality rates. These estimates are particularly important following large-scale mortality spikes, such as the COVID-19 pandemic. When systematically analysed, mortality rates and life expectancy allow comparisons of the consequences of causes of death globally and over time, providing a nuanced understanding of the effect of these causes on global populations. METHODS The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2021 cause-of-death analysis estimated mortality and years of life lost (YLLs) from 288 causes of death by age-sex-location-year in 204 countries and territories and 811 subnational locations for each year from 1990 until 2021. The analysis used 56 604 data sources, including data from vital registration and verbal autopsy as well as surveys, censuses, surveillance systems, and cancer registries, among others. As with previous GBD rounds, cause-specific death rates for most causes were estimated using the Cause of Death Ensemble model-a modelling tool developed for GBD to assess the out-of-sample predictive validity of different statistical models and covariate permutations and combine those results to produce cause-specific mortality estimates-with alternative strategies adapted to model causes with insufficient data, substantial changes in reporting over the study period, or unusual epidemiology. YLLs were computed as the product of the number of deaths for each cause-age-sex-location-year and the standard life expectancy at each age. As part of the modelling process, uncertainty intervals (UIs) were generated using the 2·5th and 97·5th percentiles from a 1000-draw distribution for each metric. We decomposed life expectancy by cause of death, location, and year to show cause-specific effects on life expectancy from 1990 to 2021. We also used the coefficient of variation and the fraction of population affected by 90% of deaths to highlight concentrations of mortality. Findings are reported in counts and age-standardised rates. Methodological improvements for cause-of-death estimates in GBD 2021 include the expansion of under-5-years age group to include four new age groups, enhanced methods to account for stochastic variation of sparse data, and the inclusion of COVID-19 and other pandemic-related mortality-which includes excess mortality associated with the pandemic, excluding COVID-19, lower respiratory infections, measles, malaria, and pertussis. For this analysis, 199 new country-years of vital registration cause-of-death data, 5 country-years of surveillance data, 21 country-years of verbal autopsy data, and 94 country-years of other data types were added to those used in previous GBD rounds. FINDINGS The leading causes of age-standardised deaths globally were the same in 2019 as they were in 1990; in descending order, these were, ischaemic heart disease, stroke, chronic obstructive pulmonary disease, and lower respiratory infections. In 2021, however, COVID-19 replaced stroke as the second-leading age-standardised cause of death, with 94·0 deaths (95% UI 89·2-100·0) per 100 000 population. The COVID-19 pandemic shifted the rankings of the leading five causes, lowering stroke to the third-leading and chronic obstructive pulmonary disease to the fourth-leading position. In 2021, the highest age-standardised death rates from COVID-19 occurred in sub-Saharan Africa (271·0 deaths [250·1-290·7] per 100 000 population) and Latin America and the Caribbean (195·4 deaths [182·1-211·4] per 100 000 population). The lowest age-standardised death rates from COVID-19 were in the high-income super-region (48·1 deaths [47·4-48·8] per 100 000 population) and southeast Asia, east Asia, and Oceania (23·2 deaths [16·3-37·2] per 100 000 population). Globally, life expectancy steadily improved between 1990 and 2019 for 18 of the 22 investigated causes. Decomposition of global and regional life expectancy showed the positive effect that reductions in deaths from enteric infections, lower respiratory infections, stroke, and neonatal deaths, among others have contributed to improved survival over the study period. However, a net reduction of 1·6 years occurred in global life expectancy between 2019 and 2021, primarily due to increased death rates from COVID-19 and other pandemic-related mortality. Life expectancy was highly variable between super-regions over the study period, with southeast Asia, east Asia, and Oceania gaining 8·3 years (6·7-9·9) overall, while having the smallest reduction in life expectancy due to COVID-19 (0·4 years). The largest reduction in life expectancy due to COVID-19 occurred in Latin America and the Caribbean (3·6 years). Additionally, 53 of the 288 causes of death were highly concentrated in locations with less than 50% of the global population as of 2021, and these causes of death became progressively more concentrated since 1990, when only 44 causes showed this pattern. The concentration phenomenon is discussed heuristically with respect to enteric and lower respiratory infections, malaria, HIV/AIDS, neonatal disorders, tuberculosis, and measles. INTERPRETATION Long-standing gains in life expectancy and reductions in many of the leading causes of death have been disrupted by the COVID-19 pandemic, the adverse effects of which were spread unevenly among populations. Despite the pandemic, there has been continued progress in combatting several notable causes of death, leading to improved global life expectancy over the study period. Each of the seven GBD super-regions showed an overall improvement from 1990 and 2021, obscuring the negative effect in the years of the pandemic. Additionally, our findings regarding regional variation in causes of death driving increases in life expectancy hold clear policy utility. Analyses of shifting mortality trends reveal that several causes, once widespread globally, are now increasingly concentrated geographically. These changes in mortality concentration, alongside further investigation of changing risks, interventions, and relevant policy, present an important opportunity to deepen our understanding of mortality-reduction strategies. Examining patterns in mortality concentration might reveal areas where successful public health interventions have been implemented. Translating these successes to locations where certain causes of death remain entrenched can inform policies that work to improve life expectancy for people everywhere. FUNDING Bill & Melinda Gates Foundation
All scripts and data files
All scripts and data file
A Time and a Place for Everything: Phylogenetic history and geography as joint predictors of oak plastome phylogeny
Due to high rates of introgressive hybridization, the plastid genome is poorly suited to fine-scale DNA barcoding and phylogenetic studies of the oak genus (Quercus, Fagaceae). At the tips of the oak plastome phylogeny, recent gene migration and reticulation generally cause topology to reflect geographic structure, while deeper branches reflect lineage divergence. In this study, we quantify the simple and partial effects of geographic proximity and nucleome-inferred phylogenetic history on oak plastome phylogeny at different evolutionary scales. Our study compares pairwise phylogenetic distances based on complete plastome sequences, pairwise phylogenetic distances from nuclear restriction site-associated DNA sequences (RADseq), and pairwise geographic distances for 34 individuals of the white oak clade representing 24 North American and Eurasian species. Within the North American white oak clade alone, phylogenetic history has essentially no effect on plastome variation, while geography explains 11–21% of plastome phylogenetic variance. However, across multiple continents and clades, phylogeny predicts 30–41% of plastome variation, geography 3–41%. Tipwise attenuation of phylogenetic informativeness in the plastome means that in practical terms, plastome data has little use in solving phylogenetic questions, but can still be a useful barcoding / phylogenetic marker for resolving questions among major clades.The accepted manuscript in pdf format is listed with the files at the bottom of this page. The presentation of the authors' names and (or) special characters in the title of the manuscript may differ slightly between what is listed on this page and what is listed in the pdf file of the accepted manuscript; that in the pdf file of the accepted manuscript is what was submitted by the author
Text-fig. 4. Reconstruction of Quinquala obovata fruits; artwork by K. K. Pham. in Winged Fruits Of Rutaceous Affinity From The Eocene Of Western North America
Text-fig. 4. Reconstruction of Quinquala obovata fruits; artwork by K. K. Pham.Published as part of Manchester, Steven R., Disney, Kory A. & Pham, Kasey K., 2020, Winged Fruits Of Rutaceous Affinity From The Eocene Of Western North America, pp. 211-216 in Fossil Imprint 76 (2) on page 215, DOI: 10.37520/fi.2020.018, http://zenodo.org/record/538617
Quinquala obovata MANCHESTER et DISNEY 2020, sp. nov.
<i>Quinquala obovata</i> MANCHESTER et DISNEY sp. nov. <p>Text-figs 2, 3a–g, i, j, 4</p> <p>H o l o t y p e. UF 19376-60038 (Text-fig. 2a) housed in</p> <p>Florida Museum of Natural History, Gainesville, USA.</p> <p>P a r a t y p e s. UF 19376-60023a (Text-fig. 2b), UF 19374-60339 (Text-fig. 2c), UF 19376-60050 (Text-fig. 2d), UF 19374-61756 (Text-fig. 2e), UF 19376-60067 (Textfig. 2f, p), UF 19376-60069 (Text-figs 2i, 3e), UF 19376- 60072 (Text-fig. 2j), UF 19376-60062 (Text-figs 2l, 3f), UF 19374-60403 (Text-fig. 2m), UF 19376-60023b (Textfig. 3a), UF 19376-60023c (Text-fig. 3b), UF 262-17690 (Text-fig. 3c, g), UF 229-53091 (Text-fig. 3d) housed in Florida Museum of Natural History, Gainesville, USA.</p> <p>P l a n t F o s s i l N a m e s R e g i s t r y N u m b e r.</p> <p>PFN001528 (for new species).</p> <p> E t y m o l o g y. The epithet, <i>obovata</i> refers to the fruit shape.</p> <p>T y p e l o c a l i t y. Kisinger Lakes, northwestern</p> <p>Wyoming, USA (UF 19376: N 43° 42.056′, W 109° 52.918′).</p> <p>T y p e h o r i z o n a n d a g e. Tepee Trail Formation,</p> <p>Eocene.</p> <p>A d d i t i o n a l l o c a l i t i e s. Kisinger Lakes (UF 19374: N 43° 42′ 01.9″, W 109° 52′ 44.9″; UF 19375: N 43° 42′ 03.0″, W 109° 52′ 53.3″), West Branch Creek, northcentral Oregon, USA (UF 229: N 44° 34′ 53.40″, W 120° 15′ 57.31″; UF 230: N 44° 35′ 25.31″, W 120° 15′ 27.83″; Eocene Clarno Formation), White Cliffs, northcentral Oregon (UF 262: N 44° 44.302′, W 120° 28.376′; Eocene Clarno Formation).</p> <p>D i a g n o s i s. Fruits single, obovate, 1.1–1.7 times longer than wide. Fruit apex rounded, without a stylar protrusion. Base cuneate and rounded, margins entire. Locular area oblanceolate. Five thick longitudinal wings, veins obscure. Wing and fruit body dotted with circular glands. Fruit borne on pedicel with prominent perianth scar at the junction of pedicel and fruit base. Narrow disk scar located immediately below the perianth scar.</p> <p>D e s c r i p t i o n. Fruits single, obovate, 9–15 mm long and 6–11 mm wide with a length/width ratio of 1.1–1.7, avg. 1.4. Fruit apex rounded, without a stylar protrusion. Base cuneate and rounded, margins entire. Locular area oblanceolate. Five thick longitudinal wings, veins obscure. Wing and fruit body dotted with circular glands 80–110 μm, avg. 100 μm diameter. Fruit borne on pedicel 10–13 mm long and 0.6–1.3 mm thick, with prominent perianth scar at the junction of pedicel and fruit base. Narrow disk scar located immediately below the perianth scar.</p>Published as part of <i>Manchester, Steven R., Disney, Kory A. & Pham, Kasey K., 2020, Winged Fruits Of Rutaceous Affinity From The Eocene Of Western North America, pp. 211-216 in Fossil Imprint 76 (2)</i> on page 213, DOI: 10.37520/fi.2020.018, <a href="http://zenodo.org/record/5386172">http://zenodo.org/record/5386172</a>
Phylogenomic inferences from reference-mapped and de novo assembled short-read sequence data using RADseq sequencing of California white oaks (Quercus subgenus Quercus)
The emergence of next generation sequencing has increased by several orders of magnitude the amount of data available for phylogenetics. Reduced representation approaches, such as restriction-site associated DNA sequencing (RADseq), have proven useful for phylogenetic studies of non-model species at a wide range of phylogenetic depths. However, analysis of these datasets is not uniform and we know little about the potential benefits and drawbacks of de novo assembly versus assembly by mapping to a reference genome. Using RADseq data for 83 oak samples representing 16 Quercus taxa , we identified variants via three pipelines: mapping sequence reads to a recently published draft genome of Quercus lobata, and de novo assembly under two sets of locus filters. For each pipeline, we inferred the maximum likelihood phylogeny. All pipelines produced similar trees, with minor shifts in relationships within well-supported clades, despite the fact that they yielded different numbers of loci (68K â 111K loci) and different degrees of overlap with the reference genome. We conclude that both the reference-aligned and de novo assembly pipelines yield reliable results, and that advantages and disadvantages of these approaches pertain mainly to downstream uses of RADseq data, not to phylogenetic inference per se.The accepted manuscript in pdf format is listed with the files at the bottom of this page. The presentation of the authors' names and (or) special characters in the title of the manuscript may differ slightly between what is listed on this page and what is listed in the pdf file of the accepted manuscript; that in the pdf file of the accepted manuscript is what was submitted by the author
Specimens at the Center: An Informatics Workflow and Toolkit for Specimen-level analysis of Public DNA database data
Pham, Kasey K. [et al.]Major public DNA databases — NCBI GenBank, the DNA DataBank of Japan (DDBJ), and the European Molecular Biology
Laboratory (EMBL) — are invaluable biodiversity libraries. Systematists and other biodiversity scientists commonly mine these databases for
sequence data to use in phylogenetic studies, but such studies generally use only the taxonomic identity of the sequenced tissue, not the
specimen identity. Thus studies that use DNA supermatrices to construct phylogenetic trees with species at the tips typically do not take
advantage of the fact that for many individuals in the public DNA databases, several DNA regions have been sampled; and for many species,
two or more individuals have been sampled. Thus these studies typically do not make full use of the multigene datasets in public DNA
databases to test species coherence and select optimal sequences to represent a species. In this study, we introduce a set of tools developed
in the R programming language to construct individual-based trees from NCBI GenBank data and present a set of trees for the genus Carex
(Cyperaceae) constructed using these methods. For the more than 770 species for which we found sequence data, our approach recovered an
average of 1.85 gene regions per specimen, up to seven for some specimens, and more than 450 species represented by two or more specimens.
Depending on the subset of genes analyzed, we found up to 42% of species monophyletic. We introduce a simple tree statistic—the
Taxonomic Disparity Index (TDI)—to assist in curating specimen-level datasets and provide code for selecting maximally informative (or,
conversely, minimally misleading) sequences as species exemplars. While tailored to the Carex dataset, the approach and code presented in
this paper can readily be generalized to constructing individual-level trees from large amounts of data for any species group.Funding for this work was
provided by the National Science Foundation (Award #1255901 to ALH
andMJWand Award #1256033 to EHR), including an REU supplement that
supported KKP’s work.Peer reviewe