83 research outputs found

    A search-based geographic metadata curation pipeline to refine sequencing institution information and support public health

    Get PDF
    BackgroundThe National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) has amassed a vast reservoir of genetic data since its inception in 2007. These public data hold immense potential for supporting pathogen surveillance and control. However, the lack of standardized metadata and inconsistent submission practices in SRA may impede the data’s utility in public health.MethodsTo address this issue, we introduce the Search-based Geographic Metadata Curation (SGMC) pipeline. SGMC utilized Python and web scraping to extract geographic data of sequencing institutions from NCBI SRA in the Cloud and its website. It then harnessed ChatGPT to refine the sequencing institution and location assignments. To illustrate the pipeline’s utility, we examined the geographic distribution of the sequencing institutions and their countries relevant to polio eradication and categorized them.ResultsSGMC successfully identified 7,649 sequencing institutions and their global locations from a random selection of 2,321,044 SRA accessions. These institutions were distributed across 97 countries, with strong representation in the United States, the United Kingdom and China. However, there was a lack of data from African, Central Asian, and Central American countries, indicating potential disparities in sequencing capabilities. Comparison with manually curated data for U.S. institutions reveals SGMC’s accuracy rates of 94.8% for institutions, 93.1% for countries, and 74.5% for geographic coordinates.ConclusionSGMC may represent a novel approach using a generative AI model to enhance geographic data (country and institution assignments) for large numbers of samples within SRA datasets. This information can be utilized to bolster public health endeavors

    Phylogeny of imported and reestablished wild polioviruses in the Democratic Republic of the Congo from 2006 to 2011

    Get PDF
    BACKGROUND : The last case of polio associated with wild poliovirus (WPV) indigenous to the Democratic Republic of the Congo (DRC) was reported in 2001, marking a major milestone toward polio eradication in Africa. However, during 2006-2011, outbreaks associated with WPV type 1 (WPV1) were widespread in the DRC, with >250 reported cases. METHODS : WPV1 isolates obtained from patients with acute flaccid paralysis (AFP) were compared by nucleotide sequencing of the VP1 capsid region (906 nucleotides). VP1 sequence relationships among isolates from the DRC and other countries were visualized in phylogenetic trees, and isolates representing distinct lineage groups were mapped. RESULTS : Phylogenetic analysis indicated that WPV1 was imported twice in 2004-2005 and once in approximately 2006 from Uttar Pradesh, India (a major reservoir of endemicity for WPV1 and WPV3 until 2010-2011), into Angola. WPV1 from the first importation spread to the DRC in 2006, sparking a series of outbreaks that continued into 2011. WPV1 from the second importation was widely disseminated in the DRC and spread to the Congo in 2010-2011. VP1 sequence relationships revealed frequent transmission of WPV1 across the borders of Angola, the DRC, and the Congo. Long branches on the phylogenetic tree signaled prolonged gaps in AFP surveillance and a likely underreporting of polio cases. CONCLUSIONS : The reestablishment of widespread and protracted WPV1 transmission in the DRC and Angola following long-range importations highlights the continuing risks of WPV spread until global eradication is achieved, and it further underscores the need for all countries to maintain high levels of poliovirus vaccine coverage and sensitive surveillance to protect their polio-free status.Centers for Disease Control and Preventionhttp://jid.oxfordjournals.org2015-11-30hb201

    An Analysis of the Abstracts Presented at the Annual Meetings of the Society for Neuroscience from 2001 to 2006

    Get PDF
    Annual meeting abstracts published by scientific societies often contain rich arrays of information that can be computationally mined and distilled to elucidate the state and dynamics of the subject field. We extracted and processed abstract data from the Society for Neuroscience (SFN) annual meeting abstracts during the period 2001–2006 in order to gain an objective view of contemporary neuroscience. An important first step in the process was the application of data cleaning and disambiguation methods to construct a unified database, since the data were too noisy to be of full utility in the raw form initially available. Using natural language processing, text mining, and other data analysis techniques, we then examined the demographics and structure of the scientific collaboration network, the dynamics of the field over time, major research trends, and the structure of the sources of research funding. Some interesting findings include a high geographical concentration of neuroscience research in the north eastern United States, a surprisingly large transient population (66% of the authors appear in only one out of the six studied years), the central role played by the study of neurodegenerative disorders in the neuroscience community, and an apparent growth of behavioral/systems neuroscience with a corresponding shrinkage of cellular/molecular neuroscience over the six year period. The results from this work will prove useful for scientists, policy makers, and funding agencies seeking to gain a complete and unbiased picture of the community structure and body of knowledge encapsulated by a specific scientific domain

    Nuclear Translocation of β-Catenin during Mesenchymal Stem Cells Differentiation into Hepatocytes Is Associated with a Tumoral Phenotype

    Get PDF
    Wnt/β-catenin pathway controls biochemical processes related to cell differentiation. In committed cells the alteration of this pathway has been associated with tumors as hepatocellular carcinoma or hepatoblastoma. The present study evaluated the role of Wnt/β-catenin activation during human mesenchymal stem cells differentiation into hepatocytes. The differentiation to hepatocytes was achieved by the addition of two different conditioned media. In one of them, β-catenin nuclear translocation, up-regulation of genes related to the Wnt/β-catenin pathway, such as Lrp5 and Fzd3, as well as the oncogenes c-myc and p53 were observed. While in the other protocol there was a Wnt/β-catenin inactivation. Hepatocytes with nuclear translocation of β-catenin also had abnormal cellular proliferation, and expressed membrane proteins involved in hepatocellular carcinoma, metastatic behavior and cancer stem cells. Further, these cells had also increased auto-renewal capability as shown in spheroids formation assay. Comparison of both differentiation protocols by 2D-DIGE proteomic analysis revealed differential expression of 11 proteins with altered expression in hepatocellular carcinoma. Cathepsin B and D, adenine phosphoribosyltransferase, triosephosphate isomerase, inorganic pyrophosphatase, peptidyl-prolyl cis-trans isomerase A or lactate dehydrogenase β-chain were up-regulated only with the protocol associated with Wnt signaling activation while other proteins involved in tumor suppression, such as transgelin or tropomyosin β-chain were down-regulated in this protocol. In conclusion, our results suggest that activation of the Wnt/β-catenin pathway during human mesenchymal stem cells differentiation into hepatocytes is associated with a tumoral phenotype

    The Brain Atlas Concordance Problem: Quantitative Comparison of Anatomical Parcellations

    Get PDF
    Many neuroscientific reports reference discrete macro-anatomical regions of the brain which were delineated according to a brain atlas or parcellation protocol. Currently, however, no widely accepted standards exist for partitioning the cortex and subcortical structures, or for assigning labels to the resulting regions, and many procedures are being actively used. Previous attempts to reconcile neuroanatomical nomenclatures have been largely qualitative, focusing on the development of thesauri or simple semantic mappings between terms. Here we take a fundamentally different approach, discounting the names of regions and instead comparing their definitions as spatial entities in an effort to provide more precise quantitative mappings between anatomical entities as defined by different atlases. We develop an analytical framework for studying this brain atlas concordance problem, and apply these methods in a comparison of eight diverse labeling methods used by the neuroimaging community. These analyses result in conditional probabilities that enable mapping between regions across atlases, which also form the input to graph-based methods for extracting higher-order relationships between sets of regions and to procedures for assessing the global similarity between different parcellations of the same brain. At a global scale, the overall results demonstrate a considerable lack of concordance between available parcellation schemes, falling within chance levels for some atlas pairs. At a finer level, this study reveals spatial relationships between sets of defined regions that are not obviously apparent; these are of high potential interest to researchers faced with the challenge of comparing results that were based on these different anatomical models, particularly when coordinate-based data are not available. The complexity of the spatial overlap patterns revealed points to problems for attempts to reconcile anatomical parcellations and nomenclatures using strictly qualitative and/or categorical methods. Detailed results from this study are made available via an interactive web site at http://obart.info

    Robust estimation of bacterial cell count from optical density

    Get PDF
    Optical density (OD) is widely used to estimate the density of cells in liquid culture, but cannot be compared between instruments without a standardized calibration protocol and is challenging to relate to actual cell count. We address this with an interlaboratory study comparing three simple, low-cost, and highly accessible OD calibration protocols across 244 laboratories, applied to eight strains of constitutive GFP-expressing E. coli. Based on our results, we recommend calibrating OD to estimated cell count using serial dilution of silica microspheres, which produces highly precise calibration (95.5% of residuals <1.2-fold), is easily assessed for quality control, also assesses instrument effective linear range, and can be combined with fluorescence calibration to obtain units of Molecules of Equivalent Fluorescein (MEFL) per cell, allowing direct comparison and data fusion with flow cytometry measurements: in our study, fluorescence per cell measurements showed only a 1.07-fold mean difference between plate reader and flow cytometry data

    ALICE: Physics Performance Report, Volume I

    Get PDF
    ALICE is a general-purpose heavy-ion experiment designed to study the physics of strongly interacting matter and the quark-gluon plasma in nucleus-nucleus collisions at the LHC. It currently includes more than 900 physicists and senior engineers, from both nuclear and high-energy physics, from about 80 institutions in 28 countries. The experiment was approved in February 1997. The detailed design of the different detector systems has been laid down in a number of Technical Design Reports issued between mid-1998 and the end of 2001 and construction has started for most detectors. Since the last comprehensive information on detector and physics performance was published in the ALICE Technical Proposal in 1996, the detector as well as simulation, reconstruction and analysis software have undergone significant development. The Physics Performance Report (PPR) will give an updated and comprehensive summary of the current status and performance of the various ALICE subsystems, including updates to the Technical Design Reports, where appropriate, as well as a description of systems which have not been published in a Technical Design Report. The PPR will be published in two volumes. The current Volume I contains: 1. a short theoretical overview and an extensive reference list concerning the physics topics of interest to ALICE, 2. relevant experimental conditions at the LHC, 3. a short summary and update of the subsystem designs, and 4. a description of the offline framework and Monte Carlo generators. Volume II, which will be published separately, will contain detailed simulations of combined detector performance, event reconstruction, and analysis of a representative sample of relevant physics observables from global event characteristics to hard processes

    The ALICE experiment at the CERN LHC

    Get PDF
    ALICE (A Large Ion Collider Experiment) is a general-purpose, heavy-ion detector at the CERN LHC which focuses on QCD, the strong-interaction sector of the Standard Model. It is designed to address the physics of strongly interacting matter and the quark-gluon plasma at extreme values of energy density and temperature in nucleus-nucleus collisions. Besides running with Pb ions, the physics programme includes collisions with lighter ions, lower energy running and dedicated proton-nucleus runs. ALICE will also take data with proton beams at the top LHC energy to collect reference data for the heavy-ion programme and to address several QCD topics for which ALICE is complementary to the other LHC detectors. The ALICE detector has been built by a collaboration including currently over 1000 physicists and engineers from 105 Institutes in 30 countries. Its overall dimensions are 161626 m3 with a total weight of approximately 10 000 t. The experiment consists of 18 different detector systems each with its own specific technology choice and design constraints, driven both by the physics requirements and the experimental conditions expected at LHC. The most stringent design constraint is to cope with the extreme particle multiplicity anticipated in central Pb-Pb collisions. The different subsystems were optimized to provide high-momentum resolution as well as excellent Particle Identification (PID) over a broad range in momentum, up to the highest multiplicities predicted for LHC. This will allow for comprehensive studies of hadrons, electrons, muons, and photons produced in the collision of heavy nuclei. Most detector systems are scheduled to be installed and ready for data taking by mid-2008 when the LHC is scheduled to start operation, with the exception of parts of the Photon Spectrometer (PHOS), Transition Radiation Detector (TRD) and Electro Magnetic Calorimeter (EMCal). These detectors will be completed for the high-luminosity ion run expected in 2010. This paper describes in detail the detector components as installed for the first data taking in the summer of 2008

    PoSE: visualization of patterns of sequence evolution using PAML and MATLAB

    No full text
    Abstract Background Determining patterns of nucleotide and amino acid substitution is the first step during sequence evolution analysis. However, it is not easy to visualize the different phylogenetic signatures imprinted in aligned nucleotide and amino acid sequences. Results Here we present PoSE (Pattern of Sequence Evolution), a reliable resource for unveiling the evolutionary history of sequence alignments and for graphically displaying their contents. Substitutions are displayed by category (transitions and transversions), codon position, and phenotypic effect (synonymous and nonsynonymous). Visualization is accomplished using MATLAB scripts wrapped around PAML (Phylogenetic Analysis by Maximum Likelihood), implemented in an easy-to-use graphical user interface. The application displays inferred substitutions estimated by baseml or codeml, two programs included in the PAML software package. PoSE organizes patterns of substitution in eleven plots, including estimated non-synonymous/synonymous ratios (dN/dS) along the sequence alignment. In addition, PoSE provides visualization and annotation of patterns of amino acid substitutions along groups of related sequences that can be graphically inspected in a phylogenetic tree window. Conclusions PoSE is a useful tool to help determine major patterns during sequence evolution of protein-coding sequences, hypervariable regions, or changes in dN/dS ratios. PoSE is publicly available at https://github.com/CDCgov/PoS
    corecore