41 research outputs found

    Automating biomedical data science through tree-based pipeline optimization

    Full text link
    Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding

    Predicting the Difficulty of Pure, Strict, Epistatic Models: Metrics for Simulated Model Selection

    Get PDF
    Background: Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account, a reliable metric is needed for model selection. Results: We evaluate three metrics as predictors of relative model detection difficulty derived from previous works: (1) Penetrance table variance (PTV), (2) customized odds ratio (COR), and (3) our own Ease of Detection Measure (EDM), calculated from the penetrance values and respective genotype frequencies of each simulated genetic model. We evaluate the reliability of these metrics across three very different data search algorithms, each with the capacity to detect epistatic interactions. We find that a model’s EDM and COR are each stronger predictors of model detection success than heritability. Conclusions: This study formally identifies and evaluates metrics which quantify model detection difficulty. We utilize these metrics to intelligently select models from a population of potential architectures. This allows for an improved simulation study design which accounts for differences in detection difficulty attributed to model architecture. We implement the calculation and utilization of EDM and COR into GAMETES, an algorithm which rapidly and precisely generates pure, strict, n-locus epistatic models

    A Classification and Characterization of Two-Locus, Pure, Strict, Epistatic Models for Simulation and Detection

    Get PDF
    BackgroundThe statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model ‘architecture’ on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models

    The Effect of the Achilles Tendon on Trabecular Structure in the Primate Calcaneus

    Full text link
    Humans possess the longest Achilles tendon relative to total muscle length of any primate, an anatomy that is beneficial for bipedal locomotion. Reconstructing the evolutionary history of the Achilles tendon has been challenging, in part because soft tissue does not fossilize. The only skeletal evidence for Achilles tendon anatomy in extinct taxa is the insertion site on the calcaneal tuber, which is rarely preserved in the fossil record and, when present, is equivocal for reconstructing tendon morphology. In this study, we used high‐resolution three‐dimensional microcomputed tomography (micro‐CT) to quantify the microstructure of the trabecular bone underlying the Achilles tendon insertion site in baboons, gibbons, chimpanzees, and humans to test the hypothesis that trabecular orientation differs among primates with different tendon morphologies. Surprisingly, despite their very different Achilles tendon lengths, we were unable to find differences between the trabecular properties of chimpanzee and human calcanei in this specific region. There were regional differences within the calcaneus in the degree of anisotropy (DA) in both chimpanzees and humans, though the patterns were similar between the two species (higher DA inferiorly in the calcaneal tuber). Our results suggest that while trabecular bone within the calcaneus varies, it does not respond to the variation of Achilles tendon morphology across taxa in the way we hypothesized. These results imply that internal bone architecture may not be informative for reconstructing Achilles tendon anatomy in early hominins. Anat Rec, 296:1509–1517, 2013. © 2013 Wiley Periodicals, Inc.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/100175/1/ar22739.pd

    Open Problems in Extracellular RNA Data Analysis: Insights From an ERCC Online Workshop.

    Get PDF
    We now know RNA can survive the harsh environment of biofluids when encapsulated in vesicles or by associating with lipoproteins or RNA binding proteins. These extracellular RNA (exRNA) play a role in intercellular signaling, serve as biomarkers of disease, and form the basis of new strategies for disease treatment. The Extracellular RNA Communication Consortium (ERCC) hosted a two-day online workshop (April 19-20, 2021) on the unique challenges of exRNA data analysis. The goal was to foster an open dialog about best practices and discuss open problems in the field, focusing initially on small exRNA sequencing data. Video recordings of workshop presentations and discussions are available (https://exRNA.org/exRNAdata2021-videos/). There were three target audiences: experimentalists who generate exRNA sequencing data, computational and data scientists who work with those groups to analyze their data, and experimental and data scientists new to the field. Here we summarize issues explored during the workshop, including progress on an effort to develop an exRNA data analysis challenge to engage the community in solving some of these open problems

    The human gut symbiont Ruminococcus gnavus shows specificity to blood group A antigen during mucin glycan foraging: Implication for niche colonisation in the gastrointestinal tract

    Get PDF
    AU The:human Pleaseconfirmthatallheadinglevelsarerepresentedcorrectly gut symbiont Ruminococcus gnavus displays strain-specific : repertoires of glycoside hydrolases (GHs) contributing to its spatial location in the gut. Sequence similarity network analysis identified strain-specific differences in blood-group endo-β-1,4-galactosidase belonging to the GH98 family. We determined the substrate and linkage specificities of GH98 from R. gnavus ATCC 29149, RgGH98, against a range of defined oligosaccharides and glycoconjugates including mucin. We showed by HPAEC-PAD and LC-FD-MS/MS that RgGH98 is specific for blood group A tetrasaccharide type II (BgA II). Isothermal titration calorimetry (ITC) and saturation transfer difference (STD) NMR confirmed RgGH98 affinity for blood group A over blood group B and H antigens. The molecular basis of RgGH98 strict specificity was further investigated using a combination of glycan microarrays, site-directed mutagenesis, and X-ray crystallography. The crystal structures of RgGH98 in complex with BgA trisaccharide (BgAtri) and of RgGH98 E411A with BgA II revealed a dedicated hydrogen network of residues, which were shown by site-directed mutagenesis to be critical to the recognition of the BgA epitope. We demonstrated experimentally that RgGH98 is part of an operon of 10 genes that is overexpresssed in vitro when R. gnavus ATCC 29149 is grown on mucin as sole carbon source as shown by RNAseq analysis and RT-qPCR confirmed RgGH98 expression on BgA II growth. Using MALDI-ToF MS, we showed that RgGH98 releases BgAtri from mucin and that pretreatment of mucin with RgGH98 confered R. gnavus E1 the ability to grow, by enabling the E1 strain to metabolise BgAtri and access the underlying mucin glycan chain. These data further support that the GH repertoire of R. gnavus strains enable them to colonise different nutritional niches in the human gut and has potential applications in diagnostic and therapeutics against infection

    Rule-based machine learning classification and knowledge discovery for complex problems

    No full text
    corecore