Search CORE

24 research outputs found

Sequence-Based Classification Using Discriminatory Motif Feature Selection

Author: Daniel Capurso
Hao Xiong
Mark R. Segal
Mikael Boden
Śaunak Sen
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all -mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length , such that potentially important, longer () predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Cleaning Genotype Data from Diversity Outbred Mice.

Author: Broman Karl W
Churchill Gary
Gatti Daniel M
Sen Śaunak
Svenson Karen L
Publication venue: The Mouseion at the JAXlibrary
Publication date: 07/05/2019
Field of study

Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies

The Jackson Laboratory: The Mouseion at the JAXlibrary

R/qtl2: Software for Mapping Quantitative Trait Loci with High-Dimensional Data and Multiparent Populations.

Author: Brian S. Yandell
Daniel M. Gatti
Gary A. Churchill
Karl W. Broman
Lander
Mott
Nicholas A. Furlotte
Petr Simecek
Pjotr Prins
Sen
Wickham
Śaunak Sen
Publication venue: The Mouseion at the JAXlibrary
Publication date: 01/02/2019
Field of study

R/qtl2 is an interactive software environment for mapping quantitative trait loci (QTL) in experimental populations. The R/qtl2 software expands the scope of the widely used R/qtl software package to include multiparent populations derived from more than two founder strains, such as the Collaborative Cross and Diversity Outbred mice, heterogeneous stocks, and MAGIC plant populations. R/qtl2 is designed to handle modern high-density genotyping data and high-dimensional molecular phenotypes, including gene expression and proteomics. R/qtl2 includes the ability to perform genome scans using a linear mixed model to account for population structure, and also includes features to impute SNPs based on founder strain genomes and to carry out association mapping. The R/qtl2 software provides all of the basic features needed for QTL mapping, including graphical displays and summary reports, and it can be extended through the creation of add-on packages. R/qtl2, which is free and open source software written in the R and C++ programming languages, comes with a test framework

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Information Content in $B \to VV$ Decays and the Angular Moments Method

Author: A. Datta
A. Datta
A. J. Buras
A. S. Dighe
A. S. Dighe
Amol Dighe
D. R. Cox
E. Franco
F. Abe
G. Kramer
G. Kramer
G. Kramer
G. Kramer
G. Kramer
G. Kramer
G. Valencia
H. Y. Cheng
I. Bigi
I. Dunietz
J. L. Rosner
J. M. Soares
J. S. Hagelin
L. L. Chau
L. L. Chau
M. B. Voloshin
M. Bauer
M. Bauer
M. Beneke
M. Lusignoli
R. A. Thisted
R. Aleksan
R. Aleksan
R. Fleischer
R. Fleischer
Śaunak Sen
Publication venue: 'American Physical Society (APS)'
Publication date: 16/10/1998
Field of study

The time-dependent angular distributions of decays of neutral

B

mesons into two vector mesons contain information about the lifetimes, mass differences, strong and weak phases, form factors, and CP violating quantities. A statistical analysis of the information content is performed by giving the ``information'' a quantitative meaning. It is shown that for some parameters of interest, the information content in time and angular measurements combined may be orders of magnitude more than the information from time measurements alone and hence the angular measurements are highly recommended. The method of angular moments is compared with the (maximum) likelihood method to find that it works almost as well in the region of interest for the one-angle distribution. For the complete three-angle distribution, an estimate of possible statistical errors expected on the observables of interest is obtained. It indicates that the three-angle distribution, unraveled by the method of angular moments, would be able to nail down many quantities of interest and will help in pointing unambiguously to new physics.Comment: LaTeX, 34 pages with 9 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

Selective Genotyping and Phenotyping Strategies in a Complex Trait Context

Author: Frank Johannes
Karl W. Broman
Śaunak Sen
Publication venue: 'Genetics Society of America'
Publication date
Field of study

Crossref

Quantitative Trait Locus Study Design From an Information Perspective

Author: Churchill Gary A.
Satagopan Jaya M.
Sen Śaunak
Publication venue: Genetics Society of America
Publication date: 01/01/2005
Field of study

We examine the efficiency of different genotyping and phenotyping strategies in inbred line crosses from an information perspective. This provides a mathematical framework for the statistical aspects of QTL experimental design, while guiding our intuition. Our central result is a simple formula that quantifies the fraction of missing information of any genotyping strategy in a backcross. It includes the special case of selectively genotyping only the phenotypic extreme individuals. The formula is a function of the square of the phenotype and the uncertainty in our knowledge of the genotypes at a locus. This result is used to answer a variety of questions. First, we examine the cost-information trade-off varying the density of markers and the proportion of extreme phenotypic individuals genotyped. Then we evaluate the information content of selective phenotyping designs and the impact of measurement error in phenotyping. A simple formula quantifies the information content of any combined phenotyping and genotyping design. We extend our results to cover multigenotype crosses, such as the F(2) intercross, and multiple QTL models. We find that when the QTL effect is small, any contrast in a multigenotype cross benefits from selective genotyping in the same manner as in a backcross. The benefit remains in the presence of a second unlinked QTL with small effect (explaining <20% of the variance), but diminishes if the second QTL has a large effect. Software for performing power calculations for backcross and F(2) intercross incorporating selective genotyping and marker spacing is available from http://www.biostat.ucsf.edu/sen

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

PubMed Central

Poor Performance of Bootstrap Confidence Intervals for the Location of a Quantitative Trait Locus

Author: Broman Karl W.
Dupuis Josée
Manichaikul Ani
Sen Śaunak
Publication venue: Copyright © 2006 by the Genetics Society of America
Publication date: 24/03/2006
Field of study

The aim of many genetic studies is to locate the genomic regions (called quantitative trait loci, QTL) that contribute to variation in a quantitative trait (such as body weight). Confidence intervals for the locations of QTL are particularly important for the design of further experiments to identify the gene or genes responsible for the effect. Likelihood support intervals are the most widely used method to obtain confidence intervals for QTL location, but the nonparametric bootstrap has also been recommended. Through extensive computer simulation, we show that bootstrap confidence intervals behave poorly and so should not be used in this context. The profile likelihood (or LOD curve) for QTL location has a tendency to peak at genetic markers, and so the distribution of the maximum-likelihood estimate (MLE) of QTL location has the unusual feature of point masses at genetic markers; this contributes to the poor behavior of the bootstrap. Likelihood support intervals and approximate Bayes credible intervals, on the other hand, are shown to behave appropriately

Crossref

PubMed Central

Collection Of Biostatistics Research Archive

Matrix Linear Models for High-Throughput Chemical Genetic Screens

Author: Jane W. Liang
Robert J. Nichols
Śaunak Sen
Publication venue: 'Genetics Society of America'
Publication date
Field of study

Crossref

Significance Thresholds for Quantitative Trait Locus Mapping Under Selective Genotyping

Author: Broman Karl W.
Manichaikul Ani
Palmer Abraham A.
Sen Śaunak
Publication venue: Genetics Society of America
Publication date
Field of study

In the case of selective genotyping, the usual permutation test to establish statistical significance for quantitative trait locus (QTL) mapping can give inappropriate significance thresholds, especially when the phenotype distribution is skewed. A stratified permutation test should be used, with phenotypes shuffled separately within the genotyped and ungenotyped individuals

Crossref

PubMed Central

Genetic modifiers interact with Cpe

Author: Gayle B. Collin
Jürgen K. Naggert
Terry P. Maddatu
Śaunak Sen
Publication venue: 'American Physiological Society'
Publication date
Field of study

Crossref