9,185 research outputs found
Recommended from our members
Predicting with sparse data
It is well known that effective prediction of project cost related factors is an important aspect of software engineering. Unfortunately, despite extensive research over more than 30 years, this remains a significant problem for many practitioners. A major obstacle is the absence of reliable and systematic historic data, yet this is a sine qua non for almost all proposed methods: statistical, machine learning or calibration of existing models. In this paper we describe our sparse data method (SDM) based upon a pairwise comparison technique and Saaty's Analytic Hierarchy Process (AHP). Our minimum data requirement is a single known point. The technique is supported by a software tool known as DataSalvage. We show, for data from two companies, how our approach — based upon expert judgement — adds value to expert judgement by producing significantly more accurate and less biased results. A sensitivity analysis shows that our approach is robust to pairwise comparison errors. We then describe the results of a small usability trial with a practising project manager. From this empirical work we conclude that the technique is promising and may help overcome some of the present barriers to effective project prediction
PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data.
Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUs from metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles. Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisons of PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR libraries identify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. In addition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed by analysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of the biosphere currently hidden from PCR-based surveys of diversity
Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
Existing sequence alignment algorithms use heuristic scoring schemes which
cannot be used as objective distance metrics. Therefore one relies on measures
like the p- or log-det distances, or makes explicit, and often simplistic,
assumptions about sequence evolution. Information theory provides an
alternative, in the form of mutual information (MI) which is, in principle, an
objective and model independent similarity measure. MI can be estimated by
concatenating and zipping sequences, yielding thereby the "normalized
compression distance". So far this has produced promising results, but with
uncontrolled errors. We describe a simple approach to get robust estimates of
MI from global pairwise alignments. Using standard alignment algorithms, this
gives for animal mitochondrial DNA estimates that are strikingly close to
estimates obtained from the alignment free methods mentioned above. Our main
result uses algorithmic (Kolmogorov) information theory, but we show that
similar results can also be obtained from Shannon theory. Due to the fact that
it is not additive, normalized compression distance is not an optimal metric
for phylogenetics, but we propose a simple modification that overcomes the
issue of additivity. We test several versions of our MI based distance measures
on a large number of randomly chosen quartets and demonstrate that they all
perform better than traditional measures like the Kimura or log-det (resp.
paralinear) distances. Even a simplified version based on single letter Shannon
entropies, which can be easily incorporated in existing software packages, gave
superior results throughout the entire animal kingdom. But we see the main
virtue of our approach in a more general way. For example, it can also help to
judge the relative merits of different alignment algorithms, by estimating the
significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia
Second-generation PLINK: rising to the challenge of larger and richer datasets
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide
association studies (GWAS) and research in population genetics. However, the
steady accumulation of data from imputation and whole-genome sequencing studies
has exposed a strong need for even faster and more scalable implementations of
key functions. In addition, GWAS and population-genetic data now frequently
contain probabilistic calls, phase information, and/or multiallelic variants,
none of which can be represented by PLINK 1's primary data format.
To address these issues, we are developing a second-generation codebase for
PLINK. The first major release from this codebase, PLINK 1.9, introduces
extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space
Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic
improvements. In combination, these changes accelerate most operations by 1-4
orders of magnitude, and allow the program to handle datasets too large to fit
in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data
format capable of efficiently representing probabilities, phase, and
multiallelic variants, and (b) extensions of many functions to account for the
new types of information.
The second-generation versions of PLINK will offer dramatic improvements in
performance and compatibility. For the first time, users without access to
high-end computing resources can perform several essential analyses of the
feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil
Disentangling causal webs in the brain using functional Magnetic Resonance Imaging: A review of current approaches
In the past two decades, functional Magnetic Resonance Imaging has been used
to relate neuronal network activity to cognitive processing and behaviour.
Recently this approach has been augmented by algorithms that allow us to infer
causal links between component populations of neuronal networks. Multiple
inference procedures have been proposed to approach this research question but
so far, each method has limitations when it comes to establishing whole-brain
connectivity patterns. In this work, we discuss eight ways to infer causality
in fMRI research: Bayesian Nets, Dynamical Causal Modelling, Granger Causality,
Likelihood Ratios, LiNGAM, Patel's Tau, Structural Equation Modelling, and
Transfer Entropy. We finish with formulating some recommendations for the
future directions in this area
Two-Locus Likelihoods under Variable Population Size and Fine-Scale Recombination Rate Estimation
Two-locus sampling probabilities have played a central role in devising an
efficient composite likelihood method for estimating fine-scale recombination
rates. Due to mathematical and computational challenges, these sampling
probabilities are typically computed under the unrealistic assumption of a
constant population size, and simulation studies have shown that resulting
recombination rate estimates can be severely biased in certain cases of
historical population size changes. To alleviate this problem, we develop here
new methods to compute the sampling probability for variable population size
functions that are piecewise constant. Our main theoretical result, implemented
in a new software package called LDpop, is a novel formula for the sampling
probability that can be evaluated by numerically exponentiating a large but
sparse matrix. This formula can handle moderate sample sizes () and
demographic size histories with a large number of epochs (). In addition, LDpop implements an approximate formula for the sampling
probability that is reasonably accurate and scales to hundreds in sample size
(). Finally, LDpop includes an importance sampler for the posterior
distribution of two-locus genealogies, based on a new result for the optimal
proposal distribution in the variable-size setting. Using our methods, we study
how a sharp population bottleneck followed by rapid growth affects the
correlation between partially linked sites. Then, through an extensive
simulation study, we show that accounting for population size changes under
such a demographic model leads to substantial improvements in fine-scale
recombination rate estimation. LDpop is freely available for download at
https://github.com/popgenmethods/ldpopComment: 32 pages, 13 figure
- …