18,216 research outputs found
Controlled exploration of chemical space by machine learning of coarse-grained representations
The size of chemical compound space is too large to be probed exhaustively.
This leads high-throughput protocols to drastically subsample and results in
sparse and non-uniform datasets. Rather than arbitrarily selecting compounds,
we systematically explore chemical space according to the target property of
interest. We first perform importance sampling by introducing a Markov chain
Monte Carlo scheme across compounds. We then train an ML model on the sampled
data to expand the region of chemical space probed. Our boosting procedure
enhances the number of compounds by a factor 2 to 10, enabled by the ML model's
coarse-grained representation, which both simplifies the structure-property
relationship and reduces the size of chemical space. The ML model correctly
recovers linear relationships between transfer free energies. These linear
relationships correspond to features that are global to the dataset, marking
the region of chemical space up to which predictions are reliable---a more
robust alternative to the predictive variance. Bridging coarse-grained
simulations with ML gives rise to an unprecedented database of drug-membrane
insertion free energies for 1.3 million compounds.Comment: 9 pages, 5 figure
Impact of variance components on reliability of absolute quantification using digital PCR
Background: Digital polymerase chain reaction (dPCR) is an increasingly popular technology for detecting and quantifying target nucleic acids. Its advertised strength is high precision absolute quantification without needing reference curves. The standard data analytic approach follows a seemingly straightforward theoretical framework but ignores sources of variation in the data generating process. These stem from both technical and biological factors, where we distinguish features that are 1) hard-wired in the equipment, 2) user-dependent and 3) provided by manufacturers but may be adapted by the user. The impact of the corresponding variance components on the accuracy and precision of target concentration estimators presented in the literature is studied through simulation.
Results: We reveal how system-specific technical factors influence accuracy as well as precision of concentration estimates. We find that a well-chosen sample dilution level and modifiable settings such as the fluorescence cut-off for target copy detection have a substantial impact on reliability and can be adapted to the sample analysed in ways that matter. User-dependent technical variation, including pipette inaccuracy and specific sources of sample heterogeneity, leads to a steep increase in uncertainty of estimated concentrations. Users can discover this through replicate experiments and derived variance estimation. Finally, the detection performance can be improved by optimizing the fluorescence intensity cut point as suboptimal thresholds reduce the accuracy of concentration estimates considerably.
Conclusions: Like any other technology, dPCR is subject to variation induced by natural perturbations, systematic settings as well as user-dependent protocols. Corresponding uncertainty may be controlled with an adapted experimental design. Our findings point to modifiable key sources of uncertainty that form an important starting point for the development of guidelines on dPCR design and data analysis with correct precision bounds. Besides clever choices of sample dilution levels, experiment-specific tuning of machine settings can greatly improve results. Well-chosen data-driven fluorescence intensity thresholds in particular result in major improvements in target presence detection. We call on manufacturers to provide sufficiently detailed output data that allows users to maximize the potential of the method in their setting and obtain high precision and accuracy for their experiments
Penalized Clustering of Large Scale Functional Data with Multiple Covariates
In this article, we propose a penalized clustering method for large scale
data with multiple covariates through a functional data approach. In the
proposed method, responses and covariates are linked together through
nonparametric multivariate functions (fixed effects), which have great
flexibility in modeling a variety of function features, such as jump points,
branching, and periodicity. Functional ANOVA is employed to further decompose
multivariate functions in a reproducing kernel Hilbert space and provide
associated notions of main effect and interaction. Parsimonious random effects
are used to capture various correlation structures. The mixed-effect models are
nested under a general mixture model, in which the heterogeneity of functional
data is characterized. We propose a penalized Henderson's likelihood approach
for model-fitting and design a rejection-controlled EM algorithm for the
estimation. Our method selects smoothing parameters through generalized
cross-validation. Furthermore, the Bayesian confidence intervals are used to
measure the clustering uncertainty. Simulation studies and real-data examples
are presented to investigate the empirical performance of the proposed method.
Open-source code is available in the R package MFDA
Fast simulation of the leaky bucket algorithm
We use fast simulation methods, based on importance sampling, to efficiently estimate cell loss probability in queueing models of the Leaky Bucket algorithm. One of these models was introduced by Berger (1991), in which the rare event of a cell loss is related to the rare event of an empty finite buffer in an "overloaded" queue. In particular, we propose a heuristic change of measure for importance sampling to efficiently estimate the probability of the rare empty-buffer event in an asymptotically unstable GI/GI/1/k queue. This change of measure is, in a way, "dual" to that proposed by Parekh and Walrand (1989) to estimate the probability of a rare buffer overflow event. We present empirical results to demonstrate the effectiveness of our fast simulation method. Since we have not yet obtained a mathematical proof, we can only conjecture that our heuristic is asymptotically optimal, as k/spl rarr//spl infin/
Probabilistic methods in the analysis of protein interaction networks
Imperial Users onl
Statistical modelling of transcript profiles of differentially regulated genes
Background: The vast quantities of gene expression profiling data produced in microarray studies, and
the more precise quantitative PCR, are often not statistically analysed to their full potential. Previous
studies have summarised gene expression profiles using simple descriptive statistics, basic analysis of
variance (ANOVA) and the clustering of genes based on simple models fitted to their expression profiles
over time. We report the novel application of statistical non-linear regression modelling techniques to
describe the shapes of expression profiles for the fungus Agaricus bisporus, quantified by PCR, and for E.
coli and Rattus norvegicus, using microarray technology. The use of parametric non-linear regression models
provides a more precise description of expression profiles, reducing the "noise" of the raw data to
produce a clear "signal" given by the fitted curve, and describing each profile with a small number of
biologically interpretable parameters. This approach then allows the direct comparison and clustering of
the shapes of response patterns between genes and potentially enables a greater exploration and
interpretation of the biological processes driving gene expression.
Results: Quantitative reverse transcriptase PCR-derived time-course data of genes were modelled. "Splitline"
or "broken-stick" regression identified the initial time of gene up-regulation, enabling the classification
of genes into those with primary and secondary responses. Five-day profiles were modelled using the
biologically-oriented, critical exponential curve, y(t) = A + (B + Ct)Rt + ε. This non-linear regression
approach allowed the expression patterns for different genes to be compared in terms of curve shape,
time of maximal transcript level and the decline and asymptotic response levels. Three distinct regulatory
patterns were identified for the five genes studied. Applying the regression modelling approach to
microarray-derived time course data allowed 11% of the Escherichia coli features to be fitted by an
exponential function, and 25% of the Rattus norvegicus features could be described by the critical
exponential model, all with statistical significance of p < 0.05.
Conclusion: The statistical non-linear regression approaches presented in this study provide detailed
biologically oriented descriptions of individual gene expression profiles, using biologically variable data to
generate a set of defining parameters. These approaches have application to the modelling and greater
interpretation of profiles obtained across a wide range of platforms, such as microarrays. Through careful
choice of appropriate model forms, such statistical regression approaches allow an improved comparison
of gene expression profiles, and may provide an approach for the greater understanding of common
regulatory mechanisms between genes
Multi-variate time-series for time constraint adherence prediction in complex job shops
One of the most complex and agile production environments is semiconductor manufacturing, especially wafer fabrication, as products require more than several hundred operations and remain in Work-In-Progress for months leading to complex job shops. Additionally, an increasingly competitive market environment, i.e. owing to Moore’s law, forces semiconductor companies to focus on operational excellence, resiliency and, hence, leads to product quality as a decisive factor. Product-specific time constraints comprising two or more, not necessarily consecutive, operations ensure product quality at an operational level and, thus, are an industry-specific challenge. Time constraint adherence is of utmost importance, since violations typically lead to scrapping entire lots and a deteriorating yield. Dispatching decisions that determine time constraint adherence are as a state of the art performed manually, which is stressful and error-prone. Therefore, this article presents a data-driven approach combining multi-variate time-series with centralized information to predict time constraint adherence probability in wafer fabrication to facilitate dispatching. Real-world data is analyzed and different statistical and machine learning models are evaluated
- …