43 research outputs found
FamSeq: A Variant Calling Program for Family-Based Sequencing Data Using Graphics Processing Units
<div><p>Various algorithms have been developed for variant calling using next-generation sequencing data, and various methods have been applied to reduce the associated false positive and false negative rates. Few variant calling programs, however, utilize the pedigree information when the family-based sequencing data are available. Here, we present a program, FamSeq, which reduces both false positive and false negative rates by incorporating the pedigree information from the Mendelian genetic model into variant calling. To accommodate variations in data complexity, FamSeq consists of four distinct implementations of the Mendelian genetic model: the Bayesian network algorithm, a graphics processing unit version of the Bayesian network algorithm, the Elston-Stewart algorithm and the Markov chain Monte Carlo algorithm. To make the software efficient and applicable to large families, we parallelized the Bayesian network algorithm that copes with pedigrees with inbreeding loops without losing calculation precision on an NVIDIA graphics processing unit. In order to compare the difference in the four methods, we applied FamSeq to pedigree sequencing data with family sizes that varied from 7 to 12. When there is no inbreeding loop in the pedigree, the Elston-Stewart algorithm gives analytical results in a short time. If there are inbreeding loops in the pedigree, we recommend the Bayesian network method, which provides exact answers. To improve the computing speed of the Bayesian network method, we parallelized the computation on a graphics processing unit. This allowed the Bayesian network method to process the whole genome sequencing data of a family of 12 individuals within two days, which was a 10-fold time reduction compared to the time required for this computation on a central processing unit.</p></div
Illustration of GPU parallel computing in FamSeq.
<p>The program can be divided into two parts: a serial part and a parallel part. The serial part is processed in a CPU and the parallel part is processed in a GPU. The program: 1. Prepare the data for parallel computing in a CPU; 2. Copy the data from CPU memory to GPU memory; 3. Parallelize the 3<sup>n</sup> jobs computing in the GPU, where n is the pedigree size; 4. Copy the results from GPU memory to CPU memory; and 5. Summarize the results in the CPU.</p
Workflow of FamSeq.
<p>We use a pedigree file and a file that includes the likelihood () as the input to estimate the posterior probability () for each variant genotype. (E-S: Elston-Stewart algorithm; BN: Bayesian network method; BN-GPU: The computer needs a GPU card installed to run the GPU version of the Bayesian network method; MCMC: Markov chain Monte Carlo method; VCF: variant call format.)</p
Illustration of input files.
<p>A.) Pedigree structure. B.) Pedigree structure file storing the pedigree structure shown in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003880#pcbi-1003880-g002" target="_blank">Fig. 2A</a>. From the left-most column to the right-most column, the data are ID, mID (mother ID), fID (father ID), gender and sample name. C.) Part of VCF file. From the VCF file, we can find that the genome of the grandfather (G-Father) was not sequenced. We add his information to the pedigree structure file to avoid ambiguity. For example, if we include only one parent of two siblings in the pedigree structure file, it will be unclear whether they are full or half siblings. The sample name in the pedigree structure file should be the same as the sample name in the VCF file. When the actual genome was not sequenced, we set the corresponding sample name as NA in the pedigree structure file.</p
The total time (in seconds) needed for computation using FamSeq at one million positions.
<p>PU: processing unit; E-S: Elston-Stewart algorithm; MCMC: Markov chain Monte Carlo algorithm; BN: Bayesian network algorithm; N: No, inbreeding loops are not considered; Y: Yes, inbreeding loops are considered.</p>a<p>We called only 100,000 variants due to excessive running time for the MCMC algorithm. The time shown here is 10Ă— the time required to call 100,000 variants.</p>b<p>The time in parentheses is the GPU computing time.</p><p>The total time (in seconds) needed for computation using FamSeq at one million positions.</p
Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do
Numerous chemical data sets have
become available for quantitative
structure–activity relationship (QSAR) modeling studies. However,
the quality of different data sources may be different based on the
nature of experimental protocols. Therefore, potential experimental
errors in the modeling sets may lead to the development of poor QSAR
models and further affect the predictions of new compounds. In this
study, we explored the relationship between the ratio of questionable
data in the modeling sets, which was obtained by simulating experimental
errors, and the QSAR modeling performance. To this end, we used eight
data sets (four continuous endpoints and four categorical endpoints)
that have been extensively curated both in-house and by our collaborators
to create over 1800 various QSAR models. Each data set was duplicated
to create several new modeling sets with different ratios of simulated
experimental errors (i.e., randomizing the activities of part of the
compounds) in the modeling process. A fivefold cross-validation process
was used to evaluate the modeling performance, which deteriorates
when the ratio of experimental errors increases. All of the resulting
models were also used to predict external sets of new compounds, which
were excluded at the beginning of the modeling process. The modeling
results showed that the compounds with relatively large prediction
errors in cross-validation processes are likely to be those with simulated
experimental errors. However, after removing a certain number of compounds
with large prediction errors in the cross-validation process, the
external predictions of new compounds did not show improvement. Our
conclusion is that the QSAR predictions, especially consensus predictions,
can identify compounds with potential experimental errors. But removing
those compounds by the cross-validation procedure is not a reasonable
means to improve model predictivity due to overfitting
Surface Generating Matlab Scripts
Matlab scripts to import and create the defect surfaces for the model and graphs for the journal article. Please read the ReadMe file as this contains information to run the scripts and view the data. All script are in Matlab .m and dat file, but the CT data is a CSV file so it can be read into any program
Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do
Numerous chemical data sets have
become available for quantitative
structure–activity relationship (QSAR) modeling studies. However,
the quality of different data sources may be different based on the
nature of experimental protocols. Therefore, potential experimental
errors in the modeling sets may lead to the development of poor QSAR
models and further affect the predictions of new compounds. In this
study, we explored the relationship between the ratio of questionable
data in the modeling sets, which was obtained by simulating experimental
errors, and the QSAR modeling performance. To this end, we used eight
data sets (four continuous endpoints and four categorical endpoints)
that have been extensively curated both in-house and by our collaborators
to create over 1800 various QSAR models. Each data set was duplicated
to create several new modeling sets with different ratios of simulated
experimental errors (i.e., randomizing the activities of part of the
compounds) in the modeling process. A fivefold cross-validation process
was used to evaluate the modeling performance, which deteriorates
when the ratio of experimental errors increases. All of the resulting
models were also used to predict external sets of new compounds, which
were excluded at the beginning of the modeling process. The modeling
results showed that the compounds with relatively large prediction
errors in cross-validation processes are likely to be those with simulated
experimental errors. However, after removing a certain number of compounds
with large prediction errors in the cross-validation process, the
external predictions of new compounds did not show improvement. Our
conclusion is that the QSAR predictions, especially consensus predictions,
can identify compounds with potential experimental errors. But removing
those compounds by the cross-validation procedure is not a reasonable
means to improve model predictivity due to overfitting
Model Scripts and Support Files
The master code and setup files for the hybrid analytical - finite element model defective bearing model is in this file. The model is a SImuLink model that the setup and is ran from a MatLab script. Please the read the ReadMe file as this contains all the information to run the model and modify the model to change loads, speeds, element density and much more
Immobilization of Horseradish Peroxidase for Phenol Degradation
The use of enzymes to degrade environmental pollutants
has received
wide attention as an emerging green approach. Horseradish peroxidase
(HRP) can efficiently catalyze the degradation of phenol in the environment;
however, free HRP exhibits poor stability and temperature sensitivity
and is easily deactivated, which limit its practical applications.
In this study, to improve their thermal stability, HRP enzymes were
immobilized on mesoporous molecular sieves (Al-MCM-41). Specifically,
Al-MCM-41(W) and Al-MCM-41(H) were prepared by modifying the mesoporous
molecular sieve Al-MCM-41 with glutaraldehyde and epichlorohydrin,
respectively, and used as carriers to immobilize HRP on their surface,
by covalent linkage, to form the immobilized enzymes HRP@Al-MCM-41(W)
and HRP@Al-MCM-41(H). Notably, the maximum reaction rate of HRP@Al-MCM-41(H)
was increased from 2.886 Ă— 105 (free enzyme) to 5.896
× 105 U/min–1, and its half-life
at 50 °C was increased from 745.17 to 1968.02 min; the thermal
stability of the immobilized enzyme was also significantly improved.
In addition, we elucidated the mechanism of phenol degradation by
HRP, which provides a basis for the application of this enzyme to
phenol degradation