252 research outputs found
High dimensional biological data retrieval optimization with NoSQL technology.
Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data
A Protocol for the Secure Linking of Registries for HPV Surveillance
In order to monitor the effectiveness of HPV vaccination in Canada the linkage of multiple data registries may be required. These registries may not always be managed by the same organization and, furthermore, privacy legislation or practices may restrict any data linkages of records that can actually be done among registries. The objective of this study was to develop a secure protocol for linking data from different registries and to allow on-going monitoring of HPV vaccine effectiveness.A secure linking protocol, using commutative hash functions and secure multi-party computation techniques was developed. This protocol allows for the exact matching of records among registries and the computation of statistics on the linked data while meeting five practical requirements to ensure patient confidentiality and privacy. The statistics considered were: odds ratio and its confidence interval, chi-square test, and relative risk and its confidence interval. Additional statistics on contingency tables, such as other measures of association, can be added using the same principles presented. The computation time performance of this protocol was evaluated.The protocol has acceptable computation time and scales linearly with the size of the data set and the size of the contingency table. The worse case computation time for up to 100,000 patients returned by each query and a 16 cell contingency table is less than 4 hours for basic statistics, and the best case is under 3 hours.A computationally practical protocol for the secure linking of data from multiple registries has been demonstrated in the context of HPV vaccine initiative impact assessment. The basic protocol can be generalized to the surveillance of other conditions, diseases, or vaccination programs
Routes for breaching and protecting genetic privacy
We are entering the era of ubiquitous genetic information for research,
clinical care, and personal curiosity. Sharing these datasets is vital for
rapid progress in understanding the genetic basis of human diseases. However,
one growing concern is the ability to protect the genetic privacy of the data
originators. Here, we technically map threats to genetic privacy and discuss
potential mitigation strategies for privacy-preserving dissemination of genetic
data.Comment: Draft for comment
Publishing data from electronic health records while preserving privacy: a survey of algorithms
The dissemination of Electronic Health Records (EHRs) can be highly beneficial for a range of medical studies, spanning from clinical trials to epidemic control studies, but it must be performed in a way that preserves patients’ privacy. This is not straightforward, because the disseminated data need to be protected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work, we present a survey of algorithms that have been proposed for publishing structured patient data, in a privacy-preserving way. We review more than 45 algorithms, derive insights on their operation, and highlight their advantages and disadvantages. We also provide a discussion of some promising directions for future research in this area
Energy and exergy analysis of chemical looping combustion technology and comparison with pre-combustion and oxy-fuel combustion technologies for CO2 capture
Carbon dioxide (CO2) emitted from conventional coal-based power plants is a growing concern for the environment. Chemical looping combustion (CLC), pre-combustion and oxy-fuel combustion are promising CO2 capture technologies which allow clean electricity generation from coal in an integrated gasification combined cycle (IGCC) power plant. This work compares the characteristics of the above three capture technologies to those of a conventional IGCC plant without CO2 capture. CLC technology is also investigated for two different process configurations—(i) an integrated gasification combined cycle coupled with chemical looping combustion (IGCC–CLC), and (ii) coal direct chemical looping combustion (CDCLC)—using exergy analysis to exploit the complete potential of CLC. Power output, net electrical efficiency and CO2 capture efficiency are the key parameters investigated for the assessment. Flowsheet models of five different types of IGCC power plants, (four with and one without CO2 capture), were developed in the Aspen plus simulation package. The results indicate that with respect to conventional IGCC power plant, IGCC–CLC exhibited an energy penalty of 4.5%, compared with 7.1% and 9.1% for pre-combustion and oxy-fuel combustion technologies, respectively. IGCC–CLC and oxy-fuel combustion technologies achieved an overall CO2 capture rate of ∼100% whereas pre-combustion technology could capture ∼94.8%. Modification of IGCC–CLC into CDCLC tends to increase the net electrical efficiency by 4.7% while maintaining 100% CO2 capture rate. A detailed exergy analysis performed on the two CLC process configurations (IGCC–CLC and CDCLC) and conventional IGCC process demonstrates that CLC technology can be thermodynamically as efficient as a conventional IGCC process
Dilepton mass spectra in p+p collisions at sqrt(s)= 200 GeV and the contribution from open charm
The PHENIX experiement has measured the electron-positron pair mass spectrum
from 0 to 8 GeV/c^2 in p+p collisions at sqrt(s)=200 GeV. The contributions
from light meson decays to e^+e^- pairs have been determined based on
measurements of hadron production cross sections by PHENIX. They account for
nearly all e^+e^- pairs in the mass region below 1 GeV/c^2. The e^+e^- pair
yield remaining after subtracting these contributions is dominated by
semileptonic decays of charmed hadrons correlated through flavor conservation.
Using the spectral shape predicted by PYTHIA, we estimate the charm production
cross section to be 544 +/- 39(stat) +/- 142(syst) +/- 200(model) \mu b, which
is consistent with QCD calculations and measurements of single leptons by
PHENIX.Comment: 375 authors from 57 institutions, 18 pages, 4 figures, 2 tables.
Submitted to Physics Letters B. v2 fixes technical errors in matching authors
to institutions. Plain text data tables for the points plotted in figures for
this and previous PHENIX publications are (or will be) publicly available at
http://www.phenix.bnl.gov/papers.htm
Inclusive cross section and double helicity asymmetry for \pi^0 production in p+p collisions at sqrt(s)=200 GeV: Implications for the polarized gluon distribution in the proton
The PHENIX experiment presents results from the RHIC 2005 run with polarized
proton collisions at sqrt(s)=200 GeV, for inclusive \pi^0 production at
mid-rapidity. Unpolarized cross section results are given for transverse
momenta p_T=0.5 to 20 GeV/c, extending the range of published data to both
lower and higher p_T. The cross section is described well for p_T < 1 GeV/c by
an exponential in p_T, and, for p_T > 2 GeV/c, by perturbative QCD. Double
helicity asymmetries A_LL are presented based on a factor of five improvement
in uncertainties as compared to previously published results, due to both an
improved beam polarization of 50%, and to higher integrated luminosity. These
measurements are sensitive to the gluon polarization in the proton, and exclude
maximal values for the gluon polarization.Comment: 375 authors, 7 pages, 3 figures. Submitted to Phys. Rev. D, Rapid
Communications. Plain text data tables for the points plotted in figures for
this and previous PHENIX publications are (or will be) publicly available at
http://www.phenix.bnl.gov/papers.htm
Measurement of high-p_T Single Electrons from Heavy-Flavor Decays in p+p Collisions at sqrt(s) = 200 GeV
The momentum distribution of electrons from decays of heavy flavor (charm and
beauty) for midrapidity |y| < 0.35 in p+p collisions at sqrt(s) = 200 GeV has
been measured by the PHENIX experiment at the Relativistic Heavy Ion Collider
(RHIC) over the transverse momentum range 0.3 < p_T < 9 GeV/c. Two independent
methods have been used to determine the heavy flavor yields, and the results
are in good agreement with each other. A fixed-order-plus-next-to-leading-log
pQCD calculation agrees with the data within the theoretical and experimental
uncertainties, with the data/theory ratio of 1.72 +/- 0.02^stat +/- 0.19^sys
for 0.3 < p_T < 9 GeV/c. The total charm production cross section at this
energy has also been deduced to be sigma_(c c^bar) = 567 +/- 57^stat +/-
224^sys micro barns.Comment: 375 authors from 57 institutions, 6 pages, 3 figures. Submitted to
Physical Review Letters. Plain text data tables for the points plotted in
figures for this and previous PHENIX publications are (or will be) publicly
available at http://www.phenix.bnl.gov/papers.htm
System Size and Energy Dependence of Jet-Induced Hadron Pair Correlation Shapes in Cu+Cu and Au+Au Collisions at sqrt(s_NN) = 200 and 62.4 GeV
We present azimuthal angle correlations of intermediate transverse momentum
(1-4 GeV/c) hadrons from {dijets} in Cu+Cu and Au+Au collisions at sqrt(s_NN) =
62.4 and 200 GeV. The away-side dijet induced azimuthal correlation is
broadened, non-Gaussian, and peaked away from \Delta\phi=\pi in central and
semi-central collisions in all the systems. The broadening and peak location
are found to depend upon the number of participants in the collision, but not
on the collision energy or beam nuclei. These results are consistent with sound
or shock wave models, but pose challenges to Cherenkov gluon radiation models.Comment: 464 authors from 60 institutions, 6 pages, 3 figures, 2 tables.
Submitted to Physical Review Letters. Plain text data tables for the points
plotted in figures for this and previous PHENIX publications are (or will be)
publicly available at http://www.phenix.bnl.gov/papers.htm
- …