Search CORE

10 research outputs found

Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data

Author: Anney Che (491890)
Brian T. Luke (171128)
F. Pascal Girard (491891)
Girish Venkataraman (491889)
Mohamad Khouja (491887)
Robert M. Stephens (17045)
Stephen Repetski (491888)
Uma S. Mudunuri (463683)
Publication venue
Publication date: 02/12/2013
Field of study

<div>As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.</div

Directory of Open Access Journals

PubMed Central

FigShare

Bubble chart of Cancer-Gene associations from literature.

Author: Anney Che (491890)
Brian T. Luke (171128)
F. Pascal Girard (491891)
Girish Venkataraman (491889)
Mohamad Khouja (491887)
Robert M. Stephens (17045)
Stephen Repetski (491888)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

A bubble chart representation with cancer terms on the x-axis and genes on the y-axis. The size of the bubble is directly proportional to the number of literature articles where the cancer and gene terms co-occur.</p

FigShare

Cancer term occurrences in the literature.

Author: Anney Che (491890)
Brian T. Luke (171128)
F. Pascal Girard (491891)
Girish Venkataraman (491889)
Mohamad Khouja (491887)
Robert M. Stephens (17045)
Stephen Repetski (491888)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

A bar chart representation with cancer terms on the y-axis and publication counts on the x-axis. Only the cancer terms with high literature occurrences are shown.</p

FigShare

Architecture for integrating structured and unstructured data in Hadoop.

Author: Anney Che (491890)
Brian T. Luke (171128)
F. Pascal Girard (491891)
Girish Venkataraman (491889)
Mohamad Khouja (491887)
Robert M. Stephens (17045)
Stephen Repetski (491888)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

Architectural diagram detailing the steps in creating the categorical lexicons and using them to get the PMID counts from literature. DEG stands for Differentially Expressed Gene while DE miRNA stands for Differentially Expressed miRNA.</p

FigShare

Growth of articles in MEDLINE.

Author: Anney Che (491890)
Brian T. Luke (171128)
F. Pascal Girard (491891)
Girish Venkataraman (491889)
Mohamad Khouja (491887)
Robert M. Stephens (17045)
Stephen Repetski (491888)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

A bar chart displaying the number of baseline records in NLM MEDLINE’s 2001 baseline release to 2012 baseline release. (<a href="http://www.nlm.nih.gov/bsd/licensee/2012_stats/baseline_doc.html" target="_blank">http://www.nlm.nih.gov/bsd/licensee/2012_stats/baseline_doc.html</a>).</p

FigShare

Network of Cancer-Gene associations from literature.

Author: Anney Che (491890)
Brian T. Luke (171128)
F. Pascal Girard (491891)
Girish Venkataraman (491889)
Mohamad Khouja (491887)
Robert M. Stephens (17045)
Stephen Repetski (491888)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

Network of Cancer/Gene associations displaying shared genes between cancers and genes specific to certain cancer types based on literature evidence. Cancer terms are represented as labeled nodes, genes are unlabeled pink nodes and the edges represent at least one publication with a co-occurrence of the cancer term and gene.</p

FigShare

Load and query times using simulated gene expression data.

Author: Anney Che (491890)
Brian T. Luke (171128)
F. Pascal Girard (491891)
Girish Venkataraman (491889)
Mohamad Khouja (491887)
Robert M. Stephens (17045)
Stephen Repetski (491888)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

* Query to get the DEG list was not run on the 8TB data due to time constraints.</p

FigShare

Germline mutations are affected by transcription.

Author: Aklank Jain (15856)
Albino Bacolla (17041)
Brian T. Luke (171128)
David N. Cooper (17051)
Duncan E. Donohue (171121)
Edward V. Ball (463682)
Guliang Wang (463684)
Jack R. Collins (17042)
Joseph Ivanic (43484)
Karen M. Vasquez (9306)
Ming Yi (15051)
Natalia Volfovsky (727)
Nuri A. Temiz (171116)
Regina Z. Cer (463681)
Robert M. Stephens (17045)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

Panel A, HGMD dataset; y-axis, as in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1003816#pgen-1003816-g001" target="_blank">Figure 1C</a>; x-axis, ratio of mutated NGNN sequences in protein coding genes containing the P2-guanine base on the non-transcribed (NT) vs. transcribed (T) strand; solid circles, HGMD dataset (r2 0.32, P(α)0.05 0.991, P<0.001); open circles, 1000 Genomes Project dataset. Panel B, inherited splicing mutations dataset; top, cartoon of exon-intron boundaries showing the conserved GT and AG bases at the donor (ds) and acceptor (as) splice sites; bottom, graph of splicing mutations; y-axis, number of SBSs; x-axis, position of SBSs relative to +/−20 nt of splice junctions; Panel C, model for sequence-dependent SBSs in cancer and human inherited disease. In the first step, an electron is lost from within a tetranucleotide sequence, leaving a hole. In the second step, the hole migrates to and from various competing sites, including nearby bases and chromatin-associated amino acids (not shown), eventually being trapped by a guanine base. The resulting guanine radical cation either causes DNA-protein crosslinking or undergoes subsequent chemical modifications. If the modified base is not corrected by DNA repair, it may give rise to a mutation (X-Y base-pair) as a result of error-prone DNA polymerase synthesis during DNA replication (dashed arrow).</p

FigShare

SBSs and VIPs.

Author: Aklank Jain (15856)
Albino Bacolla (17041)
Brian T. Luke (171128)
David N. Cooper (17051)
Duncan E. Donohue (171121)
Edward V. Ball (463682)
Guliang Wang (463684)
Jack R. Collins (17042)
Joseph Ivanic (43484)
Karen M. Vasquez (9306)
Ming Yi (15051)
Natalia Volfovsky (727)
Nuri A. Temiz (171116)
Regina Z. Cer (463681)
Robert M. Stephens (17045)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

Panel A, whisker plot of the fractions of SBSs at G•C bp for the EWS and GWS datasets computed using AgilentV2 and Duke35 mappability counts, respectively; red line, mean; black line, median; green lines, average GC-contents in the mappable AgilentV2 (EWS) and Duke35 (GWS) sets. Panel B, NGRA sequences are enriched in SBSs in melanoma. y-axis, for each 4-member sequence combination with matching P1–P3 bases, the fraction of mutations at P4-A was divided by the average fraction of mutations at P4-(C/T/G); x-axis, P3 base composition; R, purine; Y, pyrimidine; mean ± SD; P-value from two-tailed t-test. Panel C, the ln of normalized fractions of mutated DGNN (D = A/G/T) sequences, Fi, for the seven cancer datasets with −logP ≥3 for DGRN>DGYN (<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1003816#pgen-1003816-t001" target="_blank">Table 1</a>) were combined and plotted as a function of the average absolute free energy of base stacking, ΔG(ν), for each of the 48 DGNN sequences. Panel D, three-dimensional model of the (5′-GGG-3′)•(5′-CCC-3′) trinucleotide showing the LUBMO (lowest unoccupied beta molecular orbital) of the ionized sequence. Panel E, plot of the normalized fractions (log fi×103) of mutated DGN sequences (Duke35 counts) for the Lung_nsc cancer dataset vs. VIPs; outer circle, 5′D base; inner circle, 3′N base; blue, adenine; green, guanine; red, thymine; yellow, cytosine. Panel F, agglomerative hierarchical clustering of 14 cancer genome datasets obtained from linear correlations with ln VIP values, as obtained from T_hg19 counts; colored boxes, elements found to be clustered at the 90% confidence interval.</p

FigShare

Vertical ionization potentials (VIPs) of guanine-centered DGN sequences.

Author: Aklank Jain (15856)
Albino Bacolla (17041)
Brian T. Luke (171128)
David N. Cooper (17051)
Duncan E. Donohue (171121)
Edward V. Ball (463682)
Guliang Wang (463684)
Jack R. Collins (17042)
Joseph Ivanic (43484)
Karen M. Vasquez (9306)
Ming Yi (15051)
Natalia Volfovsky (727)
Nuri A. Temiz (171116)
Regina Z. Cer (463681)
Robert M. Stephens (17045)
Uma S. Mudunuri (463683)
Publication venue
Publication date
Field of study

VIPs for the centrally (italicized) guanine computed at the M06-2X/6-31G(d) level of theory;aVIP of free unalkylated guanine;bfrom reference <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1003816#pgen.1003816-Zaytseva1" target="_blank">[57]</a>.</p

FigShare