4,067 research outputs found
Probabilistic analysis of the human transcriptome with side information
Understanding functional organization of genetic information is a major
challenge in modern biology. Following the initial publication of the human
genome sequence in 2001, advances in high-throughput measurement technologies
and efficient sharing of research material through community databases have
opened up new views to the study of living organisms and the structure of life.
In this thesis, novel computational strategies have been developed to
investigate a key functional layer of genetic information, the human
transcriptome, which regulates the function of living cells through protein
synthesis. The key contributions of the thesis are general exploratory tools
for high-throughput data analysis that have provided new insights to
cell-biological networks, cancer mechanisms and other aspects of genome
function.
A central challenge in functional genomics is that high-dimensional genomic
observations are associated with high levels of complex and largely unknown
sources of variation. By combining statistical evidence across multiple
measurement sources and the wealth of background information in genomic data
repositories it has been possible to solve some the uncertainties associated
with individual observations and to identify functional mechanisms that could
not be detected based on individual measurement sources. Statistical learning
and probabilistic models provide a natural framework for such modeling tasks.
Open source implementations of the key methodological contributions have been
released to facilitate further adoption of the developed methods by the
research community.Comment: Doctoral thesis. 103 pages, 11 figure
Iterative Random Forests to detect predictive and stable high-order interactions
Genomics has revolutionized biology, enabling the interrogation of whole
transcriptomes, genome-wide binding sites for proteins, and many other
molecular processes. However, individual genomic assays measure elements that
interact in vivo as components of larger molecular machines. Understanding how
these high-order interactions drive gene expression presents a substantial
statistical challenge. Building on Random Forests (RF), Random Intersection
Trees (RITs), and through extensive, biologically inspired simulations, we
developed the iterative Random Forest algorithm (iRF). iRF trains a
feature-weighted ensemble of decision trees to detect stable, high-order
interactions with same order of computational cost as RF. We demonstrate the
utility of iRF for high-order interaction discovery in two prediction problems:
enhancer activity in the early Drosophila embryo and alternative splicing of
primary transcripts in human derived cell lines. In Drosophila, among the 20
pairwise transcription factor interactions iRF identifies as stable (returned
in more than half of bootstrap replicates), 80% have been previously reported
as physical interactions. Moreover, novel third-order interactions, e.g.
between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order
relationships that are candidates for follow-up experiments. In human-derived
cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated
splicing regulation, and identified novel 5th and 6th order interactions,
indicative of multi-valent nucleosomes with specific roles in splicing
regulation. By decoupling the order of interactions from the computational cost
of identification, iRF opens new avenues of inquiry into the molecular
mechanisms underlying genome biology
You can't always sketch what you want: Understanding Sensemaking in Visual Query Systems
Visual query systems (VQSs) empower users to interactively search for line
charts with desired visual patterns, typically specified using intuitive
sketch-based interfaces. Despite decades of past work on VQSs, these efforts
have not translated to adoption in practice, possibly because VQSs are largely
evaluated in unrealistic lab-based settings. To remedy this gap in adoption, we
collaborated with experts from three diverse domains---astronomy, genetics, and
material science---via a year-long user-centered design process to develop a
VQS that supports their workflow and analytical needs, and evaluate how VQSs
can be used in practice. Our study results reveal that ad-hoc sketch-only
querying is not as commonly used as prior work suggests, since analysts are
often unable to precisely express their patterns of interest. In addition, we
characterize three essential sensemaking processes supported by our enhanced
VQS. We discover that participants employ all three processes, but in different
proportions, depending on the analytical needs in each domain. Our findings
suggest that all three sensemaking processes must be integrated in order to
make future VQSs useful for a wide range of analytical inquiries.Comment: Accepted for presentation at IEEE VAST 2019, to be held October 20-25
in Vancouver, Canada. Paper will also be published in a special issue of IEEE
Transactions on Visualization and Computer Graphics (TVCG) IEEE VIS
(InfoVis/VAST/SciVis) 2019 ACM 2012 CCS - Human-centered computing,
Visualization, Visualization design and evaluation method
Recommended from our members
Exploration of human search behaviour: a multidisciplinary perspective
The following work presents an exploration of human search behaviour both from biological
and computational perspectives. Search behaviour is defined as the movements
made by an organism while attempting to find a resource. This work describes some of
the principal procedures used to record movement, methods for analysing the data and
possible ways of interpreting the data. In order to obtain a database of searching behaviour,
an experimental setup was built and tested to generate the search paths of human
participants. The test arena occupied part of a football field and the targets consisted of
an array of 20 golf balls. In the first set of experiments, a random and regular distribution
of targets were tested. For each distribution, three distinct conspicuity levels were
constructed: a cryptic level, in which targets were painted the same colour as the grass,
a semi-conspicuous level in which targets were left white and a conspicuous condition in
which the position of each target was marked by a red flag, protruding one metre from the
ground. The subjects tested were 9-11 year old children and their search paths were collected using a GPS device. Subjects did not recognise the spatial cues regarding the way targets were spatially distributed. A minimal decision model, the bouncing search model, was built based on the characteristics of the childrens search paths. The model produced an outstanding fit of the children’s behavioural data. In the second set of experiments, a new group of children were tested for two new distributions obtained by arranging the targets in patches. Again, children appeared unable to recognise spatial information during the collection processes. The children’s behaviour once again produced a good match with that of the bouncing search model. This work introduces several new methodological aspects to be explored to further understand the decision processes involved when humans search. Also, it illustrates that integrating biology and computational science can result in innovative research
Spaceflight and the Differential Gene Expression of Human Stem Cell-Derived Cardiomyocytes
The National Aeronautics and Space Administration (NASA) has performed many experiments on the International Space Station (ISS) to further understand how conditions in space can affect life on Earth. This project analyzed GLDS-258, a gene set from NASA’s GeneLab repository which examines the impact of microgravity on human induced pluripotent stem-cell-derived cardiomyocytes (hiPSC-CMs). While many datasets have been run through NASA’s RNA-Seq Consensus Pipeline (RCP) to study differential gene expression in space, a Homo sapiens dataset has yet to be analyzed using the RCP. The aim of this project was to run the first Homo sapiens dataset, GLDS-258, through the RCP on the San Jose State University College of Engineering High Performance Computing Cluster and investigate any biological significance from the results. In this study, a total of 18 hiPSC-CMs samples from ground control, flight, and post- flight groups are run through the RPC. The resulting differential gene expression data was further analyzed for biological significance using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) and Gene Set Enrichment Analysis (GSEA). Results showed that most genes were differentially expressed in ground control versus flight groups, while post- flight groups and ground control groups did not have as many differentially expressed genes. Gene set analysis showed significant expression of genes in mitochondrial pathways as well as genes related to neurodegenerative diseases such as Alzheimer’s, Huntington’s, and Parkinson’s disease. These results indicate that exposure to microgravity may play a role in altering expression of genes which are related to neurodegenerative pathways in cardiac cells. Our results demonstrate that it is possible to process Homo sapiens data through the RPC, and suggest that cardiomyocytes exposed to microgravity may exacerbate neurodegenerative disease progression
Learning stable and predictive structures in kinetic systems: Benefits of a causal approach
Learning kinetic systems from data is one of the core challenges in many
fields. Identifying stable models is essential for the generalization
capabilities of data-driven inference. We introduce a computationally efficient
framework, called CausalKinetiX, that identifies structure from discrete time,
noisy observations, generated from heterogeneous experiments. The algorithm
assumes the existence of an underlying, invariant kinetic model, a key
criterion for reproducible research. Results on both simulated and real-world
examples suggest that learning the structure of kinetic systems benefits from a
causal perspective. The identified variables and models allow for a concise
description of the dynamics across multiple experimental settings and can be
used for prediction in unseen experiments. We observe significant improvements
compared to well established approaches focusing solely on predictive
performance, especially for out-of-sample generalization
Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?
The organization and mining of malaria genomic and post-genomic data is
highly motivated by the necessity to predict and characterize new biological
targets and new drugs. Biological targets are sought in a biological space
designed from the genomic data from Plasmodium falciparum, but using also the
millions of genomic data from other species. Drug candidates are sought in a
chemical space containing the millions of small molecules stored in public and
private chemolibraries. Data management should therefore be as reliable and
versatile as possible. In this context, we examined five aspects of the
organization and mining of malaria genomic and post-genomic data: 1) the
comparison of protein sequences including compositionally atypical malaria
sequences, 2) the high throughput reconstruction of molecular phylogenies, 3)
the representation of biological processes particularly metabolic pathways, 4)
the versatile methods to integrate genomic data, biological representations and
functional profiling obtained from X-omic experiments after drug treatments and
5) the determination and prediction of protein structures and their molecular
docking with drug candidate structures. Progresses toward a grid-enabled
chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa
- …