16 research outputs found
Recommended from our members
Interpretable Machine Learning: Applications in Biology and Genomics
Machine learning (ML) and deep learning (DL) models impact our daily lives with applications in natural language modeling, image analysis, healthcare, genomics, and bioinformatics. The exponential growth of biological sequence data necessitates accompanying advances in computational methods. Although deep learning is highly effective for detecting and classifying biological sequences, challenges remain in extracting meaningful patterns and information from the learned models. To realize the potential of deep learning in biology, we need to develop strategies for model interpretation to reveal or further clarify biological principles. In this thesis, we first present problems and methods to classify patterns in biological sequence data. Next, we describe a series of techniques we developed to understand the machine learning models and identify meaningful biological patterns. For each problem we created an interpretable, intelligent system without sacrificing performance. To test our approaches for model interpretation, we first focused our analysis on known biological patterns, and then extended the search beyond what is known. This work can be categorized into four different applications: I) the development of bpRNA, a novel annotation tool capable of parsing RNA secondary structures. bpRNA is a richly-annotated database that contains over 100,000 structures from seven different sources along with base pairing information. II) The detection of pseudoknots from sequence data alone with a machine learning model, Pseudoknow. As one of the most common RNA structural motifs, pseudoknots are crucial for RNA regulation. Improving the prediction of RNA pseudoknot structure will allow for better understanding of how RNA structure informs regulation and metabolism. III) Classification from gene expression data using stacked denoising auto encoders (SDAE) to distinguish healthy cells from cancerous ones, and to predict post-mortem time-of-death. These classification methods were developed with the goal to identify genes that are most informative for prediction and hence most biological relevant. Our study suggests that the most influential genes from the dimensionality reduction performed by SDAE were highly predictive of cancerous vs non-cancerous cell type. IV) Interpretation of the rules learned by a deep convolutional neural network to recognize known and previously uncharacterized core promoter sequence motifs from the whole genome sequences of human. We proposed and compared new training strategies to identify transcription start sites (TSS), located within core promoters, from biological sequences. The main goal of this application was to develop new strategies to interpret how the convolutional neural network learns biological patterns, and to understand the correlations between and within the convolutional layers. These new techniques could aid in deriving unknown patterns in biology and genomics and are applicable more broadly to other areas of data science
Subseries Join and Compression of Time Series Data Based on Non-uniform Segmentation
A time series is composed of a sequence of data items that are measured at uniform intervals. Many application areas generate or manipulate time series, including finance, medicine, digital audio, and motion capture. Efficiently searching a large time series database is still a challenging problem, especially when partial or subseries matches are needed.
This thesis proposes a new denition of subseries join, a symmetric generalization of subseries matching, which finds similar subseries in two or more time series datasets. A solution is proposed to compute the subseries join based on a hierarchical feature representation. This hierarchical feature representation is generated by an anisotropic diffusion scale-space analysis and a non-uniform segmentation method. Each segment is represented by a minimal polynomial envelope in a reduced-dimensionality space. Based on the hierarchical feature representation, all features in a dataset are indexed in an R-tree, and candidate matching features of two datasets are found by an R-tree join operation. Given candidate matching features, a dynamic programming algorithm is developed to compute the final subseries join. To improve storage efficiency, a hierarchical compression scheme is proposed to compress features. The minimal polynomial envelope representation is transformed to a Bezier spline envelope representation. The control points of each Bezier spline are then hierarchically differenced and an arithmetic coding is used to compress these differences.
To empirically evaluate their effectiveness, the proposed subseries join and compression techniques are tested on various publicly available datasets. A large motion capture database is also used to verify the techniques in a real-world application. The experiments show that the proposed subseries join technique can better tolerate noise and local scaling than previous work, and the proposed compression technique can also achieve about 85% higher compression rates than previous work with the same distortion error
Improving Programming Support for Hardware Accelerators Through Automata Processing Abstractions
The adoption of hardware accelerators, such as Field-Programmable Gate Arrays,
into general-purpose computation pipelines continues to rise, driven by recent
trends in data collection and analysis as well as pressure from challenging
physical design constraints in hardware. The architectural designs of many of
these accelerators stand in stark contrast to the traditional von Neumann model
of CPUs. Consequently, existing programming languages, maintenance tools, and
techniques are not directly applicable to these devices, meaning that additional
architectural knowledge is required for effective programming and configuration.
Current programming models and techniques are akin to assembly-level programming
on a CPU, thus placing significant burden on developers tasked with using these
architectures. Because programming is currently performed at such low levels of
abstraction, the software development process is tedious and challenging and
hinders the adoption of hardware accelerators.
This dissertation explores the thesis that theoretical finite automata provide a
suitable abstraction for bridging the gap between high-level programming models
and maintenance tools familiar to developers and the low-level hardware
representations that enable high-performance execution on hardware accelerators.
We adopt a principled hardware/software co-design methodology to develop a
programming model providing the key properties that we observe are necessary for success,
namely performance and scalability, ease of use, expressive power, and legacy
support.
First, we develop a framework that allows developers to port existing, legacy
code to run on hardware accelerators by leveraging automata learning algorithms
in a novel composition with software verification, string solvers, and
high-performance automata architectures. Next, we design a domain-specific
programming language to aid programmers writing pattern-searching algorithms and
develop compilation algorithms to produce finite automata, which supports
efficient execution on a wide variety of processing architectures. Then, we
develop an interactive debugger for our new language, which allows developers to
accurately identify the locations of bugs in software while maintaining support
for high-throughput data processing. Finally, we develop two new
automata-derived accelerator architectures to support additional applications,
including the detection of security attacks and the parsing of recursive and
tree-structured data. Using empirical studies, logical reasoning, and
statistical analyses, we demonstrate that our prototype artifacts scale to
real-world applications, maintain manageable overheads, and support developers'
use of hardware accelerators. Collectively, the research efforts detailed in
this dissertation help ease the adoption and use of hardware accelerators for
data analysis applications, while supporting high-performance computation.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155224/1/angstadt_1.pd
Collective analog bioelectronic computation
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 677-710).In this thesis, I present two examples of fast-and-highly-parallel analog computation inspired by architectures in biology. The first example, an RF cochlea, maps the partial differential equations that describe fluid-membrane-hair-cell wave propagation in the biological cochlea to an equivalent inductor-capacitor-transistor integrated circuit. It allows ultra-broadband spectrum analysis of RF signals to be performed in a rapid low-power fashion, thus enabling applications for universal or software radio. The second example exploits detailed similarities between the equations that describe chemical-reaction dynamics and the equations that describe subthreshold current flow in transistors to create fast-and-highly-parallel integrated-circuit models of protein-protein and gene-protein networks inside a cell. Due to a natural mapping between the Poisson statistics of molecular flows in a chemical reaction and Poisson statistics of electronic current flow in a transistor, stochastic effects are automatically incorporated into the circuit architecture, allowing highly computationally intensive stochastic simulations of large-scale biochemical reaction networks to be performed rapidly. I show that the exponentially tapered transmission-line architecture of the mammalian cochlea performs constant-fractional-bandwidth spectrum analysis with O(N) expenditure of both analysis time and hardware, where N is the number of analyzed frequency bins. This is the best known performance of any spectrum-analysis architecture, including the constant-resolution Fast Fourier Transform (FFT), which scales as O(N logN), or a constant-fractional-bandwidth filterbank, which scales as O (N2).(cont.) The RF cochlea uses this bio-inspired architecture to perform real-time, on-chip spectrum analysis at radio frequencies. I demonstrate two cochlea chips, implemented in standard 0.13m CMOS technology, that decompose the RF spectrum from 600MHz to 8GHz into 50 log-spaced channels, consume < 300mW of power, and possess 70dB of dynamic range. The real-time spectrum analysis capabilities of my chips make them uniquely suitable for ultra-broadband universal or software radio receivers of the future. I show that the protein-protein and gene-protein chips that I have built are particularly suitable for simulation, parameter discovery and sensitivity analysis of interaction networks in cell biology, such as signaling, metabolic, and gene regulation pathways. Importantly, the chips carry out massively parallel computations, resulting in simulation times that are independent of model complexity, i.e., O(1). They also automatically model stochastic effects, which are of importance in many biological systems, but are numerically stiff and simulate slowly on digital computers. Currently, non-fundamental data-acquisition limitations show that my proof-of-concept chips simulate small-scale biochemical reaction networks at least 100 times faster than modern desktop machines. It should be possible to get 103 to 106 simulation speedups of genome-scale and organ-scale intracellular and extracellular biochemical reaction networks with improved versions of my chips. Such chips could be important both as analysis tools in systems biology and design tools in synthetic biology.by Soumyajit Mandal.Ph.D
Vegetable Crops
In ancient times, people benefited from ingesting different parts of various weeds (root, stem, shoot, leaf, flower, fruit, seed, etc.) to maintain a healthy life. People have obtained the vegetables we grow today by succeeding in cultivating these weeds. This book explains the health benefits of vegetable crops, organic vegetable growing, greenhouse management, and principles of irrigation management for vegetable crops
An investigation into the role and mechanism of action of small ubiquitin-like modifier interacting motifs in Arabidopsis thaliana proteins
SUMO is a small protein that is ligated to other proteins to regulate their function. Ligation occurs at lysine residues within a SUMO site motif. A wide range of proteins are targets of SUMOylation and in plants SUMO plays a diverse role in many important processes. Processes including development, stress tolerance, hormone regulation, DNA repair and chromatin remodelling are regulated by SUMOylation.
SUMO affects protein function primarily by establishing interactions through SUMO interacting motifs (SIMs) in interacting protein partners. SUMO can also alter protein function by blocking access to protein domains and by causing conformational changes to the target. The ability to predict SIMs in plant proteins would be useful for research into the poorly understood mechanisms behind SUMO regulation. Large arrays of synthetic peptides were screened with SUMO to identify SIM peptides.
These data were used to characterise the sequence composition of plant SIMs. The plant SIMs were compared and contrasted with human SIMs to highlight the functional differences between these two evolutionary distinct species. The data were used to build a predictor for SIMs using random forest models. A new SUMO site predictor was built using random forest models as well. The SIM predictor was used to identify putative SIM containing proteins in the Arabidopsis thaliana genome and the functional enrichment of these genes was analysed. The role of SUMO in the plant gibberellin (GA) pathway was also investigated. The DELLA protein RGA is a negative regulator of GA signalling and this protein was shown to be SUMOylated. RGA stability is regulated by the GA receptor GID1 and it was demonstrated that GID1a contains a SIM. It was proposed that SUMOylated RGA interacted with GID1a through its SIM which inhibited its function. The model was tested by investigating the binding of SUMO to GID1a and by generating mutants of GID1a that had reduced SUMO a affinity. The results demonstrate that GA signalling can be enhanced by introducing a mutation into the GID1a SIM
Dipterocarps protected by Jering local wisdom in Jering Menduyung Nature Recreational Park, Bangka Island, Indonesia
Apart of the oil palm plantation expansion, the Jering Menduyung Nature Recreational Park has relatively diverse plants. The 3,538 ha park is located at the north west of Bangka Island, Indonesia. The minimum species-area curve was 0.82 ha which is just below Dalil conservation forest that is 1.2 ha, but it is much higher than measurements of several secondary forests in the Island that are 0.2 ha. The plot is inhabited by more than 50 plant species. Of 22 tree species, there are 40 individual poles with the average diameter of 15.3 cm, and 64 individual trees with the average diameter of 48.9 cm. The density of Dipterocarpus grandiflorus (Blanco) Blanco or kruing, is 20.7 individual/ha with the diameter ranges of 12.1 – 212.7 cm or with the average diameter of 69.0 cm. The relatively intact park is supported by the local wisdom of Jering tribe, one of indigenous tribes in the island. People has regulated in cutting trees especially in the cape. The conservation agency designates the park as one of the kruing propagules sources in the province. The growing oil palm plantation and the less adoption of local wisdom among the youth is a challenge to forest conservation in the province where tin mining activities have been the economic driver for decades. More socialization from the conservation agency and the involvement of university students in raising environmental awareness is important to be done