3,920 research outputs found
Error Correction for DNA Sequencing via Disk Based Index and Box Queries
The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a data-intensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable require-ment for many sequence analysis applications. Most existing methods for error correction demand large expensive memory space, which limits their scalability for handling large datasets. In this thesis, we introduce a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers for sequencing genome data along with their associated metadata in a disk based index tree, called the BoND-tree, and uses the index to efïŹciently process specially designed box queries to obtain relevant k-mers and their occurring frequencies. It takes an input read and locates the potential errors in the sequence. It then applies a comprehensive voting mech-anism and possibly an efïŹcient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. To overcome the drawback of an ofïŹine approach such as DiskBQcor for wasting computing resources while DNA sequecing is in process, we suggest an online approach to correcting sequencing errors. The online processing strategies and accuracy measures are discussed. An algorithm for deleting indexed k-mers from the BoND-tree, which is a step stone for the online sequencing error correction, is also introduced. Our experiments demonstrate that the proposed methods are quite promising in error correction for sequencing genome data on disk. The resulting BoND-tree with correct k-mers can also be used for sequence analysis applications such as variant detection.Master of ScienceComputer and Information Science, College of Engineering and Computer ScienceUniversity of Michigan-Dearbornhttps://deepblue.lib.umich.edu/bitstream/2027.42/136615/3/Thesis_YarongGu_519_corrected.pd
Recommended from our members
From multiscale modeling to metamodeling of geomechanics problems
In numerical simulations of geomechanics problems, a grand challenge consists of overcoming the difficulties in making accurate and robust predictions by revealing the true mechanisms in particle interactions, fluid flow inside pore spaces, and hydromechanical coupling effect between the solid and fluid constituents, from microscale to mesoscale, and to macroscale. While simulation tools incorporating subscale physics can provide detailed insights and accurate material properties to macroscale simulations via computational homogenizations, these numerical simulations are often too computational demanding to be directly used across multiple scales. Recent breakthroughs of Artificial Intelligence (AI) via machine learning have great potential to overcome these barriers, as evidenced by their great success in many applications such as image recognition, natural language processing, and strategy exploration in games. The AI can achieve super-human performance level in a large number of applications, and accomplish tasks that were thought to be not feasible due to the limitations of human and previous computer algorithms. Yet, machine learning approaches can also suffer from overfitting, lack of interpretability, and lack of reliability. Thus the application of machine learning into generation of accurate and reliable surrogate constitutive models for geomaterials with multiscale and multiphysics is not trivial. For this purpose, we propose to establish an integrated modeling process for automatic designing, training, validating, and falsifying of constitutive models, or "metamodeling". This dissertation focuses on our efforts in laying down step-by-step the necessary theoretical and technical foundations for the multiscale metamodeling framework.
The first step is to develop multiscale hydromechanical homogenization frameworks for both bulk granular materials and granular interfaces, with their behaviors homogenized from subscale microstructural simulations. For efficient simulations of field-scale geomechanics problems across more than two scales, we develop a hybrid data-driven method designed to capture the multiscale hydro-mechanical coupling effect of porous media with pores of various different sizes. By using sub-scale simulations to generate database to train material models, an offline homogenization procedure is used to replace the up-scaling procedure to generate path-dependent cohesive laws for localized physical discontinuities at both grain and specimen scales.
To enable AI in taking over the trial-and-error tasks in the constitutive modeling process, we introduce a novel âmetamodelingâ framework that employs both graph theory and deep reinforcement learning (DRL) to generate accurate, physics compatible and interpretable surrogate machine learning models. The process of writing constitutive models is simplified as a sequence of forming graph edges with the goal of maximizing the model score (a function of accuracy, robustness and forward prediction quality). By using neural networks to estimate policies and state values, the computer agent is able to efficiently self-improve the constitutive models generated through self-playing.
To overcome the obstacle of limited information in geomechanics, we improve the efficiency in utilization of experimental data by a multi-agent cooperative metamodeling framework to provide guidance on database generation and constitutive modeling at the same time. The modeler agent in the framework focuses on evaluating all modeling options (from domain expertsâ knowledge or machine learning) in a directed multigraph of elasto-plasticity theory, and finding the optimal path that links the source of the directed graph (e.g., strain history) to the target (e.g., stress). Meanwhile, the data agent focuses on collecting data from real or virtual experiments, interacts with the modeler agent sequentially and generates the database for model calibration to optimize the prediction accuracy. Finally, we design a non-cooperative meta-modeling framework that focuses on automatically developing strategies that simultaneously generate experimental data to calibrate model parameters and explore weakness of a known constitutive model until the strengths and weaknesses of the constitutive law on the application range can be identified through competition. These tasks are enabled by a zero-sum reward system of the metamodeling game and robust adversarial reinforcement learning techniques
Threshold interval indexing techniques for complicated uncertain data
Uncertain data is an increasingly prevalent topic in database research, given the advance of instruments which inherently generate uncertainty in their data. In particular, the problem of indexing uncertain data for range queries has received considerable attention. To efficiently process range queries, existing approaches mainly focus on reducing the number of disk I/Os. However, due to the inherent complexity of uncertain data, processing a range query may incur high computational cost in addition to the I/O cost. In this paper, I present a novel indexing strategy focusing on one-dimensional uncertain continuous data, called threshold interval indexing. Threshold interval indexing is able to balance I/O cost and computational cost to achieve an optimal overall query performance. A key ingredient of the proposed indexing structure is a dynamic interval tree. The dynamic interval tree is much more resistant to skew than R-trees, which are widely used in other indexing structures. This interval tree optimizes pruning by storing x-bounds, or pre-calculated probability boundaries, at each node. In addition to the basic threshold interval index, I present two variants, called the strong threshold interval index and the hyper threshold interval index, which leverage x-bounds not only for pruning but also for accepting results. Furthermore, I present a more efficient memory-loaded versions of these indexes, which reduce the storage size so the primary interval tree can be loaded into memory. Each index description includes methods for querying, parallelizing, updating, bulk loading, and externalizing. I perform an extensive set of experiments to demonstrate the effectiveness and efficiency of the proposed indexing strategies
Master's Theses in Automatic Control 1981-1982
The report contains abstracts of Master Theses (examensarbeten) made at the Department of Automatic Control, Lund, during the academic year 81/82. During this year 30 theses were made by 40 students. Most of the theses are written in Swedish with an English abstract
Design and Implementation of a Middleware for Uniform, Federated and Dynamic Event Processing
In recent years, real-time processing of massive event streams has become an important topic in the area of data analytics. It will become even more important in the future due to cheap sensors, a growing amount of devices and their ubiquitous inter-connection also known as the Internet of Things (IoT). Academia, industry and the open source community have developed several event processing (EP) systems that allow users to define, manage and execute continuous queries over event streams. They achieve a significantly better performance than the traditional store-then-process'' approach in which events are first stored and indexed in a database. Because EP systems have different roots and because of the lack of standardization, the system landscape became highly heterogenous. Today's EP systems differ in APIs, execution behaviors and query languages. This thesis presents the design and implementation of a novel middleware that abstracts from different EP systems and provides a uniform API, execution behavior and query language to users and developers. As a consequence, the presented middleware overcomes the problem of vendor lock-in and different EP systems are enabled to cooperate with each other. In practice, event streams differ dramatically in volume and velocity. We show therefore how the middleware can connect to not only different EP systems, but also database systems and a native implementation. Emerging applications such as the IoT raise novel challenges and require EP to be more dynamic. We present extensions to the middleware that enable self-adaptivity which is needed in context-sensitive applications and those that deal with constantly varying sets of event producers and consumers. Lastly, we extend the middleware to fully support the processing of events containing spatial data and to be able to run distributed in the form of a federation of heterogenous EP systems
Scalable domain decomposition methods for finite element approximations of transient and electromagnetic problems
The main object of study of this thesis is the development of scalable and robust solvers based on domain decomposition (DD) methods for the linear systems arising from the finite element (FE) discretization of transient and electromagnetic problems.
The thesis commences with a theoretical review of the curl-conforming edge (or Nédélec) FEs of the first kind and a comprehensive description of a general implementation strategy for h- and p- adaptive elements of arbitrary order on tetrahedral and hexahedral non-conforming meshes. Then, a novel balancing domain decomposition by constraints (BDDC) preconditioner that is robust for multi-material and/or heterogeneous problems posed in curl-conforming spaces is presented. The new method, in contrast to existent approaches, is based on the definition of the ingredients of the preconditioner according to the physical coefficients of the problem and does not require spectral information. The result is a robust and highly scalable preconditioner that preserves the simplicity of the original BDDC method.
When dealing with transient problems, the time direction offers itself an opportunity for further parallelization. Aiming to design scalable space-time solvers, first, parallel-in-time parallel methods for linear and non-linear ordinary differential equations (ODEs) are proposed, based on (non-linear) Schur complement efficient solvers of a multilevel partition of the time interval. Then, these ideas are combined with DD concepts in order to design a two-level preconditioner as an extension to space-time of the BDDC method. The key ingredients for these new methods are defined such that they preserve the time causality, i.e., information only travels from the past to the future. The proposed schemes are weakly scalable in time and space-time, i.e., one can efficiently exploit increasing computational resources to solve more time steps in (approximately) the same time-to-solution.
All the developments presented herein are motivated by the driving application of the thesis, the 3D simulation of the low-frequency electromagnetic response of High Temperature Superconductors (HTS). Throughout the document, an exhaustive set of numerical experiments, which includes the simulation of a realistic 3D HTS problem, is performed in order to validate the suitability and assess the parallel performance of the High Performance Computing (HPC) implementation of the proposed algorithms.Lâobjecte principal dâestudi dâaquesta tesi Ă©s el desenvolupament de solucionadors escalables i robustos basats en mĂštodes de descomposiciĂł de dominis (DD) per a sistemes lineals que sorgeixen en la discretitzaciĂł mitjançant elements finits (FE) de problemes transitoris i electromagnĂštics.
La tesi comença amb una revisiĂł teĂČrica dels FE dâeix (o de NĂ©dĂ©lec) de la primera famĂlia i una descripciĂł exhaustiva dâuna estratĂšgia dâimplementaciĂł general per a elements h- i p-adaptatius dâordre arbitrari en malles de tetraedres i hexaedres noconformes.
Llavors, es presenta un nou precondicionador de descomposiciĂł de dominis balancejats per restricciĂł (BDDC) que Ă©s robust per a problemes amb mĂșltiples materials i/o heterogenis definits en espais curl-conformes. El nou mĂštode, en contrast amb els enfocaments existents, estĂ basat en la definiciĂł dels ingredients del precondicionador segons els coeficients fĂsics del problema i no requereix informaciĂł espectral. El resultat Ă©s un precondicionador robust i escalable que preserva la simplicitat del mĂštode original BDDC.
Quan tractem amb problemes transitoris, la direcciĂł temporal ofereix ella mateixa lâoportunitat de seguir explotant paral·lelisme. Amb lâobjectiu de dissenyar precondicionadors en espai-temps, primer, proposem solucionadors paral·lels en temps per equacions diferencials lineals i no-lineals, basats en un solucionador eficient del complement de Schur dâuna particiĂł multinivell de lâinterval de temps. Seguidament, aquestes idees es combinen amb conceptes de DD amb lâobjectiu de dissenyar precondicionadors com a extensiĂł a espai-temps dels mĂštodes de BDDC. Els ingredients clau dâaquests nous mĂštodes es defineixen de tal manera que preserven la causalitat del temps, on la informaciĂł nomĂ©s viatja de temps passats a temps futurs. Els esquemes proposats sĂłn dĂšbilment escalables en temps i en espai-temps, Ă©s a dir, es poden explotar eficientment recursos computacionals creixents per resoldre mĂ©s passos de temps en (aproximadament) el mateix temps transcorregut de cĂ lcul.
Tots els desenvolupaments presentats aquĂ sĂłn motivats pel problema dâaplicaciĂł de la tesi, la simulaciĂł de la resposta electromagnĂštica de baixa freqĂŒĂšncia dels superconductors dâalta temperatura (HTS) en 3D. Al llarg del document, es realitza un conjunt exhaustiu dâexperiments numĂšrics, els quals inclouen la simulaciĂł dâun problema de HTS realista en 3D, per validar la idoneĂŻtat i el rendiment paral·lel de la implementaciĂł per a computaciĂł dâalt rendiment dels algorismes proposatsPostprint (published version
Performance assessment of porous asphalt for stormwater treatment
The objective of this study is to assess the stormwater management capability of porous asphalt (PA) as a parking lot in Durham, New Hampshire. The site was constructed in 2004 and consists of a 4-inch PA pavement course overlying a porous media reservoir, including a fine-grained filter course. Cost per PA parking space (2,000). Pavement durability has been adequate. The lot retained or infiltrated 25% of precipitation (18 months) and 0.03 lb/sf of chloride (1 year). There was high initial surface infiltration capacity (IC) at two locations (1,000--1,300 in/hr), and moderate IC at a third (330 in/hr). Two locations had a 50% decrease in IC. Frost penetration occurred to 1 foot. Storm event hydraulic performance was exceptional. Storm event water quality treatment for diesel, zinc, and TSS were excellent; for chloride and nitrate it was negative, and for phosphorus, limited.*
Scaling kNN queries using statistical learning
The k-Nearest Neighbour (kNN) method is a fundamental building block for many sophisticated statistical learning models and has a wide application in different fields; for instance, in kNN regression, kNN classification, multi-dimensional items search, location-based services, spatial analytics, etc.
However, nowadays with the unprecedented spread of data generated by computing and communicating devices has resulted in a plethora of low-dimensional large-scale datasets and their users' community, the need for efficient and scalable kNN processing is pressing. To this end, several parallel and distributed approaches and methodologies for processing exact kNN in low-dimensional large-scale datasets have been proposed; for example Hadoop-MapReduce-based kNN query processing approaches such as Spatial-Hadoop (SHadoop), and Spark-based approaches like Simba. This thesis contributes with a variety of methodologies for kNN query processing based on statistical and machine learning techniques over large-scale datasets.
This study investigates the exact kNN query performance behaviour of the well-known Big Data Systems, SHadoop and Simba, that proposes building multi-dimensional Global and Local Indexes over low dimensional large-scale datasets. The rationale behind such methods is that when executing exact kNN query, the Global and Local indexes access a small subset of a large-scale dataset stored in a distributed file system. The Global Index is used to prune out irrelevant subsets of the dataset; while the multiple distributed Local Indexes are used to prune out unnecessary data elements of a partition (subset).
The kNN execution algorithm of SHadoop and Simba involves loading data elements that reside in the relevant partitions from disks/network points to memory. This leads to significantly high kNN query response times; so, such methods are not suitable for low-latency applications and services. An extensive literature review showed that not enough attention has been given to access relatively small-sized but relevant data using kNN query only. Based on this limitation, departing from the traditional kNN query processing methods, this thesis contributes two novel solutions: Coordinator With Index (COWI) and Coordinator with No Index(CONI) approaches. The essence of both approaches rests on adopting a coordinator-based distributed processing algorithm and a way to structure computation and index the stored datasets that ensures that only a very small number of pieces of data are retrieved from the underlying data centres, communicated over the network, and processed by the coordinator for every kNN query. The expected outcome is that scalability is ensured and kNN queries can be processed in just tens of milliseconds. Both approaches are implemented using a NoSQL Database (HBase) achieving up to three orders of magnitude of performance gain compared with state of the art methods -SHadoop and Simba.
It is common practice that the current state-of-the-art approaches for exact kNN query processing in low-dimensional space use Tree-based multi-dimensional Indexing methods to prune out irrelevant data during query processing. However, as data sizes continue to increase, (nowadays it is not uncommon to reach several Petabytes), the storage cost of Tree-based Index methods becomes exceptionally high, especially when opted to partition a dataset into smaller chunks. In this context, this thesis contributes with a novel perspective on how to organise low-dimensional large-scale datasets based on data space transformations deriving a Space Transformation Organisation Structure (STOS). STOS facilitates kNN query processing as if underlying datasets were uniformly distributed in the space. Such an approach bears significant advantages: first, STOS enjoys a minute memory footprint that is many orders of magnitude smaller than Index-based approaches found in the literature. Second, the required memory for such meta-data information over large-scale datasets, unlike related work, increases very slowly with dataset size. Hence, STOS enjoys significantly higher scalability. Third, STOS is relatively efficient to compute, outperforming traditional multivariate Index building times, and comparable, if not better, query response times.
In the literature, the exact kNN query in a large-scale dataset was limited to low-dimensional space; this is because the query response time and memory space requirement of the Tree-based index methods increase with dimension. Unable to solve such exponential dependency on the dimension, researchers assume that no efficient solution exists and propose approximation kNN in high dimensional space. Unlike the approximated kNN query that tries to retrieve approximated nearest neighbours from large-scale datasets, in this thesis a new type of kNN query referred to as âestimated kNN queryâ is proposed. The estimated kNN query processing methodology attempts to estimate the nearest neighbours based on the marginal cumulative distribution of underlying data using statistical copulas. This thesis showcases the performance trade-off of exact kNN and the estimate kNN queries in terms of estimation error and scalability. In contrast, kNN regression predicts that a value of a target variable based on kNN; but, particularly in a high dimensional large-scale dataset, a query response time of kNN regression, can be a significantly high due to the curse of dimensionality. In an effort to tackle this issue, a new probabilistic kNN regression method is proposed. The proposed method statistically predicts the values of a target variable of kNN without computing distance.
In different contexts, a kNN as missing value algorithm in high dimensional space in Pytha, a distributed/parallel missing value imputation framework, is investigated. In Pythia, a different way of indexing a high-dimensional large-scale dataset is proposed by the group (not the work of the author of this thesis); by using such indexing methods, scaling-out of kNN in high dimensional space was ensured. Pythia uses Adaptive Resonance Theory (ART) -a machine learning clustering algorithm- for building a data digest (aka signatures) of large-scale datasets distributed across several data machines. The major idea is that given an input vector, Pythia predicts the most relevant data centres to get involved in processing, for example, kNN. Pythia does not retrieve exact kNN. To this end, instead of accessing the entire dataset that resides in a data-node, in this thesis, accessing only relevant clusters that reside in appropriate data-nodes is proposed. As we shall see later, such method has comparable accuracy to that of the original design of Pythia but has lower imputation time. Moreover, the imputation time does not significantly grow with a size of a dataset that resides in a data node or with the number of data nodes in Pythia. Furthermore, as Pythia depends utterly on the data digest built by ART to predict relevant data centres, in this thesis, the performance of Pythia is investigated by comparing different signatures constructed by a different clustering algorithms, the Self-Organising Maps.
In this thesis, the performance advantages of the proposed approaches via extensive experimentation with multi-dimensional real and synthetic datasets of different sizes and context are substantiated and quantified
An Assessment to Benchmark the Seismic Performance of a Code-Conforming Reinforced-Concrete Moment-Frame Building
This report describes a state-of-the-art performance-based earthquake engineering methodology
that is used to assess the seismic performance of a four-story reinforced concrete (RC) office
building that is generally representative of low-rise office buildings constructed in highly seismic
regions of California. This âbenchmarkâ building is considered to be located at a site in the Los
Angeles basin, and it was designed with a ductile RC special moment-resisting frame as its
seismic lateral system that was designed according to modern building codes and standards. The
buildingâs performance is quantified in terms of structural behavior up to collapse, structural and
nonstructural damage and associated repair costs, and the risk of fatalities and their associated
economic costs. To account for different building configurations that may be designed in
practice to meet requirements of building size and use, eight structural design alternatives are
used in the performance assessments.
Our performance assessments account for important sources of uncertainty in the ground
motion hazard, the structural response, structural and nonstructural damage, repair costs, and
life-safety risk. The ground motion hazard characterization employs a site-specific probabilistic
seismic hazard analysis and the evaluation of controlling seismic sources (through
disaggregation) at seven ground motion levels (encompassing return periods ranging from 7 to
2475 years). Innovative procedures for ground motion selection and scaling are used to develop
acceleration time history suites corresponding to each of the seven ground motion levels.
Structural modeling utilizes both âfiberâ models and âplastic hingeâ models. Structural
modeling uncertainties are investigated through comparison of these two modeling approaches,
and through variations in structural component modeling parameters (stiffness, deformation
capacity, degradation, etc.). Structural and nonstructural damage (fragility) models are based on
a combination of test data, observations from post-earthquake reconnaissance, and expert
opinion. Structural damage and repair costs are modeled for the RC beams, columns, and slabcolumn connections. Damage and associated repair costs are considered for some nonstructural
building components, including wallboard partitions, interior paint, exterior glazing, ceilings,
sprinkler systems, and elevators. The risk of casualties and the associated economic costs are
evaluated based on the risk of structural collapse, combined with recent models on earthquake
fatalities in collapsed buildings and accepted economic modeling guidelines for the value of
human life in loss and cost-benefit studies.
The principal results of this work pertain to the building collapse risk, damage and repair
cost, and life-safety risk. These are discussed successively as follows.
When accounting for uncertainties in structural modeling and record-to-record variability
(i.e., conditional on a specified ground shaking intensity), the structural collapse probabilities of
the various designs range from 2% to 7% for earthquake ground motions that have a 2%
probability of exceedance in 50 years (2475 years return period). When integrated with the
ground motion hazard for the southern California site, the collapse probabilities result in mean
annual frequencies of collapse in the range of [0.4 to 1.4]x10
-4
for the various benchmark
building designs. In the development of these results, we made the following observations that
are expected to be broadly applicable:
(1) The ground motions selected for performance simulations must consider spectral
shape (e.g., through use of the epsilon parameter) and should appropriately account for
correlations between motions in both horizontal directions;
(2) Lower-bound component models, which are commonly used in performance-based
assessment procedures such as FEMA 356, can significantly bias collapse analysis results; it is
more appropriate to use median component behavior, including all aspects of the component
model (strength, stiffness, deformation capacity, cyclic deterioration, etc.);
(3) Structural modeling uncertainties related to component deformation capacity and
post-peak degrading stiffness can impact the variability of calculated collapse probabilities and
mean annual rates to a similar degree as record-to-record variability of ground motions.
Therefore, including the effects of such structural modeling uncertainties significantly increases
the mean annual collapse rates. We found this increase to be roughly four to eight times relative
to rates evaluated for the median structural model;
(4) Nonlinear response analyses revealed at least six distinct collapse mechanisms, the
most common of which was a story mechanism in the third story (differing from the multi-story
mechanism predicted by nonlinear static pushover analysis);
(5) Soil-foundation-structure interaction effects did not significantly affect the structural
response, which was expected given the relatively flexible superstructure and stiff soils.
The potential for financial loss is considerable. Overall, the calculated expected annual
losses (EAL) are in the range of 97,000 for the various code-conforming benchmark
building designs, or roughly 1% of the replacement cost of the building (3.5M, the fatality rate translates to an EAL due to
fatalities of 5,600 for the code-conforming designs, and 66,000, the monetary value associated with life loss is small,
suggesting that the governing factor in this respect will be the maximum permissible life-safety
risk deemed by the public (or its representative government) to be appropriate for buildings.
Although the focus of this report is on one specific building, it can be used as a reference
for other types of structures. This report is organized in such a way that the individual core
chapters (4, 5, and 6) can be read independently. Chapter 1 provides background on the
performance-based earthquake engineering (PBEE) approach. Chapter 2 presents the
implementation of the PBEE methodology of the PEER framework, as applied to the benchmark
building. Chapter 3 sets the stage for the choices of location and basic structural design. The subsequent core chapters focus on the hazard analysis (Chapter 4), the structural analysis
(Chapter 5), and the damage and loss analyses (Chapter 6). Although the report is self-contained,
readers interested in additional details can find them in the appendices
- âŠ