480 research outputs found
Efficient methods for read mapping
DNA sequencing is the mainstay of biological and medical research. Modern sequencing machines can read millions of DNA fragments, sampling the underlying genomes at high-throughput. Mapping the resulting reads to a reference genome is typically the first step in sequencing data analysis. The problem has many variants as the reads can be short or long with a low or high error rate for different sequencing technologies, and the reference can be a single genome or a graph representation of multiple genomes. Therefore, it is crucial to develop efficient computational methods for these different problem classes. Moreover, continually declining sequencing costs and increasing throughput pose challenges to the previously developed methods and tools that cannot handle the growing volume of sequencing data.
This dissertation seeks to advance the state-of-the-art in the established field of read mapping by proposing more efficient and scalable read mapping methods as well as tackling emerging new problem areas. Specifically, we design ultra-fast methods to map two types of reads: short reads for high-throughput chromatin profiling and nanopore raw reads for targeted sequencing in real-time. In tune with the characteristics of these types of reads, our methods can scale to larger sequencing data sets or map more reads correctly compared with the state-of-the-art mapping software. Furthermore, we propose two algorithms for aligning sequences to graphs, which is the foundation of mapping reads to graph-based reference genomes. One algorithm improves the time complexity of existing sequence to graph alignment algorithms for linear or affine gap penalty. The other algorithm provides good empirical performance in the case of the edit distance metric. Finally, we mathematically formulate the problem of validating paired-end read constraints when mapping sequences to graphs, and propose an exact algorithm that is also fast enough for practical use.Ph.D
The Nature of Volunteer Chinese Teaching Launched by the Hanban
With the increasing demand of Chinese language learning, the Hanban has been dispatching volunteer Chinese teachers worldwide for several years. Many studies on volunteer Chinese teaching have focused on the Hanban’s soft power projection, higher education cooperation, and policies. However, not much work has been done on the nature of official launched volunteer Chinese teaching via examining its teaching practices. Mainly referring to Phillipson’s theory of linguistic imperialism, my thesis aims to fill in the gap in the literature through exploring whether volunteer Chinese teaching launched by the Hanban in three Southeast Asian countries has features of linguistic imperialism. I conducted a qualitative case study on teaching practices of volunteer teachers in the Philippines, Thailand and Indonesia for the purpose of having a detailed understanding of the volunteer Chinese teaching situations. Combining with findings from interviews of volunteer Chinese teachers, the paper also contains a content analysis on materials from the Hanban website and online newspapers. My analysis indicates that volunteer Chinese teaching in the three Southeast Asian countries contains some features of linguistic imperialism, but cannot be completely defined as linguistic imperialism. The result on the nature of volunteer Chinese teaching clarifies some people’s concern about the promotion of Chinese language by China. Also, Chinese language teachers and administrators can have a better idea of what Chinese teaching should be like and how it could be improved
Recommended from our members
EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences.
The availability of genome-wide epigenomic datasets enables in-depth studies of epigenetic modifications and their relationships with chromatin structures and gene expression. Various alignment tools have been developed to align nucleotide or protein sequences in order to identify structurally similar regions. However, there are currently no alignment methods specifically designed for comparing multi-track epigenomic signals and detecting common patterns that may explain functional or evolutionary similarities. We propose a new local alignment algorithm, EpiAlign, designed to compare chromatin state sequences learned from multi-track epigenomic signals and to identify locally aligned chromatin regions. EpiAlign is a dynamic programming algorithm that novelly incorporates varying lengths and frequencies of chromatin states. We demonstrate the efficacy of EpiAlign through extensive simulations and studies on the real data from the NIH Roadmap Epigenomics project. EpiAlign is able to extract recurrent chromatin state patterns along a single epigenome, and many of these patterns carry cell-type-specific characteristics. EpiAlign can also detect common chromatin state patterns across multiple epigenomes, and it will serve as a useful tool to group and distinguish epigenomic samples based on genome-wide or local chromatin state patterns
Validating Paired-End Read Alignments in Sequence Graphs
Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second
Measurement-Based Characterization of 39 GHz Millimeter-Wave Dual-Polarized Channel Under Foliage Loss Impact
This paper presents a measurement-based analysis of wideband 39 GHz millimeter wave (mm-wave) dual-polarized propagation channel under the impact of foliage presence between a transmitter (Tx) and a receiver (Rx). The measurements were conducted in a rich-vegetation area, and the so-called direction-scan-sounding (DSS) method which rotates a horn antenna in angular domains was applied, aiming at investigating the direction-of-arrival (DoA)-dependent characteristics of polarimetric channels. Four Tx-to-Rx polarization configurations were considered, including co-polarization scenarios with vertical Tx-polarization to vertical Rx-polarization (VV) and horizontal to horizontal (HH), as well as cross-polarization with vertical to horizontal (VH) and horizontal to vertical (HV), which allow scrutinizing the differences in delay-direction dispersion for usually-encountered scenarios. A foliage loss model for various vegetation depths in VV polarization configuration, was also presented in this paper. The results show that the foliage-loss DoA spectra for VH and HV are similar, while the spectra exhibit less penetration loss in most directions for VV than for the HH. Furthermore, the presence of vegetation between the Tx and the Rx leads to larger dispersion in delay compared to the clear line-of-sight (LoS) scenario, particularly for vertical polarization in the Tx side, and additionally, the foliage presence also results in evident DoA dispersion, specially in the HV scenario. Selectivity in directions caused by foliage is more significant in vertically-polarized Tx scenarios than in the horizontally-polarized Tx scenarios. A statistical model is established summarizing these comparison details
Neuromorphic Online Learning for Spatiotemporal Patterns with a Forward-only Timeline
Spiking neural networks (SNNs) are bio-plausible computing models with high
energy efficiency. The temporal dynamics of neurons and synapses enable them to
detect temporal patterns and generate sequences. While Backpropagation Through
Time (BPTT) is traditionally used to train SNNs, it is not suitable for online
learning of embedded applications due to its high computation and memory cost
as well as extended latency. Previous works have proposed online learning
algorithms, but they often utilize highly simplified spiking neuron models
without synaptic dynamics and reset feedback, resulting in subpar performance.
In this work, we present Spatiotemporal Online Learning for Synaptic Adaptation
(SOLSA), specifically designed for online learning of SNNs composed of Leaky
Integrate and Fire (LIF) neurons with exponentially decayed synapses and soft
reset. The algorithm not only learns the synaptic weight but also adapts the
temporal filters associated to the synapses. Compared to the BPTT algorithm,
SOLSA has much lower memory requirement and achieves a more balanced temporal
workload distribution. Moreover, SOLSA incorporates enhancement techniques such
as scheduled weight update, early stop training and adaptive synapse filter,
which speed up the convergence and enhance the learning performance. When
compared to other non-BPTT based SNN learning, SOLSA demonstrates an average
learning accuracy improvement of 14.2%. Furthermore, compared to BPTT, SOLSA
achieves a 5% higher average learning accuracy with a 72% reduction in memory
cost.Comment: 9 pages,8 figure
Uplift Modeling based on Graph Neural Network Combined with Causal Knowledge
Uplift modeling is a fundamental component of marketing effect modeling,
which is commonly employed to evaluate the effects of treatments on outcomes.
Through uplift modeling, we can identify the treatment with the greatest
benefit. On the other side, we can identify clients who are likely to make
favorable decisions in response to a certain treatment. In the past, uplift
modeling approaches relied heavily on the difference-in-difference (DID)
architecture, paired with a machine learning model as the estimation learner,
while neglecting the link and confidential information between features. We
proposed a framework based on graph neural networks that combine causal
knowledge with an estimate of uplift value. Firstly, we presented a causal
representation technique based on CATE (conditional average treatment effect)
estimation and adjacency matrix structure learning. Secondly, we suggested a
more scalable uplift modeling framework based on graph convolution networks for
combining causal knowledge. Our findings demonstrate that this method works
effectively for predicting uplift values, with small errors in typical
simulated data, and its effectiveness has been verified in actual industry
marketing data.Comment: 6 pages, 6 figure
- …