93 research outputs found

    Data simulation of tumor phylogenetic trees and evaluation of phylogenetic reconstructing tools

    Get PDF
    Tumor heterogeneity describes that a tumor usually contains more than one type of cells which are called clones. Clones in a tumor have distinct morphological and physiological features such as genetic variations. Different clones display different sensitivities to cytotoxic drugs, and tumor heterogeneity can add complexity to understand tumor composition and pose challenges for the development of successful therapies. Thus, studying tumor heterogeneity can guide tumor therapies for individual patient and enhance our understanding of inter-clonal functional relationships during therapies, which could be benefit to personalized and efficient treatments. Heterogenetic tumor development is an evolutionary process. There exists an evolutionary relationship among the clones of a heterogenetic tumor and the relationship can be described by an phylogenetic tree. Computational tools have been increasingly important to study tumor heterogeneity because of their time and economic efficiency. Such tools usually take as input the genetic variability data produced by high-throughput sequencing technologies, then output clonal composition of a tumor and reconstruct the polygenetic tree of it. In this thesis, we simulated a large amount of datasets consisting of tumor phylogenetic trees with varying properties and used the datasets to evaluate five recent and popular tumor phylogenetic reconstructing computational tools. We found relatively large differences for performance among those tools and also their strengths and shortcomings, respectively. We left as future work improvement of the data simulation methods and exploration of tool parameters for possibly more beneficial results

    Integration of Mission Control System, On-board Computer Core and spacecraft Simulator for a Satellite Test Bench: Integration of Mission Control System,On-board Computer Core and spacecraft Simulator for a Satellite Test Bench

    Get PDF
    The satellite avionics platform has been developed in cooperation with Airbus and is called „Future Low-cost Platform“ (FLP). It is based on an Onboard Computer (OBC) with redundant processor boards based on SPARC V8 microchips of type Cobham Aeroflex UT699. At the University of Stuttgart a test bench with a real hardware OBC and a fully simulated satellite is available for testing real flight scenarios with the Onboard Software (OBSW) running on representative hardware. The test bench as later the real flying satellite "Flying Laptop" – is commanded from a real Ground Control Centre (GCC). The main challenges in the FLP project were - Onboard computer design, - Software design and - Interfaces between platform and payloads In the course of industrialization of this FLP platform technology for later use in satellite constellations, Airbus has started to set up an in-house test bench where all the technologies shall be developed. The initial plan is to get first core elements of the FLP OBSW ported to the new dual-core processor and the new Space Wire(SpW) routing network. The plan also has an inclusion of new Mission Control Software with which one can command the OBC. The new OBC has a dual core processor Cobham Gaisler GR712 and hence, all the payload and related functionality are to be implemented only in a second core which involves a lot of low-level task distribution. The consequent SpW router network application and dual-core platform/payload OBSW sharing are entirely new in the field of satellite engineering

    Learning Graph Parameters from Linear Measurements: Fundamental Trade-offs and Application to Electric Grids

    Get PDF
    We consider a specific graph learning task: reconstructing a symmetric matrix that represents an underlying graph using linear measurements. We study fundamental trade-offs between the number of measurements (sample complexity), the complexity of the graph class, and the probability of error by first deriving a necessary condition (fundamental limit) on the number of measurements. Then, by considering a two-stage recovery scheme, we give a sufficient condition for recovery. In the special cases of the uniform distribution on trees with n nodes and the Erdös-Rényi (n, p) class, the sample complexity derived from the fundamental trade-offs is tight up to multiplicative factors. In addition, we design and implement a polynomial-time (in n) algorithm based on the two-stage recovery scheme. Simulations for several canonical graph classes and IEEE power system test cases demonstrate the effectiveness of the proposed algorithm for accurate topology and parameter recovery

    Learning Graph Parameters from Linear Measurements: Fundamental Trade-offs and Application to Electric Grids

    Get PDF
    We consider a specific graph learning task: reconstructing a symmetric matrix that represents an underlying graph using linear measurements. We study fundamental trade-offs between the number of measurements (sample complexity), the complexity of the graph class, and the probability of error by first deriving a necessary condition (fundamental limit) on the number of measurements. Then, by considering a two-stage recovery scheme, we give a sufficient condition for recovery. In the special cases of the uniform distribution on trees with n nodes and the Erdös-Rényi (n, p) class, the sample complexity derived from the fundamental trade-offs is tight up to multiplicative factors. In addition, we design and implement a polynomial-time (in n) algorithm based on the two-stage recovery scheme. Simulations for several canonical graph classes and IEEE power system test cases demonstrate the effectiveness of the proposed algorithm for accurate topology and parameter recovery

    Investigation of HIV-TB co-infection through analysis of the potential impact of host genetic variation on host-pathogen protein interactions

    Get PDF
    HIV and Mycobacterium tuberculosis (Mtb) co-infection causes treatment and diagnostic difficulties, which places a major burden on health care systems in settings with high prevalence of both infectious diseases, such as South Africa. Human genetic variation adds further complexity, with variants affecting disease susceptibility and response to treatment. The identification of variants in African populations is affected by reference mapping bias, especially in complex regions like the Major Histocompatibility Complex (MHC), which plays an important role in the immune response to HIV and Mtb infection. We used a graph-based approach to identify novel variants in the MHC region within African samples without mapping to the canonical reference genome. We generated a host-pathogen functional interaction network made up of inter- and intraspecies protein interactions, gene expression during co-infection, drug-target interactions, and human genetic variation. Differential expression and network centrality properties were used to prioritise proteins that may be important in co-infection. Using the interaction network we identified 28 human proteins that interact with both pathogens (”bridge” proteins). Network analysis showed that while MHC proteins did not have significantly higher centrality measures than non-MHC proteins, bridge proteins had significantly shorter distance to MHC proteins. Proteins that were significantly differentially expressed during co-infection or contained variants clinically-associated with HIV or TB also had significantly stronger network properties. Finally, we identified common and consequential variants within prioritised proteins that may be clinically-associated with HIV and TB. The integrated network was extensively annotated and stored in a graph database that enables rapid and high throughput prioritisation of sets of genes or variants, facilitates detailed investigations and allows network-based visualisation

    Learning Graphs from Linear Measurements: Fundamental Trade-offs and Applications

    Get PDF
    We consider a specific graph learning task: reconstructing a symmetric matrix that represents an underlying graph using linear measurements. We present a sparsity characterization for distributions of random graphs (that are allowed to contain high-degree nodes), based on which we study fundamental trade-offs between the number of measurements, the complexity of the graph class, and the probability of error. We first derive a necessary condition on the number of measurements. Then, by considering a three-stage recovery scheme, we give a sufficient condition for recovery. Furthermore, assuming the measurements are Gaussian IID, we prove upper and lower bounds on the (worst-case) sample complexity for both noisy and noiseless recovery. In the special cases of the uniform distribution on trees with n nodes and the Erdős-Rényi (n,p) class, the fundamental trade-offs are tight up to multiplicative factors with noiseless measurements. In addition, for practical applications, we design and implement a polynomial-time (in n ) algorithm based on the three-stage recovery scheme. Experiments show that the heuristic algorithm outperforms basis pursuit on star graphs. We apply the heuristic algorithm to learn admittance matrices in electric grids. Simulations for several canonical graph classes and IEEE power system test cases demonstrate the effectiveness and robustness of the proposed algorithm for parameter reconstruction

    Holocentric plants of the genus Rhynchospora as a new model to study meiotic adaptations to chromosomal structural rearrangements

    Get PDF
    Climate change, world hunger and overpopulation are some of the biggest challenges the world is currently facing. Moreover, they are part of a multidimensional single scenario: as climate change continues to modify our planet, we might see a decrease of arable land and increase in extreme weather patterns, posing a threat to food security. This has a direct impact on regions with high population growth, where food security is already scarce. Considering additionally the unsustainability of intensive global food production and its contribution to greenhouse emissions and biodiversity loss, it´s clear that all these factors are interconnected (Cardinale et al., 2012; Prosekov & Ivanova, 2018; Wiebe et al., 2019). Plants are the main source of staple food in the world and are also the main actors in carbon fixation, they are therefore key protagonists in controlling climate change. Plants are also an essential habitat-defining element balancing our ecosystem. Thus, how we grow plants and crops will, aside from the obvious implications for food security, also have a profound impact on the climate and biodiversity. The natural variability of species is considered an immense pool of genes and traits, and their understanding is key to generate new useful knowledge. For instance, natural populations can be more tolerant to abiotic and biotic stresses, or carry traits that combined together in hybrids, might achieve a higher seed number, or a faster growth. Classical breeding has exploited unrelated varieties to achieve traits of interest like dwarfism and higher grain production. However, only a limited number of crop species have been the focus of recent scientific and technological approaches, and they do not represent the extremely vast natural diversity of species that could generate useful knowledge for future applications (Castle et al., 2006; Pingali, 2012). The key to this natural variability is a process called meiotic recombination, the exchange of genomic material between homologous parental chromosomes. Meiotic recombination takes place during meiosis, a specialized cell division in which sexually reproducing organisms reduce the genomic complement of their gametes by half in preparation for fertilization. Meiotic recombination takes place at the beginning of meiosis, in a stage called prophase I. To exchange DNA sequences, the strands of two homologous chromosomes must be fragmented. This specific process of physiologically induced DNA fragmentation is conserved in the vast majority of eukaryotes (Keeney et al., 1997). After the formation of double-strand breaks, the 3’ ends that are left are targeted by recombinases that help the strands search and invade templates for repair. After invasion, the 3’ end is extended by DNA synthesis, exposing sequences on the opposite strand that can anneal to the other 3’ end of the original double strand break. DNA synthesis at both ends generates a new structure called a double Holliday Junction (dHJ), forming a physical link between homologous chromosomes, named chiasma (Wyatt & West, 2014). The resolutions of these structures are called crossovers (COs), which is the molecular event representing the outcome of meiotic recombination. Other outcomes are possible, like noncrossovers (NCOs). In this case, the invading strand is ejected and anneals to the single-strand 3´end of the original double-strand break (Allers & Lichten, 2001). Crossovers can be divided into two main groups, called class I and class II. COs of the first group are considered to be sensitive to interference, which means that there are mechanisms that prevent two class I COs from happening in proximity of each other. Class II is insensitive to interference. Class I COs are the result of a pathway called ZMM, which involves a group of specialised proteins that are highly conserved among eukaryotes (Lambing et al., 2017; Mercier et al., 2015). Class I COs are the most common, studied and important type of COs. Centromeres are structures, located on regions of the chromosomes, that allow proper chromosome segregation during mitosis and meiosis. Centromeres have a profound effect on plant breeding and crop improvement, as it is known that meiotic recombination is suppressed at centromeres in most eukaryotes. This represents a great limitation for crop improvement, as many possibly useful traits might be in regions not subject to recombination and thus might not be available for breeding purposes. Additionally, the mechanisms behind how recombination is regulated and prevented from happening at centromeres are still unclear. In most model organisms centromeres are single entities localized on specific regions on the chromosomes. This configuration is called monocentric. However, another type of configuration can be found in nature, but is less studied. In fact, some organisms harbour multiple centromeric determinants distributed over their whole chromosomal length. This configuration is called holocentric. The Cyperaceae comprise a vast, diverse family of plants, with a cosmopolitan distribution in all habitats (Spalink et al., 2016). Despite the presence of this family worldwide, knowledge about it is limited. Few genomes are available and molecular insights are scarce. This family is also known to be mainly formed by holocentric species (Melters et al., 2012). Understanding if and how meiotic recombination is achieved in holocentric plants will generate new knowledge that in the future might unlock new traits in elite crops, previously unavailable to breeding, that could help humanity face global climatic, economic and social challenges. Recent studies have reported new knowledge about important meiotic, chromosome and genome adaptions found in species of the Cyperaceae family and in particular the genus Rhynchospora (Marques et al., 2015, 2016a). With the recent publication of the first reference genomes for several Rhynchospora species, we could already perform a comprehensive analysis of their unique genome features and trace the evolutionary history of their karyotypes and how these have been determined by chromosome fusions (Hofstatter et al., 2021, 2022). This new resource paves the way for future research utilising Rhynchospora as a model genus to study adaptations to holocentricity in plants. With this work, my intention is to shed light on the underexplored topic of holocentricity in plants. Using cutting edge techniques, I examine the conservation of meiotic recombination together with other species-specific adaptations like achiasmy and polyploidy in holocentrics. My results reveal new insights into how plant meiotic recombination is regulated when small centromere units are found distributed chromosome-wide, challenging the classic dogma of suppression of recombination at centromeres

    GOCE data analysis: optimized brute force solutions of large-scale linear equation systems on parallel computers

    Get PDF
    The satellite mission GOCE (Gravity field and steady-state Ocean Circulation Explorer) was set up to determine the figure of the Earth with unprecedented accuracy. The sampling frequency is 1 Hz which results in a massive amount of data over the one year period the satellite is intended to be functional. From this data we can setup an overdeterminded linear system of equations to estimate the geopotential coefficients which are required for modelling the Earth's gravity field with spherical harmonics in the desired high resolution. The linear system of equations is solved "brute-force" which means that the normal equation matrix has to be kept in memory as a whole. The normal equations matrix has a memory demand of up to 65 GByte, hence we need a computer providing a sufficient amount of memory and fast multiple processors for the computations to get them done in a reasonable time. In this study, a program was written to compute the geopotential coefficients from simulated GOCE data, as GOCE real data were not available yet. As a first step, the program was optimized for the computations to become more efficient. As a second step, the program was parallelized to speed-up the computations by two different techniques: For a first version, the parallelization was done via OpenMP which can be used on shared memory systems which usally have a small number of processors. For the second version, MPI was used which is suited for a distributed memory architecture, hence can incorporate much more processors in the computations. In summation, we gained a huge boost in efficiency of the program due to the optimization. Furthermore, huge speed-up was gained due to the parallelization. The more processors are incorporated in the computation, the more the overall efficiency drops because of increasing communication between the processors. Here we could show that for huge problems the MPI version is running more efficient than the OpenMP version.Die Satellitenmission GOCE (Gravity field and steady-state Ocean Circulation Explorer) wurde gestartet, um die Erdfigur in bisher unerreichter Genauigkeit zu untersuchen. Eine Aufzeichnungsrate von 1 Hz führt zu einer enormen Datenmenge während der Funktionsdauer des Satelliten von einem Jahr (laut Planung). Mit den Daten kann ein überbestimmtes lineares Gleichungssystem aufgestellt werden, um die Gravitationsfeldparameter zu bestimmen, die das Gravitationsfeld der Erde mittels einer Kugelflächenfunktionsentwicklung in der gewünschten Genauigkeit modellieren. Das lineare Gleichungssystem wird mittels der "brute-force"-Methode gelöst. Das heißt, dass die komplette Normalgleichungsmatrix im Speicher gehalten werden muss. Die Normalgleichungsmatrix hat einen Speicherbedarf von bis zu 65 GByte, weswegen ein Computer benötigt wird, der ausreichend Speicher und eine größere Anzahl von schnellen Prozessoren für die Berechnungen zur Verfügung stellt, damit diese in einer angemessenen Zeit ausgeführt werden können. Im Laufe dieser Diplomarbeit wurde ein Programm entwickelt, um die Gravitationsfeldparameter aus simulierten GOCE-Daten zu bestimmen. Real-Daten von GOCE waren noch nicht verfügbar. Zunächst wurde das Programm optimiert, um mehr Berechnungseffizienz zu erhalten. Dann wurde das Programm mit Hilfe von zwei verschiedenen Techniken parallelisiert, um die Berechnungen noch weiter zu beschleunigen. Die erste Programmversion wurde mittels OpenMP parallelisiert, welches auf Systemen mit gemeinsamen Speicher (shared memory) eingesetzt werden kann. Diese Systeme enthalten normalerweise nur wenige Prozessoren. Bei der zweiten Version wurde MPI verwendet, welches besser für Systeme mit verteiltem Speicher (distributed memory) geeignet ist und daher viel mehr Prozessoren in die Berechnungen einbeziehen kann. Zusammengefasst konnte durch die Optimierung eine große Effizienzsteigerung erreicht werden. Darüber hinaus konnte mit der Parallelisierung ein Zuwachs der Ausführungsgeschwindigkeit erzielt werden. Je mehr Prozessoren für die Berechnungen verwendet werden, desto stärker fällt allerdings die Effizienz wegen der zunehmenden Kommunikation zwischen den Prozessen. Es konnte gezeigt werden, dass die MPI-Version für große Probleme effizienter ist als die OpenMP-Version

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
    corecore