151 research outputs found
Computational methods for augmenting association-based gene mapping
The context and motivation for this thesis is gene mapping, the discovery of genetic variants that affect susceptibility to disease. The goals of gene mapping research include understanding of disease mechanisms, evaluating individual disease risks and ultimately developing new medicines and treatments.
Traditional genetic association mapping methods test each measured genetic variant independently for association with the disease. One way to improve the power of detecting disease-affecting variants is to base the tests on haplotypes, strings of adjacent variants that are inherited together, instead of individual variants. To enable haplotype analyses in large-scale association studies, this thesis introduces two novel statistical models and gives an efficient algorithm for haplotype reconstruction, jointly called HaloRec. HaploRec is based on modeling local regularities of variable length in the haplotypes of the studied population and using the obtained model to statistically reconstruct the most probable haplotypes for each studied individual. Our experiments demonstrate that HaploRec is especially well suited to data sets with a large number or markers and subjects, such as those typically used in currently popular genome-wide association studies.
Public biological databases contain large amounts of data that can help in determining the relevance of putative associations. In this thesis, we introduce Biomine, a database and search engine that integrates data from several such databases under a uniform graph representation. The graph database is used to derive a general proximity measure for biological entities represented as graph nodes, based on a novel scheme of weighting individual graph edges based on their informativeness and type. The resulting proximity measure can be used as a basis for various data analysis tasks, such as ranking putative disease genes and visualization of gene relationships.
Our experiments show that relevant disease genes can be identified from among the putative ones with a reasonable accuracy using Biomine. Best accuracy is obtained when a pre-known reference set of disease genes is available, but experiments using a novel clustering-based method demonstrate that putative disease genes can also be ranked without a reference set under suitable conditions.
An important complementary use of Biomine is the search and visualization of indirect relationships between graph nodes, which can be used e.g. to characterize the relationship of putative disease genes to already known disease genes. We provide two methods for selecting subgraphs to be visualized: one based on weights of the edges on the paths connecting query nodes, and one based on using context free grammars to define the types of paths to be displayed. Both of these query interfaces to Biomine are available online.Tämän väitöskirjan aihealue on geenikartoitus, tautialttiuteen vaikuttavien perinnöllisten muunnosten paikantaminen. Geenikartoituksen käytännöllisiä päämääriä ovat tautimekanismien ymmärtäminen, yksilöllisten tautiriskien arviointi sekä uusien lääkitysten kehittäminen. Tässä työssä on kehitetty laskennallisia menetelmiä joita voidaan käyttää parantamaan olemassaolevien geenikartoitusmenetelmien tehoa sekä analysoimaan niiden antamia alustavia tuloksia.
Geenikartoitusmenetelmät perustuvat ns. markereihin, jotka ovat yksilöllistä vaihtelua sisältäviä kohtia perimässä. Tyypillisesti käytetyt menetelmät mittaavat kussakin markerissa esiintyvien muunnosten yhteyttä tautiin erikseen, huomioimatta muita markereita. Kartoituksen tarkkuutta voidaan parantaa käyttämällä testaamisen yksikkönä yksittäisten markerien sijaan haplotyyppejä, lähekkäisissä markereissa esiintyvien muunnosten muodostamia säännönmukaisia jaksoja jotka periytyvät yhdessä. Laboratoriomenelmät eivät suoraan tuota tietoa siitä, miten kunkin yksilön perimästä mitatut muunnokset jakautuvat tämän kahdelta vanhemmalta perimiin haplotyyppeihin. Tämän väitöskirjan alkupuolella esitetään laskennallinen menetelmä, joilla haplotyypit voidaan rekonstruoida tilastollisesti, perustuen niiden paikallisiin säännönmukaisuuksiin. Kehitetty menetelmä on laskennallisesti tehokas ja soveltuu erityisesti genominlaajuisiin tutkimuksiin, joissa sekä tutkittujen yksilöiden että markereiden määrät ovat suuria, ja markerit sijaitsevat kohtuullisen etäällä toisistaan.
Yksittäisten muunnosten vaikutukset tauteihin ovat usein suhteellisen heikkoja, ja kun testataan suuri joukko markereita, tuloksiin tulee yleensä sattumalta mukaan myös muunnoksia joilla ei ole todellista vaikutusta tautiin. Julkiset biologiset tietokannat sisältävät paljon tietoa joka voi auttaa alustavien geenikartoitustulosten merkityksen arvioimista. Työn toisessa osassa esitellään Biomine, tietokanta jossa on yhdistetty tietoa joukosta tällaisia tietokantoja käyttäen painotettua verkkomallia joka kuvaa mm. geenien, proteiinien ja tautien välisiä tunnettuja yhteyksiä. Verkon solmujen välisten epäsuorien yhteyksien voimakkuuden mittaamiseen esitetään uusi menetelmä. Tätä menetelmää voidaan hyödyntää mm. geenikartoituksella löydettyjen kandidaattigeenien priorisointiin, perustuen siihen että mitataan kandidaattigeenien ja entuudestaan tunnettujen tautigeenien välisten yhteyksien voimakkuutta, tai kandidaattigeenien keskinäisten yhteyksien voimakkuutta.
Työssä esitetään myös menetelmiä verkkotietokannan solmujen välisten epäsuorien yhteyksien visualisointiin, perustuen kulloinkin kiinnostuksen kohteena olevien solmujen yhteyttä parhaiten kuvaavan pienen aliverkon eristämiseen tietokannasta. Aliverkon valintaan esitetään kaksi laskennallisesti tehokasta menetelmää: toinen perustuen yhteyksien voimakkuuden arvioimiseen, ja toinen perustuen yhdistävien polkujen sisältämien linkkien tyyppeihin. Nämä visualisointimenetelmät ovat myös käytettävissä julkisessa verkkopalvelussa jossa voi tehdä kyselyjä Biomine-tietokantaan
Demographic and Population Separation History Inference Based on Whole Genome Sequences.
Patterns of DNA sequence variation among present day individuals contain rich information about past population history. The recent availability of whole genome sequences provides challenges and opportunities for developing computational methods to infer detailed models of population history. The goal of this thesis is to extend current methodology and apply available techniques to answer questions about population history in human, gorilla and canine species.
Recent methodologies based on the sequentially Markovian coalescent model permit the inference of population history using single or several whole genome sequences. However, these approaches fail to generate parametric estimates for split times, which are confounded by subsequent migration. Additionally, the effect of switch errors resulted from statistical phasing on split time estimation is largely unknown. We reconstructed phased haplotypes of nine individuals from diverse populations using fosmid pool sequencing. We analyzed population size and separation history using the Pairwise Sequentially Markovian Coalescent model (PSMC) and Multiple Sequentially Markovian Coalescent model (MSMC) and found that applying MSMC on statistically phased haplotypes results in more recent split time estimation compared with physically phased haplotypes due to switch errors. We further extended PSMC with Approximate Bayesian Computation to infer split time and migration rates under a standard isolation with migration model. We dated several key events in human separation history using these methods.
Gorillas are human’s closet living relatives other than chimpanzees. We analyzed whole genome sequencing data of thirteen gorilla individuals and applied GPhoCS, a Bayesian coalescent-based approach to infer ancestral population sizes, divergence times and migration rates amongst three gorilla subspecies, shedding light on the evolutionary forces that have uniquely influenced patterns of gorilla genetic variation.
The origins and dynamics of dog domestication has been a controversial and intriguing problem. We analyzed two ancient dog genomes from the Neolithic and over 100 contemporary canine genomes. While both dogs show signatures of admixture, they predominantly share ancestry with modern European dogs, contradicting a late Neolithic population replacement suggested by mitochondrial studies. By calibrating the mutation rate using our oldest dog, we narrowed the timing of dog domestication to a window of 20-40 kyrs ago.PhDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133341/1/songsy_1.pd
Recommended from our members
Topics in Signal Processing: applications in genomics and genetics
The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown parameters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant context, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diversity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium
Deep learning in population genetics
KK is supported by a grant from the Deutsche Forschungsgemeinschaft (DFG) through the TUM International Graduate School of Science and Engineering (IGSSE), GSC 81, within the project GENOMIE QADOP. We acknowledge the support of Imperial College London - TUM Partnership award.Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, con volutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.Publisher PDFPeer reviewe
Identification of breed contributions in crossbred dogs
There has been a strong public interest recently in the interrogation of canine ancestries using direct-toconsumer (DTC) genetic ancestry inference tools. Our goal is to improve the accuracy of the associated computational tools, by developing superior algorithms for identifying the breed composition of mixedbreed dogs. Genetic test data has been provided by Mars Veterinary, using SNP markers. We approach this ancestry inference problem from two main directions. The first approach is optimized for datasets composed of a small number of ancestry informative markers (AIM). Firstly, we compute haplotype frequencies from purebred ancestral panels which characterize genetic variation within breeds and are utilized to predict breed compositions. Due to a large number of possible breed combinations in admixed dogs we approximately sample this search space with a Metropolis-Hastings algorithm. As proposal density we either uniformly sample new breeds for the lineage, or we bias the Markov Chain so that breeds in the lineage are more likely to be replaced by similar breeds. The second direction we explore is dominated by HMM approaches which view genotypes as realizations of latent variable sequences corresponding to breeds. In this approach an admixed canine sample is viewed as a linear combination of segments from dogs in the ancestral panel. Results were evaluated using two different performance measures. Firstly, we looked at a generalization of binary ROC-curves to multi-class classification problems. Secondly, to more accurately judge breed contribution approximations we computed the difference between expected and predicted breed contributions. Experimental results on a synthetic, admixed test dataset using AIMs showed that the MCMC approach successfully predicts breed proportions for a variety of lineage complexities. Furthermore, due to exploration in the MCMC algorithm true breed contributions are underestimated. The HMM approach performed less well which is presumably due to using less information of the dataset
Evaluating Methods for Privacy-Preserving Data Sharing in Genomics
The availability of genomic data is often essential to progress in biomedical re- search, personalized medicine, drug development, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. In this dissertation, we study and build systems that are geared towards privacy preserving genomic data sharing. We first look at the Matchmaker Exchange, a platform that connects multiple distributed databases through an API and allows researchers to query for genetic variants in other databases through the network. However, queries are broadcast to all researchers that made a similar query in any of the connected databases, which can lead to a reluctance to use the platform, due to loss of privacy or competitive advantage. In order to overcome this reluctance, we propose a framework to support anonymous querying on the platform. Since genomic data’s sensitivity does not degrade over time, we analyze the real-world guarantees provided by the only tool available for long term genomic data storage. We find that the system offers low security when the adversary has access to side information, and we support our claims by empirical evidence. We also study the viability of synthetic data for privacy preserving data sharing. Since for genomic data research, the utility of the data provided is of the utmost importance, we first perform a utility evaluation on generative models for different types of datasets (i.e., financial data, images, and locations). Then, we propose a privacy evaluation framework for synthetic data. We then perform a measurement study assessing state-of-the-art generative models specifically geared for human genomic data, looking at both utility and privacy perspectives. Overall, we find that there is no single approach for generating synthetic data that performs well across the board from both utility and privacy perspectives
Dissecting genetic interactions in complex traits
Of central importance in the dissection of the components that govern complex traits is understanding
the architecture of natural genetic variation. Genetic interaction, or epistasis,
constitutes one aspect of this, but epistatic analysis has been largely avoided in genome wide
association studies because of statistical and computational difficulties. This thesis explores
both issues in the context of two-locus interactions.
Initially, through simulation and deterministic calculations it was demonstrated that not only
can epistasis maintain deleterious mutations at intermediate frequencies when under selection,
but that it may also have a role in the maintenance of additive variance. Based on the epistatic
patterns that are evolutionarily persistent, and the frequencies at which they are maintained, it
was shown that exhaustive two dimensional search strategies are the most powerful approaches
for uncovering both additive variance and the other genetic variance components that are co-precipitated.
However, while these simulations demonstrate encouraging statistical benefits, two dimensional
searches are often computationally prohibitive, particularly with the marker densities and sample
sizes that are typical of genome wide association studies. To address this issue different
software implementations were developed to parallelise the two dimensional triangular search
grid across various types of high performance computing hardware. Of these, particularly effective
was using the massively-multi-core architecture of consumer level graphics cards. While
the performance will continue to improve as hardware improves, at the time of testing the speed
was 2-3 orders of magnitude faster than CPU based software solutions that are in current use.
Not only does this software enable epistatic scans to be performed routinely at minimal cost,
but it is now feasible to empirically explore the false discovery rates introduced by the high
dimensionality of multiple testing. Through permutation analysis it was shown that the significance threshold for epistatic searches is a function of both marker density and population
sample size, and that because of the correlation structure that exists between tests the threshold
estimates currently used are overly stringent.
Although the relaxed threshold estimates constitute an improvement in the power of two dimensional
searches, detection is still most likely limited to relatively large genetic effects. Through
direct calculation it was shown that, in contrast to the additive case where the decay of estimated
genetic variance was proportional to falling linkage disequilibrium between causal variants and
observed markers, for epistasis this decay was exponential. One way to rescue poorly captured
causal variants is to parameterise association tests using haplotypes rather than single markers.
A novel statistical method that uses a regularised parameter selection procedure on two locus
haplotypes was developed, and through extensive simulations it can be shown that it delivers a
substantial gain in power over single marker based tests.
Ultimately, this thesis seeks to demonstrate that many of the obstacles in epistatic analysis
can be ameliorated, and with the current abundance of genomic data gathered by the scientific
community direct search may be a viable method to qualify the importance of epistasis
Ameerika populatsioonide genoomne portree
Väitekirja elektrooniline versioon ei sisalda publikatsiooneAmeerika populatsioonide evolutsiooni on käsitlenud mitmed multidistsiplinaarsed uuringud. Meie teadmised Ameerika maailmajao geneetilise mitmekesisuse kujunemisest on endiselt ebatäielikud, ehkki geneetilised uuringud lisavad sel teemal pidevalt uusi detaile. Uute tehnoloogiate nagu järgmise põlvkonna sekveneerimine (NGS) väljaarendamine koos teiste tehniliste edasiminekutega avavad võimaluse eraldada ja analüüsida DNA-d iidsetest proovidest, tehes "iidsest genoomikast" (aDNA) ühe paljudest põhilistest tööriistadest meie esivanemate mineviku mõistmiseks. Veelgi enam, need tehnoloogiad on tohutult suurendanud genoomsete andmete hulka kogu maailmast, sealhulgas Ameerika mandritelt.
Ehkki Ameerika maailmajagu oli viimane, milleni meie sapiens’i esivanemad jõudsid, on selle geneetilise varieeruvuse protsessid olnud väga keerukad. Nende uuringud on rohkem kui kolme kümnendi jooksul olnud paljude geneetikaalaste teadustööde teemaks. Algul domineerisid Ameerika populatsioonide populatsioonigeneetilistes uuringutes uniparentaalsed geneetilised süsteemid, alustades mitokondriaalse DNA-ga (mtDNA) ja peagi kaasates Y-kromosoomi (chrY) analüüsi. Viimasest selgus, et põlisameeriklaste kaks chrY asutajahaplogruppi olid tõenäoliselt hg C ja hg Q, mida leiti vastavalt umbes 5% ja 75% põlisameerika meestest. Kuid nende haplogruppide resolutsioon ei paranenud oluliselt enne kui mõne aasta eest.
Selle doktoritöö esimese publikatsiooni (Ref I) eesmärgiks on uurida Ameerika maailmajao geneetilist ajalugu meeste perspektiivist, lahates suure täpsusastmega üleameerikalist haplogruppi Q, ning koostada kõikehõlmav ja detailne haplogrupp Q ja selle alamliinide fülogeograafia.
Uniparentaalseid geneetilisi süsteeme võib pidada kaheks lookuseks, mida kasutatakse inimese ajaloo nais- ja meesperspektiivi mõistmiseks. Nad saavad kirjeldada ainult kaht esivanemat neist tuhandetest, kes on seotud tänapäeva populatsioonide geneetilise pärandi kujundamisega. Olulisem arv esivanemaid on genoomis esindatud autosomaalsetes markerites. Seega on autosomaalsed markerid hädavajalikud Ameerika maailmajao populatsioonide liikumiste ajastuse ja dünaamika mõistmiseks. Tänu arheoloogilistele ja geneetilistele tõenditele tunnistatakse nüüdseks, et esimesed Põhja-Ameerikasse jõudnud inimesed tulid Siberist, ületades pärast hilist jääaega Beringi maakitsuse. Algsetele asulakohtadele järgnesid ulatuslikud inimeste ränded, mis jõudsid Lõuna-Ameerika lõunaossa suhteliselt kiiresti, juba ~15 000 aastat tagasi. Mitu hiljutist uuringut on selle teema kohta uut informatsiooni andnud, rekonstrueerides Ameerika maailmajao erinevate piirkondade põliselanike rühmade genoomset ajalugu, kuid Isthmo-Colombia piirkond on seni puudu.
Seega rakendab selle doktoritöö teine publikatsioon (Ref II) nii iidse kui ka tänapäeva DNA andmete analüüsi, et rekonstrueerida Isthmo-Colombia piirkonna genoomset ajalugu. Selle eesmärgiks on teha kindlaks Panama põlispopulatsioonide genoomne taust, et hinnata maakitsuse sisest varieeruvust ja selgitada Kolumbuse-eelsete ameeriklaste genoomset ajalugu, hinnates Isthmo-Colombia piirkonna sidemeid ülejäänud Ameerika maailmajaoga.
Lisaks esialgsetele rännetele pärinevad Ameerika populatsioonid mitmest segunemisest, alates koloniseerimisest ja Atlandi orjakaubandusest. Peale selle toimus viimase kahe sajandi jooksul palju rändelaineid, millele järgnes kohalik segunemine, ning nende mõju on suuresti uurimata.
Selle doktoritöö kolmas publikatsioon (Ref III) uurib, kuidas hilisemad ränded kujundasid segunenud Ameerika populatsioonide genoomset tausta. Täpsemalt on selle uuringu eesmärgiks rekonstrueerida kõrgel lahutusastmel põlvnemise komponendid, anda hinnang segunemise ajale, uurida erinevate mandrite põlvnemise demograafilist evolutsiooni pärast segunemist ning hinnata soost sõltuva geenivoolu dünaamika ulatust ja tugevust segunenud Ameerika populatsioonides.
Käesoleva doktoritöö peamised tulemused ja järeldused on järgmised:
• Tehti kindlaks ja dateeriti kõrge resolutsiooniga haplogrupp Q fülogeneesipuu, mis annab uut informatsiooni oma Euraasia ja Ameerika harude geograafilise jaotuse kohta tänapäeva ja iidsetes proovides.
• Esimest korda tuvastati kaks eristuvat Y-kromosoomi liini, mis peegeldavad hiljutistes genoomsetes uuringutes varem kirjeldatud kaht peamist põlvnemiskomponenti (SNA ja NNA). Nende liinide lahknemine toimus tõenäoliselt Beringi maakitsuse idaosas enne Ameerika maailmajakku sisenemist, milleks kasutati kaht teed: ranniku (SNA, Q-Z780/Q-M848) ja sisemaa teed (NNA, Q-Y4276). Sinna jõudnuna segunesid need kaks põlvnemiskomponenti Põhja-Ameerikas tõenäoliselt väga vara, millele viitab iidne Kennewicki mees, kelle tuumagenoom kuulub SNA komponenti (Q-M848), kuid mtDNA haplogrupp on NNA-st (X2a).
• Avastati SNA liinide kaks märkimisväärset ekspansiooni Meso- ja Lõuna-Ameerikas, üks umbes 15 000 aastat tagasi, kohe pärast esmaasustamist, ja teine 3000 aastat tagasi pärast klimaatilisi muutusi ja kohalikke kultuurilisi nihkeid.
• Panama sees tuvastati märkimisväärne geneetiline struktuur, mis kattus üldjoontes käesolevas uuringus analüüsitud mineviku ja praeguste põliselanike rühmadega. Need rühmad on ka tuhandeid aastaid suguluses olnud, eriti Kariibi mere piirkonnas Panama lääne- ja Costa Rica kaguosa piiril. Ida-Panama põliselanike rühmade vahel ning Emberá ja hispaanlaste-eelsete panamalaste vahel, kes elasid Vana Panamat ümbritsevas piirkonnas enne kontakti eurooplastega, leiti vähem geneetilisi sarnasusi.
• Ameerika maailmajao iidsete põliselanike seas avastati varem kirjeldamata põlvnemiskomponent. See komponent esineb ainult selles piirkonnas ning on tuvastatav iidsetes hispaanlaste-eelsetes indiviidides ja inimestes, kes ise identifitseerivad end tänapäeva põliselanike, Aafrika ja latiino-põliselanike rühmade järglastena. See jõudis Panama maakitsusele rohkem kui 10 000 aastat tagasi, levis varases Holotseenis lokaalselt ning jättis tänapäevani püsivaid genoomseid jälgi, eriti Guna rahva hulgas.
• Euroopa geneetiline panus Ameerika populatsioonidesse peegeldab kolonisatsiooni aegset geopoliitilist olukorda. Avastati mitu sekundaarset Euroopa allikat, mis panustasid arvestatavasse osassse Ameerika populatsioonidest, nt Itaalia Brasiilias ja Argentiinas, Kesk-Euroopa Brasiilias. Tuletati Aafrika allikate eristuv panus Ameerika populatsioonidesse.
• Segunemise ajad langevad kokku rändelainetega Euroopast ja peegeldavad ekspluateeritud Aafrika piirkondade muutumist ajas.
• Segunemise demograafilise mõju analüüsist selgub üldine languse ja taastumise muster mitmes uuritavas populatsioonis, mis vastab koloniaalajastu algusele ja lõpule. Kuid Peruud ja Mehhikot iseloomustavad erinevad demograafilised trajektoorid.
• Soost sõltuva segunemise dünaamika analüüs viitab sellele, et tänapäeva populatsioonidesse on panustanud rohkem Ameerika naisi kui mehi. Vastupidiselt oli Euroopa meeste panus olulisem kui samalt mandrilt pärinevate naiste oma. Sellele vastandlikult ilmnes mõnes populatsioonis, kuid mitte kõigis, tõendeid suuremast naiste panusest, mis on osaliselt vastuolus ajalooliste andmetega Aafrika päritolust.The evolution of American populations has been the subject of several multidisciplinary studies. Our knowledge regarding the formation of the genetic diversity of the Americas is still incomplete, although genetic studies are constantly adding new details on this topic. The development of new technologies, such as Next Generation Sequencing (NGS), together with other technical improvements, lead to the possibility of extracting and analysing DNA from ancient specimens, making "ancient genomic" (aDNA) one of the many fundamental tools to understand our ancestor's past. Moreover, these technologies enormously increased the number of worldwide genomic data, including those from the Americas.
Although the Americas were the last continents to be reached by our sapiens ancestors, their genetic variation processes have been extremely complex. Their studies have been the topic of many genetic surveys for more than three decades. In the beginning, uniparental systems dominated the population genetics research of American populations. It started with mitochondrial DNA (mtDNA) and soon included the Y chromosome (chrY) analysis. The latter revealed that the two founding Native American chrY haplogroups probably were Hg C and Hg Q, accounting for about 5% and 75% of Native American males, respectively. However, the resolution of these haplogroups did not undergo substantial improvements until a few years ago.
The first publication included in this dissertation (Ref I) aims to investigate from a male perspective the genetic history of the Americas through a fine dissection of the Pan-American haplogroup Q and to reconstruct a comprehensive and detailed haplogroup Q phylogeography and that of its sub-lineages.
The uniparental systems could be considered as two loci that are used to understand the female and male perspective of human history. They can describe only two ancestors of the thousands involved in shaping the genetic legacy of modern populations. The genomic representation of a more significant number of ancestors is encrypted in the autosomal markers. Therefore, autosomal markers are crucial to understanding the timing and the dynamics of population movements in the Americas. Thanks to archaeological and genetic evidence, it is now accepted that the first people arriving in North America came from Siberia, passing through Beringia after late Glacial times. Initial settlements were followed by widespread people movements that reached southern South America relatively fast, as early as ~15 thousand years ago. Several recent studies have provided new information about this subject, reconstructing the genomic history of indigenous groups from different regions of the Americas, but the Isthmo-Colombian area is still lacking.
Hence, the second publication of this thesis (Ref II) employed both ancient and modern DNA data analysis to reconstruct the genomic history of the Isthmo-Colombian area. It aims to define the genomic background of Panamanian indigenous populations to evaluate the intra-Isthmus variability and shed light on pre-Columbian Americans' genomic history assessing the connection between the Isthmo-Colombian area and the rest of the Americas.
Besides the first migrations, American populations result from several admixture events since the colonial era and the Atlantic slave trade. Moreover, many waves of migration followed by local admixture occurred in the last two centuries, the impact of which has been largely unexplored.
The third reference in this thesis (Ref III) explores how more recent migrations shaped the genomic background of admixed American populations. In particular, this study aims to reconstruct the fine-scale ancestry composition, estimate the time of admixture, examine the demographic evolution of different continental ancestries after the admixture and assess the extent and magnitude of sex-biased gene-flow dynamics in admixed American populations.
The main results and conclusions of this research thesis are the following:
• A high-resolution haplogroup Q phylogeny that presents new insights into its Eurasian and American branches' geographic distribution in modern and ancient samples was ascertained and dated.
• For the first time, two distinct Y chromosome lineages reflecting the two main ancestral components (SNA and NNA) earlier described by recent genomic studies were observed. The differentiation of these lineages probably occurred in eastern Beringia before entering the Americas through two routes: the coastal (SNA, Q-Z780/Q-M848) and the internal route (NNA, Q-Y4276). Once there, these two ancestral components probably admixed very early in North America, as suggested by the ancient Kennewick nuclear genome belonging to SNA (Q-M848) yet carrying an NNA mtDNA haplogroup (X2a).
• Two significant expansions of the SNA lineages in Meso- and South America, one around 15 kya, early after the first peopling, and another at 3 kya, following climatic changes and local cultural shifts, were revealed.
• A remarkable genomic structure within Panama was identified, mainly overlapping with past and present Indigenous groups analysed in this study. These groups also show relatedness, especially in the Caribbean region on the border between western Panama and southeastern Costa Rica over thousands of years. Fewer genetic similarities were identified between the Indigenous groups located in eastern Panama and between the Emberá and the pre-Hispanic Panamanians who lived in the area around Old Panama before European contact.
• A previously undescribed ancestry among ancient Indigenous peoples of the Americas was revealed. This ancestry is unique to the region and detectable in the ancient pre-Hispanic individuals and the self-identified descendants of current Indigenous, African and Hispano-Indigenous groups. It reached the Panama land bridge over 10 thousand years ago, expanded locally during the early Holocene, and left genomic traces up to the present day, especially among the Guna.
• The European genetic contribution in American populations mirrors the geopolitical situation during colonisation. Several European secondary sources contributing to a substantial proportion of American populations were revealed, e.g. Italy in Brazil and Argentina, Central Europe in Brazil. A differential contribution of African sources among American populations was inferred.
• Times of admixture are concordant with migration waves from Europe and reflect differences in African areas exploited through time.
• The investigation of the demographic impact of admixture reveals a general decline and recovery pattern in several populations under study corresponding to the beginning and the end of the Colonial Era. However, Peru and Mexico are characterised by different demographic trajectories.
• The analysis of sex-biased admixture dynamics suggests that a higher number of American females than males have contributed to the modern populations. In contrast, European males had a more significant contribution than females from the same continent. In contrast, some populations, but not all, showed evidence for a higher female contribution, partially conflicting with historical records for African ancestry.https://www.ester.ee/record=b545015
Modelling the genomic structure, and antiviral susceptibility of Human Cytomegalovirus
Human Cytomegalovirus (HCMV) is found ubiquitously in humans worldwide, and once acquired, the
infection persists within the host throughout their life. Although Immunocompetent people rarely are
affected by HCMV infections, their related diseases pose a major health problem worldwide for those
with compromised or suppressed immune systems such as transplant recipients. Additionally,
congenital transmission of HCMV is the most common infectious cause of birth defects globally and
is associated with a substantial economic burden.
This thesis explores the application of statistical modelling and genomics to unpick three key areas of
interest in HCMV research. First, a comparative genomics analysis of global HCMV strains was
undertaken to delineate the molecular population structure of this highly variable virus. By including
in-house sequenced viruses of African origin and by developing a statistical framework to deconvolute
highly variable regions of the genome, novel and important insights into the co-evolution of HCMV
with its host were uncovered.
Second, a rich database relating mutations to drug sensitivity was curated for all the antiviral treated
herpesviruses. This structured information along with the development of a mutation annotation
pipeline, allowed the further development of statistical models that predict the phenotype of a virus
from its sequence. The predictive power of these models was validated for HSV1 by using external
unseen mutation data provided in collaboration with the UK Health Security Agency.
Finally, a nonlinear mixed effects model, expanded to account for Ganciclovir pharmacokinetics and
pharmacodynamics, was developed by making use of rich temporal HCMV viral load data. This model
allowed the estimation of the impact of immune-clearance versus antiviral inhibition in controlling
HCMV lytic replication in already established infections post-haematopoietic stem cell transplant
- …