377 research outputs found

    MWS and FWS Codes for Coordinate-Wise Weight Functions

    Full text link
    A combinatorial problem concerning the maximum size of the (hamming) weight set of an [n,k]q[n,k]_q linear code was recently introduced. Codes attaining the established upper bound are the Maximum Weight Spectrum (MWS) codes. Those [n,k]q[n,k]_q codes with the same weight set as Fqn \mathbb{F}_q^n are called Full Weight Spectrum (FWS) codes. FWS codes are necessarily ``short", whereas MWS codes are necessarily ``long". For fixed k,q k,q the values of n n for which an [n,k]q [n,k]_q -FWS code exists are completely determined, but the determination of the minimum length M(H,k,q) M(H,k,q) of an [n,k]q [n,k]_q -MWS code remains an open problem. The current work broadens discussion first to general coordinate-wise weight functions, and then specifically to the Lee weight and a Manhattan like weight. In the general case we provide bounds on n n for which an FWS code exists, and bounds on n n for which an MWS code exists. When specializing to the Lee or to the Manhattan setting we are able to completely determine the parameters of FWS codes. As with the Hamming case, we are able to provide an upper bound on M(L,k,q) M(\mathcal{L},k,q) (the minimum length of Lee MWS codes), and pose the determination of M(L,k,q) M(\mathcal{L},k,q) as an open problem. On the other hand, with respect to the Manhattan weight we completely determine the parameters of MWS codes.Comment: 17 page

    Quasi-Perfect Lee Codes of Radius 2 and Arbitrarily Large Dimension

    Get PDF
    A construction of two-quasi-perfect Lee codes is given over the space ?np for p prime, p ? ±5 (mod 12), and n = 2[p/4]. It is known that there are infinitely many such primes. Golomb and Welch conjectured that perfect codes for the Lee metric do not exist for dimension n ? 3 and radius r ? 2. This conjecture was proved to be true for large radii as well as for low dimensions. The codes found are very close to be perfect, which exhibits the hardness of the conjecture. A series of computations show that related graphs are Ramanujan, which could provide further connections between coding and graph theories

    Genomic selection in farm animals: accuracy of prediction and applications with imputed whole-genome sequencing data in chicken

    Get PDF
    Methoden zur genomischen Vorhersage basierend auf Genotypinformationen von Single Nucleotide Polymorphism (SNP)-Arrays mit unterschiedlicher Markeranzahl sind mittlerweile in vielen Zuchtprogrammen für Nutztiere fest implementiert. Mit der zunehmenden Verfügbarkeit von vollständigen Genomsequenzdaten, die auch kausale Mutationen enthalten, werden mehr und mehr Studien veröffentlicht, bei denen genomische Vorhersagen beruhend auf Sequenzdaten durchgeführt werden. Das Hauptziel dieser Arbeit war zu untersuchen, inwieweit SNP-Array-Daten mit statistischen Verfahren bis zum Sequenzlevel ergänzt werden können (sogenanntes „Imputing“) (Kapitel 2) und ob die genomische Vorhersage mit imputeten Sequenzdaten und zusätzlicher Information über die genetische Architektur eines Merkmals verbessert werden kann (Kapitel 3). Um die Genauigkeit der genomischen Vorhersage besser verstehen und eine neue Methode zur Approximation dieser Genauigkeit ableiten zu können, wurde außerdem eine Simulationsstudie durchgeführt, die den Grad der Überschätzung der Genauigkeit der genomischen Vorhersage verschiedener bereits bekannter Ansätze überprüfte (Kapitel 4). Der technische Fortschritt im letzten Jahrzehnt hat es ermöglicht, in relativ kurzer Zeit Millionen von DNA-Abschnitten zu sequenzieren. Mehrere auf unterschiedlichen Algorithmen basierende Software-Programme zur Auffindung von Sequenzvarianten (sogenanntes „Variant Calling“) haben sich etabliert und es möglich gemacht, SNPs in den vollständigen Genomsequenzdaten zu detektieren detektieren. Oft werden nur wenige Individuen einer Population vollständig sequenziert und die Genotypen der anderen Individuen, die mit einem SNP-Array an einer Teilmenge dieser SNPs typisiert wurden, imputet. In Kapitel 2 wurden deshalb anhand von 50 vollständig sequenzierten Weiß- und Braunleger-Individuen die mit drei unterschiedlichen Variant-Calling-Programmen (GATK, freebayes and SAMtools) detektierten Genomvarianten verglichen und die Qualität der Genotypen überprüft. Auf den untersuchten Chromosomen 3,6 und 26 wurden 1.741.573 SNPs von allen drei Variant Callers detektiert was 71,6% (81,6%, 88,0%) der Anzahl der von GATK (SAMtools, freebayes) detektierten Varianten entspricht. Die Kenngröße der Konkordanz der Genotypen („genotype concordance“), die durch den Anteil der Individuen definiert ist, deren Array-basierte Genotypen mit den Sequenz-basierten Genotypen an allen auch auf dem Array vorhandenen SNPs übereinstimmt, betrug 0,98 mit GATK, 0,98 mit SAMtools und 0,97 mit freebayes (Werte gemittelt über SNPs auf den untersuchten Chromosomen). Des Weiteren wiesen bei Nutzung von GATK (SAMtools, freebayes) 90% (88 %, 75%) der Varianten hohe Werte (>0.9) anderer Qualitätsmaße (non-reference sensitivity, non-reference genotype concordance und precision) auf. Die Leistung aller untersuchten Variant-Calling-Programme war im Allgemeinen sehr gut, besonders die von GATK und SAMtools. In dieser Studie wurde außerdem in einem Datensatz von ungefähr 1000 Individuen aus 6 Generationen die Güte des Imputings von einem hochdichten SNP-Array zum Sequenzlevel untersucht. Die Güte des Imputings wurde mit Hilfe der Korrelationen zwischen imputeten und wahren Genotypen pro SNP oder pro Individuum und der Anzahl an Mendelschen Konflikten bei Vater-Nachkommen-Paaren beschrieben. Drei unterschiedliche Imputing-Programme (Minimac, FImpute und IMPUTE2) wurden in unterschiedlichen Szenarien validiert. Bei allen Imputing-Programmen betrug die Korrelation zwischen wahren und imputeten Genotypen bei 1000 Array-SNPs, die zufällig ausgewählt und deren Genotypen im Imputing-Prozess als unbekannt angenommen wurden, durchschnittlich mehr als 0.95 sowie mehr als 0.85 bei einer Leave-One-Out-Kreuzvalidierung, die mit den sequenzierten Individuen durchgeführt wurde. Hinsichtlich der Genotypenkorrelation zeigten Minimac und IMPUTE2 etwas bessere Ergebnisse als FImpute. Dies galt besonders für SNPs mit niedriger Frequenz des selteneren Allels. FImpute wies jedoch die kleinste Anzahl von Mendelschen Konflikten in verfügbaren Vater-Nachkommen-Paaren auf. Die Korrelation zwischen wahren und imputeten Genotypen blieb auf hohem Niveau, auch wenn die Individuen, deren Genotypen imputet wurden, einige Generationen jünger waren als die sequenzierten Individuen. Zusammenfassend zeigte in dieser Studie GATK die beste Leistung unter den getesteten Variant-Calling-Programmen, während Minimac sich unter den untersuchten Imputing-Programmen als das beste erwies. Aufbauend auf den Ergebnissen aus Kapitel 2 wurden in Kapitel 3 Studien zur genomischen Vorhersage mit imputeten Sequenzdaten durchgeführt. Daten von 892 Individuen aus 6 Generationen einer kommerziellen Braunlegerlinie standen hierfür zur Verfügung. Diese Tiere waren alle mit einem hochdichten SNP-Array genotypisiert. Unter der Nutzung der Daten von 25 vollständig sequenzierten Individuen wurden jene Tiere ausgehend von den Array-Genotypen bis zum Sequenzlevel hin imputet. Das Imputing wurde mit Minimac3 durchgeführt, das bereits haplotypisierte Daten (in dieser Studie mit Beagle4 erzeugt) als Input benötigt. Die Genauigkeit der genomischen Vorhersage wurde durch die Korrelation zwischen de-regressierten konventionellen Zuchtwerten und direkt genomischen Zuchtwerten für die Merkmale Bruchfestigkeit, Futteraufnahme und Legerate gemessen. Neben dem Vergleich der Genauigkeit der auf SNP-Array-Daten und Sequenzdaten basierenden genomischen Vorhersage wurde in dieser Studie auch untersucht, wie sich die Verwendung verschiedener genomischer Verwandtschaftsmatrizen, die die genetische Architektur berücksichtigen, auf die Vorhersagegenauigkeit auswirkt. Hierbei wurden neben dem Basisszenario mit gleichgewichteten SNPs auch Szenarien mit Gewichtungsfaktoren, nämlich den -(〖log〗_10 P)-Werten eines t-Tests basierend auf einer genomweiten Assoziationsstudie und den quadrierten geschätzten SNP-Effekten aus einem Random Regression-BLUP-Modell, sowie die Methode BLUP|GA („best linear unbiased prediction given genetic architecture“) überprüft. Das Szenario GBLUP mit gleichgewichteten SNPs wurde sowohl mit einer Verwandtschaftsmatrix aus allen verfügbaren SNPs oder nur derer in Genregionen, jeweils ausgehend von der Grundmenge aller imputeten SNPs in der Sequenz oder der Array-SNPs, getestet. Gemittelt über alle untersuchten Merkmale war die Vorhersagegenauigkeit mit SNPs aus Genregionen, die aus den imputeten Sequenzdaten extrahiert wurden, mit 0,366 ± 0,075 am höchsten. Den zweithöchsten Wert erreichte die genomische Vorhersage mit SNPs aus Genregionen, die im SNP-Array erhalten sind (0,361 ± 0,072). Weder die Verwendung gewichteter genomischer Verwandtschaftsmatrizen noch die Anwendung von BLUP|GA führten im Vergleich zum normalen GBLUP-Ansatz zu höheren Vorhersagegenauigkeiten. Diese Beobachtung war unabhängig davon, ob SNP-Array- oder imputete Sequenzdaten verwendet wurden. Die Ergebnisse dieser Studie zeigten, dass kaum oder kein Zusatznutzen durch die Verwendung von imputeten Sequenzdaten generiert werden kann. Eine Erhöhung der Vorhersagegenauigkeit konnte jedoch erreicht werden, wenn die Verwandschaftsmatrix nur aus den SNPs in Genregionen gebildet wurde, die aus den Sequenzdaten extrahiert wurden. Die Auswahl der Selektionskandidaten erfolgt in genomischen Selektionsprogrammen mit Hilfe der geschätzten genomischen Zuchtwerte (GBVs). Die Genauigkeit des GBV ist hierbei ein relevanter Parameter, weil sie die Stabilität der geschätzten Zuchtwerte beschreibt und zeigen kann, wie sich der GBV verändern kann, wenn mehr Informationen verfügbar werden. Des Weiteren ist sie einer der entscheidenden Faktoren beim erwarteten Zuchtfortschritt (auch als so genannte „Züchtergleichung“ beschrieben). Diese Genauigkeit der genomischen Vorhersage ist jedoch in realen Daten schwer zu quantifizieren, da die wahren Zuchtwerte (TBV) nicht verfügbar sind. In früheren Studien wurden mehrere Methoden vorgeschlagen, die es ermöglichen, die Genauigkeit von GBV durch Populations- und Merkmalsparameter (z.B. effektive Populationsgröße, Sicherheit der verwendeten Quasi-Phänotypen, Anzahl der unabhängigen Chromosomen-Segmente) zu approximieren. Weiterhin kann die Genauigkeit bei Verwendung von gemischten Modellen mit Hilfe der Varianz des Vorhersagefehlers abgeleitet werden. In der Praxis wiesen die meisten dieser Ansätze eine Überschätzung der Genauigkeit der Vorhersage auf. Deshalb wurden in Kapitel 4 mehrere methodische Ansätze aus früheren Arbeiten in simulierten Daten mit unterschiedlichen Parametern, mit Hilfe derer verschiedene Tierzuchtprogramme (neben einem Basisszenario ein Rinder- und ein Schweinezuchtschema) abgebildet wurden, überprüft und die Höhe der Überschätzung gemessen. Außerdem wurde in diesem Kapitel eine neue und leicht rechenbare Methode zur Approximation der Genauigkeit vorgestellt Die Ergebnisse des Vergleichs der methodischen Ansätze in Kapitel 4 zeigten, dass die Genauigkeit der GBV durch den neuen Ansatz besser vorhergesagt werden kann. Der vorgestellte Ansatz besitzt immer noch einen unbekannten Parameter, für den jedoch eine Approximation möglich ist, wenn in einem geeigneten Datensatz Ergebnisse von Zuchtwertschätzungen zu zwei verschiedenen Zeitpunkten vorliegen. Zusammenfassend kann gesagt werden, dass diese neue Methode die Approximation der Genauigkeit des GBV in vielen Fällen verbessert.Genomic prediction has been successfully applied in many livestock breeding schemes, based on different densities of single nucleotide polymorphism (SNP) array data. With the availability of whole-genome sequencing (WGS) data, which may contain the causal mutations, there are a growing number of studies to conducting genomic prediction with WGS data. The main objective of this thesis was to investigate the possibility of imputing SNP array data up to the whole genome sequence level (Chapter 2) and then perform genomic prediction based on the imputed WGS data and SNP array data with different genomic relationship matrices to account for genetic architecture (Chapter 3). To further understand the accuracy of genomic prediction, a simulation study was performed to determine the degree of overestimation of the accuracy of genomic prediction, in order to propose a new method (Chapter 4). The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract SNPs out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals that have been genotyped for a subset of SNPs using a genotyping array. Thus, in Chapter 2 we first compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6% (81.6%, 88.0%) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC), defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array, was 0.98 with GATK, 0.98 with SAMtools, and 0.97 with freebayes averaged over all SNPs on the studied chromosomes, respectively. Furthermore, for GATK (SAMtools, freebayes) 90 (88, 75) percent of variants had high values (>0.9) for other quality measures (non-reference sensitivity, non-reference genotype concordance and precision). Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from approximately 1000 individuals from six generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios. Across all imputation programs, correlation between true and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, however, it did have the lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals. In conclusion, among three variant callers tested GATK proved the relatively better performance; Minimac proved the relatively better performance comparing to the other two imputation programs tested. Based on the conclusions in Chapter 2, we applied a genomic prediction with imputed WGS in Chapter 3. A commercial brown layer line comprising of 892 chickens from 6 generations was used in the study. These chickens were genotyped with a high density array data. Using the WGS data of 25 individuals, those array data were imputed up to the sequence level. The imputation was done with Minimac3, which needs pre-phased data generated with Beagle4. Accuracy of genomic prediction was measured as the correlation between de-regressed proofs and direct genomic breeding values of eggshell strength, feed intake and laying rate. In this study, besides the accuracy of genomic prediction based on array data and WGS data, accuracy based on different genomic relationship matrices to account for genetic architecture was investigated. The alternative weighting factors used were uniform, -(〖log〗_10 P) from a t-test of genome wide association study, and the square of estimated SNP effects from random regression BLUP. Best linear unbiased prediction given genetic architecture (BLUP|GA) was investigated as well. Prediction with uniform weights (the original GBLUP) was implemented with all SNPs or with only genic SNPs, both based on array and imputed whole sequence data. Averaging over the studied traits, predictive ability with only genic SNPs in WGS data was 0.366 ± 0.075, which was the highest predictive ability observed in the current study. Genomic prediction with genic SNPs in high density array data provided the second highest accuracy (0.361 ± 0.072). The prediction with -(〖log〗_10 P) or squares of SNP effects as weighting factors for building a genomic relationship matrix or BLUP|GA did not lead to higher accuracy, compared to that with uniform weights, regardless of the SNP set used. The results from this study showed that little or no benefit was gained when using all imputed WGS data to perform genomic prediction compared to using HD array data, regardless of the different SNP weightings tested. However, higher predictive ability was observed when using only genic SNPs extracted from the WGS data for genomic prediction. Decisions of genomic selection schemes are made based on the genomic breeding values (GBV) of selection candidates. Thus, the accuracy of GBV is a relevant parameter, as it reflects the stability of the prediction and the possibility that the GBV might change when more information becomes available. It is also one of the key factors in expected response to selection, which is also known as breeders’ equation. Accuracy of genomic prediction, however, is difficult to assess, considering true breeding values (TBV) of the candidates are not available in reality. In previous studies, several methods are proposed to assess the accuracy of GBV by using population and trait parameters (e.g. the effective population size, the reliability of quasi-phenotypes used, the number of independent chromosome segments) or parameters inferred from the mixed model equations. In practice, most approaches were found to overestimate the accuracy of genomic prediction. Thus, in Chapter 4 we tested several approaches used in previous studies based on simulated data under a variety of parameters mimicking different livestock breeding programs (i.e. a cattle-like and a pig-like as well as a basic scenario) and measured the magnitude of overestimation. Then we proposed a novel and computationally feasible method. Based on the comparison in Chapter 4, the new method provided a better prediction for the accuracy of GBV. The method still had one unknown parameter, for which we suggested an approach to approximate its value from a suitable data set reflecting two separate time points. In conclusion, the new approach provided a better assessment of the accuracy of GBVs in many cases

    Perfect codes in the Lee and Chebyshev metrics and iterating Rédei functions

    Get PDF
    Orientadores: Sueli Irene Rodrigues Costa, Daniel Nelson Panario RodriguezTese (doutorado) - Universidade Estadual de Campinas, Instituto de Matemática Estatística e Computação CientíficaResumo: O conteúdo desta tese insere-se dentro de duas áreas de pesquisa muito ativas: a teoria de códigos corretores de erros e sistemas dinâmicos sobre corpos finitos. Para abordar problemas em ambos os tópicos introduzimos um tipo de sequência finita que chamamos v-séries. No conjunto destas definimos uma métrica que induz uma estrutura de poset usada no estudo das possíveis estruturas de grupo abeliano representadas por códigos perfeitos na métrica de Chebyshev. Por outro lado, cada v-série é associada a uma árvore com raiz, a qual terá um papel importante em resultados relacionados à estrutura dinâmica de iterações de funções de Rédei. Na teoria de códigos corretores de erros, estudamos códigos perfeitos na métrica de Lee e na métrica de Chebyshev (correspondentes à métrica lp para p=1 e p=infinito respetivamente). Os principais resultados aqui estão relacionados com a descrição dos códigos q-ários n-dimensionais com raio de empacotamento e que sejam perfeitos nestas métricas, a obtenção de suas matrizes geradoras e a classificação destes, a menos de isometrias e a menos de isomorfismos. Varias construções de códigos perfeitos e famílias interessantes destes códigos com respeito à métrica de Chebyshev são apresentadas. Em sistemas dinâmicos sobre corpos finitos centramos nossa atenção em iterações de funções de Rédei, sendo o principal resultado um teorema estrutural para estas funções, o qual permite estender vários resultados sobre funções de Rédei. Este teorema pode também ser aplicado para outras classes de funções permitindo obter provas alternativas mais simples de alguns resultados conhecidos como o número de componentes conexas, o número de pontos periódicos e o valor esperado para o período e preperíodo da aplicação exponencial sobre corpos finitosAbstract: The content of this thesis is inserted in two very active research areas: the theory of error correcting codes and dynamical systems over finite fields. To approach problems in both topics we introduce a type of finite sequence called v-series. A metric is introduced in the set of such sequences inducing a poset structure used to determine all possible abelian group structures represented by perfect codes in the Chebyshev metric. Moreover, each v-serie is associated with a rooted tree, which has an important role in results related to the cycle structure of iterating Rédei functions. Regarding the theory of error correcting codes, we study perfect codes in the Lee metric and Chebyshev metric (corresponding to the lp metric for p=1 and p=infinity, respectively). The main results here are related to the description of n-dimensional q-ary codes with packing radius e which are perfect in these metrics, obtaining their generator matrices and their classification up to isometry and up to isomorphism. Several constructions of perfect codes in the Chebyshev metric are given and interesting families of such codes are presented. Regarding dynamical system over finite fields we focus on iterating Rédei functions, where our main result is a structural theorem, which allows us to extend several results on Rédei functions. The above theorem can also be applied to other maps, allowing simpler proofs of some known results related to the number of components, the number of periodic points and the expected value for the period and preperiod for iterating exponentiations over finite fieldsDoutoradoMatematica AplicadaDoutor em Matemática Aplicada2012/10600-2FAPESPCAPE

    The discovery of novel recessive genetic disorders in dairy cattle : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Animal Science at AL Rae Centre of Genetics and Breeding, Massey University, Palmerston North, New Zealand

    Get PDF
    The selection of desirable characteristics in livestock has resulted in the transmission of advantageous genetic variants for generations. The advent of artificial insemination has accelerated the propagation of these advantageous genetic variants and led to tremendous advances in animal productivity. However, this intensive selection has led to the rapid uptake of deleterious alleles as well. Recently, a recessive mutation in the GALNT2 gene was identified to dramatically impair growth and production traits in dairy cattle causing small calf syndrome. The research presented here seeks to further investigate the presence and impact of recessive mutations in dairy cattle. A primary aim of genetics is to identify causal variants and understand how they act to manipulate a phenotype. As datasets have expanded, larger analyses are now possible and statistical methods to discover causal mutations have become commonplace. One such method, the genome-wide association study (GWAS), presents considerable exploratory utility in identifying quantitative trait loci (QTL) and causal mutations. GWAS' have predominantly focused on identifying additive genetic effects assuming that each allele at a locus acts independently of the other, whereas non-additive effects including dominant, recessive, and epistatic effects have been neglected. Here, we developed a single-locus non-additive GWAS model intended for the detection of dominant and recessive genetic mechanisms. We applied our non-additive GWAS model to growth, developmental, and lactation phenotypes in dairy cattle. We identified several candidate causal mutations that are associated with moderate to large deleterious recessive disorders of animal welfare and production. These mutations included premature-stop (MUS81, ITGAL, LRCH4, RBM34), splice disrupting (FGD4, GALNT2), and missense (PLCD4, MTRF1, DPF2, DOCK8, SLC25A4, KIAA0556, IL4R) variants, and these occur at surprisingly high frequencies in cattle. We further investigated these candidates for anatomical, molecular, and metabolic phenotypes to understand how these disorders might manifest. In some cases, these mutations were analogous to disorder-causing mutations in other species, these included: Coffin-Siris syndrome (DPF2); Charcot Marie Tooth disease (FGD4); a congenital disorder of glycosylation (GALNT2); hyper Immunoglobulin-E syndrome (DOCK8); Joubert syndrome (KIAA0556); and mitochondrial disease (SLC25A4). These discoveries demonstrate that deleterious recessive mutations exist in dairy cattle at remarkably high frequencies and we are able to detect these disorders through modern genotyping and phenotyping capabilities. These are important findings that can be used to improve the health and productivity of dairy cattle in New Zealand and internationally

    2018 IMSAloquium, Student Investigation Showcase

    Get PDF
    This is IMSA\u27s 31st year of leading in educational innovation and the 30th year of the Student Inquiry and Research Program (SIR)! ... These studies have all happened during the past year in a variety of laboratories, real or virtual, on and off campus. Students were asked to not only learn a great deal about complex topics, but to contribute to them in meaningful ways. The presentations you hear today reflect the various stages of their work on a myriad of projects.https://digitalcommons.imsa.edu/archives_sir/1028/thumbnail.jp

    Statistical Methods and Analysis to Identify Disease-Related Variants from Genetics Studies

    Full text link
    Advances in genotyping and sequencing technologies have greatly revolutionized the analytic methods in genetics research. Due to the dramatically decreasing per-genotype cost, millions of variants have been detected and genotyped from population-scale data. The findings provide a new insight into the human genome, and are continuously shaping our understanding of the genetic basis for disease. In this dissertation, I focus on three topics related to discovering disease-related variants in genetics studies in the aspects of method development and dataset analysis. In chapter 2, I develop a likelihood-based method, LIME, to detect and genotype mobile element insertions (MEIs), a specific type of large insertions, from sequencing data. The method generates genotype likelihoods for each MEI using simulation that mimics the distribution of reads in regions with and without MEIs From both simulated and real sequence data, our method shows better sensitivity than existing methods, especially in low-coverage data. In chapter 3, I present genome-wide association studies and a whole-genome sequencing effort of discovering potentially novel loci for colorectal cancer. Using an imputation-based meta-analysis strategy, I replicate many previous findings and provide a list of novel variants and genes for colorectal cancer. In collaboration with Fred Hutch Cancer Research Center, we additionally sequenced ~3,000 individuals and generated a variant call set. By incorporating gene annotation, sequence function prediction and online gene expression database, I highlight potentially functional loci for colorectal cancer in the known region 12q12 and the novel region 6q21.31. Although it is difficult to obtain new significant variants in the absence of extremely large dataset, our analysis provides some practical examples to incorporate functional genomics data into association analysis and to prioritize potentially functional candidates under limited sample size. Additionally, from the variant calling of whole-genome sequencing samples, we identified over 50 million variants, half of them being novel to the dbSNP database. In chapter 4, I describe a major update to the meta-analysis software RAREMETAL that brings in software engineering improvements and several useful new methods for rare variant analysis. The engineering improvements make RAREMETAL more computationally efficient. The new methods in addition preserve the ability to meta-analysis in unbalanced studies, multi-allelic sites and generalized linear mixed models.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138617/1/saichen_1.pd
    corecore