Search CORE

1,196 research outputs found

Computational Methods for Sequencing and Analysis of Heterogeneous RNA Populations

Author: Glebova Olga
Publication venue: ScholarWorks @ Georgia State University
Publication date: 15/12/2016
Field of study

Next-generation sequencing (NGS) and mass spectrometry technologies bring unprecedented throughput, scalability and speed, facilitating the studies of biological systems. These technologies allow to sequence and analyze heterogeneous RNA populations rather than single sequences. In particular, they provide the opportunity to implement massive viral surveillance and transcriptome quantification. However, in order to fully exploit the capabilities of NGS technology we need to develop computational methods able to analyze billions of reads for assembly and characterization of sampled RNA populations. In this work we present novel computational methods for cost- and time-effective analysis of sequencing data from viral and RNA samples. In particular, we describe: i) computational methods for transcriptome reconstruction and quantification; ii) method for mass spectrometry data analysis; iii) combinatorial pooling method; iv) computational methods for analysis of intra-host viral populations

ScholarWorks @ Georgia State University

Optimization Techniques For Next-Generation Sequencing Data Analysis

Author: Caciula Adrian
Publication venue: ScholarWorks @ Georgia State University
Publication date: 12/08/2014
Field of study

High-throughput RNA sequencing (RNA-Seq) is a popular cost-efficient technology with many medical and biological applications. This technology, however, presents a number of computational challenges in reconstructing full-length transcripts and accurately estimate their abundances across all cell types. Our contributions include (1) transcript and gene expression level estimation methods, (2) methods for genome-guided and annotation-guided transcriptome reconstruction, and (3) de novo assembly and annotation of real data sets. Transcript expression level estimation, also referred to as transcriptome quantification, tackle the problem of estimating the expression level of each transcript. Transcriptome quantification analysis is crucial to determine similar transcripts or unraveling gene functions and transcription regulation mechanisms. We propose a novel simulated regression based method for transcriptome frequency estimation from RNA-Seq reads. Transcriptome reconstruction refers to the problem of reconstructing the transcript sequences from the RNA-Seq data. We present genome-guided and annotation-guided transcriptome reconstruction methods. Empirical results on both synthetic and real RNA-seq datasets show that the proposed methods improve transcriptome quantification and reconstruction accuracy compared to currently state of the art methods. We further present the assembly and annotation of Bugula neritina transcriptome (a marine colonial animal), and Tallapoosa darter genome (a species-rich radiation freshwater fish)

ScholarWorks @ Georgia State University

Sparse Linear Identifiable Multivariate Modeling

Author: Aapo Hyvärinen
Dtu Informatics
Ole Winther
Ricardo Henao
Richard Petersens Plads
Publication venue
Publication date: 01/01/2011
Field of study

In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component delta-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/.Comment: 45 pages, 17 figure

arXiv.org e-Print Archive

CiteSeerX

Online Research Database In Technology

Extracting circadian clock information from a single time point assay

Author: Vlachou Denise F.
Publication venue
Publication date
Field of study

A working internal circadian clock allows a healthy organism to keep time in order to anticipate transitions between night and day, allowing the temporal optimisation and control of internal processes. The internal circadian clock is regulated by a set of core genes that form a tightly coupled oscillator system. These oscillators are autonomous and robust to noise, but can be slowly reset by external signals that are processed by the master clock in the brain. In this thesis we explore the robustness of a tightly coupled oscillator model of the circadian clock, and show that its deterministic and stochastic forms are both significantly robust to noise. Using a simple linear algebra approach to rhythmicity detection, we show that a small set of circadian clock genes are rhythmic and synchronised in mouse tissues, and rhythmic and synchronised in a group of human individuals. These sets of tightly regulated, robust oscillators, are genes that we use to de ne the expected behaviour of a healthy circadian clock. We use these “time fingerprints" to design a model, dubbed “Time-Teller", that can be used to tell the time from single time point samples of mouse or human transcriptome. The dysfunction of the molecular circadian clock is implicated in several major diseases and there is significant evidence that disrupted circadian rhythm is a hallmark of many cancers. Convincing results showing the dysfunction of the circadian clock in solid tumours is lacking due to the difficulties of studying circadian rhythms in tumours within living mammals. Instead of developing biological assays to study this, we take advantage of the design of Time-Teller, using its underlying features to build a metric, ϴ, that indicates dysfunction of the circadian clock. We use Time-Teller to explore the clock function of samples from existing, publicly available tumour transcriptome data. Although multiple algorithms have been published with the aims of “time-telling" using transcriptome data, none of them have been reported to be able to tell the times of single samples, or provide metrics of clock dysfunction in single samples. Time-Teller is presented in this thesis as an algorithm that both tells the time of a single time-point sample, and provides a measure of clock function for that sample. In a case study, we use the clock function metric, , as a retrospective prognostic marker for breast cancer using data from a completed clinical trial. ϴ is shown to correlate with many prognostic markers of breast cancer, and we show how could also be a predictive marker for treatment efficacy and patient survival

Warwick Research Archives Portal Repository

Transcript assembly, quantification and differential alternative splicing detection from RNA-Seq

Author: Liu Ruolin
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2017
Field of study

This dissertation is focused on improving RNA-Seq processing in terms of transcript assembly, transcript quantification and detection of differential alternative splicing. There are two major challenges of solving these three problems. The first is accurately deriving transcript-level expression values from RNA-Seq reads that often align ambiguously to a set of overlapping isoforms. To make matter worse, gene annotation tends to misguide transcript quantification as new transcripts are often discovered in new RNA-Seq experiments. The second challenge is accounting for intrinsic uncertainties or variabilities in RNA-Seq measurement when calling differential alternative splicing from multiple samples across two conditions. Those uncertainties include coverage bias and biological variations. Failing to account for these variabilities can lead to higher false positive rates. To addressed these challenges, I develop a series of novel algorithms which are implemented in a software package called Strawberry. To tackle the read assignment uncertainty challenge, Strawberry assembles aligned RNA-Seq reads into transcripts using a constrained flow network algorithm. After the assembly, Strawberry uses a latent class model to assign reads to transcripts. These two steps use different optimization frameworks but utilize the same graph structure, which allows a highly efficient, expandable and accurate algorithm for dealing large data. To infer differential alternative splicing, Strawberry extends the single sample quantification model by imposing a generalized linear model on the relative transcript proportions. To account for count overdispersion, Strawberry uses an empirical Bayesian hierarchical model. For coverage bias, Strawberry performs a bias correction step which borrows information across samples and genes before fitting the differential analysis model. A serious of simulated and real data are used to evaluate and benchmark Strawberry\u27s result. Strawberry outperforms Cufflinks and StringTie in terms of both assembly and quantification accuracies. In terms of detecting differential alternative splicing, Strawberry also outperforms several state-of-the-art methods including DEXSeq, Cuffdiff 2 and DSGseq. Strawberry and its supporting code, e.g., simulation and validation, are freely available at my github (\url{https://github.com/ruolin})

Digital Repository @ Iowa State University (ISU)

Bayesian Methods for Gene Expression Analysis from High-Throughput Sequencing data

Author: Glaus Peter
Publication venue
Publication date: 01/08/2014
Field of study

The University of Manchester - Institutional Repository

Understanding the Organization and functional Control of Polysomes by integrative Approaches

Author: Lauria Fabio
Publication venue: University of Trento
Publication date: 19/12/2017
Field of study

Background and rationale Translation is a fundamental biological process occurring in cells, carried out by ribosomes simultaneously bound to an mRNA molecule (polyribosomes). It has been exhaustively demonstrated that dysregulation of translation is implicated in a wide collection of pathologies including tumours and neurological disorders. Latest findings reveal the existence of translational regulatory mechanisms acting in cis or trans with respect to the mRNAs and governing the movement and the position of ribosomes along transcripts or directly impacting on the ribosome catalogue of its constituent proteins. For this reason, translational controls also account for widespread uncoupling between transcript and protein abundances in cells. To explain the poor correlation between transcripts and protein levels, many computational models of translation have been developed. Usually, these approaches aim at predicting protein abundances in cells starting from the mRNA abundance. Despite the efforts of these modelling studies, a consensus model remains elusive, drawing to contradictory conclusions concerning the role of mRNA regulatory elements such as the usage of codons (codon usage bias) and slowdown mechanism at the beginning of the coding sequence (ramp). More recently, following the rapid and widespread diffusion of ribosome footprinting assays (RiboSeq), which enables the dissection of translation at single nucleotide resolution, a number of computational pipelines dedicated to the analysis of RiboSeq data have been proposed. These tools are typically designed for extracting gene expression alterations at the translational level, while the positional information describing fluxes and positions of ribosomes along the transcript is still underutilized. Therefore, the polysome organization, in term of number and position of ribosomes along the transcript and the translational controls directed in shaping cellular phenotypes is still open to breakthrough discoveries. Broad objectives The aim of my thesis is the development of mathematical and computational tools integrated with experimental data for a comprehensive understanding of translation regulation and polysome organization rules governing the number of ribosomes per polysome and the ribosome position along transcripts. Project design and methods With this purpose, I developed riboWaves, an integrated bioinformatics suite divided in two branches. riboWaves includes in the first branch two modeling modules: riboAbacus, predicting the number of ribosomes per transcript, and riboSim, predicting ribosome localization along mRNAs. In the second branch, riboWaves provides two pipelines, riboWaltz and riboScan, for detailed analyses of ribosome profiling data aimed at providing meaningful and yet unexplored ribosome positional information. The models and the pipelines are implemented in C and R, respectively. riboAbacus and riboWaltz are available on GitHub. Results To predict the number of ribosomes per transcript and the position of ribosomes on mRNAs, I applied riboAbacus and riboSim, respectively, to transcriptomes of different organisms (yeast, mouse, human) for understanding the role of translational regulatory elements in tuning polysome in different organisms. First, I trained and validated performances of riboAbacus taking advantage of Atomic Force Microscopy images of polysomes, while performances of riboSim were assessed employing ribosome profiling data. Predictions provided by riboAbacus and riboSim were evaluated in parallel. I showed that the average number of ribosomes translating a molecule of mRNA can be well explained by the deterministic model, riboAbacus, that includes as features the mRNA levels, the mRNA sequences, the codon usage bias and a slowdown mechanism at the beginning of the CDS (ramp hypothesis). The predictions of ribosome localization by riboSim that used as features the mRNA sequence, the codon usage and the ramp, were run for yeast, mouse and human. I observed a good similarity between the predicted and experimental positions of ribosomes along transcripts in yeast, while poor similarity was obtained between predicted and experimental ribosome positions in the two mammals, suggesting the presence of more elaborate controls that tune ribosomes movement in higher eukaryotes than in simple species. After having developed two tools for the analyses of RiboSeq data and extraction of positional information on ribosome localization along transcripts, I applied both riboWaltz and riboScan in a case study. The aim was to dissect possible defects in ribosome localization in tissues of a mouse model of Spinal Muscular Atrophy (SMA). SMA is a neurodegenerative disorder caused by low levels of the Survival of Motor Neuron protein (SMN) in which translational impairments are recently emerging as possible cause of the disease. I analysed ribosome profiling data obtained from three different types of RiboSeq variants in healthy and SMA-affected mouse brains at the early-symptomatic stage of the disease. I observed i) a significant drop-off of translating ribosomes along the coding sequence in the SMA condition (using riboWaltz); ii) in SMA-affected mice, the possible accumulation of ribosomes along the 3' UTR in neuro-related mRNAs (using riboScan); iii) the involvement of SMN-specialized ribosomes in playing a very intimate role with the elongation stage of translation of the first codons of transcripts (riboWaltz), iv) the loss of ribosomes at the 3rd codon in SMA in transcripts bound by SMN-specialized ribosomes and v) a remarkable connection between SMN and the down-regulation of genes in SMA-affected mice. Overall, these findings confirmed previous observation about possible SMN-related dysregulations of local protein synthesis in neurons. More importantly, they unravel a completely new role of SMN in tuning translation at multiple levels (initiation, elongation and the recycling of terminating ribosomes), opening new hypotheses and scenarios for explaining the most devastating genetic disease, leading cause worldwide of infant mortality. Conclusions The present work provides a new comprehensive and integrated scenario for better understanding translation and demonstrates that this approach is a very powerful strategy to pave the way for new understanding of fine alteration in polysome organization and functional control in both physiological and pathological conditions

Unitn-eprints PhD

Recommended from our members

Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits

Author: Crouse Wesley
He Xin
Luo Kaixuan
Qian Sheng
Stephens Matthew
Zhao Siming
Publication venue
Publication date: 27/01/2024
Field of study

Many methods have been developed to leverage expression quantitative trait loci (eQTL) data to nominate candidate genes from genome-wide association studies. These methods, including colocalization, transcriptome-wide association studies (TWAS) and Mendelian randomization-based methods; however, all suffer from a key problem—when assessing the role of a gene in a trait using its eQTLs, nearby variants and genetic components of other genes’ expression may be correlated with these eQTLs and have direct effects on the trait, acting as potential confounders. Our extensive simulations showed that existing methods fail to account for these ‘genetic confounders’, resulting in severe inflation of false positives. Our new method, causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes. In conclusion, cTWAS provides a robust statistical framework for gene discovery

Knowledge UChicago

Accuracy of Genomic Prediction in Dairy Cattle

Author: Erbe Malena
Publication venue
Publication date: 16/05/2013
Field of study

Die genomische Zuchtwertschätzung ist vor allem im Bereich der Milchrinderzucht in den letzten Jahren zu einer beliebten Methode geworden, um sichere Zuchtwerte von Tieren ohne phänotypische Information zu erhalten. Das Ziel dieser Arbeit war es, verschiedene Einflussfaktoren auf die Genauigkeit der genomischen Zuchtwertschätzung in realen Rinderdatensätzen genauer zu untersuchen. In Kapitel 2 findet sich eine grundlegende Arbeit zur Kreuzvalidierung, in der die Eigenschaften verschiedener Kreuzvalidierungsstrategien in realen Datensätzen untersucht wurden. Kreuzvalidierung bedeutet, dass die verfügbaren Daten in eine Trainings- und eine Validierungsstichprobe aufgeteilt werden, wobei für die Individuen in der Validierungsstichprobe alle Beobachtungswerte als nicht vorhanden angenommen werden. Die Werte der Individuen in der Validierungsstichprobe werden dann mit einem Modell, das mit Hilfe der Beobachtungswerte der Individuen in der Trainingsstichprobe angepasst wird, vorhergesagt. Im Kontext der genomischen Zuchtwertschätzung werden Kreuzvalidierungsstrategien benutzt, um die Genauigkeit der genomischen Zuchtwertschätzung mit einer bestimmten Trainingspopulation abzubilden. Die Korrelation zwischen maskierten und vorhergesagten Werten der Tiere in der Validierungsstichprobe spiegelt die Genauigkeit der genomischen Zuchtwertschätzung wider. Die Art und Weise, wie der Datensatz in Trainings- und Validierungsstichprobe unterteilt wird, kann die Ergebnisse einer Kreuzvalidierung beeinflussen. Das Ziel dieser Studie war es deshalb, optimale Strategien für unterschiedliche Zwecke – Beschreibung der Genauigkeit der genomischen Vorhersage für mögliche Selektionskandidaten mit dem vorhandenen Datensatz oder Vergleich von zwei Methoden zur Vorhersage – zu finden. Ein Datensatz von etwa 2‘300 Holstein Friesian-Bullen, die mit dem Illumina BovineSNP50 BeadChip (im Folgenden 50K Chip genannt) typisiert waren, wurde unterschiedlich aufgeteilt, so dass sich zwischen 800 bis 2‘200 Tiere in der Trainingsstichprobe und die jeweils restlichen Tiere in der Validierungsstichprobe befanden. Zwei BLUP-Modelle, eines mit einem zufälligen genomischen Effekt und eines mit einem zufälligen polygenen und einem zufälligen genomischen Effekt, wurden zur Vorhersage verwendet. Die höchste Genauigkeit der Vorhersage konnte mit der größten Trainingsstichprobe erreicht werden. Eine große Trainingsstichprobe bei gegebenem limitierten Datenmaterial impliziert aber auch, dass gleichzeitig die Validierungsstichproben klein und damit die Standardfehler der beobachteten Genauigkeiten sehr hoch sind. Falls es das Ziel einer Studie ist, signifikante Unterschiede zwischen Modellen nachzuweisen, ist es besser größere Validierungsstichproben zu verwenden. Eine fünffache Kreuzvalidierung scheint in vielen Fällen ein guter Kompromiss zu sein. Die Verwandtschaftsstruktur zwischen den Tieren in der Trainings- und der Validierungsstichprobe hat einen großen Effekt auf die Genauigkeit der genomischen Zuchtwertschätzung. Momentan sind noch genügend nachkommengeprüfte Bullen in den Trainingsstichproben vorhanden, mit denen die Tiere in der Validierungsstichprobe hoch verwandt sind. Wenn die genomische Selektion konsequent angewendet wird, ist es möglich, dass solche Individuen für die Trainingsstichprobe knapper werden. Deshalb enthält Kapitel 3 eine Studie, die untersucht, wie sich die Verwandtschafts- und Altersstruktur auf die Genauigkeit der genomischen Zuchtwerte von jungen Bullen auswirkt. Ein Datensatz mit 5‘698 Bullen der Rasse Holstein Friesian, die alle mit dem 50K Chip typisiert wurden und zwischen 1981 und 2005 geboren wurden, war die Basis dieser Arbeit. In allen Szenarien wurden die 500 jüngsten Bullen dieses Datensatzes als Validierungsstichprobe verwendet. Verschiedene Trainingsstichproben mit je 1‘500 Individuen wurden ausgewählt, um die genomischen Zuchtwerte der jungen Tiere (Selektionskandidaten) vorherzusagen: eine zufällige Auswahl an Bullen, die ältesten und jüngsten verfügbaren Tiere, Tiere mit Verwandtschaftskoeffizienten kleiner 0.25 oder 0.5 zu allen Selektionskandidaten, oder Tiere, die am stärksten mit den Selektionskandidaten verwandt waren. Verglichen mit dem Szenario mit der zufälligen Auswahl führte eine Verringerung der Verwandtschaft zu einer sichtbaren Abnahme der Genauigkeit der genomischen Vorhersage. Die Genauigkeit für die Szenarien mit den hoch verwandten Tieren bzw. den jüngsten Tieren in der Trainingsstichprobe war hingegen höher. Für die praktische Anwendung bedeutet dies, dass in stark verwandten Gruppen wie Elitebullen der Rasse Holstein Friesian keine weiteren Probleme für die Vorhersage junger Tiere zu erwarten sind, solange Väter, Voll- und Halbgeschwister in der Trainingsstichprobe vorhanden sind. Neue nachkommengeprüfte Bullen sollten deshalb kontinuierlich zur Trainingsstichprobe hinzugefügt werden – sonst wird eine klare Abnahme der Genauigkeit schon nach ein oder zwei Generationen zu sehen sein. Kapitel 4 beschäftigt sich mit zwei weiteren Faktoren, die die Genauigkeit der genomischen Vorhersage beeinflussen können: Markerdichte und Methodenwahl. Bis jetzt wurden normalerweise 50K SNPs für die genomische Zuchtwertschätzung verwendet, aber seit Kurzem ist auch ein neues hochdichtes SNP-Array mit 777K SNPs verfügbar. Dies lässt die Frage aufkommen, ob die höhere Markerdichte zu einem Anstieg in der Genauigkeit führen kann. Je mehr Marker verfügbar sind, umso größer wird auch die Notwendigkeit, Methoden zu entwickeln, die einen Teil der Marker als nicht informativ (d.h. ohne Effekt auf das untersuchte Merkmal) zulassen. Deshalb wurde eine neue und effiziente Bayes’sche Methode (BayesR) entwickelt, die annimmt, dass die SNP Effekte aus einer Reihe von Normalverteilungen stammen, die unterschiedliche Varianzen haben. Die Anzahl der SNPs pro Verteilung wird nicht festgesetzt, sondern mit Hilfe einer Dirichlet-Verteilung modelliert. In Kapitel 4 wird außerdem auf die Frage eingegangen, wie sich die Genauigkeit der Vorhersage im Fall von Trainingsstichproben mit mehreren Rassen bei unterschiedlicher Markerdichte verhält. Bei Milchrinderrassen sind große Trainingsstichproben erforderlich, um robuste Schätzer der SNP-Effekte zu erhalten, aber gerade bei kleinen Rassen kann es schwierig sein, solch große Trainingsstichproben aufzubauen. Trainingsstichproben, die Tiere mehrerer Rassen enthalten, können deshalb eine Möglichkeit sein, dieses Problem zu umgehen. Mit 50K SNPs war der Erfolg solcher Mehrrassen-Trainingsstichproben gering, was darauf zurückgeführt wurde, dass die Haplotypenstruktur über die Rassen hinweg bei dieser Markerdichte nicht konsistent war. Der hochdichte SNP-Chip könnte hier allerdings Verbesserungen für die Vorhersage über Rassen hinweg bringen. Die Veränderungen in der Genauigkeit der genomischen Zuchtwertschätzung innerhalb einer Rasse und über Rassen hinweg wurden mit Daten von australischen Bullen der Rassen Holstein Friesian und Jersey, die mit dem 50K Chip typisiert und auf 777K SNPs imputet waren, und zwei verschiedenen Methoden (GBLUP, BayesR) untersucht. Die Verwendung von imputeten hochdichten Markern führte zu keinem signifikanten Anstieg der Genauigkeit innerhalb einer Rasse und nur zu einer geringen Verbesserung der Genauigkeit in der kleineren Rasse im Mehrrassen-Szenario. BayesR lieferte gleichwertige oder in vielen Fällen höhere Genauigkeiten als GBLUP. Eine Eigenschaft von BayesR ist außerdem, dass es möglich ist, aus den Ergebnissen Erkenntnisse zur genetischen Architektur des Merkmals zu erhalten, z.B. indem man die durchschnittliche Anzahl an SNPs in den verschiedenen Verteilungen betrachtet. Die Genauigkeit der genomischen Zuchtwertschätzung kann mit verschiedenen Validierungsprozeduren berechnet werden, sobald reale Daten vorhanden sind. In manchen Situationen kann es jedoch von Vorteil sein, wenn man die erwartete Genauigkeit der Vorhersage im Vorfeld einer Studie abschätzen kann, z.B. um zu wissen, welche Größe die Trainingsstichprobe haben sollte oder wie hoch die Markerdichte sein sollten, um eine bestimmte Genauigkeit zu erreichen. Verschiedene deterministische Formeln zur Abschätzung der erreichbaren Genauigkeit sind in der Literatur verfügbar, die alle auf den mehr oder weniger gleichen Parametern beruhen. Einer dieser Parameter ist die Anzahl unabhängig segregierender Chromosomensegmente (Me), die normalerweise mit Hilfe von theoretischen Werten wie der effektiven Populationsgröße (Ne) deterministisch bestimmt wird. In Kapitel 5 wird ein Maximum-Likelihood Ansatz beschrieben, der es ermöglicht, Me basierend auf systematisch angelegten Kreuzvalidierungsexperimenten empirisch zu bestimmen. Darauf aufbauend wurden verschiedene deterministische Funktionen zur Vorhersage der Genauigkeit verglichen und so modifiziert, dass sie am besten zu den vorhandenen Datensätzen passten. Mit 5‘698 Holstein Friesian-Bullen, die mit dem 50K Chip typisiert waren, und 1‘333 Braunvieh-Bullen, die mit dem 50K Chip typisiert und auf 777K SNPs imputet waren, wurden mit GBLUP verschiedene k-fache Kreuzvalidierungen (k=2, 3, …, 10, 15, 20) durchgeführt. So konnte eine genomische Zuchtwertschätzung bei unterschiedlichen Größen der Trainingsstichprobe nachgebildet werden. Weiterhin wurden alle Szenarien mit verschiedenen Subsets der vorhandenen SNPs (10‘000, 20‘000, 30‘000, 42‘551 SNPs für Holstein Friesian, und jeder, jeder zweite, jeder 4., … jeder 256. SNP für Braunvieh) durchgeführt, um den Einfluss der Markerdichte erfassen zu können. Der Maximum-Likelihood Ansatz wurde angewendet, um Me für die beiden vorhandenen Datensätze bestmöglich zu schätzen. Die höchste Likelihood wurde erreicht, wenn eine modifizierte Form der deterministischen Formel von Daetwyler et al. (2010, Genetics 185:1021-1031) für die Modellierung der erwarteten Genauigkeit die Grundlage bildete. Die wahrscheinlichsten Werte für Me, wenn alle vorhandenen Marker genutzt wurden, waren 1‘241 (412) und 1‘046 (197) für die Merkmale Zellzahl und Milchmenge für Holstein Friesian (Braunvieh). Die Werte für Me für Braunvieh und Holstein Friesian unterschieden sich deutlich, während Ne für beide Populationen (berechnet auf Basis des Pedigrees oder über die Struktur des Kopplungsungleichgewichts) sehr ähnlich war. Die Schätzungen für Me variierten zwischen verschiedenen Merkmalen innerhalb von Populationen und über Populationen mit ähnlichen Populationsstrukturen hinweg. Dies zeigt, dass Me wahrscheinlich kein Parameter ist, der sich nur aus Ne und der Länge des Genoms berechnen lässt. Die Modifizierung der Formel von Daetwyler et al. (2010) bestand darin, einen Gewichtungsfaktor hinzuzufügen, der berücksichtigt, dass die maximale Genauigkeit bei gegebener Markerdichte auch mit unendlich großer Trainingsstichprobe nicht 1 sein muss. Dies basiert auf der Annahme, dass die vorhandenen SNPs nicht die ganze genetische Varianz wiedergeben können. Auch dieser Gewichtungsfaktor wurde empirisch bestimmt. Die quadrierten Werte, d.h. der Prozentsatz der genetischen Varianz, die erklärt wird, lagen zwischen 76% und 82% für 10‘000 bis 42‘551 SNPs bei Holstein Friesian und zwischen 63% und 75% für 2‘451 bis 627‘306 SNPs bei Braunvieh. Zwischen dem natürlichen Logarithmus der Markerdichte und dem Gewichtungsfaktor bestand ein linearer Zusammenhang bis zu einer populationsspezifischen Grenze hinsichtlich der Markerdichte (~ 20‘000 SNPs bei Braunvieh). Oberhalb dieser Grenze fand sich ein Plateau, was bedeutet, dass das Hinzufügen von weiteren Markern den Anteil der genetischen Varianz, der erklärt wird, nicht mehr verändert

Georg-August-University Göttingen