EXPLORING ANCIENT HYBRIDIZATION EVENT USING MACHINE LEARNING

Abstract

Hybridization is an important mechanism in evolution. It can be detected by examining distributions of synonymous substitutions (Ks) in a genome. Traditional methods for examining these Ks plots include visual inspection and univariate mixture models. These traditional methods can be difficult to use. Instead I attempt to create a machine learning algorithm to examine Ks plots for evidence of hybridization and whole genome duplication (WGD). I trained and tested four different machine learning classifiers: Support Vector Classification (SVC), Linear Support Vector Classification (Linear SVC), Stochastic Gradient Descent (SGD), and Gaussian Naïve Bayes (Naïve Bayes). I found SVC to be the most accurate classifier, and that this accuracy increased with more samples and larger bin sizes. Refining this work will provide a framework with which to make further inferences about hybridizations

    Similar works