SweepNet:A Lightweight CNN Architecture for the Classification of Adaptive Genomic Regions

Abstract

The accurate identification of positive selection in genomes represents a challenge in the field of population genomics. Several recent approaches have cast this problem as an image classification task and employed Convolutional Neural Networks (CNNs). However, limited efforts have been placed on discovering a practical CNN architecture that can classify images visualizing raw genomic data in the presence of population bottlenecks, migration, and recombination hotspots, factors that typically confound the identification and localization of adaptive genomic regions. In this work, we present SweepNet, a new CNN architecture that resulted from a thorough hyper-parameter-based architecture exploration process. SweepNet has a higher training efficiency than existing CNNs and requires considerably less epochs to achieve high validation accuracy. Furthermore, it performs consistently better in the presence of confounding factors, generating models with higher validation accuracy and lower top-1 error rate for distinguishing between neutrality and a selective sweep. Unlike existing network architectures, the number of trainable parameters of SweepNet remains constant irrespective of the sample size and number of Single Nucleotide Polymorphisms, which reduces the risk of overfitting and leads to more efficient training for large datasets. Our SweepNet implementation is available for download at: https://github.com/Zhaohq96/SweepNet

    Similar works