1 research outputs found

    Deep learning in classifying cancer subtypes, extracting relevant genes and identifying novel mutations

    Get PDF
    Technological advancement in high-throughput genomics such as deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) sequencing has significantly increased the size and the complexity of data-sets. Also, a high number of features (genes) with a limited number of samples (patients) introduces a significant amount of noise in the data. Modern deep learning algorithms in conjunction with high computational power can handle those problems and are able to detect and diagnose diseases in a short time with reduced chance of error. Thus, developing such method and pipeline is of current research interest. At the beginning of the research, high dimensionality problem of the dataset was addressed and tried to overcome that problem by reducing the number of features using various feature extraction methods. Thus adversarial autoencoder (AAE) based feature extraction model has been introduced. Then the performance of the proposed model is evaluated using classification and weight matrix. First, the AAE model was tested using twelve various classifiers, and the results show significant performance improvement in terms of precision and recall. Compared to all other methods, AAE with support vector machine (SVM), decision tree (DT), k-nearest neighbours (KNN), quadratic discriminat analysis (QDA) and xgboost (XGB) classifiers show significant performance improvement in precision having score of 85.96%, 84.41%, 85.74%, 84.27% and 85.47% respectively. Most importantly, AAE provides consistent results in all performance metrics across twelve different classifiers which makes this feature extraction model classifier independent. Then, by analysing the weight matrix of AAE, important biomarkers such as OR2T27, OR2A25, OR8B8 and OR6V1 are identified that belong to molecular function of olfactory receptor activity. Recent study shows that olfactory receptor genes are highly expressed in breast carcinoma tissues which validated our result. Later, a pipeline has been developed for mutation identification using methylated DNA dataset. For this, raw data is analysed and three different variant caller methods are used to validate the results. Then common variants are taken for annotation to extract biological information. Next, low-quality genes are filtered out and identified 22 mutated genes that are responsible for acute myeloid leukemia (AML) diseases. Among them, most of the mutated genes are missense, non-synonymous and some of them are stop-gain and non-frameshift. The mutation frequency of the mutated genes are validated using DriverDB and Intogen databases. Here, three genes such as ZFHX3, DOK2, and PKHD1 have mutation frequency of 0.51% found in AML where rest of the mutated genes are novel for leukemia disease. To summarise, it is shown in the experiment that the feature extraction method is not only useful for diagnosing diseases, but it can be helpful for identifying biological marker which can further assist in developing personalised medicine and selecting a therapeutic target. Besides, variant analysis pipeline provides novel variant genes which could enhance the understanding of the leukemia diseases and provide direction for further clinical research and drug development
    corecore