Search CORE

5,390 research outputs found

Improvement on KNN using genetic algorithm and combined feature extraction to identify COVID-19 sufferers based on CT scan image

Author: Nugraha Arie Sapta
Nugroho Radityo Adi
Rahayu Fenny Winda
Rasyid Aylwin Al
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/10/2021
Field of study

Coronavirus disease 2019 (COVID-19) has spread throughout the world. The detection of this disease is usually carried out using the reverse transcriptase polymerase chain reaction (RT-PCR) swab test. However, limited resources became an obstacle to carrying out the massive test. To solve this problem, computerized tomography (CT) scan images are used as one of the solutions to detect the sufferer. This technique has been used by researchers but mostly using classifiers that required high resources, such as convolutional neural network (CNN). In this study, we proposed a way to classify the CT scan images by using the more efficient classifier, k-nearest neighbors (KNN), for images that are processed using a combination of these feature extraction methods, Haralick, histogram, and local binary pattern. Genetic algorithm is also used for feature selection. The results showed that the proposed method was able to improve KNN performance, with the best accuracy of 93.30% for the combination of Haralick and local binary pattern feature extraction, and the best area under the curve (AUC) for the combination of Haralick, histogram, and local binary pattern with a value of 0.948. The best accuracy of our models also outperforms CNN by a 4.3% margin

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Machine learning and data-parallel processing for viral metagenomics

Author: Bzhalava Zurab
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 03/04/2020
Field of study

More than 2 million cancer cases around the world each year are caused by viruses. In addition, there are epidemiological indications that other cancer-associated viruses may also exist. However, the identification of highly divergent and yet unknown viruses in human biospecimens is one of the biggest challenges in bio- informatics. Modern-day Next Generation Sequencing (NGS) technologies can be used to directly sequence biospecimens from clinical cohorts with unprecedented speed and depth. These technologies are able to generate billions of bases with rapidly decreasing cost but current bioinformatics tools are inefficient to effectively process these massive datasets. Thus, the objective of this thesis was to facilitate both the detection of highly divergent viruses among generated sequences as well as large-scale analysis of human metagenomic datasets. To re-analyze human sample-derived sequences that were classified as being of “unknown” origin by conventional alignment-based methods, we used a meth- odology based on profile Hidden Markov Models (HMM) which can capture evolutionary changes by using multiple sequence alignments. We thus identified 510 sequences that were classified as distantly related to viruses. Many of these sequences were homologs to large viruses such as Herpesviridae and Mimiviridae but some of them were also related to small circular viruses such as Circoviridae. We found that bioinformatics analysis using viral profile HMM is capable of extending the classification of previously unknown sequences and consequently the detection of viruses in biospecimens from humans. Different organisms use synonymous codons differently to encode the same amino acids. To investigate whether codon usage bias could predict the presence of virus in metagenomic sequencing data originating from human samples, we trained Random Forest and Artificial Neural Networks based on Relative Synonymous Codon Usage (RSCU) frequency. Our analysis showed that machine learning tech- niques based on RSCU could identify putative viral sequences with area under the ROC curve of 0.79 and provide important information for taxonomic classification. For identification of viral genomes among raw metagenomic sequences, we devel- oped the tool ViraMiner, a deep learning-based method which uses Convolutional Neural Networks with two convolutional branches. Using 300 base-pair length sequences, ViraMiner achieved 0.923 area under the ROC curve which is con- siderably improved performance in comparison with previous machine learning methods for virus sequence classification. The proposed architecture, to the best of our knowledge, is the first deep learning tool which can detect viral genomes on raw metagenomic sequences originating from a variety of human samples. To enable large-scale analysis of massive metagenomic sequencing data we used Apache Hadoop and Apache Spark to develop ViraPipe, a scalable parallel bio- informatics pipeline for viral metagenomics. Comparing ViraPipe (executed on 23 nodes) with the sequential pipeline (executed on a single node) was 11 times faster in the metagenome analysis. The new distributed workflow contains several standard bioinformatics tools and can scale to terabytes of data by accessing more computer power from the nodes. To analyze terabytes of RNA-seq data originating from head and neck squamous cell carcinoma samples, we used our parallel bioinformatics pipeline ViraPipe and the most recent version of the HPV sequence database. We detected transcription of HPV viral oncogenes in 92/500 cancers. HPV 16 was the most important HPV type, followed by HPV 33 as the second most common infection. If these cancers are indeed caused by HPV, we estimated that vaccination might prevent about 36 000 head and neck cancer cases in the United States every year. In conclusion, the work in this thesis improves the prospects for biomedical researchers to classify the sequence contents of ultra-deep datasets, conduct large- scale analysis of metagenome studies, and detect presence of viral genomes in human biospecimens. Hopefully, this work will contribute to our understanding of biodiversity of viruses in humans which in turn can help exploring infectious causes of human disease

Publications from Karolinska Institutet

Bioinformatics Methods For Studying Intra-Host and Inter-Host Evolution Of Highly Mutable Viruses

Author: Icer Pelin Burcak
Publication venue: ScholarWorks @ Georgia State University
Publication date: 04/05/2021
Field of study

Reproducibility and robustness of genomic tools are two important factors to assess the reliability of bioinformatics analysis. Such assessment based on these criteria requires repetition of experiments across lab facilities which is usually costly and time consuming. In this study we propose methods that are able to generate computational replicates, allowing the assessment of the reproducibility of genomic tools. We analyzed three different groups of genomic tools: DNA-seq read alignment tools, structural variant (SV) detection tools and RNA-seq gene expression quantification tools. We tested these tools with different technical replicate data. We observed that while some tools were impacted by the technical replicate data some remained robust. We observed the importance of the choice of read alignment tools for SV detection as well. On the other hand, we found out that the RNA-seq quantification tools (Kallisto and Salmon) that we chose were not affected by the shuffled data but were affected by reverse complement data. Using these findings, our proposed method here may help biomedical communities to advice on the robustness and reproducibility factors of genomic tools and help them to choose the most appropriate tools in terms of their needs. Furthermore, this study will give an insight to genomic tool developers about the importance of a good balance between technical improvements and reliable results

ScholarWorks @ Georgia State University

Computational Methods for the Analysis of Genomic Data and Biological Processes

Author
Publication venue: 'MDPI AG'
Publication date: 01/05/2021
Field of study

In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

Directory of Open Access Books (DOAB)

Predicting HIV Status Using Neural Networks and Demographic Factors

Author: Tim Taryn Nicole Ho
Publication venue
Publication date: 15/02/2007
Field of study

Student Number : 0006036T - MSc(Eng) project report - School of Electrical and Information Engineering - Faculty of Engineering and the Built EnvironmentDemographic and medical history information obtained from annual South African antenatal surveys is used to estimate the risk of acquiring HIV. The estimation system consists of a classifier: a neural network trained to perform binary classification, using supervised learning with the survey data. The survey information contains discrete variables such as age, gravidity and parity, as well as the quantitative variables race and location, making up the input to the neural network. HIV status is the output. A multilayer perceptron with a logistic function is trained with a cross entropy error function, providing a probabilistic interpretation of the output. Predictive and classification performance is measured, and the sensitivity and specificity are illustrated on the Receiver Operating Characteristic. An auto-associative neural network is trained on complete datasets, and when presented with partial data, global optimisation methods are used to approximate the missing entries. The effect of the imputed data on the network prediction is investigated

Wits Institutional Repository on DSPACE

Agrupamiento, predicción y clasificación ordinal para series temporales utilizando técnicas de machine learning: aplicaciones

Author: Guijo Rubio David
Publication venue: Universidad de Córdoba, UCOPress
Publication date: 01/01/2021
Field of study

In the last years, there has been an increase in the number of fields improving their standard processes by using machine learning (ML) techniques. The main reason for this is that the vast amount of data generated by these processes is difficult to be processed by humans. Therefore, the development of automatic methods to process and extract relevant information from these data processes is of great necessity, giving that these approaches could lead to an increase in the economic benefit of enterprises or to a reduction in the workload of some current employments. Concretely, in this Thesis, ML approaches are applied to problems concerning time series data. Time series is a special kind of data in which data points are collected chronologically. Time series are present in a wide variety of fields, such as atmospheric events or engineering applications. Besides, according to the main objective to be satisfied, there are different tasks in the literature applied to time series. Some of them are those on which this Thesis is mainly focused: clustering, classification, prediction and, in general, analysis. Generally, the amount of data to be processed is huge, arising the need of methods able to reduce the dimensionality of time series without decreasing the amount of information. In this sense, the application of time series segmentation procedures dividing the time series into different subsequences is a good option, given that each segment defines a specific behaviour. Once the different segments are obtained, the use of statistical features to characterise them is an excellent way to maximise the information of the time series and simultaneously reducing considerably their dimensionality. In the case of time series clustering, the objective is to find groups of similar time series with the idea of discovering interesting patterns in time series datasets. In this Thesis, we have developed a novel time series clustering technique. The aim of this proposal is twofold: to reduce as much as possible the dimensionality and to develop a time series clustering approach able to outperform current state-of-the-art techniques. In this sense, for the first objective, the time series are segmented in order to divide the them identifying different behaviours. Then, these segments are projected into a vector of statistical features aiming to reduce the dimensionality of the time series. Once this preprocessing step is done, the clustering of the time series is carried out, with a significantly lower computational load. This novel approach has been tested on all the time series datasets available in the University of East Anglia and University of California Riverside (UEA/UCR) time series classification (TSC) repository. Regarding time series classification, two main paths could be differentiated: firstly, nominal TSC, which is a well-known field involving a wide variety of proposals and transformations applied to time series. Concretely, one of the most popular transformation is the shapelet transform (ST), which has been widely used in this field. The original method extracts shapelets from the original time series and uses them for classification purposes. Nevertheless, the full enumeration of all possible shapelets is very time consuming. Therefore, in this Thesis, we have developed a hybrid method that starts with the best shapelets extracted by using the original approach with a time constraint and then tunes these shapelets by using a convolutional neural network (CNN) model. Secondly, time series ordinal classification (TSOC) is an unexplored field beginning with this Thesis. In this way, we have adapted the original ST to the ordinal classification (OC) paradigm by proposing several shapelet quality measures taking advantage of the ordinal information of the time series. This methodology leads to better results than the state-of-the-art TSC techniques for those ordinal time series datasets. All these proposals have been tested on all the time series datasets available in the UEA/UCR TSC repository. With respect to time series prediction, it is based on estimating the next value or values of the time series by considering the previous ones. In this Thesis, several different approaches have been considered depending on the problem to be solved. Firstly, the prediction of low-visibility events produced by fog conditions is carried out by means of hybrid autoregressive models (ARs) combining fixed-size and dynamic windows, adapting itself to the dynamics of the time series. Secondly, the prediction of convective cloud formation (which is a highly imbalance problem given that the number of convective cloud events is much lower than that of non-convective situations) is performed in two completely different ways: 1) tackling the problem as a multi-objective classification task by the use of multi-objective evolutionary artificial neural networks (MOEANNs), in which the two conflictive objectives are accuracy of the minority class and the global accuracy, and 2) tackling the problem from the OC point of view, in which, in order to reduce the imbalance degree, an oversampling approach is proposed along with the use of OC techniques. Thirdly, the prediction of solar radiation is carried out by means of evolutionary artificial neural networks (EANNs) with different combinations of basis functions in the hidden and output layers. Finally, the last challenging problem is the prediction of energy flux from waves and tides. For this, a multitask EANN has been proposed aiming to predict the energy flux at several prediction time horizons (from 6h to 48h). All these proposals and techniques have been corroborated and discussed according to physical and atmospheric models. The work developed in this Thesis is supported by 11 JCR-indexed papers in international journals (7 Q1, 3 Q2, 1 Q3), 11 papers in international conferences, and 4 papers in national conferences

Repositorio Institucional de la Universidad de Córdoba

Recommended from our members

A Machine Learning Approach for Identifying Amino Acid Signatures in the HIV Env Gene Predictive of Dementia

Author: Gabuzda Dana Helga
Holman Alexander G.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 22/04/2013
Field of study

The identification of nucleotide sequence variations in viral pathogens linked to disease and clinical outcomes is important for developing vaccines and therapies. However, identifying these genetic variations in rapidly evolving pathogens adapting to selection pressures unique to each host presents several challenges. Machine learning tools provide new opportunities to address these challenges. In HIV infection, virus replicating within the brain causes HIV-associated dementia (HAD) and milder forms of neurocognitive impairment in 20–30% of patients with unsuppressed viremia. HIV neurotropism is primarily determined by the viral envelope (env) gene. To identify amino acid signatures in the HIV env gene predictive of HAD, we developed a machine learning pipeline using the PART rule-learning algorithm and C4.5 decision tree inducer to train a classifier on a meta-dataset (n = 860 env sequences from 78 patients: 40 HAD, 38 non-HAD). To increase the flexibility and biological relevance of our analysis, we included 4 numeric factors describing amino acid hydrophobicity, polarity, bulkiness, and charge, in addition to amino acid identities. The classifier had 75% predictive accuracy in leave-one-out cross-validation, and identified 5 signatures associated with HAD diagnosis (p<0.05, Fisher’s exact test). These HAD signatures were found in the majority of brain sequences from 8 of 10 HAD patients from an independent cohort. Additionally, 2 HAD signatures were validated against env sequences from CSF of a second independent cohort. This analysis provides insight into viral genetic determinants associated with HAD, and develops novel methods for applying machine learning tools to analyze the genetics of rapidly evolving pathogens

Harvard University - DASH

Data based identification and prediction of nonlinear and complex dynamical systems

Author: Grebogi Celso
Lai Ying-Cheng
Wang Wen-Xu
Publication venue: 'Elsevier BV'
Publication date: 27/04/2017
Field of study

We thank Dr. R. Yang (formerly at ASU), Dr. R.-Q. Su (formerly at ASU), and Mr. Zhesi Shen for their contributions to a number of original papers on which this Review is partly based. This work was supported by ARO under Grant No. W911NF-14-1-0504. W.-X. Wang was also supported by NSFC under Grants No. 61573064 and No. 61074116, as well as by the Fundamental Research Funds for the Central Universities, Beijing Nova Programme.Peer reviewedPostprin

arXiv.org e-Print Archive

Aberdeen University Research

Genetic and antigenic characterization of complete genomes of Type 1 Porcine Reproductive and Respiratory Syndrome viruses (PRRSV) isolated in Denmark over a period of 10 years

Author: Allende
An
Balka
Botner
Botner
Brown
Cavanagh
Charlotte K. Hjulsager
Charlotte S. Kristensen
Chen
Conzelmann
Darwich
Diaz
Diaz
Fang
Fang
Firth
Forsberg
Forsberg
Frossard
Han
Han
Han
Holtkamp
Johnson
Johnson
Keffaber
King
Klara T. Lauritsen
Kvisgaard
Körber
Lars E. Larsen
Lise K. Kvisgaard
Madsen
Meulenberg
Nielsen
Oleksiewicz
Oleksiewicz
Oleksiewicz
Plagemann
Sanger
Shi
Snijder
Stadejek
Stadejek
Stadejek
Van Doorsselaere
Van Doorsselaere
Verheije
Wensvoort
Wu
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

AbstractPorcine Reproductive and Respiratory Syndrome (PRRS) caused by the PRRS virus (PRRSV) is considered one of the most devastating swine diseases worldwide. PRRS viruses are divided into two major genotypes, Type 1 and Type 2, with pronounced diversity between and within the genotypes. In Denmark more than 50% of the herds are infected with Type 1 and/or Type 2 PRRSV. The main objective of this study was to examine the genetic diversity and drift of Type 1 viruses in a population with limited introduction of new animals and semen. A total of 43 ORF5 and 42 ORF7 nucleotide sequences were obtained from viruses collected from 2003 to February 2013. Phylogenetic analysis of ORF5 nucleotide sequences showed that the Danish isolates formed two major clusters within the subtype 1. The nucleotide identity to the subtype 1 protogenotype Lelystad virus (LV) spanned 84.9–98.8% for ORF5 and 90.7–100% for ORF7. Among the Danish viruses the pairwise nucleotide identities in ORF5 and ORF7 were 81.2–100% and 88.9–100%, respectively. Sequencing of the complete genomes, including the 5′- and 3′-end nucleotides, of 8 Danish PRRSV Type 1 showed that the genome lengths differed from 14,876 to 15,098 nucleotides and the pairwise nucleotide identity among the Danish viruses was 86.5–97.3% and the identity to LV was 88.7–97.9%. The study strongly indicated that there have been at least two independent introductions of Type 1 PRRSV in Denmark and analysis of the full genomes revealed a significant drift in several regions of the virus

Elsevier - Publisher Connector

Crossref

Edinburgh Research Explorer

Online Research Database In Technology