Search CORE

798 research outputs found

Deep Learning for Genomics: A Concise Overview

Author: Wang Haohan
Yue Tianwei
Publication venue
Publication date: 08/05/2018
Field of study

Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

arXiv.org e-Print Archive

Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA

Author: de Souza Teixeira Felipe Carvalho
Nobre Cristiane Neri
Ortega José Miguel
Silva Lívia Márcia
Zárate Luis Enrique
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow. Results Through this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the <it>Mus musculus</it> and <it>Rattus norvegicus</it> organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: <it>Arabidopsis thaliana</it>, <it>Caenorhabditis elegans</it>, <it>Drosophila melanogaster</it>, <it>Homo sapiens</it>, <it>Nasonia vitripennis</it>. The precision increases significantly by 39% and 22.9% for <it>Mus musculus</it> and <it>Rattus norvegicus</it>, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (<it>Mus musculus</it>) and from 47.45% to 88.09% (<it>Rattus norvegicus</it>). Conclusions In order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA

Author: Chalka Antonia
Dallman Tim J
Gally David L
Stevens Mark P
Vohra Prerna
Publication venue
Publication date: 01/10/2023
Field of study

Salmonella enterica is a taxonomically diverse pathogen with over 2600 serovars associated with a wide variety of animal hosts including humans, other mammals, birds and reptiles. Some serovars are host-specific or host-restricted and cause disease in distinct host species, while others, such as serovar S. Typhimurium (STm), are generalists and have the potential to colonize a wide variety of species. However, even within generalist serovars such as STm it is becoming clear that pathovariants exist that differ in tropism and virulence. Identifying the genetic factors underlying host specificity is complex, but the availability of thousands of genome sequences and advances in machine learning have made it possible to build specific host prediction models to aid outbreak control and predict the human pathogenic potential of isolates from animals and other reservoirs. We have advanced this area by building host-association prediction models trained on a wide range of genomic features and compared them with predictions based on nearest-neighbour phylogeny. SNPs, protein variants (PVs), antimicrobial resistance (AMR) profiles and intergenic regions (IGRs) were extracted from 3883 high-quality STm assemblies collected from humans, swine, bovine and poultry in the USA, and used to construct Random Forest (RF) machine learning models. An additional 244 recent STm assemblies from farm animals were used as a test set for further validation. The models based on PVs and IGRs had the best performance in terms of predicting the host of origin of isolates and outperformed nearest-neighbour phylogenetic host prediction as well as models based on SNPs or AMR data. However, the models did not yield reliable predictions when tested with isolates that were phylogenetically distinct from the training set. The IGR and PV models were often able to differentiate human isolates in clusters where the majority of isolates were from a single animal source. Notably, IGRs were the feature with the best performance across multiple models which may be due to IGRs acting as both a representation of their flanking genes, equivalent to PVs, while also capturing genomic regulatory variation, such as altered promoter regions. The IGR and PV models predict that ~45 % of the human infections with STm in the USA originate from bovine, ~40 % from poultry and ~14.5 % from swine, although sequences of isolates from other sources were not used for training. In summary, the research demonstrates a significant gain in accuracy for models with IGRs and PVs as features compared to SNP-based and core genome phylogeny predictions when applied within the existing population structure. This article contains data hosted by Microreact

Utrecht University Repository

Fine-mapping inflammatory bowel disease loci to single-variant resolution

Author
Publication venue
Publication date: 13/07/2017
Field of study

King's Research Portal

Handling imbalance visualized pattern dataset for yield prediction

Author: Jusoh Shaidah
Megat Mohamed Noor Megat Norulazmi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

The prediction of the yield outcome in a non close loop manufacturing process can be achieved by visualizing the historical data pattern generated from the inspection machine, transform the data pattern and map it into machine learning algorithm for training, in order to automatically generate a prediction model without the visual interpretation needs to be done by human. Anyhow, the nature of manufacturing process dataset for the bad yield outcome is highly skewed where the majority class of good yield extremely outnumbers the minority class of bad yield. Comparison between the undersampling, over- sampling and SMOTE + VDM sampling technique indicates that the combination of SMOTE + VDM and undersampled dataset produced a robust classifier performance capable of handling better with different batches of prediction test data sets. Furtherance, suitable distance function for SMOTE is needed to improve class recall and minimize overfitting whilst different approach on the majority class sampling is required to improve the class precision due to information loss by the undersampling

UUM Repository

A machine learning-based investigation of cloud service attacks

Author: Intisar Al-Mandhari (1257222)
Publication venue
Publication date: 01/01/2019
Field of study

In this thesis, the security challenges of cloud computing are investigated in the Infrastructure as a Service (IaaS) layer, as security is one of the major concerns related to Cloud services. As IaaS consists of different security terms, the research has been further narrowed down to focus on Network Layer Security. Review of existing research revealed that several types of attacks and threats can affect cloud security. Therefore, there is a need for intrusion defence implementations to protect cloud services. Intrusion Detection (ID) is one of the most effective solutions for reacting to cloud network attacks. [Continues.

Loughborough University Institutional Repository

Population genomics of a critically endangered data-deficient elasmobranch, the blue skate Dipturus batis

Author: Delaval Aurelien Nicolas
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2021
Field of study

Doctoral thesis (PhD) - Nord University, 2021publishedVersio

Brage Nord Open Research Archive

Comparative genomics and transcriptomics elucidate virulence mechanisms and host responses in infectious diseases

Author: Sae-Ong Tongta
Publication venue
Publication date: 01/01/2022
Field of study

The main thematic area of the present thesis is the development and application of bioinformatics pipelines, namely whole-genome sequence (WGS) analysis and transcriptome profile analysis. These pipelines were applied to study the fungal pathogen Aspergillus fumigatus (Manuscripts I, III, and IV) and the early human immune mechanisms activated in response to different types of pathogens (bacteria, fungi, and co-infections) in sepsis patients (Manuscript II). The comparative genomic and transcriptomic analyses applied in my thesis have significantly improved our understanding of fungal pathogenicity as well as the pathogen-specific immune response mechanisms of the human host. Next to a number of novel insights, my work included in this thesis has generated a large number of new hypotheses based on big-data analysis, offering the scientific community the possibility to design exciting new research to confirm them in future experimental studies and bring us closer to actual precision medicine for infectious diseases

Digitale Bibliothek Thüringen