721 research outputs found

    Rohlin Distance and the Evolution of Influenza A virus: Weak Attractors and Precursors

    Get PDF
    The evolution of the hemagglutinin amino acids sequences of Influenza A virus is studied by a method based on an informational metrics, originally introduced by Rohlin for partitions in abstract probability spaces. This metrics does not require any previous functional or syntactic knowledge about the sequences and it is sensitive to the correlated variations in the characters disposition. Its efficiency is improved by algorithmic tools, designed to enhance the detection of the novelty and to reduce the noise of useless mutations. We focus on the USA data from 1993/94 to 2010/2011 for A/H3N2 and on USA data from 2006/07 to 2010/2011 for A/H1N1 . We show that the clusterization of the distance matrix gives strong evidence to a structure of domains in the sequence space, acting as weak attractors for the evolution, in very good agreement with the epidemiological history of the virus. The structure proves very robust with respect to the variations of the clusterization parameters, and extremely coherent when restricting the observation window. The results suggest an efficient strategy in the vaccine forecast, based on the presence of "precursors" (or "buds") populating the most recent attractor.Comment: 13 pages, 5+4 figure

    PhyloMap: an algorithm for visualizing relationships of large sequence data sets and its application to the influenza A virus genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Results of phylogenetic analysis are often visualized as phylogenetic trees. Such a tree can typically only include up to a few hundred sequences. When more than a few thousand sequences are to be included, analyzing the phylogenetic relationships among them becomes a challenging task. The recent frequent outbreaks of influenza A viruses have resulted in the rapid accumulation of corresponding genome sequences. Currently, there are more than 7500 influenza A virus genomes in the database. There are no efficient ways of representing this huge data set as a whole, thus preventing a further understanding of the diversity of the influenza A virus genome.</p> <p>Results</p> <p>Here we present a new algorithm, "PhyloMap", which combines ordination, vector quantization, and phylogenetic tree construction to give an elegant representation of a large sequence data set. The use of PhyloMap on influenza A virus genome sequences reveals the phylogenetic relationships of the internal genes that cannot be seen when only a subset of sequences are analyzed.</p> <p>Conclusions</p> <p>The application of PhyloMap to influenza A virus genome data shows that it is a robust algorithm for analyzing large sequence data sets. It utilizes the entire data set, minimizes bias, and provides intuitive visualization. PhyloMap is implemented in JAVA, and the source code is freely available at <url>http://www.biochem.uni-luebeck.de/public/software/phylomap.html</url></p

    Prediction of Genomic Signature of Ngs Sequences and Comparative Drug-Likeness

    Get PDF
    Developing a drug or particular immunotherapy medication for a worldwide epidemic illness caused by viruses (current pandemic) necessitates comprehensive evaluation and annotation of the metagenomic datasets to filter nucleotide sequences quickly and efficiently. Because of the homologs' origin of aligning sequences, space complexity, and time complexity of the analyzing system, traditional sequence alignment procedures are unsuccessful. This necessitates employing an alignment-free sequencing approach in this research that solves the foregoing issue. We suggest a distance function that compresses performance metrics for automatically identifying Short nucleotide sequences used by SARS coronavirus variants to identify critical features in genetic markers and genomic structure. This method provides easy recognition of data compressed by using a set of mathematical and computational tools in the study. We also show that by using our suggested technique to examine extremely short regions of nucleotide sequences, we can differentiate SAR-CoV-2 from SAR-CoV-1 viruses. Later, the Lipinski descriptor (rule of 5) was used to predict the drug-likeness of the target protein in SARS-CoV-2. A regression model using random forest was created to validate the machine learning model for computational analysis. This work was furthered by comparing the regressor model to other machine learning models using lezypredict, allowing scientists to swiftly and accurately identify and describe the SARS coronavirus strains. 

    Development of Self-Compressing BLSOM for Comprehensive Analysis of Big Sequence Data

    Get PDF

    Quantum Hopfield neural network

    Full text link
    Quantum computing allows for the potential of significant advancements in both the speed and the capacity of widely used machine learning techniques. Here we employ quantum algorithms for the Hopfield network, which can be used for pattern recognition, reconstruction, and optimization as a realization of a content-addressable memory system. We show that an exponentially large network can be stored in a polynomial number of quantum bits by encoding the network into the amplitudes of quantum states. By introducing a classical technique for operating the Hopfield network, we can leverage quantum algorithms to obtain a quantum computational complexity that is logarithmic in the dimension of the data. We also present an application of our method as a genetic sequence recognizer.Comment: 13 pages, 3 figures, final versio

    Recent advances in inferring viral diversity from high-throughput sequencing data

    Get PDF
    Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.ISSN:0168-170

    Detecting COVID-19 Outbreak with Anomalous Term Frequency

    Get PDF
    Previously many studies have aimed at predicting the trend of a disease through time series forecasting using machine learning methods. However, data extracted from the real world is often noisy, which can pose numerous challenges for directly predicting the trend, and therefore leading to suboptimal prediction results. Furthermore, real-world data is usually very large, that is, having very long time periods. When it comes to data of such scale, trend forecasting becomes intractable even to state-of-the-art forecasting algorithms such as RNN-LSTM. In the past, not much research has been conducted in applying anomaly detection for disease outbreak detection, including the most recent COVID-19 pandemic. Consequently, in this research, we propose redefining the problem into outbreak detection, which aims to predict whether a future point is or is not a sign of a large scaled COVID-19 outbreak. Through simplifying a complex regression problem into a binary classification problem, the requirements of the learning model may be decreased and therefore the learning performance may be enhanced

    Building an automated platform for the classification of peptides/proteins using machine learning

    Get PDF
    Dissertação de mestrado em BioinformaticsOne of the challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physico-chemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and capacity. Here, a generic modular automated platform for the classification of proteins based on their physicochemical properties using different ML algorithms is proposed. The tool developed, as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. As it is modular, the user retains the power to alter the code to fit specific needs. This platform was tested to predict membrane active anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods (resulting in a total of over 20 datasets), seven ML models were trained and tested, using cross validation for error estimation and grid search for model selection. The different models, feature sets and feature selection techniques were compared. The best models obtained for distinct metric were then used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importances were also analysed. The models obtained will be useful in future research, also providing a biological insight of the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.Um dos problemas mais desafiantes em bioinformática é a caracterização de sequências, estruturas e funções de proteínas. Propriedades físico-químicas e estruturais derivadas da sequêcia proteica têm sido utilizadas no desenvolvimento de modelos de aprendizagem máquina (AM). No entanto, ferramentas para calcular estes atributos são escassas e têm limitações em termos de eficiência, facilidade de uso e capacidade de adaptação a diferentes problemas. Aqui, é descrita uma plataforma modular genérica e automatizada para a classificação de proteínas com base nas suas propriedades físico-químicas, que faz uso de diferentes algoritmos de AM. A ferramenta desenvolvida facilita as principais tarefas de AM e inclui módulos para ler e alterar sequências, calcular atributos de proteínas, realizar pré-processamento de dados, fazer redução e seleção de features, executar clustering, criar modelos de AM e fazer previsões. Como é construído de forma modular, o utilizador mantém o poder de alterar o código para atender às suas necessidades específicas. Esta plataforma foi testada com péptidos anticancerígenos e antimicrobianos e foi ainda utilizada para explorar péptidos de fusão virais. Os péptidos de fusão são uma classe de péptidos que interagem com a membrana, encontrados em vírus encapsulados e que são particularmente relevantes para a fusão da membrana do vírus com a membrana do hospedeiro. Determinar quais são as propriedades que os caracterizam é uma questão científica muito relevante, com importantes implicações tecnológicas. Usando três conjuntos de dados diferentes compostos por sequências bem anotadas, quatro técnicas diferentes de extração de features e cinco métodos diferentes de seleção de features (num total de 24 conjuntos de dados testados), sete modelos de AM, com validação cruzada de io vezes e uma abordagem de pesquisa em grelha, foram treinados e testados. Os melhores modelos obtidos, com avaliações MCC entre 0,7 e o,8 e precisão entre 0,85 e 0,9, foram utilizados para prever a localização de um péptido de fusão conhecido numa sequência da proteína de fusão do vírus do Dengue. Os modelos obtidos para prever a localização do péptido de fusão são úteis em pesquisas futuras, fornecendo também uma visão biológica das características físico-químicas distintivas dos mesmos. Este trabalho apresenta uma ferramenta disponível gratuitamente para realizar a classificação de proteínas com AM e a primeira análise global de péptidos de fusão virais usando métodos baseados em AM, reforçando a usabilidade e a importância da AM em problemas de classificação de proteínas

    Reconstrucción computacional rápida de árboles filogenéticos de SARS-CoV-2.

    Get PDF
    NCD (Normalized Compressed Distance) es un método de compresión para la creación de filogenias de carácter general. Se inspecciona y se hace uso de este método para formar árboles filogenéticos de COVID-19, ADN mitocondrial y del virus Monkeypox. Exploramos este método con grandes datasets de las fuentes de datos más relevantes, GISAID y Nextstrain, haciendo especial hincapié en esta última y el software que proporciona para el análisis de patógenos. Se han creado árboles filogenéticos de hasta 500 secuencias, con el objetivo de comprobar si el método es apto para grandes filogenias. La comparación de este método con otros de la actualidad basados en el alineamiento de secuencias ha concluido que estos habían avanzado mucho más rápido y obtenían tiempos de ejecución mucho menores que NCD, probablemente debido a la rapidez del proceso de multialineamiento cuando existe una secuencia de referencia. Sin embargo, a nivel cualitativo, los resultados fueron bastante buenos y aptos para el análisis.<br /
    corecore