3,789 research outputs found

    Resdiue-residue contact driven protein structure prediction using optimization and machine learning

    Get PDF
    Significant improvements in the prediction of protein residue-residue contacts are observed in the recent years. These contacts, predicted using a variety of coevolution-based and machine learning methods, are the key contributors to the recent progress in ab initio protein structure prediction, as demonstrated in the recent CASP experiments. Continuing the development of new methods to reliably predict contact maps, tools to assess the utility of predicted contacts, and methods to construct protein tertiary structures from predicted contacts, are essential to further improve ab initio structure prediction. In this dissertation, three contributions are described -- (a) DNCON2, a two-level convolutional neural network-based method for protein contact prediction, (b) ConEVA, a toolkit for contact assessment and evaluation, and (c) CONFOLD, a method of building protein 3D structures from predicted contacts and secondary structures. Additional related contributions on protein contact prediction and structure reconstruction are also described. DNCON2 and CONFOLD demonstrate state-of-the-art performance on contact prediction and structure reconstruction from scratch. All three protein structure methods are available as software or web server which are freely available to the scientific community.Includes biblographical reference

    Implementing a webserver for managing and detecting viral fusion proteins

    Get PDF
    Dissertação de mestrado em BioinformáticaViral fusion proteins are essential to allow enveloped viruses (such as Influenza, Dengue, HIV and SARS-CoV-2) to enter their hosts’ cells, in a mechanism referred to as membrane fusion. This makes these proteins (with special relevance to their fusion peptides, the com ponent of the protein that can insert into the host’s membrane by itself) interesting potential therapeutic targets for preventing or treating for some well-known diseases. However, there is no centralized data repository containing all the relevant information regarding viral fusion proteins. With that in mind, the main purpose of this work is to develop a CRUD (Create, Read, Update and Delete) web server that will allow researchers to find all the necessary data regarding viral fusion proteins, through an easy-to-use web interface. The web application will also contain other bioinformatics functionalities, such as sequence alignment (through BLAST, Clustal and Weblogo) to allow researchers to retrieve key pieces of information regarding a fusion protein, as well as machine learning models capable of predicting the location of fusion peptides inside the viral fusion protein sequence. The implementation of the server used Django as its back-end, retrieving the data from a MySQL database, and Angular as its front-end. The main result of the work is, therefore, a working webserver, with a web interface available online through the URL: https://viralfp.bio.di.uminho.pt/. The web application allows users to explore the gathered data related to viral fusion proteins in a user-friendly way. This tool contains all the proposed functionalities and machine learning models. As expected in an application’s development, there are several aspects that require future work to improve the usefulness of this tool to the scientific community.Proteínas virais de fusão são essenciais para que vírus encapsulados (tais como Influenza, Dengue, HIV e SARS-CoV-2) sejam capazes de se inserir nos seus hospedeiros, num mecanismo conhecido como fusão membranar. Por este motivo, estas proteínas (com especial relevância para os seus péptidos de fusão, que são a parte da proteína que se insere na membrana do hospedeiro por si mesma) são potenciais alvos terapêuticos interessantes para prevenir ou tratar algumas doenças bem conhecidas. No entanto, não existe nenhuma fonte de dados centralizada disponível que contenha toda a informação relativa a proteínas virais de fusão. Sabendo isto, o propósito primário deste trabalho é desenvolver um web server CRUD (Create, Read, Update and Delete) que permitira investigadores encontrar toda a informação necessária relacionada com proteínas virais de fusão, através de um interface user-friendly. Este web server incluirá outras funcionalidades bioinformáticas, tais como ferramentas de alinhamento de sequências (como BLAST, Clustal e Weblogo), que permitirá investigadores extrair informações importantes acerca de uma proteína de fusão. Por fim, incluir a modelos de machine learning capazes de prever a localização de péptidos de fusão na sequência da proteína de fusão. A implementação do servidor usou Django como seu back-end, que permite extrair a informação da base de dados MySQL, e Angular como front-end. O principal resultado deste trabalho é, portanto, um web server funcional, com a interface web disponível através do URL: https://viralfp.bio.di.uminho.pt/. Esta aplicação web permite que utilizadores possam explorar a informação acumulada acerca de proteínas virais de fusão através de uma interface user-friendly. Esta ferramenta contém todas as funcionalidades e modelos de machine learning propostos. Como seria de esperar no desenvolvimento de uma aplicação, existem vários aspetos que requerem trabalho futuro para melhorar a utilidade desta ferramenta para a comunidade científica.First and foremost, this dissertation is funded by COMPETE 2020, Portugal 2020 and FCT - Fundação para a Ciência e a Tecnologia, under the project ”Using computational and experimental methods to provide a global characterization of viral fusion peptides”, through the funding program ”02/SAICT/2017 - Projetos de Investigação Científica e Desenvolvimento Tecnologico (IC&DT)”, with the reference ”NORTE-01-0145-FEDER-028200”, who I would like to thank for their trust

    Bacteriophage-host determinants: identification of bacteriophage receptors through machine learning techniques

    Get PDF
    Dissertação de mestrado em BioinformaticsBacterial resistance to antibiotics is nowadays becoming a major concern. Several reports indicate that bacteria are developing resistance mechanisms to various antibiotics. Moreover, the processes involved in the development of new antibiotics are lengthy and expensive. Therefore, an alternative to antibiotics is needed. One promising alternative are bacteriophages, viruses that specifically infect bacteria, causing their lysis. Hence, it would be interesting to discover which bacteria a specific phage recognizes. The bacterial receptors determine phage specificity, using tail spikes/fibres as receptor binding proteins to detect carbohydrates or proteins, in bacterial surface. Studying interactions between phage tail spikes/- fibres and bacterial receptors can allow the identification of interaction pairs. Machine learning algorithms can be used to find patterns in these interactions and build models to make predictions. In this work, PhageHost, a tool that predicts hosts at a strain level, for three species, E. coli, K. pneumoniae and A. baumannii was developed. Several data was extracted from GenBank, retrieving general, protein and coding information, for both phages and bacteria. The protein data was used to build an important phage protein function database, that allowed the classification of protein functions, namely, phage tail spikes/fibres. In the end, several machine learning models with relevant protein features were created to predict phage-host strain interactions. Compared with previously performed works, these models show better predictive power and the ability to perform strain-level predictions. For the best model, a Matthews correlation coefficient (MCC) of 96.6% and an F-score of 98.3% were obtained. These best predictive models were implemented online, in a server under the name PhageHost (https://galaxy.bio.di. uminho.pt).Resistência bacteriana a antibióticos está a tornar-se uma preocupação hoje em dia. Várias bactérias foram descritas desenvolvendo mecanismos de resistência a diversos antibióticos. Aliado a isto, estão os longos e dispendiosos processos envolvidos no desenvolvimento de antibióticos. Por isso, há a necessidade de procurar uma alternativa aos antibióticos. Uma alternativa promissora são os bacteriófagos, vírus que infetam especificamente bactérias e levam à sua lise. Posto isto, seria interessante descobrir qual a bactéria que um certo fago reconhece. A especificidade de fagos é dada pelos recetores da superfícies das bactérias que conseguem reconhecer. Eles usam proteínas das spikes/fibras para reconhecer recetires proteicos ou hidratos de carbono nas bactérias. Estudar as interações entre spikes/fibras das caudas de fagos e recetores bacterianos pode permitir a identificação de pares de interação. Algoritmos de aprendizagem máquina podem ser utilizados para descobrir padrões nestas interações e construir modelos para realizar previsões. Neste trabalho, a ferramenta PhageHost foi desenvolvida. Permite a previsão de hospedeiros ao nível da estirpe, para três espécies, E. coli, K. pneumoniae e A. baumannii. Vários dados foram extraídos do GenBank, nomeadamente informações gerais, de proteína e codificante, para fagos e bactérias. Com todos os dados proteicos, uma base de dados importante foi construída, que permitiu a classificação de funções proteicas, nomeadamente, spikes/fibras das caudas dos fagos. Finalmente, vários modelos de aprendizagem máquina, com características proteicas relevantes, capazes de prever interações fago-hospedeiro, a nível da estirpe. Em comparação com outros trabalhos semelhantes, estes modelos demonstraram melhor poder preditivo, assim como capacidade de prever interações a nível da estirpe. Para o melhor modelo foram obtidos um coeficiente de correlação de Matthews de 96.6% e um F-score de 98.3%. Os melhores modelos foram implementados online, num servidor com o nome PhageHost (https://galaxy.bio.di.uminho.pt)

    Computational miRNA Target Prediction in Animals

    Get PDF
    miRNAs are a class of small RNA molecules about 22 nucleotides long that regulate gene expression at the post-transcriptional level. The discovery of the second miRNA 10 years ago was as much a surprise in its own way as the very structure of DNA discovered a half century earlier[1]. How could these small molecules regulate so many genes? During the past decade the complex cascade of regulation has been investigated and reported in detail[2]. The regions of the genome called untranslated regions, or UTRs, proved true to their name: they were indeed untranslated, but certainly not unimportant: they act as the origin and often the destination of miRNAs. miRBase[3] contains 1048 human miRNAs with more undoubtedly on the way. But experimental identification of miRNA targets has proven dreadfully slow and difficult. Instead, scientists have turned to computational target prediction programs as the preferred method to quickly identify potential miRNA targets. Current prediction tools have produced a huge number of potential target sites, but determining if they are correct, or which algorithms produce the most reliable predictions, remains an open question. This project examines one type of algorithm, a probabilistic model called a profile Hidden Markov Model (pHMM), and uses it to predict miRNA target sites. HMMs are known to be very effective in pattern recognition and have been successfully applied to various bioinformatic applications, such as gene finding, multiple sequence alignment and protein family classification[4]. We proposed to build a pHMM from known miRNA interactions and use this model to identify potential miRNA target sites in UTR regions by abstracting the Watson-Crick base pairs into meta codes intended to more naturally describe important relationships in RNA folding. High quality positive training data came from the best curated mRNA:miRNA data-bases we could find, while negative training data was generated using random sequences. The purpose of this project was to demonstrate the flexibility of the pHMM architecture to process many kinds of interesting data and by doing so improve their miRNA target site prediction

    Template Based Modeling and Structural Refinement of Protein-Protein Interactions.

    Full text link
    Determining protein structures from sequence is a fundamental problem in molecular biology, as protein structure is essential to understanding protein function. In this study, I developed one of the first fully automated pipelines for template based quaternary structure prediction starting from sequence. Two critical steps for template based modeling are identifying the correct homologous structures by threading which generates sequence to structure alignments and refining the initial threading template coordinates closer to the native conformation. I developed SPRING (single-chain-based prediction of interactions and geometries), a monomer threading to dimer template mapping program, which was compared to the dimer co-threading program, COTH, using 1838 non homologous target complex structures. SPRING’s similarity score outperformed COTH in the first place ranking of templates, correctly identifying 798 and 527 interfaces respectively. More importantly the results were found to be complementary and the programs could be combined in a consensus based threading program showing a 5.1% improvement compared to SPRING. Template based modeling requires a structural analog being present in the PDB. A full search of the PDB, using threading and structural alignment, revealed that only 48.7% of the PDB has a suitable template whereas only 39.4% of the PDB has templates that can be identified by threading. In order to circumvent this, I included intramolecular domain-domain interfaces into the PDB library to boost template recognition of protein dimers; the merging of the two classes of interfaces improved recognition of heterodimers by 40% using benchmark settings. Next the template based assembly of protein complexes pipeline, TACOS, was created. The pipeline combines threading templates and domain knowledge from the PDB into a knowledge based energy score. The energy score is integrated into a Monte Carlo sampling simulation that drives the initial template closer to the native topology. The full pipeline was benchmarked using 350 non homologous structures and compared to two state of the art programs for dimeric structure prediction: ZDOCK and MODELLER. On average, TACOS models global and interface structure have a better quality than the models generated by MODELLER and ZDOCK.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135847/1/bgovi_1.pd

    Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

    Get PDF
    In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem

    Awjedni: A Reverse-Image-Search Application

    Get PDF
    The abundance of photos on the internet, along with smartphones that could implement computer vision technologies allow for a unique way to browse the web. These technologies have potential used in many widely accessible and globally available reverse-image search applications. One of these applications is the use of reverse-image search to help people finding items which they're interested in, but they can't name it. This is where Awjedni was born. Awjedni is a reverse-image search application compatible with iOS and Android smartphones built to provide an efficient way to search millions of products on the internet using images only. Awjedni utilizes a computer vision technology through implementing multiple libraries and frameworks to process images, recognize objects, and crawl the web. Users simply upload/take a photo of a desired item and the application returns visually similar items and a direct link to the websites that sell them

    ChimeRScope: a novel alignment-free algorithm for fusion gene prediction using paired-end short reads

    Get PDF
    Fusion genes are those that result from the fusion of two or more genes, and they are typically generated due to the perturbations in the genome structure in cancer cells. In turn, fusion genes can contribute to tumor formation and progression by promoting the expression of an oncogene, deregulation of a tumor-suppressor, or producing much more active abnormal proteins. More importantly, oncogenic fusion genes are specifically expressed in the tumor cells, which provide enormous diagnostic and therapeutic advantages for cancer treatment. With the development of next-generation sequencing (NGS) technology, RNA-Seq becomes increasingly popular for transcriptomic study because of its high sensitivity and the capability of detecting novel transcripts including fusion genes. To date, many fusion gene detection tools have been developed, most of which attempt to find reliable alignment evidence for chimeric transcripts from RNA-Seq data. It is well accepted that the alignment quality of sequencing reads against the reference genome is often limited when significant differences in the genomes exist, which is the case with cancer genomes that contain many genomic perturbations and structural variations. Hence, regions where fusion genes occur in the cancer genome tend to be largely different from those in the reference genome, which prevents the alignment-based fusion gene detection methods from achieving good accuracies. We developed a tool called ChimeRScope. ChimeRScope, being an alignment-free method, bypasses the sequence alignment step by assessing the gene fingerprint profiles (in the form of k-mers) from RNA-Seq paired-end reads for fusion gene prediction (Chapter Two). We also optimized the data structure and ChimeRScope algorithms, in order to overcome the common limitations (memory-utilization, low accuracies) that are commonly seen in alignment-free methods (Chapter Two). Results on simulated datasets, previously studied cancer RNA-Seq datasets, and experimental validations on in-house datasets have shown that ChimeRScope consistently performed better than other popular alignment-based methods irrespective of the read length and depth of sequencing coverage (Chapter Three). ChimeRScope also generates graphical outputs for illustrations of the fusion patterns. Lastly, we also developed downloadable software for ChimeRScope and implemented an online data analysis server using the Galaxy platform (Chapter Four). ChimeRScope is available at https://github.com/ChimeRScope/ChimeRScope/
    corecore