11 research outputs found

    MetaBase--the wiki-database of biological databases.

    Get PDF
    Biology is generating more data than ever. As a result, there is an ever increasing number of publicly available databases that analyse, integrate and summarize the available data, providing an invaluable resource for the biological community. As this trend continues, there is a pressing need to organize, catalogue and rate these resources, so that the information they contain can be most effectively exploited. MetaBase (MB) (http://MetaDatabase.Org) is a community-curated database containing more than 2000 commonly used biological databases. Each entry is structured using templates and can carry various user comments and annotations. Entries can be searched, listed, browsed or queried. The database was created using the same MediaWiki technology that powers Wikipedia, allowing users to contribute on many different levels. The initial release of MB was derived from the content of the 2007 Nucleic Acids Research (NAR) Database Issue. Since then, approximately 100 databases have been manually collected from the literature, and users have added information for over 240 databases. MB is synchronized annually with the static Molecular Biology Database Collection provided by NAR. To date, there have been 19 significant contributors to the project; each one is listed as an author here to highlight the community aspect of the project

    Knowledge Annotation within Research Data Management System for Oxygen-Free Production Technologies

    Get PDF
    The comprehensive implementation of digital technologies in product manufacturing leads to changes in engineering processes and requires new approaches to data management. An important role belongs to the processes of organizing the collection, storage and reuse of research data obtained and used in the process of product, system or technology development, taking into account the FAIR data principles. This article describes a Research Data Management System for the organization of documentation and measurement requests in the research and development of new oxygen-free production technologies

    Identifying a causal link between prolactin signaling pathways and COVID-19 vaccine-induced menstrual changes

    Get PDF
    COVID-19 vaccines have been instrumental tools in the fight against SARS-CoV-2 helping to reduce disease severity and mortality. At the same time, just like any other therapeutic, COVID-19 vaccines were associated with adverse events. Women have reported menstrual cycle irregularity after receiving COVID-19 vaccines, and this led to renewed fears concerning COVID-19 vaccines and their effects on fertility. Herein we devised an informatics workflow to explore the causal drivers of menstrual cycle irregularity in response to vaccination with mRNA COVID-19 vaccine BNT162b2. Our methods relied on gene expression analysis in response to vaccination, followed by network biology analysis to derive testable hypotheses regarding the causal links between BNT162b2 and menstrual cycle irregularity. Five high-confidence transcription factors were identified as causal drivers of BNT162b2-induced menstrual irregularity, namely: IRF1, STAT1, RelA (p65 NF-kB subunit), STAT2 and IRF3. Furthermore, some biomarkers of menstrual irregularity, including TNF, IL6R, IL6ST, LIF, BIRC3, FGF2, ARHGDIB, RPS3, RHOU, MIF, were identified as topological genes and predicted as causal drivers of menstrual irregularity. Our network-based mechanism reconstruction results indicated that BNT162b2 exerted biological effects similar to those resulting from prolactin signaling. However, these effects were short-lived and didn’t raise concerns about long-term infertility issues. This approach can be applied to interrogate the functional links between drugs/vaccines and other side effects

    SNP based literature and data retrieval

    Get PDF
    >Magister Scientiae - MScReference single nucleotide polymorphism (refSNP) identifiers are used to earmark SNPs in the human genome. These identifiers are often found in variant call format (VCF) files. RefSNPs can be useful to include as terms submitted to search engines when sourcing biomedical literature. In this thesis, the development of a bioinformatics software package is motivated, planned and implemented as a web application (http://sniphunter.sanbi.ac.za) with an application programming interface (API). The purpose is to allow scientists searching for relevant literature to query a database using refSNP identifiers and potential keywords assigned to scientific literature by the authors. Multiple queries can be simultaneously launched using either the web interface or the API. In addition, a VCF file parser was developed and packaged with the application to allow users to upload, extract and write information from VCF files to a file format that can be interpreted by the novel search engine created during this project. The parsing feature is seamlessly integrated with the web application's user interface, meaning there is no expectation on the user to learn a scripting language. This multi-faceted software system, called SNiPhunter, envisions saving researchers time during life sciences literature procurement, by suggesting articles based on the amount of times a reference SNP identifier has been mentioned in an article. This will allow the user to make a quantitative estimate as to the relevance of an article. A second novel feature is the inclusion of the email address of a correspondence author in the results returned to the user, which promotes communication between scientists. Moreover, links to external functional information are provided to allow researchers to examine annotations associated with their reference SNP identifier of interest. Standard information such as digital object identifiers and publishing dates, that are typically provided by other search engines, are also included in the results returned to the user.National Research Foundation (NRF) /The South African Research Chairs Initiative (SARChI

    Um algoritmo de alocação para bancos de dados biológicos distribuídos

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2014O presente trabalho propõe um algoritmo de alocação de dados distribuídos baseado na anidade de dados e perfis de uso com foco em bancos de dados (BD) relacionais biológicos. A proposta visa instruir os administradores de banco de dados (DBAs) sobre como alocar os dados nos nós de um cluster visando obter o melhor desempenho possível nas consultas e demais requisições dos usuários. O esquema e verificado através de testes em laboratório. Os experimentos são realizados sobre o sistema data warehouse (DW) Intermine (SMITH et al., 2012) utilizando o pgGrid, que adiciona funções de reaplicação e fragmentação no PostgreSQL e o HadoopDB (implementação do modelo Map-Reduce para bancos de dados relacionais). O algoritmo e comparado com outras propostas de alocação geradas por algoritmos desenvolvidos em pesquisas recentes.Abstract: This work proposes a data allocation algorithm based on distributed data affinity and query profile with focus on biological relational databases.The proposal aims to help database administrators (DBAs) about how to allocate the data across nodes in a cluster in order to obtain the maximum performance improvements on query time and executing other user requests. The allocation schema is verified in laboratory tests. The Intermine datawarehouse (DW) system (SMITH et al., 2012) was chosen as subject of this evaluation. The experiments were executed on distributed database platforms such as pgGrid, which adds replication and fragmentation functions to PostgreSQL and HadoopDB(implementation of Map-Reduce model for relational databases). Finally, the algorithm is compared with other allocation methods developed in recent researches

    Bioinformatics assisted breeding, from QTL to candidate genes

    Get PDF
    Over the last decade, the amount of data generated by a single run of a NGS sequencer outperforms days of work done with Sanger sequencing. Metabolomics, proteomics and transcriptomics technologies have also involved producing more and more information at an ever faster rate. In addition, the number of databases available to biologists and breeders is increasing every year. The challenge for them becomes two-fold, namely: to cope with the increased amount of data produced by these new technologies and to cope with the distribution of the information across the Web. An example of a study with a lot of ~omics data is described in Chapter 2, where more than 600 peaks have been measured using liquid chromatography mass-spectrometry (LCMS) in peel and flesh of a segregating F1apple population. In total, 669 mQTL were identified in this study. The amount of mQTL identified is vast and almost overwhelming. Extracting meaningful information from such an experiment requires appropriate data filtering and data visualization techniques. The visualization of the distribution of the mQTL on the genetic map led to the discovery of QTL hotspots on linkage group: 1, 8, 13 and 16. The mQTL hotspot on linkage group 16 was further investigated and mainly contained compounds involved in the phenylpropanoid pathway. The apple genome sequence and its annotation were used to gain insight in genes potentially regulating this QTL hotspot. This led to the identification of the structural gene leucoanthocyanidin reductase (LAR1) as well as seven genes encoding transcription factors as putative candidates regulating the phenylpropanoid pathway, and thus candidates for the biosynthesis of health beneficial compounds. However, this study also indicated bottlenecks in the availability of biologist-friendly tools to visualize large-scale QTL mapping results and smart ways to mine genes underlying QTL intervals. In this thesis, we provide bioinformatics solutions to allow exploration of regions of interest on the genome more efficiently. In Chapter 3, we describe MQ2, a tool to visualize results of large-scale QTL mapping experiments. It allows biologists and breeders to use their favorite QTL mapping tool such as MapQTL or R/qtl and visualize the distribution of these QTL among the genetic map used in the analysis with MQ2. MQ2provides the distribution of the QTL over the markers of the genetic map for a few hundreds traits. MQ2is accessible online via its web interface but can also be used locally via its command line interface. In Chapter 4, we describe Marker2sequence (M2S), a tool to filter out genes of interest from all the genes underlying a QTL. M2S returns the list of genes for a specific genome interval and provides a search function to filter out genes related to the provided keyword(s) by their annotation. Genome annotations often contain cross-references to resources such as the Gene Ontology (GO), or proteins of the UniProt database. Via these annotations, additional information can be gathered about each gene. By integrating information from different resources and offering a way to mine the list of genes present in a QTL interval, M2S provides a way to reduce a list of hundreds of genes to possibly tens or less of genes potentially related to the trait of interest. Using semantic web technologies M2S integrates multiple resources and has the flexibility to extend this integration to more resources as they become available to these technologies. Besides the importance of efficient bioinformatics tools to analyze and visualize data, the work in Chapter 2also revealed the importance of regulatory elements controlling key genes of pathways. The limitation of M2S is that it only considers genes within the interval. In genome annotations, transcription factors are not linked to the trait (keyword) and to the gene it controls, and these relationships will therefore not be considered. By integrating information about the gene regulatory network of the organism into Marker2sequence, it should be able to integrate in its list of genes, genes outside of the QTL interval but regulated by elements present within the QTL interval. In tomato, the genome annotation already lists a number of transcription factors, however, it does not provide any information about their target. In Chapter 5, we describe how we combined transcriptomics information with six genotypes from an Introgression Line (IL) population to find genes differentially expressed while being in a similar genomic background (i.e.: outside of any introgression segments) as the reference genotype (with no introgression). These genes may be differentially expressed as a result of a regulatory element present in an introgression. The promoter regions of these genes have been analyzed for DNA motifs, and putative transcription factor binding sites have been found. The approaches taken in M2S (Chaper 4) are focused on a specific region of the genome, namely the QTL interval. In Chapter 6, we generalized this approach to develop Annotex. Annotex provides a simple way to browse the cross-references existing between biological databases (ChEBI, Rhea, UniProt, GO) and genome annotations. The main concept of Annotex being, that from any type of data present in the databases, one can navigate the cross-references to retrieve the desired type of information. This thesis has resulted in the production of three tools that biologists and breeders can use to speed up their research and build new hypothesis on. This thesis also revealed the state of bioinformatics with regards to data integration. It also reveals the need for integration into annotations (for example, genome annotations, protein annotations, and pathway annotations) of more ontologies than just the Gene Ontology (GO) currently used. Multiple platforms are arising to build these new ontologies but the process of integrating them into existing resources remains to be done. It also confirms the state of the data in plants where multiples resources may contain overlapping. Finally, this thesis also shows what can be achieved when the data is made inter-operable which should be an incentive to the community to work together and build inter-operable, non-overlapping resources, creating a bioinformatics Web for plant research.</p

    Hierarchical information representation and efficient classification of gene expression microarray data

    Get PDF
    In the field of computational biology, microarryas are used to measure the activity of thousands of genes at once and create a global picture of cellular function. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and eficiently. Even if microarrays are a consolidated research technology nowadays and the trends in high-throughput data analysis are shifting towards new technologies like Next Generation Sequencing (NGS), an optimum method for sample classification has not been found yet. Microarray classification is a complicated task, not only due to the high dimensionality of the feature set, but also to an apparent lack of data structure. This characteristic limits the applicability of processing techniques, such as wavelet filtering or other filtering techniques that take advantage of known structural relation. On the other hand, it is well known that genes are not expressed independently from other each other: genes have a high interdependence related to the involved regulating biological process. This thesis aims to improve the current state of the art in microarray classification and to contribute to understand how signal processing techniques can be developed and applied to analyze microarray data. The goal of building a classification framework needs an exploratory work in which algorithms are constantly tried and adapted to the analyzed data. The developed algorithms and classification frameworks in this thesis tackle the problem with two essential building blocks. The first one deals with the lack of a priori structure by inferring a data-driven structure with unsupervised hierarchical clustering tools. The second key element is a proper feature selection tool to produce a precise classifier as an output and to reduce the overfitting risk. The main focus in this thesis is the binary data classification, field in which we obtained relevant improvements to the state of the art. The first key element is the data-driven structure, obtained by modifying hierarchical clustering algorithms derived from the Treelets algorithm from the literature. Several alternatives to the original reference algorithm have been tested, changing either the similarity metric to merge the feature or the way two feature are merged. Moreover, the possibility to include external sources of information from publicly available biological knowledge and ontologies to improve the structure generation has been studied too. About the feature selection, two alternative approaches have been studied: the first one is a modification of the IFFS algorithm as a wrapper feature selection, while the second approach involved an ensemble learning focus. To obtain good results, the IFFS algorithm has been adapted to the data characteristics by introducing new elements to the selection process like a reliability measure and a scoring system to better select the best feature at each iteration. The second feature selection approach is based on Ensemble learning, taking advantage of the microarryas feature abundance to implement a different selection scheme. New algorithms have been studied in this field, improving state of the art algorithms to the microarray data characteristic of small sample and high feature numbers. In addition to the binary classification problem, the multiclass case has been addressed too. A new algorithm combining multiple binary classifiers has been evaluated, exploiting the redundancy offered by multiple classifiers to obtain better predictions. All the studied algorithm throughout this thesis have been evaluated using high quality publicly available data, following established testing protocols from the literature to offer a proper benchmarking with the state of the art. Whenever possible, multiple Monte Carlo simulations have been performed to increase the robustness of the obtained results.En el campo de la biología computacional, los microarrays son utilizados para medir la actividad de miles de genes a la vez y producir una representación global de la función celular. Los microarrays permiten analizar la expresión de muchos genes en un solo experimento, rápidamente y eficazmente. Aunque los microarrays sean una tecnología de investigación consolidada hoy en día y la tendencia es en utilizar nuevas tecnologías como Next Generation Sequencing (NGS), aun no se ha encontrado un método óptimo para la clasificación de muestras. La clasificación de muestras de microarray es una tarea complicada, debido al alto número de variables y a la falta de estructura entre los datos. Esta característica impide la aplicación de técnicas de procesado que se basan en relaciones estructurales, como el filtrado con wavelet u otras técnicas de filltrado. Por otro lado, los genes no se expresen independientemente unos de otros: los genes están inter-relacionados según el proceso biológico que les regula. El objetivo de esta tesis es mejorar el estado del arte en la clasi cación de microarrays y contribuir a entender cómo se pueden diseñar y aplicar técnicas de procesado de señal para analizar microarrays. El objetivo de construir un algoritmo de clasi cación, necesita un estudio de comprobaciones y adaptaciones de algoritmos existentes a los datos analizados. Los algoritmo desarrollados en esta tesis encaran el problema con dos bloques esenciales. El primero ataca la falta de estructura, derivando un árbol binario usando herramientas de clustering no supervisado. El segundo elemento fundamental para obtener clasificadores precisos reduciendo el riesgo de overfitting es un elemento de selección de variables. La principal tarea en esta tesis es la clasificación de datos binarios en la cual hemos obtenido mejoras relevantes al estado del arte. El primer paso es la generación de una estructura, para eso se ha utilizado el algoritmo Treelets disponible en la literatura. Múltiples alternativas a este algoritmo original han sido propuestas y evaluadas, cambiando las métricas de similitud o las reglas de fusión durante el proceso. Además, se ha estudiado la posibilidad de usar fuentes de información externas, como ontologías de información biológica, para mejorar la inferencia de la estructura. Se han estudiado dos enfoques diferentes para la selección de variables: el primero es una modificación del algoritmo IFFS y el segundo utiliza un esquema de aprendizaje con “ensembles”. El algoritmo IFFS ha sido adaptado a las características de microarrays para obtener mejores resultados, añadiendo elementos como la medida de fiabilidad y un sistema de evaluación para seleccionar la mejor variable en cada iteración. El método que utiliza “ensembles” aprovecha la abundancia de features de los microarrays para implementar una selección diferente. En este campo se han estudiado diferentes algoritmos, mejorando alternativas ya existentes al escaso número de muestras y al alto número de variables, típicos de los microarrays. El problema de clasificación con más de dos clases ha sido también tratado al estudiar un nuevo algoritmo que combina múltiples clasificadores binarios. El algoritmo propuesto aprovecha la redundancia ofrecida por múltiples clasificadores para obtener predicciones más fiables. Todos los algoritmos propuestos en esta tesis han sido evaluados con datos públicos y de alta calidad, siguiendo protocolos establecidos en la literatura para poder ofrecer una comparación fiable con el estado del arte. Cuando ha sido posible, se han aplicado simulaciones Monte Carlo para mejorar la robustez de los resultados

    Provenance, propagation and quality of biological annotation

    Get PDF
    PhD ThesisBiological databases have become an integral part of the life sciences, being used to store, organise and share ever-increasing quantities and types of data. Biological databases are typically centred around raw data, with individual entries being assigned to a single piece of biological data, such as a DNA sequence. Although essential, a reader can obtain little information from the raw data alone. Therefore, many databases aim to supplement their entries with annotation, allowing the current knowledge about the underlying data to be conveyed to a reader. Although annotations come in many di erent forms, most databases provide some form of free text annotation. Given that annotations can form the foundations of future work, it is important that a user is able to evaluate the quality and correctness of an annotation. However, this is rarely straightforward. The amount of annotation, and the way in which it is curated, varies between databases. For example, the production of an annotation in some databases is entirely automated, without any manual intervention. Further, sections of annotations may be reused, being propagated between entries and, potentially, external databases. This provenance and curation information is not always apparent to a user. The work described within this thesis explores issues relating to biological annotation quality. While the most valuable annotation is often contained within free text, its lack of structure makes it hard to assess. Initially, this work describes a generic approach that allows textual annotations to be quantitatively measured. This approach is based upon the application of Zipf's Law to words within textual annotation, resulting in a single value, . The relationship between the value and Zipf's principle of least e ort provides an indication as to the annotations quality, whilst also allowing annotations to be quantitatively compared. Secondly, the thesis focuses on determining annotation provenance and tracking any subsequent propagation. This is achieved through the development of a visualisation - i - framework, which exploits the reuse of sentences within annotations. Utilising this framework a number of propagation patterns were identi ed, which on analysis appear to indicate low quality and erroneous annotation. Together, these approaches increase our understanding in the textual characteristics of biological annotation, and suggests that this understanding can be used to increase the overall quality of these resources

    Role of MIR-29B-1 and MIR-29A in endocrine-resistant breast cancer.

    Get PDF
    Therapies targeting estrogen receptor α (ERα) including selective estrogen receptor modulators (SERMs), e.g., tamoxifen (TAM); selective estrogen receptor downregulators (SERDs), e.g., fulvestrant (ICI 182,780); and aromatase inhibitors (AI), e.g., letrozole, are successfully used in treating breast cancer patients whose initial tumor expresses ERα. Unfortunately, the effectiveness of endocrine therapies is limited as ~ 40% of breast cancer patients will eventually acquire resistance to them. The role of miRNAs in the progression of endocrine-resistant breast cancer is of keen interest in developing biomarkers and therapies to counter metastatic disease. This dissertation begins with a review on miRNAs implicated in breast cancer, their bona fide gene targets, and associated pathways promoting endocrine resistance. Although microRNAs are dysregulated in breast cancer, their contribution to endocrine-resistance is not yet fully understood. Previous microarray analysis identified miR-29a and miR-29b-1 as repressed by TAM in MCF-7 endocrine-sensitive breast cancer cells but stimulated by TAM in LY2 endocrine-resistant breast cancer cells. Here we examined the mechanism for the differential regulation of these miRs by TAM in MCF-7 versus TAM-resistant LY2 and LCC9 breast cancer cells and the functional role of these microRNAs in these cells. Knockdown studies revealed that ERα is responsible for TAM regulation of miR-29b-1/a transcription. Transient overexpression of miR-29b-1/a decreased MCF-7, LCC9, and LY2 proliferation and inhibited LY2 cell migration and colony formation but did not sensitize LCC9 or LY2 cells to TAM. Furthermore, TAM reduced DICER1 mRNA and protein in LY2 cells, a known target of miR-29. Supporting this observation, anti-miR-29b-1 or anti-miR-29a inhibited the suppression of DICER by 4-OHT. These results suggest that miR-29b-1/a have tumor suppressor activity in TAM-resistant cells and do not appear to play a role in mediating TAM resistance. The target genes mediating miR-29b-1/a tumor suppressor activity were unknown. Here, using RNA sequencing, we identify miR-29b-1 and miR-29a target transcripts in both MCF-7 and LCC9 cells. We find that miR-29b-1 and miR-29a regulate common and unique transcripts in each cell line. The cell-specific and common downregulated genes were characterized using the MetaCore Gene Ontology (GO) enrichment analysis algorithm. LCC9-sepecific miR-29b-1/a-regulated GO processes include oxidative phosphorylation, ATP metabolism, and apoptosis. Extracellular flux analysis of cells transfected with anti- or pre- miR-29a confirmed that miR-29a inhibits mitochondrial bioenergetics in LCC9 cells. qPCR and luciferase reporter assays also verified the ATP synthase subunit genes ATP5G1 and ATPIF1 as bona fide miR29b-1/a targets. Our results suggest that miR-29 repression of TAM-resistant breast cancer cell proliferation is mediated in part through repression of genes important in mitochondrial bioenergetics. There is a critical need to develop sensitive circulating biomarkers that accurately identify signaling pathways altered in breast cancer patients resistant to endocrine therapies. Serum miRNAs have the potential to serve as biomarkers of the progression of endocrine-resistant breast cancer due to their cancer-specific expression and stability. Exosomal transfer of miRNAs has been implicated in metastasis and endocrine-resistance. This dissertation ends with a review on miRNAs in breast tumors and in serum, including exosomes, from breast cancer patients that are associated with resistance to tamoxifen

    An integrative polyomics investigation of bovine mastitis

    Get PDF
    Bovine mastitis, inflammation of the mammary gland, is one of the most costly and prevalent diseases in the dairy industry. It is commonly caused by bacteria, and Streptococcus uberis is one of the most prevalent causative agents. With advancements in omics technologies, the analysis of system-wide changes in the expression of proteins and metabolites in milk has become possible, and such analyses have broadened the knowledge of molecular changes in bovine mastitis. The work presented in this thesis aims to understand the dynamics of molecular changes in bovine mastitis caused by Streptococcus uberis through system-wide profiling and integrated analysis of milk proteins and metabolites. To this end, archived milk samples collected at specific intervals during the course of an experimentally induced model of Streptococcus uberis mastitis were used. Label-free quantitative proteomics and untargeted metabolomics data were generated from the archived milk samples obtained from six cows at six time-points (0, 36, 42, 57, 81 & 312 hours post-challenge). A total of 570 bovine proteins and 690 putative metabolites were quantified. Hierarchical cluster analysis and principal component analysis showed clustering of samples by the stage of infection, with similarities between pre-infection and resolution stages (0 and 312 hours post-challenge), early infection stages (36 and 42 hours post-challenge) and late infection stages (57 and 81 hours post-challenge). The proteomics and metabolomics data were analysed at both individual omics-layer level and combined inter-layer-level. At individual omics layer-level, the temporal changes identified include changes in the expression of proteins in acute-phase response signalling, FXR/RXR activation, complement system, IL-6 and IL-10 pathways, and changes in the expression of metabolites related to amino acid, carbohydrate, lipid and nucleotide metabolisms. The combined inter-layer-level analyses revealed functional relevance of proteins and metabolites enriched in the co-expression modules. For example, possible immunomodulatory role of bile acids via the FXR/RXR activation pathways could be inferred. Similarly, the actin-binding proteins could be linked to endocytic trafficking of signalling receptors. Overall, the work presented in this thesis provides deeper understanding of molecular changes in mastitis. On a secondary note, it also serves as a case study in the use of integrative polyomics analysis methods in the investigation of host-pathogen interactions
    corecore