110 research outputs found

    The Combinational Polymorphisms of ORAI1

    Get PDF

    Haplotype estimation in polyploids using DNA sequence data

    Get PDF
    Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p

    Structural Prediction of Protein–Protein Interactions by Docking: Application to Biomedical Problems

    Get PDF
    A huge amount of genetic information is available thanks to the recent advances in sequencing technologies and the larger computational capabilities, but the interpretation of such genetic data at phenotypic level remains elusive. One of the reasons is that proteins are not acting alone, but are specifically interacting with other proteins and biomolecules, forming intricate interaction networks that are essential for the majority of cell processes and pathological conditions. Thus, characterizing such interaction networks is an important step in understanding how information flows from gene to phenotype. Indeed, structural characterization of protein–protein interactions at atomic resolution has many applications in biomedicine, from diagnosis and vaccine design, to drug discovery. However, despite the advances of experimental structural determination, the number of interactions for which there is available structural data is still very small. In this context, a complementary approach is computational modeling of protein interactions by docking, which is usually composed of two major phases: (i) sampling of the possible binding modes between the interacting molecules and (ii) scoring for the identification of the correct orientations. In addition, prediction of interface and hot-spot residues is very useful in order to guide and interpret mutagenesis experiments, as well as to understand functional and mechanistic aspects of the interaction. Computational docking is already being applied to specific biomedical problems within the context of personalized medicine, for instance, helping to interpret pathological mutations involved in protein–protein interactions, or providing modeled structural data for drug discovery targeting protein–protein interactions.Spanish Ministry of Economy grant number BIO2016-79960-R; D.B.B. is supported by a predoctoral fellowship from CONACyT; M.R. is supported by an FPI fellowship from the Severo Ochoa program. We are grateful to the Joint BSC-CRG-IRB Programme in Computational Biology.Peer ReviewedPostprint (author's final draft

    Computational Approaches To Anti-Toxin Therapies And Biomarker Identification

    Get PDF
    This work describes the fundamental study of two bacterial toxins with computational methods, the rational design of a potent inhibitor using molecular dynamics, as well as the development of two bioinformatic methods for mining genomic data. Clostridium difficile is an opportunistic bacillus which produces two large glucosylating toxins. These toxins, TcdA and TcdB cause severe intestinal damage. As Clostridium difficile harbors considerable antibiotic resistance, one treatment strategy is to prevent the tissue damage that the toxins cause. The catalytic glucosyltransferase domain of TcdA and TcdB was studied using molecular dynamics in the presence of both a protein-protein binding partner and several substrates. These experiments were combined with lead optimization techniques to create a potent irreversible inhibitor which protects 95% of cells in vitro. Dynamics studies on a TcdB cysteine protease domain were performed to an allosteric communication pathway. Comparative analysis of the static and dynamic properties of the TcdA and TcdB glucosyltransferase domains were carried out to determine the basis for the differential lethality of these toxins. Large scale biological data is readily available in the post-genomic era, but it can be difficult to effectively use that data. Two bioinformatics methods were developed to process whole-genome data. Software was developed to return all genes containing a motif in single genome. This provides a list of genes which may be within the same regulatory network or targeted by a specific DNA binding factor. A second bioinformatic method was created to link the data from genome-wide association studies (GWAS) to specific genes. GWAS studies are frequently subjected to statistical analysis, but mutations are rarely investigated structurally. HyDn-SNP-S allows a researcher to find mutations in a gene that correlate to a GWAS studied phenotype. Across human DNA polymerases, this resulted in strongly predictive haplotypes for breast and prostate cancer. Molecular dynamics applied to DNA Polymerase Lambda suggested a structural explanation for the decrease in polymerase fidelity with that mutant. When applied to Histone Deacetylases, mutations were found that alter substrate binding, and post-translational modification

    Improvements on the bees algorithm for continuous optimisation problems

    Get PDF
    This work focuses on the improvements of the Bees Algorithm in order to enhance the algorithm’s performance especially in terms of convergence rate. For the first enhancement, a pseudo-gradient Bees Algorithm (PG-BA) compares the fitness as well as the position of previous and current bees so that the best bees in each patch are appropriately guided towards a better search direction after each consecutive cycle. This method eliminates the need to differentiate the objective function which is unlike the typical gradient search method. The improved algorithm is subjected to several numerical benchmark test functions as well as the training of neural network. The results from the experiments are then compared to the standard variant of the Bees Algorithm and other swarm intelligence procedures. The data analysis generally confirmed that the PG-BA is effective at speeding up the convergence time to optimum. Next, an approach to avoid the formation of overlapping patches is proposed. The Patch Overlap Avoidance Bees Algorithm (POA-BA) is designed to avoid redundancy in search area especially if the site is deemed unprofitable. This method is quite similar to Tabu Search (TS) with the POA-BA forbids the exact exploitation of previously visited solutions along with their corresponding neighbourhood. Patches are not allowed to intersect not just in the next generation but also in the current cycle. This reduces the number of patches materialise in the same peak (maximisation) or valley (minimisation) which ensures a thorough search of the problem landscape as bees are distributed around the scaled down area. The same benchmark problems as PG-BA were applied against this modified strategy to a reasonable success. Finally, the Bees Algorithm is revised to have the capability of locating all of the global optimum as well as the substantial local peaks in a single run. These multi-solutions of comparable fitness offers some alternatives for the decision makers to choose from. The patches are formed only if the bees are the fittest from different peaks by using a hill-valley mechanism in this so called Extended Bees Algorithm (EBA). This permits the maintenance of diversified solutions throughout the search process in addition to minimising the chances of getting trap. This version is proven beneficial when tested with numerous multimodal optimisation problems

    Biodiversity assessment of marine benthic communities with COI metabarcoding: methods and applications

    Full text link
    [eng] Ecosystem biomonitoring is crucial for proper management of natural communities during the Anthropocene era. With the advent of new sequencing technologies, DNA metabarcoding has been proposed as a game-changing tool for biomonitoring. In this Thesis we plead for the use of metabarcoding of a highly variable marker to infer not only the interspecies but also the intraspecies variability to assess both biogeographic, at the species level, and metaphylogeographic patterns, at the haplotype level. We focused on highly complex hard-substratum benthic littoral communities. The term "Metaphylogeography", coined in this Thesis, refers to the study of phylogeographic patterns of many species at the same time using metabarcoding data. However, as of the start of this Thesis, only a few studies had tested the metabarcoding method to directly characterize the whole eukaryotic community in highly diverse benthic ecosystems. This required to set up and calibrate methods for these communities as a prior step. We first evaluated both the sampling methods and the bioinformatic pipelines. We assessed the viability of detecting the environmental DNA released from the benthic community into the adjacent water layer using metabarcoding of COI with highly degenerated primers targeting the whole eukaryotic community. We sampled water from 0 to 20m from shallow rocky benthic communities and compared the DNA signal with the results obtained from metabarcoding directly the benthic communities by traditional quadrat sampling. We also designed a pipeline combining clustering and denoising methods to treat metabarcoding data of COI. We considered the entropy of each codon position of this coding fragment both to improve the detection of spurious sequences and to calibrate the best performing parameters of the software used. In addition, we created our own denoising program, DnoisE, to incorporate information on the codon position. This new code and parameter calibration were required as the commonly used bioinformatic pipelines had been designed and tested mostly for less variable ribosomal fragments and, particularly, in prokaryotes. Results showed that the DNA signal from the benthos decreased with the distance but was too weak for a correct assessment of benthic biodiversity. The proportion of eukaryotic DNA sequenced was also very low in water samples due to the amplification of prokaryotic DNA. We thus concluded that the benthos must be sampled directly to properly assess its biodiversity composition. The new bioinformatic developments allowed us to propose new methods for processing metabarcoding reads, combining clustering and denoising steps, and to set optimal values for the parameters used at each step. These contributions effectively expanded the field to the novel analysis of inter- and intraspecies genetic variability with metabarcoding data. Finally, we applied this methodology to 12 localities of the Western Iberian Coast along two well studied fronts, the Almeria-Oran Front (AOF) and the Ibiza Channel (IC). We analysed the species and haplotypes using the COI barcode. From a biogeographical perspective, the AOF had a strong effect in separating regions, while IC effect was less marked, but still half of the MOTUs were found in only one side of this divide. For the metaphylogeographic analysis, only 10% of the MOTUs could be used. However, they showed a good separation between populations of the three regions with a strong effect of the AOF break. The IC, on the other hand, seemed to be more a transitional zone than a fixed break. This Thesis laid the ground for the efficient use of metabarcoding in the biomonitoring of benthic reef habitats, allowing community composition, β-diversity, and biogeographic patterns to be analysed in a fast, repeatable, and cost-efficient way. We also developed the metaphylogeography approach as a new tool to assess population genetic structure at the community-wide level

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    Reconstrução e classificação de sequências de ADN desconhecidas

    Get PDF
    The continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.Os contínuos avanços em tecnologias de sequenciação de ADN e técnicas em meta genómica requerem metodologias de reconstrução confiáveis e de classificação precisas para o aumento da diversidade do repositório natural, contribuindo, entretanto, para a descrição e organização dos organismos. No entanto, após a sequenciação e a montagem de-novo, um dos desafios mais complexos advém das sequências de ADN que não correspondem ou se assemelham a qualquer sequencia biológica da literatura. São três as principais razões que contribuem para essa exceção: uma irregularidade emergiu no processo de reconstrução, a sequência do organismo é altamente dissimilar dos organismos da literatura, ou um novo e diferente organismo foi reconstruído. A incapacidade de classificar com eficiência essas sequências desconhecidas aumenta a incerteza da constituição da amostra e desperdiça a oportunidade de descobrir novas espécies, uma vez que muitas vezes são descartadas. Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional eficiente para resolver este desafio com base em um conjunto de especialistas, nomeadamente preditores baseados em compressão, a distribuição de conteúdo de sequência e comprimentos de sequência normalizados. O método usa sequências de ADN e de aminoácidos e fornece classificação eficiente além das comparações referenciais padrão. Excecionalmente, ele classifica as sequências de ADN sem recorrer diretamente a genomas de referência, mas sim às características que as sequências biológicas da espécie compartilham. Especificamente, ele usa apenas recursos extraídos individualmente de cada genoma sem usar comparações de sequência. Além disso, o pipeline é totalmente automático e permite a reconstrução sem referência de genomas a partir de reads FASTQ com a garantia adicional de armazenamento seguro de informações sensíveis. O RFSC é então um pipeline de classificação de aprendizagem automática que se baseia em um conjunto de especialistas para fornecer classificação eficiente em contextos meta genómicos. Este pipeline foi aplicado em dados sintéticos e reais, alcançando em ambos resultados precisos e exatos que, no momento do desenvolvimento desta dissertação, não foram relatados na literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma precisão de aproximadamente 97% na classificação de domínio/tipo.Mestrado em Engenharia de Computadores e Telemátic
    • …
    corecore