110 research outputs found
Haplotype estimation in polyploids using DNA sequence data
Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p
Structural Prediction of Protein–Protein Interactions by Docking: Application to Biomedical Problems
A huge amount of genetic information is available thanks to the recent advances in sequencing technologies and the larger computational capabilities, but the interpretation of such genetic data at phenotypic level remains elusive. One of the reasons is that proteins are not acting alone, but are specifically interacting with other proteins and biomolecules, forming intricate interaction networks that are essential for the majority of cell processes and pathological conditions. Thus, characterizing such interaction networks is an important step in understanding how information flows from gene to phenotype. Indeed, structural characterization of protein–protein interactions at atomic resolution has many applications in biomedicine, from diagnosis and vaccine design, to drug discovery. However, despite the advances of experimental structural determination, the number of interactions for which there is available structural data is still very small. In this context, a complementary approach is computational modeling of protein interactions by docking, which is usually composed of two major phases: (i) sampling of the possible binding modes between the interacting molecules and (ii) scoring for the identification of the correct orientations. In addition, prediction of interface and hot-spot residues is very useful in order to guide and interpret mutagenesis experiments, as well as to understand functional and mechanistic aspects of the interaction. Computational docking is already being applied to specific biomedical problems within the context of personalized medicine, for instance, helping to interpret pathological mutations involved in protein–protein interactions, or providing modeled structural data for drug discovery targeting protein–protein interactions.Spanish Ministry of Economy grant number BIO2016-79960-R; D.B.B. is supported by a
predoctoral fellowship from CONACyT; M.R. is supported by an FPI fellowship from the
Severo Ochoa program. We are grateful to the Joint BSC-CRG-IRB Programme in
Computational Biology.Peer ReviewedPostprint (author's final draft
Computational Approaches To Anti-Toxin Therapies And Biomarker Identification
This work describes the fundamental study of two bacterial toxins with computational methods, the rational design of a potent inhibitor using molecular dynamics, as well as the development of two bioinformatic methods for mining genomic data.
Clostridium difficile is an opportunistic bacillus which produces two large glucosylating toxins. These toxins, TcdA and TcdB cause severe intestinal damage. As Clostridium difficile harbors considerable antibiotic resistance, one treatment strategy is to prevent the tissue damage that the toxins cause. The catalytic glucosyltransferase domain of TcdA and TcdB was studied using molecular dynamics in the presence of both a protein-protein binding partner and several substrates. These experiments were combined with lead optimization techniques to create a potent irreversible inhibitor which protects 95% of cells in vitro. Dynamics studies on a TcdB cysteine protease domain were performed to an allosteric communication pathway. Comparative analysis of the static and dynamic properties of the TcdA and TcdB glucosyltransferase domains were carried out to determine the basis for the differential lethality of these toxins.
Large scale biological data is readily available in the post-genomic era, but it can be difficult to effectively use that data. Two bioinformatics methods were developed to process whole-genome data. Software was developed to return all genes containing a motif in single genome. This provides a list of genes which may be within the same regulatory network or targeted by a specific DNA binding factor. A second bioinformatic method was created to link the data from genome-wide association studies (GWAS) to specific genes. GWAS studies are frequently subjected to statistical analysis, but mutations are rarely investigated structurally. HyDn-SNP-S allows a researcher to find mutations in a gene that correlate to a GWAS studied phenotype. Across human DNA polymerases, this resulted in strongly predictive haplotypes for breast and prostate cancer. Molecular dynamics applied to DNA Polymerase Lambda suggested a structural explanation for the decrease in polymerase fidelity with that mutant. When applied to Histone Deacetylases, mutations were found that alter substrate binding, and post-translational modification
Improvements on the bees algorithm for continuous optimisation problems
This work focuses on the improvements of the Bees Algorithm in order to enhance the algorithm’s performance especially in terms of convergence rate. For the first enhancement, a pseudo-gradient Bees Algorithm (PG-BA) compares the fitness as well as the position of previous and current bees so that the best bees in each patch are appropriately guided towards a better search direction after each consecutive cycle. This method eliminates the need to differentiate the objective function which is unlike the typical gradient search method. The improved algorithm is subjected to several numerical benchmark test functions as well as the training of neural network. The results from the experiments are then compared to the standard variant of the Bees Algorithm and other swarm intelligence procedures. The data analysis generally confirmed that the PG-BA is effective at speeding up the convergence time to optimum.
Next, an approach to avoid the formation of overlapping patches is proposed. The Patch Overlap Avoidance Bees Algorithm (POA-BA) is designed to avoid redundancy in search area especially if the site is deemed unprofitable. This method is quite similar to Tabu Search (TS) with the POA-BA forbids the exact exploitation of previously visited solutions along with their corresponding neighbourhood. Patches are not allowed to intersect not just in the next generation but also in the current cycle. This reduces the number of patches materialise in the same peak (maximisation) or valley (minimisation) which ensures a thorough search of the problem landscape as bees are distributed around the scaled down area. The same benchmark problems as PG-BA were applied against this modified strategy to a reasonable success.
Finally, the Bees Algorithm is revised to have the capability of locating all of the global optimum as well as the substantial local peaks in a single run. These multi-solutions of comparable fitness offers some alternatives for the decision makers to choose from. The patches are formed only if the bees are the fittest from different peaks by using a hill-valley mechanism in this so called Extended Bees Algorithm (EBA). This permits the maintenance of diversified solutions throughout the search process in addition to minimising the chances of getting trap. This version is proven beneficial when tested with numerous multimodal optimisation problems
Biodiversity assessment of marine benthic communities with COI metabarcoding: methods and applications
[eng] Ecosystem biomonitoring is crucial for proper management of natural communities during the Anthropocene era. With the advent of new sequencing technologies, DNA metabarcoding has been proposed as a game-changing tool for biomonitoring. In this Thesis we plead for the use of metabarcoding of a highly variable marker to infer not only the interspecies but also the intraspecies variability to assess both biogeographic, at the species level, and metaphylogeographic patterns, at the haplotype level. We focused on highly complex hard-substratum benthic littoral communities. The term "Metaphylogeography", coined in this Thesis, refers to the study of phylogeographic patterns of many species at the same time using metabarcoding data. However, as of the start of this Thesis, only a few studies had tested the metabarcoding method to directly characterize the whole eukaryotic community in highly diverse benthic ecosystems. This required to set up and calibrate methods for these communities as a prior step.
We first evaluated both the sampling methods and the bioinformatic pipelines. We assessed the viability of detecting the environmental DNA released from the benthic community into the adjacent water layer using metabarcoding of COI with highly degenerated primers targeting the whole eukaryotic community. We sampled water from 0 to 20m from shallow rocky benthic communities and compared the DNA signal with the results obtained from metabarcoding directly the benthic communities by traditional quadrat sampling. We also designed a pipeline combining clustering and denoising methods to treat metabarcoding data of COI. We considered the entropy of each codon position of this coding fragment both to improve the detection of spurious sequences and to calibrate the best performing parameters of the software used. In addition, we created our own denoising program, DnoisE, to incorporate information on the codon position. This new code and parameter calibration were required as the commonly used bioinformatic pipelines had been designed and tested mostly for less variable ribosomal
fragments and, particularly, in prokaryotes.
Results showed that the DNA signal from the benthos decreased with the distance but was too weak for a correct assessment of benthic biodiversity. The proportion of eukaryotic DNA sequenced was also very low in water samples due to the amplification of prokaryotic DNA. We thus concluded that the benthos must be sampled directly to properly assess its biodiversity composition. The new bioinformatic developments allowed us to propose new methods for processing metabarcoding reads, combining clustering and denoising steps, and to set optimal values for the parameters used at each step. These contributions effectively expanded the field to the novel analysis of inter- and intraspecies genetic variability with metabarcoding data.
Finally, we applied this methodology to 12 localities of the Western Iberian Coast along two well studied fronts, the Almeria-Oran Front (AOF) and the Ibiza Channel (IC). We analysed the species and haplotypes using the COI barcode. From a biogeographical perspective, the AOF had a strong effect in separating regions, while IC effect was less marked, but still half of the MOTUs were found in only one side of this divide. For the metaphylogeographic analysis, only 10% of the MOTUs could be used. However, they showed a good separation between populations of the three regions with a strong effect of the AOF break. The IC, on the other hand, seemed to be more a transitional zone than a fixed break.
This Thesis laid the ground for the efficient use of metabarcoding in the biomonitoring of benthic reef habitats, allowing community composition, β-diversity, and biogeographic patterns to be analysed in a fast, repeatable, and cost-efficient way. We also developed the metaphylogeography approach as a new tool to assess population genetic structure at the community-wide level
Applications
Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
Applications
Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
Reconstrução e classificação de sequências de ADN desconhecidas
The continuous advances in DNA sequencing technologies and techniques
in metagenomics require reliable reconstruction and accurate classification
methodologies for the diversity increase of the natural repository while contributing
to the organisms' description and organization. However, after
sequencing and de-novo assembly, one of the highest complex challenges
comes from the DNA sequences that do not match or resemble any biological
sequence from the literature. Three main reasons contribute to this
exception: the organism sequence presents high divergence according to the
known organisms from the literature, an irregularity has been created in the
reconstruction process, or a new organism has been sequenced. The inability
to efficiently classify these unknown sequences increases the sample
constitution's uncertainty and becomes a wasted opportunity to discover
new species since they are often discarded.
In this context, the main objective of this thesis is the development and
validation of a tool that provides an efficient computational solution to
solve these three challenges based on an ensemble of experts, namely
compression-based predictors, the distribution of sequence content, and
normalized sequence lengths. The method uses both DNA and amino acid
sequences and provides efficient classification beyond standard referential
comparisons. Unusually, it classifies DNA sequences without resorting directly
to the reference genomes but rather to features that the species biological
sequences share. Specifically, it only makes use of features extracted
individually from each genome without using sequence comparisons.
RFSC was then created as a machine learning classification pipeline that
relies on an ensemble of experts to provide efficient classification in metagenomic
contexts. This pipeline was tested in synthetic and real data, both
achieving precise and accurate results that, at the time of the development
of this thesis, have not been reported in the state-of-the-art. Specifically, it
has achieved an accuracy of approximately 97% in the domain/type classification.Os contÃnuos avanços em tecnologias de sequenciação de ADN e técnicas
em meta genómica requerem metodologias de reconstrução confiáveis e de
classificação precisas para o aumento da diversidade do repositório natural,
contribuindo, entretanto, para a descrição e organização dos organismos.
No entanto, após a sequenciação e a montagem de-novo, um dos desafios
mais complexos advém das sequências de ADN que não correspondem ou se
assemelham a qualquer sequencia biológica da literatura. São três as principais
razões que contribuem para essa exceção: uma irregularidade emergiu
no processo de reconstrução, a sequência do organismo é altamente dissimilar
dos organismos da literatura, ou um novo e diferente organismo foi
reconstruÃdo. A incapacidade de classificar com eficiência essas sequências
desconhecidas aumenta a incerteza da constituição da amostra e desperdiça
a oportunidade de descobrir novas espécies, uma vez que muitas vezes são
descartadas.
Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional
eficiente para resolver este desafio com base em um conjunto
de especialistas, nomeadamente preditores baseados em compressão, a distribuição de conteúdo de sequência e comprimentos de sequência normalizados.
O método usa sequências de ADN e de aminoácidos e fornece classificação eficiente além das comparações referenciais padrão. Excecionalmente,
ele classifica as sequências de ADN sem recorrer diretamente a genomas
de referência, mas sim à s caracterÃsticas que as sequências biológicas da
espécie compartilham. Especificamente, ele usa apenas recursos extraÃdos
individualmente de cada genoma sem usar comparações de sequência. Além
disso, o pipeline é totalmente automático e permite a reconstrução sem referência de genomas a partir de reads FASTQ com a garantia adicional de
armazenamento seguro de informações sensÃveis.
O RFSC é então um pipeline de classificação de aprendizagem automática
que se baseia em um conjunto de especialistas para fornecer classificação
eficiente em contextos meta genómicos. Este pipeline foi aplicado em dados
sintéticos e reais, alcançando em ambos resultados precisos e exatos que,
no momento do desenvolvimento desta dissertação, não foram relatados na
literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma
precisão de aproximadamente 97% na classificação de domÃnio/tipo.Mestrado em Engenharia de Computadores e Telemátic
- …