709 research outputs found

    Parallelism and partitioning in large-scale GAs using spark

    Get PDF
    Big Data promises new scientific discovery and economic value. Genetic algorithms (GAs) have proven their flexibility in many application areas and substantial research effort has been dedicated to improving their performance through parallelisation. In contrast with most previous efforts we reject approaches that are based on the centralisation of data in the main memory of a single node or that require remote access to shared/distributed memory. We focus instead on scenarios where data is partitioned across machines. In this partitioned scenario, we explore two parallelisation models: PDMS, inspired by the traditional master-slave model, and PDMD, based on island models; we compare their performance in large-scale classification problems. We implement two distributed versions of Bio-HEL, a popular large-scale single-node GA classifier, using the Spark distributed data processing platform. In contrast to existing GA based on MapReduce, Spark allows a more efficient implementation of parallel GAs thanks to its simple, efficient iterative processing of partitioned datasets. We study the accuracy, efficiency and scalability of the proposed models. Our results show that PDMS provides the same accuracy of traditional BioHEL and exhibit good scalability up to 64 cores, while PDMD provides substantial reduction of execution time at a minor loss of accuracy

    Quantifying evolutionary constraints on B cell affinity maturation

    Full text link
    The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep short-read data set of B cell receptors. We find that the substitution process is conserved across individuals but varies significantly across gene segments. We investigate selection on B cell receptors using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic mapping and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection, which provides a more nuanced view of the constraints on framework and variable regions.Comment: Previously entitled "Substitution and site-specific selection driving B cell affinity maturation is consistent across individuals

    A hybrid kidney algorithm strategy for combinatorial interaction testing problem

    Get PDF
    Combinatorial Interaction Testing (CIT) generates a sampled test case set (Final Test Suite (FTS)) instead of all possible test cases. Generating the FTS with the optimum size is a computational optimization problem (COP) as well as a Non-deterministic Polynomial hard (NP-hard) problem. Recent studies have implemented hybrid metaheuristic algorithms as the basis for CIT strategy. However, the existing hybrid metaheuristic-based CIT strategies generate a competitive FTS size, there is no single CIT strategy can overcome others existing in all cases. In addition, the hybrid metaheuristic-based CIT strategies require more execution time than their own original algorithm-based strategies. Kidney Algorithm (KA) is a recent metaheuristic algorithm and has high efficiency and performance in solving different optimization problems against most of the state-of-the-art of metaheuristic algorithms. However, KA has limitations in the exploitation and exploration processes as well as the balancing control process is needed to be improved. These shortages cause KA to fail easily into the local optimum. This study proposes a low-level hybridization of KA with the mutation operator and improve the filtration process in KA to form a recently Hybrid Kidney Algorithm (HKA). HKA addresses the limitations in KA by improving the algorithm's exploration and exploitation processes by hybridizing KA with mutation operator, and improve the balancing control process by enhancing the filtration process in KA. HKA improves the efficiency in terms of generating an optimum FTS size and enhances the performance in terms of the execution time. HKA has been adopted into the CIT strategy as HKA based CIT Strategy (HKAS) to generate the most optimum FTS size. The results of HKAS shows that HKAS can generate the optimum FTS size in more than 67% of the benchmarking experiments as well as contributes by 34 new optimum size of FTS. HKAS also has better efficiency and performance than KAS. HKAS is the first hybrid metaheuristic-based CIT strategy that generates an optimum FTS size with less execution time than the original algorithm-based CIT strategy. Apart from supporting different CIT features: uniform/VS CIT, IOR CIT as well as the interaction strength up to 6, this study also introduces another recently variant of KA which are Improved KA (IKA) and Mutation KA (MKA) as well as new CIT strategies which are IKA-based (IKAS) and MKA-based (MKAS)

    The detection of high-qualified indels in exomes and their effect on cognition

    Full text link
    Plusieurs insertions/délétions (indels) génétiques ont été identifiées en lien avec des troubles du neurodéveloppement, notamment le trouble du spectre de l’autisme (TSA) et la déficience intellectuelle (DI). Bien que ce soit le deuxième type de variant le plus courant, la détection et l’identification des indels demeure difficile à ce jour, et on y retrouve un grand nombre de faux positifs. Ce projet vise à trouver une méthode pour détecter des indels de haute qualité ayant une forte probabilité d’être des vrais positifs. Un « ensemble de vérité » a été construit à partir d’indels provenant de deux cohortes familiales basé sur un diagnostic d’autisme. Ces indels ont été filtrés selon un ensemble de paramètres prédéterminés et ils ont été appelés par plusieurs outils d’appel de variants. Cet ensemble a été utilisé pour entraîner trois modèles d’apprentissage automatique pour identifier des indels de haute qualité. Par la suite, nous avons utilisé ces modèles pour prédire des indels de haute qualité dans une cohorte de population générale, ayant été appelé par une technologie d’appel de variant. Les modèles ont pu identifier des indels de meilleure qualité qui ont une association avec le QI, malgré que cet effet soit petit. De plus, les indels prédits par les modèles affectent un plus petit nombre de gènes par individu que ceux ayant été filtrés par un seuil de rejet fixe. Les modèles ont tendance à améliorer la qualité des indels, mais nécessiteront davantage de travail pour déterminer si ce serait possible de prédire les indels qui ont un effet non-négligeable sur le QI.Genetic insertions/deletions (indels) have been linked to many neurodevelopmental disorders (NDDs) such as autism spectrum disorder (ASD) and intellectual disability (ID). However, although they are the second most common type of genetic variant, they remain to this day difficult to identify and verify, presenting a high number of false positives. We sought to find a method that would appropriately identify high-quality indels that are likely to be true positives. We built an indel “truth set” using indels from two diagnosis-based family cohorts that were filtered according to a set of threshold values and called by several variant calling tools in order to train three machine learning models to identify the highest quality indels. The two best performing models were then used to identify high quality indels in a general population cohort that was called using only one variant calling technology. The machine learning models were able to identify higher quality indels that showed a association with IQ, although the effect size was small. The indels predicted by the models also affected a much smaller number of genes per individual than those predicted through using minimum thresholds alone. The models tend to show an overall improvement in the quality of the indels but would require further work to see if it could a noticeable and significant effect on IQ

    Big Data Optimization con algoritmos metaheurísticos utilizando frameworks de computación distribuida

    Get PDF
    La comunidad científica ha encontrado en el uso de los recursos tecnológicos disponibles una aliada para abordar problemas de gran complejidad e identificados como irresolubles. Tales problemas han sido abordados con técnicas exactas o heurísticas para lograr su resolución, o al menos conseguir soluciones de alta calidad, cuando los mismos se clasifican como NP-duros. Inicialmente, los problemas se planteaban en entornos estáticos, pero en los últimos años se les trata de resolver reproduciendo las características dinámicas y de alta dimensionalidad que los alteran. La optimización de estos problemas, conocida como Big Data Optimization, se puede realizar diseñando algoritmos metaheurísticos secuenciales y distribuidos (solvers) bajo frameworks de programación de alto nivel como los que incorporan el paradigma MapReduce para el manejo de Big Data. Dichos solvers, en principio, serán diseñados y testeados con problemas académicos, con el objetivo de analizar el comportamiento en cuanto a eficiencia y escalabilidad. En consecuencia, nuestro objetivo central es adaptar estos solvers para abordar problemas de interés en contextos reales (científico, industrial, entre otros) donde estamos trabajando, y puntualmente en problemas de planificación y de diseño de redes de distribución de agua y de sensores en plantas industriales.Eje: Agentes y sistemas inteligentes.Red de Universidades con Carreras en Informátic

    Teadusarvutuse algoritmide taandamine hajusarvutuse raamistikele

    Get PDF
    Teadusarvutuses kasutatakse arvuteid ja algoritme selleks, et lahendada probleeme erinevates reaalteadustes nagu geneetika, bioloogia ja keemia. Tihti on eesmärgiks selliste loodusnähtuste modelleerimine ja simuleerimine, mida päris keskkonnas oleks väga raske uurida. Näiteks on võimalik luua päikesetormi või meteoriiditabamuse mudel ning arvutisimulatsioonide abil hinnata katastroofi mõju keskkonnale. Mida keerulisemad ja täpsemad on sellised simulatsioonid, seda rohkem arvutusvõimsust on vaja. Tihti kasutatakse selleks suurt hulka arvuteid, mis kõik samaaegselt töötavad ühe probleemi kallal. Selliseid arvutusi nimetatakse paralleel- või hajusarvutusteks. Hajusarvutuse programmide loomine on aga keeruline ning nõuab palju rohkem aega ja ressursse, kuna vaja on sünkroniseerida erinevates arvutites samaaegselt tehtavat tööd. On loodud mitmeid tarkvararaamistikke, mis lihtsustavad seda tööd automatiseerides osa hajusprogrammeerimisest. Selle teadustöö eesmärk oli uurida selliste hajusarvutusraamistike sobivust keerulisemate teadusarvutuse algoritmide jaoks. Tulemused näitasid, et olemasolevad raamistikud on üksteisest väga erinevad ning neist ükski ei ole sobiv kõigi erinevat tüüpi algoritmide jaoks. Mõni raamistik on sobiv ainult lihtsamate algoritmide jaoks; mõni ei sobi olukorras, kus andmed ei mahu arvutite mällu. Algoritmi jaoks kõige sobivama hajusarvutisraamistiku valimine võib olla väga keeruline ülesanne, kuna see nõuab olemasolevate raamistike uurimist ja rakendamist. Sellele probleemile lahendust otsides otsustati luua dünaamiline algoritmide modelleerimise rakendus (DAMR), mis oskab simuleerida algoritmi implementatsioone erinevates hajusarvutusraamistikes. DAMR aitab hinnata milline hajusraamistik on kõige sobivam ette antud algoritmi jaoks, ilma algoritmi reaalselt ühegi hajusraamistiku peale implementeerimata. Selle uurimustöö peamine panus on hajusarvutusraamistike kasutuselevõtu lihtsamaks tegemine teadlastele, kes ei ole varem nende kasutamisega kokku puutunud. See peaks märkimisväärselt aega ja ressursse kokku hoidma, kuna ei pea ükshaaval kõiki olemasolevaid hajusraamistikke tundma õppima ja rakendama.Scientific computing uses computers and algorithms to solve problems in various sciences such as genetics, biology and chemistry. Often the goal is to model and simulate different natural phenomena which would otherwise be very difficult to study in real environments. For example, it is possible to create a model of a solar storm or a meteor hit and run computer simulations to assess the impact of the disaster on the environment. The more sophisticated and accurate the simulations are the more computing power is required. It is often necessary to use a large number of computers, all working simultaneously on a single problem. These kind of computations are called parallel or distributed computing. However, creating distributed computing programs is complicated and requires a lot more time and resources, because it is necessary to synchronize different computers working at the same time. A number of software frameworks have been created to simplify this process by automating part of a distributed programming. The goal of this research was to assess the suitability of such distributed computing frameworks for complex scientific computing algorithms. The results showed that existing frameworks are very different from each other and none of them are suitable for all different types of algorithms. Some frameworks are only suitable for simple algorithms; others are not suitable when data does not fit into the computer memory. Choosing the most appropriate distributed computing framework for an algorithm can be a very complex task, because it requires studying and applying the existing frameworks. While searching for a solution to this problem, it was decided to create a Dynamic Algorithms Modelling Application (DAMA), which is able to simulate the implementation of the algorithm in different distributed computing frameworks. DAMA helps to estimate which distributed framework is the most appropriate for a given algorithm, without actually implementing it in any of the available frameworks. This main contribution of this study is simplifying the adoption of distributed computing frameworks for researchers who are not yet familiar with using them. It should save significant time and resources as it is not necessary to study each of the available distributed computing frameworks in detail
    corecore