11 research outputs found

    Many-Task Computing and Blue Waters

    Full text link
    This report discusses many-task computing (MTC) generically and in the context of the proposed Blue Waters systems, which is planned to be the largest NSF-funded supercomputer when it begins production use in 2012. The aim of this report is to inform the BW project about MTC, including understanding aspects of MTC applications that can be used to characterize the domain and understanding the implications of these aspects to middleware and policies. Many MTC applications do not neatly fit the stereotypes of high-performance computing (HPC) or high-throughput computing (HTC) applications. Like HTC applications, by definition MTC applications are structured as graphs of discrete tasks, with explicit input and output dependencies forming the graph edges. However, MTC applications have significant features that distinguish them from typical HTC applications. In particular, different engineering constraints for hardware and software must be met in order to support these applications. HTC applications have traditionally run on platforms such as grids and clusters, through either workflow systems or parallel programming systems. MTC applications, in contrast, will often demand a short time to solution, may be communication intensive or data intensive, and may comprise very short tasks. Therefore, hardware and software for MTC must be engineered to support the additional communication and I/O and must minimize task dispatch overheads. The hardware of large-scale HPC systems, with its high degree of parallelism and support for intensive communication, is well suited for MTC applications. However, HPC systems often lack a dynamic resource-provisioning feature, are not ideal for task communication via the file system, and have an I/O system that is not optimized for MTC-style applications. Hence, additional software support is likely to be required to gain full benefit from the HPC hardware

    Scientific data mining, integration, and visualization

    Get PDF
    This report summarises the workshop on Scientific Data Mining, Integration and Visualization (SDMIV) held at the e-Science Institute, Edinburgh (eSI[1] ) on 24-25 October 2002, and presents a set of recommendations arising from the discussion that took place there. The aims of the workshop were threefold: (A) To inform researchers in the SDMIV communities of the infrastructural advances being made by computing initiatives, such as the Grid; (B) To feed back requirements from the SDMIV areas to those developing the computational infrastructure; and (C) To foster interaction among all these communities, since the coordinated efforts of all of them will be required to realise the potential for scientific knowledge extraction offered by e-science initiatives worldwide

    Development of tools for the simulation of nanometric transistors using advanced computational architectures

    Get PDF
    The aim of this thesis project is the study of nanoscale semiconductor devices, including new options based on new architectures and designs, for which multidimensional simulation tool based on Monte-Carlo models are going to be developed, including quantum corrections by solving the Schrödinger equation in the transverse direction to the propagation of carriers within the device. So far, our research group has developed several simulators semiconductor devices using various simulation techniques. This work is developed in collaboration with several national and international groups. It should primarily highlight the group maintains collaborations with the universities of Glasgow, Swansea and Granada and gives rise to this thesis project.The ultimate goal is to use the simulator to study various optimized, especially classical electronic devices, SOI-based and multigate, with silicon devices for sizes of under 10 nm

    Real-time pathogen surveillance systems using DNA sequencing

    Get PDF
    Microbiological research has uncovered the basis of fermentation, infectious disease, vaccination and antibiotics. Now, a technological revolution leveraging DNA, the code of life, has allowed us to unravel cellular and evolutionary processes in exquisite detail. Today our need for new innovation is still great. The modern world is a challenging environment: over-population, climate change and highly mobile populations create a high risk of pandemic disease especially from viruses and many bacteria are now resistant to our life saving antibiotic drugs due to overuse. In hospitals, the spread of pathogens can be rapid and life threatening. Whole-genome sequencing has the power to identify the source of infections and determine whether clusters of cases belong to an outbreak. Portable, real-time nanopore sequencing enables sequencing to be performed near the patient, even in resource-limited settings. Integrating with existing datasets allows digital surveillance able to detect outbreaks earlier while they can still be contained. Early demonstrations of the power of whole-genome sequencing for outbreak surveillance have made it an area of intense interest and further development in laboratory methods and infrastructure will make it an important tool that can be deployed in response to future outbreaks

    Conception et analyse des biopuces à ADN en environnements parallèles et distribués

    Get PDF
    Microorganisms represent the largest diversity of the living beings. They play a crucial rôle in all biological processes related to their huge metabolic potentialities and their capacity for adaptation to different ecological niches. The development of new genomic approaches allows a better knowledge of the microbial communities involved in complex environments functioning. In this context, DNA microarrays represent high-throughput tools able to study the presence, or the expression levels of several thousands of genes, combining qualitative and quantitative aspects in only one experiment. However, the design and analysis of DNA microarrays, with their current high density formats as well as the huge amount of data to process, are complex but crucial steps. To improve the quality and performance of these two steps, we have proposed new bioinformatics approaches for the design and analysis of DNA microarrays in parallel and distributed environments. These multipurpose approaches use high performance computing (HPC) and new software engineering approaches, especially model driven engineering (MDE), to overcome the current limitations. We have first developed PhylGrid 2.0, a new distributed approach for the selection of explorative probes for phylogenetic DNA microarrays at large scale using computing grids. This software was used to build PhylOPDb: a comprehensive 16S rRNA oligonucleotide probe database for prokaryotic identification. MetaExploArrays, which is a parallel software of oligonucleotide probe selection on different computing architectures (a PC, a multiprocessor, a cluster or a computing grid) using meta-programming and a model driven engineering approach, has been developed to improve flexibility in accordance to user’s informatics resources. Then, PhylInterpret, a new software for the analysis of hybridization results of DNA microarrays. PhylInterpret uses the concepts of propositional logic to determine the prokaryotic composition of metagenomic samples. Finally, a new parallelization method based on model driven engineering (MDE) has been proposed to compute a complete backtranslation of short peptides to select probes for functional microarrays.Les microorganismes constituent la plus grande diversité du monde vivant. Ils jouent un rôle clef dans tous les processus biologiques grâce à leurs capacités d’adaptation et à la diversité de leurs capacités métaboliques. Le développement de nouvelles approches de génomique permet de mieux explorer les populations microbiennes. Dans ce contexte, les biopuces à ADN représentent un outil à haut débit de choix pour l'étude de plusieurs milliers d’espèces en une seule expérience. Cependant, la conception et l’analyse des biopuces à ADN, avec leurs formats de haute densité actuels ainsi que l’immense quantité de données à traiter, représentent des étapes complexes mais cruciales. Pour améliorer la qualité et la performance de ces deux étapes, nous avons proposé de nouvelles approches bioinformatiques pour la conception et l’analyse des biopuces à ADN en environnements parallèles. Ces approches généralistes et polyvalentes utilisent le calcul haute performance (HPC) et les nouvelles approches du génie logiciel inspirées de la modélisation, notamment l’ingénierie dirigée par les modèles (IDM) pour contourner les limites actuelles. Nous avons développé PhylGrid 2.0, une nouvelle approche distribuée sur grilles de calcul pour la sélection de sondes exploratoires pour biopuces phylogénétiques. Ce logiciel a alors été utilisé pour construire PhylOPDb: une base de données complète de sondes oligonucléotidiques pour l’étude des communautés procaryotiques. MetaExploArrays qui est un logiciel parallèle pour la détermination de sondes sur différentes architectures de calcul (un PC, un multiprocesseur, un cluster ou une grille de calcul), en utilisant une approche de méta-programmation et d’ingénierie dirigée par les modèles a alors été conçu pour apporter une flexibilité aux utilisateurs en fonction de leurs ressources matériel. PhylInterpret, quant à lui est un nouveau logiciel pour faciliter l’analyse des résultats d’hybridation des biopuces à ADN. PhylInterpret utilise les notions de la logique propositionnelle pour déterminer la composition en procaryotes d’échantillons métagénomiques. Enfin, une démarche d’ingénierie dirigée par les modèles pour la parallélisation de la traduction inverse d’oligopeptides pour le design des biopuces à ADN fonctionnelles a également été mise en place

    Data Enrichment for Data Mining Applied to Bioinformatics and Cheminformatics Domains

    Get PDF
    Problemas cada vez mais complexos estão a ser tratados na àrea das ciências da vida. A aquisição de todos os dados que possam estar relacionados com o problema em questão é primordial. Igualmente importante é saber como os dados estão relacionados uns com os outros e com o próprio problema. Por outro lado, existem grandes quantidades de dados e informações disponíveis na Web. Os investigadores já estão a utilizar Data Mining e Machine Learning como ferramentas valiosas nas suas investigações, embora o procedimento habitual seja procurar a informação baseada nos modelos indutivos. Até agora, apesar dos grandes sucessos já alcançados com a utilização de Data Mining e Machine Learning, não é fácil integrar esta vasta quantidade de informação disponível no processo indutivo, com algoritmos proposicionais. A nossa principal motivação é abordar o problema da integração de informação de domínio no processo indutivo de técnicas proposicionais de Data Mining e Machine Learning, enriquecendo os dados de treino a serem utilizados em sistemas de programação de lógica indutiva. Os algoritmos proposicionais de Machine Learning são muito dependentes dos atributos dos dados. Ainda é difícil identificar quais os atributos mais adequados para uma determinada tarefa na investigação. É também difícil extrair informação relevante da enorme quantidade de dados disponíveis. Vamos concentrar os dados disponíveis, derivar características que os algoritmos de ILP podem utilizar para induzir descrições, resolvendo os problemas. Estamos a criar uma plataforma web para obter informação relevante para problemas de Bioinformática (particularmente Genómica) e Quimioinformática. Esta vai buscar os dados a repositórios públicos de dados genómicos, proteicos e químicos. Após o enriquecimento dos dados, sistemas Prolog utilizam programação lógica indutiva para induzir regras e resolver casos específicos de Bioinformática e Cheminformática. Para avaliar o impacto do enriquecimento dos dados com ILP, comparamos com os resultados obtidos na resolução dos mesmos casos utilizando algoritmos proposicionais.Increasingly more complex problems are being addressed in life sciences. Acquiring all the data that may be related to the problem in question is paramount. Equally important is to know how the data is related to each other and to the problem itself. On the other hand, there are large amounts of data and information available on the Web. Researchers are already using Data Mining and Machine Learning as a valuable tool in their researches, albeit the usual procedure is to look for the information based on induction models. So far, despite the great successes already achieved using Data Mining and Machine Learning, it is not easy to integrate this vast amount of available information in the inductive process with propositional algorithms. Our main motivation is to address the problem of integrating domain information into the inductive process of propositional Data Mining and Machine Learning techniques by enriching the training data to be used in inductive logic programming systems. The algorithms of propositional machine learning are very dependent on data attributes. It still is hard to identify which attributes are more suitable for a particular task in the research. It is also hard to extract relevant information from the enormous quantity of data available. We will concentrate the available data, derive features that ILP algorithms can use to induce descriptions, solving the problems. We are creating a web platform to obtain relevant bioinformatics (particularly Genomics) and Cheminformatics problems. It fetches the data from public repositories with genomics, protein and chemical data. After the data enrichment, Prolog systems use inductive logic programming to induce rules and solve specific Bioinformatics and Cheminformatics case studies. To assess the impact of the data enrichment with ILP, we compare with the results obtained solving the same cases using propositional algorithms

    Sparse machine learning models in bioinformatics

    Get PDF
    The meaning of parsimony is twofold in machine learning: either the structure or (and) the parameter of a model can be sparse. Sparse models have many strengths. First, sparsity is an important regularization principle to reduce model complexity and therefore avoid overfitting. Second, in many fields, for example bioinformatics, many high-dimensional data may be generated by a very few number of hidden factors, thus it is more reasonable to use a proper sparse model than a dense model. Third, a sparse model is often easy to interpret. In this dissertation, we investigate the sparse machine learning models and their applications in high-dimensional biological data analysis. We focus our research on five types of sparse models as follows. First, sparse representation is a parsimonious principle that a sample can be approximated by a sparse linear combination of basis vectors. We explore existing sparse representation models and propose our own sparse representation methods for high dimensional biological data analysis. We derive different sparse representation models from a Bayesian perspective. Two generic dictionary learning frameworks are proposed. Also, kernel and supervised dictionary learning approaches are devised. Furthermore, we propose fast active-set and decomposition methods for the optimization of sparse coding models. Second, gene-sample-time data are promising in clinical study, but challenging in computation. We propose sparse tensor decomposition methods and kernel methods for the dimensionality reduction and classification of such data. As the extensions of matrix factorization, tensor decomposition techniques can reduce the dimensionality of the gene-sample-time data dramatically, and the kernel methods can run very efficiently on such data. Third, we explore two sparse regularized linear models for multi-class problems in bioinformatics. Our first method is called the nearest-border classification technique for data with many classes. Our second method is a hierarchical model. It can simultaneously select features and classify samples. Our experiment, on breast tumor subtyping, shows that this model outperforms the one-versus-all strategy in some cases. Fourth, we propose to use spectral clustering approaches for clustering microarray time-series data. The approaches are based on two transformations that have been recently introduced, especially for gene expression time-series data, namely, alignment-based and variation-based transformations. Both transformations have been devised in order to take into account temporal relationships in the data, and have been shown to increase the ability of a clustering method in detecting co-expressed genes. We investigate the performances of these transformations methods, when combined with spectral clustering on two microarray time-series datasets, and discuss their strengths and weaknesses. Our experiments on two well known real-life datasets show the superiority of the alignment-based over the variation-based transformation for finding meaningful groups of co-expressed genes. Fifth, we propose the max-min high-order dynamic Bayesian network (MMHO-DBN) learning algorithm, in order to reconstruct time-delayed gene regulatory networks. Due to the small sample size of the training data and the power-low nature of gene regulatory networks, the structure of the network is restricted by sparsity. We also apply the qualitative probabilistic networks (QPNs) to interpret the interactions learned. Our experiments on both synthetic and real gene expression time-series data show that, MMHO-DBN can obtain better precision than some existing methods, and perform very fast. The QPN analysis can accurately predict types of influences and synergies. Additionally, since many high dimensional biological data are subject to missing values, we survey various strategies for learning models from incomplete data. We extend the existing imputation methods, originally for two-way data, to methods for gene-sample-time data. We also propose a pair-wise weighting method for computing kernel matrices from incomplete data. Computational evaluations show that both approaches work very robustly

    Boreal Bird Ecology, Management and Conservation

    Get PDF
    Northern forested landscapes are important habitats for many boreal birds. This Special Issue portrays the current state of knowledge on boreal bird diversity, ecology, management, and conservation. Humans have diverse impacts on boreal habitats worldwide, and knowledge of the avian community associated with these northern forests is key to conservation measures

    Genetic programming and cellular automata for fast flood modelling on multi-core CPU and many-core GPU computers

    Get PDF
    Many complex systems in nature are governed by simple local interactions, although a number are also described by global interactions. For example, within the field of hydraulics the Navier-Stokes equations describe free-surface water flow, through means of the global preservation of water volume, momentum and energy. However, solving such partial differential equations (PDEs) is computationally expensive when applied to large 2D flow problems. An alternative which reduces the computational complexity, is to use a local derivative to approximate the PDEs, such as finite difference methods, or Cellular Automata (CA). The high speed processing of such simulations is important to modern scientific investigation especially within urban flood modelling, as urban expansion continues to increase the number of impervious areas that need to be modelled. Large numbers of model runs or large spatial or temporal resolution simulations are required in order to investigate, for example, climate change, early warning systems, and sewer design optimisation. The recent introduction of the Graphics Processor Unit (GPU) as a general purpose computing device (General Purpose Graphical Processor Unit, GPGPU) allows this hardware to be used for the accelerated processing of such locally driven simulations. A novel CA transformation for use with GPUs is proposed here to make maximum use of the GPU hardware. CA models are defined by the local state transition rules, which are used in every cell in parallel, and provide an excellent platform for a comparative study of possible alternative state transition rules. Writing local state transition rules for CA systems is a difficult task for humans due to the number and complexity of possible interactions, and is known as the ‘inverse problem’ for CA. Therefore, the use of Genetic Programming (GP) algorithms for the automatic development of state transition rules from example data is also investigated in this thesis. GP is investigated as it is capable of searching the intractably large areas of possible state transition rules, and producing near optimal solutions. However, such population-based optimisation algorithms are limited by the cost of many repeated evaluations of the fitness function, which in this case requires the comparison of a CA simulation to given target data. Therefore, the use of GPGPU hardware for the accelerated learning of local rules is also developed. Speed-up factors of up to 50 times over serial Central Processing Unit (CPU) processing are achieved on simple CA, up to 5-10 times speedup over the fully parallel CPU for the learning of urban flood modelling rules. Furthermore, it is shown GP can generate rules which perform competitively when compared with human formulated rules. This is achieved with generalisation to unseen terrains using similar input conditions and different spatial/temporal resolutions in this important application domain

    A complex systems approach to education in Switzerland

    Get PDF
    The insights gained from the study of complex systems in biological, social, and engineered systems enables us not only to observe and understand, but also to actively design systems which will be capable of successfully coping with complex and dynamically changing situations. The methods and mindset required for this approach have been applied to educational systems with their diverse levels of scale and complexity. Based on the general case made by Yaneer Bar-Yam, this paper applies the complex systems approach to the educational system in Switzerland. It confirms that the complex systems approach is valid. Indeed, many recommendations made for the general case have already been implemented in the Swiss education system. To address existing problems and difficulties, further steps are recommended. This paper contributes to the further establishment complex systems approach by shedding light on an area which concerns us all, which is a frequent topic of discussion and dispute among politicians and the public, where billions of dollars have been spent without achieving the desired results, and where it is difficult to directly derive consequences from actions taken. The analysis of the education system's different levels, their complexity and scale will clarify how such a dynamic system should be approached, and how it can be guided towards the desired performance