46 research outputs found

    Big Data processing with Hadoop

    Get PDF
    Collana seminari interni 2012, Number 20120418.In this seminar, we explore the Hadoop MapReduce framework and its use to solve certain types of Big Data problems. These problems, characterized by their large data set sizes, are becoming more commonplace as data acquisition rates increase in many fields of study and business, luring people by the prospects of increased analysis sensitivity. However, by definition Big Data problems are not tractable when using commonly available software and computing systems, such as the desktop workstation. As a result, they require specialized solutions that are designed to handle large quantities of data and scale across large, possibly cheap, computing infrastructure. Hadoop provides relatively low cost access to such solutions by implementing distributed computation and robustness as integral features that, therefore, do not have to be reimplemented by the application developer. Moreover, in addition to its native Java API, it also provides a high-level Python API developed right here at CRS4. As a concrete example of a Big Data solution, we briefly look at the Seal suite of distributed tools for processing high-throughput DNA sequencing data, currently used by the CRS4 Sequencing and Genotyping Platform. Finally, we discuss how Hadoop may be applied to your own Big Data problems

    MMsPred: a bioactivity and toxicology predictive system

    Get PDF
    In the last decade, the development and use of new methods in combinatorial chemistry and high-throughput screening has dramatically increased the number of known biologically active compounds. Paradoxically, the number of drugs reaching the market has not followed the same trend, often because many of the candidate drugs present poor qualities in absorption, distribution, metabolism, excretion, and toxicological properties (ADME-Tox). The ability to recognize and discard bad candidates early in the drug discovery steps would save lost investments in time and money. Machine learning techniques could provide solutions to this problem.
The goal of my research is to develop classifiers that accurately discriminate between active and inactive molecules for a specific target. To this end, I am comparing the effectiveness of the application of different machine learning techniques to this problem.	As a source of data we have selected a set of PubChem's public BioAssays1. In addition, with the objective of realizing a real-time query service with our predictors, we aim to keep the features describing the chemical compounds relatively simple.
At the end of this process, we should better understand how to build statistical models that are able to recognize molecules active in a specific bioassay, including how to select the most appropriate classification technique, and how to describe compounds in such a way that is not excessively resource-consuming to generate, yet contains sufficient information for the classification. We see immediate applications of such technology to recognize compounds with high-risk of toxicity, and also to suggest likely metabolic pathways that would process it

    Il calcolo su larga scala. Dall'analisi dei dati genetici all'analisi del web

    Get PDF
    2011-09-23Parco di Monteclaro - CagliariLa notte dei ricercator

    The Seal suite of distributed software for high-throughput sequencing

    Get PDF
    23-23Pubblicat

    Il supporto dei sistemi informativi territoriali nella modellazione dei sistemi di trasporto regionali:la collaborazione tra CRiMM e CRS4 (settembre - dicembre 1998)

    Get PDF
    In questo rapporto si d a una sintesi delle attivitĂ  svolte presso il CRS4 nell'ambito della collaborazione con il Centro di Ricerca Modelli MobilitĂ  (CRiMM) dell'UniversitĂ  di Cagliari per lo studio propedeutico al Piano Pluriennale di Protezione Civile Regionale

    Unlocking Large-Scale Genomics

    Get PDF
    The dramatic progress in DNA sequencing technology over the last decade, with the revolutionary introduction of next-generation sequencing, has brought with it opportunities and difficulties. Indeed, the opportunity to study the genomes of any species at an unprecedented level of detail has come accompanied by the difficulty in scaling analysis to handle the tremendous data generation rates of the sequencing machinery and scaling operational procedures to handle the increasing sample sizes in ever larger sequencing studies. This dissertation presents work that strives to address both these problems. The first contribution, inspired by the success of data-driven industry, is the Seal suite of tools which harnesses the scalability of the Hadoop framework to accelerate the analysis of sequencing data and keep up with the sustained throughput of the sequencing machines. The second contribution, addressing the second problem, is a system is developed to automate the standard analysis procedures at a typical sequencing center. Additional work is presented to make the first two contributions compatible with each other, as to provide a complete solution for a sequencing operation and to simplify their use. Finally, the work presented here has been integrated into the production operations at the CRS4 Sequencing Lab, helping it scale its operation while reducing personnel requirements

    Automated and traceable processing for large-scale high-throughput sequencing facilities

    Get PDF
    Scaling up production in medium and large high-throughput sequencing facilities presents a number of challenges. As the rate of samples to process increases, manually performing and tracking the center’s operations becomes increasingly difficult, costly and error prone, while processing the massive amounts of data poses significant computational challenges. We present our ongoing work to automate and track all data-related procedures at the CRS4 Sequencing and Genotyping Platform, while integrating state-of-the-art processing technologies such as Hadoop, OMERO, iRODS, and Galaxy into our automated workflows. Currently, the core system is in its testing phase and it is on schedule to be in production use at CRS4 by May 2013. The results thus far obtained are encouraging and the authors are confident that the CRS4 Platform will increase its efficiency and capacity thanks to this system. In the near future, the integration components will be released as as open source software.23-24Pubblicat

    An Update on the Seal Hadoop-based Sequencing Processing Toolbox

    Get PDF
    Contributed presentation to the 14th Bioinformatics Open Source Conference.2013-07Berlin14th Bioinformatics Open Source Conferenc

    Scripting for large-scale sequencing based on Hadoop

    Get PDF
    The large volumes of data generated by modern sequencing experiments present significant challenges in their manipulation and analysis. Traditional approaches are often found to be complicated to scale. We describe our ongoing work on SeqPig, a tool that facilitates the use of the Pig Latin distributed scripting language to manipulate, analyze and query sequencing data applying the advances motivated by the “big data revolution” in data-intensive activities. SeqPig provides access to popular data formats and implements a number of custom sequencing-specific functions. Most importantly, it grants users access to the scalable Hadoop platform from a high level scripting language84-85Pubblicat
    corecore