Search CORE

46 research outputs found

Big Data processing with Hadoop

Author: Pireddu Luca
Publication venue
Publication date: 18/04/2012
Field of study

Collana seminari interni 2012, Number 20120418.In this seminar, we explore the Hadoop MapReduce framework and its use to solve certain types of Big Data problems. These problems, characterized by their large data set sizes, are becoming more commonplace as data acquisition rates increase in many fields of study and business, luring people by the prospects of increased analysis sensitivity. However, by definition Big Data problems are not tractable when using commonly available software and computing systems, such as the desktop workstation. As a result, they require specialized solutions that are designed to handle large quantities of data and scale across large, possibly cheap, computing infrastructure. Hadoop provides relatively low cost access to such solutions by implementing distributed computation and robustness as integral features that, therefore, do not have to be reimplemented by the application developer. Moreover, in addition to its native Java API, it also provides a high-level Python API developed right here at CRS4. As a concrete example of a Big Data solution, we briefly look at the Seal suite of distributed tools for processing high-throughput DNA sequencing data, currently used by the CRS4 Sequencing and Genotyping Platform. Finally, we discuss how Hadoop may be applied to your own Big Data problems

P-arch

MMsPred: a bioactivity and toxicology predictive system

Author: Joel Masciocchi
Luca Pani
Luca Pireddu
Matteo Floris
Patricia Rodriguez-Tome
Piergiorgio Palla
Ricardo Medda
Publication venue
Publication date: 28/01/2010
Field of study

In the last decade, the development and use of new methods in combinatorial chemistry and high-throughput screening has dramatically increased the number of known biologically active compounds. Paradoxically, the number of drugs reaching the market has not followed the same trend, often because many of the candidate drugs present poor qualities in absorption, distribution, metabolism, excretion, and toxicological properties (ADME-Tox). The ability to recognize and discard bad candidates early in the drug discovery steps would save lost investments in time and money. Machine learning techniques could provide solutions to this problem.
The goal of my research is to develop classifiers that accurately discriminate between active and inactive molecules for a specific target. To this end, I am comparing the effectiveness of the application of different machine learning techniques to this problem.	As a source of data we have selected a set of PubChem's public BioAssays1. In addition, with the objective of realizing a real-time query service with our predictors, we aim to keep the features describing the chemical compounds relatively simple.
At the end of this process, we should better understand how to build statistical models that are able to recognize molecules active in a specific bioassay, including how to select the most appropriate classification technique, and how to describe compounds in such a way that is not excessively resource-consuming to generate, yet contains sufficient information for the classification. We see immediate applications of such technology to recognize compounds with high-risk of toxicity, and also to suggest likely metabolic pathways that would process it

Nature Precedings

Il calcolo su larga scala. Dall'analisi dei dati genetici all'analisi del web

Author: Pireddu Luca
Publication venue
Publication date: 13/10/2011
Field of study

2011-09-23Parco di Monteclaro - CagliariLa notte dei ricercator

P-arch

The Seal suite of distributed software for high-throughput sequencing

Author: Leo Simone
Pireddu Luca
Zanetti Gianluigi
Publication venue: 'EMBnet Stichting'
Publication date: 01/11/2011
Field of study

23-23Pubblicat

P-arch

Il supporto dei sistemi informativi territoriali nella modellazione dei sistemi di trasporto regionali:la collaborazione tra CRiMM e CRS4 (settembre - dicembre 1998)

Author: Giacomelli Andrea
Lorrai Eva Barbara
Pireddu Luca
Publication venue
Publication date: 01/12/1998
Field of study

In questo rapporto si d a una sintesi delle attività svolte presso il CRS4 nell'ambito della collaborazione con il Centro di Ricerca Modelli Mobilità (CRiMM) dell'Università di Cagliari per lo studio propedeutico al Piano Pluriennale di Protezione Civile Regionale

P-arch

Unlocking Large-Scale Genomics

Author: Pireddu Luca
Publication venue
Publication date
Field of study

The dramatic progress in DNA sequencing technology over the last decade, with the revolutionary introduction of next-generation sequencing, has brought with it opportunities and difficulties. Indeed, the opportunity to study the genomes of any species at an unprecedented level of detail has come accompanied by the difficulty in scaling analysis to handle the tremendous data generation rates of the sequencing machinery and scaling operational procedures to handle the increasing sample sizes in ever larger sequencing studies. This dissertation presents work that strives to address both these problems. The first contribution, inspired by the success of data-driven industry, is the Seal suite of tools which harnesses the scalability of the Hadoop framework to accelerate the analysis of sequencing data and keep up with the sustained throughput of the sequencing machines. The second contribution, addressing the second problem, is a system is developed to automate the standard analysis procedures at a typical sequencing center. Additional work is presented to make the first two contributions compatible with each other, as to provide a complete solution for a sequencing operation and to simplify their use. Finally, the work presented here has been integrated into the production operations at the CRS4 Sequencing Lab, helping it scale its operation while reducing personnel requirements

UniCA Eprints

Automated and traceable processing for large-scale high-throughput sequencing facilities

Author: Cuccuru Gianmauro
Fotia Giorgio
Lianas Luca
Pireddu Luca
Vocale Matteo
Zanetti Gianluigi
Publication venue: 'EMBnet Stichting'
Publication date: 01/01/2013
Field of study

Scaling up production in medium and large high-throughput sequencing facilities presents a number of challenges. As the rate of samples to process increases, manually performing and tracking the center’s operations becomes increasingly difficult, costly and error prone, while processing the massive amounts of data poses significant computational challenges. We present our ongoing work to automate and track all data-related procedures at the CRS4 Sequencing and Genotyping Platform, while integrating state-of-the-art processing technologies such as Hadoop, OMERO, iRODS, and Galaxy into our automated workflows. Currently, the core system is in its testing phase and it is on schedule to be in production use at CRS4 by May 2013. The results thus far obtained are encouraging and the authors are confident that the CRS4 Platform will increase its efficiency and capacity thanks to this system. In the near future, the integration components will be released as as open source software.23-24Pubblicat

P-arch

An Update on the Seal Hadoop-based Sequencing Processing Toolbox

Author: Leo Simone
Pireddu Luca
Zanetti Gianluigi
Publication venue
Publication date: 20/07/2013
Field of study

Contributed presentation to the 14th Bioinformatics Open Source Conference.2013-07Berlin14th Bioinformatics Open Source Conferenc

P-arch

Scripting for large-scale sequencing based on Hadoop

Author: Heljanko Keijo
Kallio Aleksi
Korpelainen Eija
Niemenmaa Matti
Pireddu Luca
Schumacher André
Zanetti Gianluigi
Publication venue: 'EMBnet Stichting'
Publication date: 01/01/2013
Field of study

The large volumes of data generated by modern sequencing experiments present significant challenges in their manipulation and analysis. Traditional approaches are often found to be complicated to scale. We describe our ongoing work on SeqPig, a tool that facilitates the use of the Pig Latin distributed scripting language to manipulate, analyze and query sequencing data applying the advances motivated by the “big data revolution” in data-intensive activities. SeqPig provides access to popular data formats and implements a number of custom sequencing-specific functions. Most importantly, it grants users access to the scalable Hadoop platform from a high level scripting language84-85Pubblicat

P-arch

Fingerprint-based detection of acute aquatic toxicity

Author: B Boser
B Poulin
CL Russom
EU
L Michielan
Luca Pireddu
M Floris
P Rodriguez-Tomé
S Moro
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central