Search CORE

90 research outputs found

Alignment-free Genomic Analysis via a Big Data Spark Platform

Author: Cattaneo Giuseppe
Giancarlo Raffaele
Palini Francesco
Petrillo Umberto Ferraro
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2021
Field of study

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Using HTML5 to Prevent Detection of Drive-by-Download Web Malware

Author: De Maio Giancarlo
De Santis Alfredo
Petrillo Umberto Ferraro
Publication venue: 'Wiley'
Publication date: 01/01/2015
Field of study

The web is experiencing an explosive growth in the last years. New technologies are introduced at a very fast-pace with the aim of narrowing the gap between web-based applications and traditional desktop applications. The results are web applications that look and feel almost like desktop applications while retaining the advantages of being originated from the web. However, these advancements come at a price. The same technologies used to build responsive, pleasant and fully-featured web applications, can also be used to write web malware able to escape detection systems. In this article we present new obfuscation techniques, based on some of the features of the upcoming HTML5 standard, which can be used to deceive malware detection systems. The proposed techniques have been experimented on a reference set of obfuscated malware. Our results show that the malware rewritten using our obfuscation techniques go undetected while being analyzed by a large number of detection systems. The same detection systems were able to correctly identify the same malware in its original unobfuscated form. We also provide some hints about how the existing malware detection systems can be modified in order to cope with these new techniques.Comment: This is the pre-peer reviewed version of the article: \emph{Using HTML5 to Prevent Detection of Drive-by-Download Web Malware}, which has been published in final form at \url{http://dx.doi.org/10.1002/sec.1077}. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archivin

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy

Author: Cattaneo Giuseppe
Ferraro Petrillo Umberto
Giancarlo Raffaele
Palini Francesco
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Background Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. Conclusions Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future

Archivio della ricerca- Università di Roma La Sapienza

DIAMIN: a software library for the distributed analysis of large-scale molecular interaction networks

Author: Di Rocco Lorenzo
Ferraro Petrillo Umberto
Rombo Simona E
Publication venue: BMC
Publication date: 01/01/2022
Field of study

Background Huge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale. Results We propose DIAMIN, that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. DIAMIN relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different tasks on an abstract representation of very large graphs, providing a built-in support for methods and algorithms commonly used to analyze these networks. DIAMIN has been tested on data retrieved from two of the most used molecular interactions databases, resulting to be highly efficient and scalable. As shown by different provided examples, DIAMIN can be exploited by users without any distributed programming experience, in order to perform various types of data analysis, and to implement new algorithms based on its primitives. Conclusions The proposed DIAMIN has been proved to be successful in allowing users to solve specific biological problems that can be modeled relying on biological networks, by using its functionalities. The software is freely available and this will hopefully allow its rapid diffusion through the scientific community, to solve both specific data analysis and more complex tasks

PubMed Central

Archivio della ricerca- Università di Roma La Sapienza

Archivio istituzionale della ricerca - Università di Palermo

Rank-Similarity Measures for Comparing Gene Prioritizations: A Case Study in Autism

Author: Ferraro Petrillo Umberto
Guerra Concettina
Joshi Sarang
Lu Yinquan
Palini Francesco
Rossignac Jarek
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2020
Field of study

We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease?gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism

Archivio della ricerca- Università di Roma La Sapienza

Minimal Extrathyroidal Extension in Predicting 1-Year Outcomes: A Longitudinal Multicenter Study of Low-to-Intermediate-Risk Papillary Thyroid Carcinoma (ITCO#4)

Background: The role of minimal extrathyroidal extension (mETE) as a risk factor for persistent papillary thyroid carcinoma (PTC) is still debated. The aim of this study was to assess the clinical impact of mETE as a predictor of worse initial treatment response in PTC patients and to verify the impact of radioiodine therapy after surgery in patients with mETE. Methods: We reviewed all records in the Italian Thyroid Cancer Observatory (ITCO) database and selected 2237 consecutive patients with PTC who satisfied the inclusion criteria (PTC with no lymph node metastases and at least 1 year of follow-up). For each case, we considered initial surgery, histological variant of PTC, tumor diameter, recurrence risk class according to the American Thyroid Association (ATA) risk stratification system, use of radioiodine therapy, and initial therapy response, as suggested by ATA guidelines. Results: At 1-year follow-up, 1831 patients (81.8%) had an excellent response, 296 (13.2%) had an indeterminate response, 55 (2.5%) had a biochemical incomplete response, and 55 (2.5%) had a structural incomplete response. Statistical analysis suggested that mETE (odds ratio [OR] 1.16, p=0.65), tumor size >2 cm (OR 1.45, p=0.34), aggressive PTC histology (OR 0.55, p=0.15), and age at diagnosis (OR 0.90, p=0.32) were not significant risk factors for a worse initial therapy response. When evaluating the combination of mETE, tumor size, and aggressive PTC histology, the presence of mETE with a >2 cm tumor was significantly associated with a worse outcome (OR 5.27, 95% CI, p=0.014). The role of radioiodine ablation in patients with mETE was also evaluated. When considering radioiodine treatment, propensity score-based matching was performed, and no significant differences were found between treated and non-treated patients (p=0.24). Conclusions: This study failed to show the prognostic value of mETE in predicting initial therapy response in a large cohort of PTC patients without lymph node metastases. The study suggests that the combination of tumor diameter and mETE can be used as a reliable prognostic factor for persistence and could be easily applied in clinical practice to manage PTC patients with low-to-intermediate risk of recurrent/persistent disease

Archivio istituzionale della Ricerca - Università degli Studi di Parma

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio della ricerca- Università di Roma La Sapienza

Using the audio of 8-bit video games to monitor web marketing campaigns

Author: FERRARO PETRILLO UMBERTO
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Monitoring the performance of a web marketing campaign is usually a long-lasting, low-effort but distracting task, where a user repeatedly glances at some sort of visual analytics tools to check whether the campaign is going well. In this paper, we explore an alternative approach for this task, where the performance of a web marketing campaign is monitored through sonifcation, using the soundset of popular 8-bit arcade video games. On one hand, sonifcation would allow a user to be constantly informed about the current state of the campaign without being distracted. On the other hand, the sound metaphors coming from popular 8-bit arcade video games would be able to convey information about the status of the campaign in a simple and effective way (i.e., if the sonifcation of a campaign resembles the audio of a successful game session, then the campaign is going well). We investigated this idea by developing a prototype system for the sonifcation of the behavior of a web server activity through a confgurable set of sound metaphors. We then analyzed the effectiveness of our approach by conducting a simple experimental study. This was done, frst, by sonifying the progress of a given web marketing campaign using the soundset of two popular 8-bit video games: Super Mario Bros and Bubble Bobble. The outcoming soundtrack was then used in a controlled setting to assess the performance of a group of 20 participants listening to our soundtrack under different work conditions

Archivio della ricerca- Università di Roma La Sapienza

Reliable accounting in grids

Author: FERRARO PETRILLO UMBERTO
Ivan Visconti
L. Catuogno
Luigi Catuogno
Pompeo Faruolo
Umberto Ferraro Petrillo
Publication venue: 'Inderscience Publishers'
Publication date: 01/01/2013
Field of study

Grid computing is a distributed environment in which a remote service is provided by a resource owner to a client by means of a grid infrastructure. One of the major expectations for grid computing is about the rising of a market where users pay to access the computational and storage capacity offered by a resource owner. In this scenario, all the steps of the economic transaction related to the fulfilment of a service are accomplished with the mediation of the grid infrastructure. Several economic models have been proposed for determining how to charge the services offered through a grid. In this paper, we outline one important security issue that may arise in models where services are priced according to the amount of resources they consume. Our contribution is to propose a new security model where secure grid transactions are possible even when resource owners and clients are corrupted. Copyright © 2013 Inderscience Enterprises Ltd

Archivio della ricerca- Università di Roma La Sapienza