90 research outputs found

    Alignment-free Genomic Analysis via a Big Data Spark Platform

    Get PDF
    Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

    Using HTML5 to Prevent Detection of Drive-by-Download Web Malware

    Get PDF
    The web is experiencing an explosive growth in the last years. New technologies are introduced at a very fast-pace with the aim of narrowing the gap between web-based applications and traditional desktop applications. The results are web applications that look and feel almost like desktop applications while retaining the advantages of being originated from the web. However, these advancements come at a price. The same technologies used to build responsive, pleasant and fully-featured web applications, can also be used to write web malware able to escape detection systems. In this article we present new obfuscation techniques, based on some of the features of the upcoming HTML5 standard, which can be used to deceive malware detection systems. The proposed techniques have been experimented on a reference set of obfuscated malware. Our results show that the malware rewritten using our obfuscation techniques go undetected while being analyzed by a large number of detection systems. The same detection systems were able to correctly identify the same malware in its original unobfuscated form. We also provide some hints about how the existing malware detection systems can be modified in order to cope with these new techniques.Comment: This is the pre-peer reviewed version of the article: \emph{Using HTML5 to Prevent Detection of Drive-by-Download Web Malware}, which has been published in final form at \url{http://dx.doi.org/10.1002/sec.1077}. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archivin

    FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy

    Get PDF
    Background Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. Conclusions Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future

    DIAMIN: a software library for the distributed analysis of large-scale molecular interaction networks

    Get PDF
    Background Huge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale. Results We propose DIAMIN, that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. DIAMIN relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different tasks on an abstract representation of very large graphs, providing a built-in support for methods and algorithms commonly used to analyze these networks. DIAMIN has been tested on data retrieved from two of the most used molecular interactions databases, resulting to be highly efficient and scalable. As shown by different provided examples, DIAMIN can be exploited by users without any distributed programming experience, in order to perform various types of data analysis, and to implement new algorithms based on its primitives. Conclusions The proposed DIAMIN has been proved to be successful in allowing users to solve specific biological problems that can be modeled relying on biological networks, by using its functionalities. The software is freely available and this will hopefully allow its rapid diffusion through the scientific community, to solve both specific data analysis and more complex tasks

    Rank-Similarity Measures for Comparing Gene Prioritizations: A Case Study in Autism

    Get PDF
    We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease?gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism

    Minimal Extrathyroidal Extension in Predicting 1-Year Outcomes: A Longitudinal Multicenter Study of Low-to-Intermediate-Risk Papillary Thyroid Carcinoma (ITCO#4)

    Get PDF
    Background: The role of minimal extrathyroidal extension (mETE) as a risk factor for persistent papillary thyroid carcinoma (PTC) is still debated. The aim of this study was to assess the clinical impact of mETE as a predictor of worse initial treatment response in PTC patients and to verify the impact of radioiodine therapy after surgery in patients with mETE. Methods: We reviewed all records in the Italian Thyroid Cancer Observatory (ITCO) database and selected 2237 consecutive patients with PTC who satisfied the inclusion criteria (PTC with no lymph node metastases and at least 1 year of follow-up). For each case, we considered initial surgery, histological variant of PTC, tumor diameter, recurrence risk class according to the American Thyroid Association (ATA) risk stratification system, use of radioiodine therapy, and initial therapy response, as suggested by ATA guidelines. Results: At 1-year follow-up, 1831 patients (81.8%) had an excellent response, 296 (13.2%) had an indeterminate response, 55 (2.5%) had a biochemical incomplete response, and 55 (2.5%) had a structural incomplete response. Statistical analysis suggested that mETE (odds ratio [OR] 1.16, p=0.65), tumor size >2 cm (OR 1.45, p=0.34), aggressive PTC histology (OR 0.55, p=0.15), and age at diagnosis (OR 0.90, p=0.32) were not significant risk factors for a worse initial therapy response. When evaluating the combination of mETE, tumor size, and aggressive PTC histology, the presence of mETE with a >2 cm tumor was significantly associated with a worse outcome (OR 5.27, 95% CI, p=0.014). The role of radioiodine ablation in patients with mETE was also evaluated. When considering radioiodine treatment, propensity score-based matching was performed, and no significant differences were found between treated and non-treated patients (p=0.24). Conclusions: This study failed to show the prognostic value of mETE in predicting initial therapy response in a large cohort of PTC patients without lymph node metastases. The study suggests that the combination of tumor diameter and mETE can be used as a reliable prognostic factor for persistence and could be easily applied in clinical practice to manage PTC patients with low-to-intermediate risk of recurrent/persistent disease

    Using the audio of 8-bit video games to monitor web marketing campaigns

    No full text
    Monitoring the performance of a web marketing campaign is usually a long-lasting, low-effort but distracting task, where a user repeatedly glances at some sort of visual analytics tools to check whether the campaign is going well. In this paper, we explore an alternative approach for this task, where the performance of a web marketing campaign is monitored through sonifcation, using the soundset of popular 8-bit arcade video games. On one hand, sonifcation would allow a user to be constantly informed about the current state of the campaign without being distracted. On the other hand, the sound metaphors coming from popular 8-bit arcade video games would be able to convey information about the status of the campaign in a simple and effective way (i.e., if the sonifcation of a campaign resembles the audio of a successful game session, then the campaign is going well). We investigated this idea by developing a prototype system for the sonifcation of the behavior of a web server activity through a confgurable set of sound metaphors. We then analyzed the effectiveness of our approach by conducting a simple experimental study. This was done, frst, by sonifying the progress of a given web marketing campaign using the soundset of two popular 8-bit video games: Super Mario Bros and Bubble Bobble. The outcoming soundtrack was then used in a controlled setting to assess the performance of a group of 20 participants listening to our soundtrack under different work conditions

    Reliable accounting in grids

    No full text
    Grid computing is a distributed environment in which a remote service is provided by a resource owner to a client by means of a grid infrastructure. One of the major expectations for grid computing is about the rising of a market where users pay to access the computational and storage capacity offered by a resource owner. In this scenario, all the steps of the economic transaction related to the fulfilment of a service are accomplished with the mediation of the grid infrastructure. Several economic models have been proposed for determining how to charge the services offered through a grid. In this paper, we outline one important security issue that may arise in models where services are priced according to the amount of resources they consume. Our contribution is to propose a new security model where secure grid transactions are possible even when resource owners and clients are corrupted. Copyright © 2013 Inderscience Enterprises Ltd
    • …
    corecore