92 research outputs found
Alignment-free Genomic Analysis via a Big Data Spark Platform
Motivation: Alignment-free distance and similarity functions (AF functions,
for short) are a well established alternative to two and multiple sequence
alignments for many genomic, metagenomic and epigenomic tasks. Due to
data-intensive applications, the computation of AF functions is a Big Data
problem, with the recent Literature indicating that the development of fast and
scalable algorithms computing AF functions is a high-priority task. Somewhat
surprisingly, despite the increasing popularity of Big Data technologies in
Computational Biology, the development of a Big Data platform for those tasks
has not been pursued, possibly due to its complexity. Results: We fill this
important gap by introducing FADE, the first extensible, efficient and scalable
Spark platform for Alignment-free genomic analysis. It supports natively
eighteen of the best performing AF functions coming out of a recent hallmark
benchmarking study. FADE development and potential impact comprises novel
aspects of interest. Namely, (a) a considerable effort of distributed
algorithms, the most tangible result being a much faster execution time of
reference methods like MASH and FSWM; (b) a software design that makes FADE
user-friendly and easily extendable by Spark non-specialists; (c) its ability
to support data- and compute-intensive tasks. About this, we provide a novel
and much needed analysis of how informative and robust AF functions are, in
terms of the statistical significance of their output. Our findings naturally
extend the ones of the highly regarded benchmarking study, since the functions
that can really be used are reduced to a handful of the eighteen included in
FADE
Using HTML5 to Prevent Detection of Drive-by-Download Web Malware
The web is experiencing an explosive growth in the last years. New
technologies are introduced at a very fast-pace with the aim of narrowing the
gap between web-based applications and traditional desktop applications. The
results are web applications that look and feel almost like desktop
applications while retaining the advantages of being originated from the web.
However, these advancements come at a price. The same technologies used to
build responsive, pleasant and fully-featured web applications, can also be
used to write web malware able to escape detection systems. In this article we
present new obfuscation techniques, based on some of the features of the
upcoming HTML5 standard, which can be used to deceive malware detection
systems. The proposed techniques have been experimented on a reference set of
obfuscated malware. Our results show that the malware rewritten using our
obfuscation techniques go undetected while being analyzed by a large number of
detection systems. The same detection systems were able to correctly identify
the same malware in its original unobfuscated form. We also provide some hints
about how the existing malware detection systems can be modified in order to
cope with these new techniques.Comment: This is the pre-peer reviewed version of the article: \emph{Using
HTML5 to Prevent Detection of Drive-by-Download Web Malware}, which has been
published in final form at \url{http://dx.doi.org/10.1002/sec.1077}. This
article may be used for non-commercial purposes in accordance with Wiley
Terms and Conditions for Self-Archivin
FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy
Background
Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic.
Results
We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.
Conclusions
Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future
DIAMIN: a software library for the distributed analysis of large-scale molecular interaction networks
Background Huge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale. Results We propose DIAMIN, that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. DIAMIN relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different tasks on an abstract representation of very large graphs, providing a built-in support for methods and algorithms commonly used to analyze these networks. DIAMIN has been tested on data retrieved from two of the most used molecular interactions databases, resulting to be highly efficient and scalable. As shown by different provided examples, DIAMIN can be exploited by users without any distributed programming experience, in order to perform various types of data analysis, and to implement new algorithms based on its primitives. Conclusions The proposed DIAMIN has been proved to be successful in allowing users to solve specific biological problems that can be modeled relying on biological networks, by using its functionalities. The software is freely available and this will hopefully allow its rapid diffusion through the scientific community, to solve both specific data analysis and more complex tasks
Rank-Similarity Measures for Comparing Gene Prioritizations: A Case Study in Autism
We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease?gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism
Minimal Extrathyroidal Extension in Predicting 1-Year Outcomes: A Longitudinal Multicenter Study of Low-to-Intermediate-Risk Papillary Thyroid Carcinoma (ITCO#4)
Background: The role of minimal extrathyroidal extension (mETE) as a risk factor for persistent papillary thyroid carcinoma (PTC) is still debated. The aim of this study was to assess the clinical impact of mETE as a predictor of worse initial treatment response in PTC patients and to verify the impact of radioiodine therapy after surgery in patients with mETE.
Methods: We reviewed all records in the Italian Thyroid Cancer Observatory (ITCO) database and selected 2237 consecutive patients with PTC who satisfied the inclusion criteria (PTC with no lymph node metastases and at least 1 year of follow-up). For each case, we considered initial surgery, histological variant of PTC, tumor diameter, recurrence risk class according to the American Thyroid Association (ATA) risk stratification system, use of radioiodine therapy, and initial therapy response, as suggested by ATA guidelines.
Results: At 1-year follow-up, 1831 patients (81.8%) had an excellent response, 296 (13.2%) had an indeterminate response, 55 (2.5%) had a biochemical incomplete response, and 55 (2.5%) had a structural incomplete response. Statistical analysis suggested that mETE (odds ratio [OR] 1.16, p=0.65), tumor size >2 cm (OR 1.45, p=0.34), aggressive PTC histology (OR 0.55, p=0.15), and age at diagnosis (OR 0.90, p=0.32) were not significant risk factors for a worse initial therapy response. When evaluating the combination of mETE, tumor size, and aggressive PTC histology, the presence of mETE with a >2 cm tumor was significantly associated with a worse outcome (OR 5.27, 95% CI, p=0.014). The role of radioiodine ablation in patients with mETE was also evaluated. When considering radioiodine treatment, propensity score-based matching was performed, and no significant differences were found between treated and non-treated patients (p=0.24).
Conclusions: This study failed to show the prognostic value of mETE in predicting initial therapy response in a large cohort of PTC patients without lymph node metastases. The study suggests that the combination of tumor diameter and mETE can be used as a reliable prognostic factor for persistence and could be easily applied in clinical practice to manage PTC patients with low-to-intermediate risk of recurrent/persistent disease
Using the audio of 8-bit video games to monitor web marketing campaigns
Monitoring the performance of a web marketing
campaign is usually a long-lasting, low-effort but
distracting task, where a user repeatedly glances at some
sort of visual analytics tools to check whether the campaign
is going well. In this paper, we explore an alternative
approach for this task, where the performance of a web
marketing campaign is monitored through sonifcation,
using the soundset of popular 8-bit arcade video games. On
one hand, sonifcation would allow a user to be constantly
informed about the current state of the campaign without
being distracted. On the other hand, the sound metaphors
coming from popular 8-bit arcade video games would be
able to convey information about the status of the campaign
in a simple and effective way (i.e., if the sonifcation of a
campaign resembles the audio of a successful game session,
then the campaign is going well). We investigated
this idea by developing a prototype system for the sonifcation
of the behavior of a web server activity through a
confgurable set of sound metaphors. We then analyzed
the effectiveness of our approach by conducting a simple
experimental study. This was done, frst, by sonifying the
progress of a given web marketing campaign using the
soundset of two popular 8-bit video games: Super Mario
Bros and Bubble Bobble. The outcoming soundtrack was then used in a controlled setting to assess the performance
of a group of 20 participants listening to our soundtrack
under different work conditions
Reliable accounting in grids
Grid computing is a distributed environment in which a remote service is provided by a resource owner to a client by means of a grid infrastructure. One of the major expectations for grid computing is about the rising of a market where users pay to access the computational and storage capacity offered by a resource owner. In this scenario, all the steps of the economic transaction related to the fulfilment of a service are accomplished with the mediation of the grid infrastructure. Several economic models have been proposed for determining how to charge the services offered through a grid. In this paper, we outline one important security issue that may arise in models where services are priced according to the amount of resources they consume. Our contribution is to propose a new security model where secure grid transactions are possible even when resource owners and clients are corrupted. Copyright © 2013 Inderscience Enterprises Ltd
- …