868 research outputs found
SourcererCC: Scaling Code Clone Detection to Big Code
Despite a decade of active research, there is a marked lack in clone
detectors that scale to very large repositories of source code, in particular
for detecting near-miss clones where significant editing activities may take
place in the cloned code. We present SourcererCC, a token-based clone detector
that targets three clone types, and exploits an index to achieve scalability to
large inter-project repositories using a standard workstation. SourcererCC uses
an optimized inverted-index to quickly query the potential clones of a given
code block. Filtering heuristics based on token ordering are used to
significantly reduce the size of the index, the number of code-block
comparisons needed to detect the clones, as well as the number of required
token-comparisons needed to judge a potential clone.
We evaluate the scalability, execution time, recall and precision of
SourcererCC, and compare it to four publicly available and state-of-the-art
tools. To measure recall, we use two recent benchmarks, (1) a large benchmark
of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of
thousands of fine-grained artificial clones. We find SourcererCC has both high
recall and precision, and is able to scale to a large inter-project repository
(250MLOC) using a standard workstation.Comment: Accepted for publication at ICSE'16 (preprint, unrevised
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research
The data-driven parallelization framework Hadoop MapReduce allows analysing large data sets in a scalable way. Since the development of MapReduce programs can be a time-intensive and challenging task, the application and usage of Hadoop in Biomedical Research is still limited. Here we resent Cloudflow, a high-level framework to hide the implementation details of Hadoop and to provide a set of building blocks to create biomedical pipelines in a more intuitive way. We demonstrate the benefit of Cloudflow on three different genetic use cases. It will be shown how the framework can be combined with the Hadoop workflow system Cloudgene and the cloud orchestration platform CloudMan to provide Hadoop pipelines as a service to everyone
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research
- The data-driven parallelization framework Hadoop MapReduce allows analysing large data sets in a scalable way. Since the development of MapReduce programs can be a time-intensive and challenging task, the application and usage of Hadoop in Biomedical Research is still limited. Here we present Cloudflow, a high-level framework to hide the implementation details of Hadoop and to provide a set of building blocks to create biomedical pipelines in a more intuitive way. We demonstrate the benefit of Cloudflow on three different genetic use cases. It will be shown how the framework can be combined with the Hadoop workflow system Cloudgene and the cloud orchestration platform CloudMan to provide Hadoop pipelines as a service to everyone. The framework is open source and free available at https://github.com/genepi/cloudflow.
Document type: Conference objec
CapillaryX: A Software Design Pattern for Analyzing Medical Images in Real-time using Deep Learning
Recent advances in digital imaging, e.g., increased number of pixels
captured, have meant that the volume of data to be processed and analyzed from
these images has also increased. Deep learning algorithms are state-of-the-art
for analyzing such images, given their high accuracy when trained with a large
data volume of data. Nevertheless, such analysis requires considerable
computational power, making such algorithms time- and resource-demanding. Such
high demands can be met by using third-party cloud service providers. However,
analyzing medical images using such services raises several legal and privacy
challenges and does not necessarily provide real-time results. This paper
provides a computing architecture that locally and in parallel can analyze
medical images in real-time using deep learning thus avoiding the legal and
privacy challenges stemming from uploading data to a third-party cloud
provider. To make local image processing efficient on modern multi-core
processors, we utilize parallel execution to offset the resource-intensive
demands of deep neural networks. We focus on a specific medical-industrial case
study, namely the quantifying of blood vessels in microcirculation images for
which we have developed a working system. It is currently used in an
industrial, clinical research setting as part of an e-health application. Our
results show that our system is approximately 78% faster than its serial system
counterpart and 12% faster than a master-slave parallel system architecture
CapillaryX: A Software Design Pattern for Analyzing Medical Images in Real-time using Deep Learning
Abstract Recent advances in digital imaging, e.g., increased number of pixels captured, have meant that the volume of data to be processed and analyzed from these images has also increased. Deep learning algorithms are state-of-the-art for analyzing such images, given their high accuracy when trained with a large data volume of data. Nevertheless, such analysis requires considerable computational power, making such algorithms time- and resource-demanding. Such high demands can be met by using third-party cloud service providers. However, analyzing medical images using such services raises several legal and privacy challenges and do not necessarily provide real-time results. This paper provides a computing architecture that locally and in parallel can analyze medical images in real-time using deep learning thus avoiding the legal and privacy challenges stemming from uploading data to a third-party cloud provider. To make local image processing efficient on modern multi-core processors, we utilize parallel execution to offset the resource- intensive demands of deep neural networks. We focus on a specific medical-industrial case study, namely the quantifying of blood vessels in microcirculation images for which we have developed a working system. It is currently used in an industrial, clinical research setting as part of an e-health application. Our results show that our system is approximately 78% faster than its serial system counterpart and 12% faster than a master-slave parallel system architecture
Laminar: A New Serverless Stream-based Framework with Semantic Code Search and Code Completion
This paper introduces Laminar, a novel serverless framework based on
dispel4py, a parallel stream-based dataflow library. Laminar efficiently
manages streaming workflows and components through a dedicated registry,
offering a seamless serverless experience. Leveraging large lenguage models,
Laminar enhances the framework with semantic code search, code summarization,
and code completion. This contribution enhances serverless computing by
simplifying the execution of streaming computations, managing data streams more
efficiently, and offering a valuable tool for both researchers and
practitioners.Comment: 13 pages, 10 Figures, 6 Table
- …