321 research outputs found
Technical support for Life Sciences communities on a production grid infrastructure
Production operation of large distributed computing infrastructures (DCI)
still requires a lot of human intervention to reach acceptable quality of
service. This may be achievable for scientific communities with solid IT
support, but it remains a show-stopper for others. Some application execution
environments are used to hide runtime technical issues from end users. But they
mostly aim at fault-tolerance rather than incident resolution, and their
operation still requires substantial manpower. A longer-term support activity
is thus needed to ensure sustained quality of service for Virtual Organisations
(VO). This paper describes how the biomed VO has addressed this challenge by
setting up a technical support team. Its organisation, tooling, daily tasks,
and procedures are described. Results are shown in terms of resource usage by
end users, amount of reported incidents, and developed software tools. Based on
our experience, we suggest ways to measure the impact of the technical support,
perspectives to decrease its human cost and make it more community-specific.Comment: HealthGrid'12, Amsterdam : Netherlands (2012
Implementation of Turing machines with the Scufl data-flow language
International audienceIn this paper, the expressiveness of the simple Scufl data-flow language is studied by showing how it can be used to implement Turing machines. To do that, several non trivial Scufl patterns such as self-looping or sub-workflows are required and we precisely explicit them. The main result of this work is to show how a complex workflow can be implemented using a very simple data-flow language. Beyond that, it shows that Scufl is a Turing complete language, given some restrictions that we discuss
Predicting computational reproducibility of data analysis pipelines in large population studies using collaborative filtering
Evaluating the computational reproducibility of data analysis pipelines has
become a critical issue. It is, however, a cumbersome process for analyses that
involve data from large populations of subjects, due to their computational and
storage requirements. We present a method to predict the computational
reproducibility of data analysis pipelines in large population studies. We
formulate the problem as a collaborative filtering process, with constraints on
the construction of the training set. We propose 6 different strategies to
build the training set, which we evaluate on 2 datasets, a synthetic one
modeling a population with a growing number of subject types, and a real one
obtained with neuroinformatics pipelines. Results show that one sampling
method, "Random File Numbers (Uniform)" is able to predict computational
reproducibility with a good accuracy. We also analyze the relevance of
including file and subject biases in the collaborative filtering model. We
conclude that the proposed method is able to speedup reproducibility
evaluations substantially, with a reduced accuracy loss
High-Resolution Road Vehicle Collision Prediction for the City of Montreal
Road accidents are an important issue of our modern societies, responsible
for millions of deaths and injuries every year in the world. In Quebec only, in
2018, road accidents are responsible for 359 deaths and 33 thousands of
injuries. In this paper, we show how one can leverage open datasets of a city
like Montreal, Canada, to create high-resolution accident prediction models,
using big data analytics. Compared to other studies in road accident
prediction, we have a much higher prediction resolution, i.e., our models
predict the occurrence of an accident within an hour, on road segments defined
by intersections. Such models could be used in the context of road accident
prevention, but also to identify key factors that can lead to a road accident,
and consequently, help elaborate new policies.
We tested various machine learning methods to deal with the severe class
imbalance inherent to accident prediction problems. In particular, we
implemented the Balanced Random Forest algorithm, a variant of the Random
Forest machine learning algorithm in Apache Spark. Interestingly, we found that
in our case, Balanced Random Forest does not perform significantly better than
Random Forest.
Experimental results show that 85% of road vehicle collisions are detected by
our model with a false positive rate of 13%. The examples identified as
positive are likely to correspond to high-risk situations. In addition, we
identify the most important predictors of vehicle collisions for the area of
Montreal: the count of accidents on the same road segment during previous
years, the temperature, the day of the year, the hour and the visibility
A Service-Oriented Architecture enabling dynamic services grouping for optimizing distributed workflows execution
International audienceIn this paper, we describe a Service-Oriented Architecture allowing the optimization of the execution of service workflows. We discuss the advantages of the service-oriented approach with regard to the enactment of scientific applications on a grid infrastructure. Based on the development of a generic Web-Services wrapper, we show how the flexibility of our architecture enables dynamic service grouping for optimizing the application execution time. We demonstrate performance results on a real medical imaging application. On a production grid infrastructure, the optimization proposed introduces a significant speed-up (from 1.2 to 2.9) when compared to a traditional execution
Flexible and efficient workflow deployement of data-intensive applications on grids with MOTEUR
Special issue on Workflow Systems in Grid EnvironmentsInternational audienceWorkflows offer a powerful way to describe and deploy applications on grid infrastructures. Many workflow management systems have been proposed but there is still a lack of a system that would allow both a simple description of the dataflow of the application and an efficient execution on a grid platform. In this paper, we study the requirements of such a system, underlining the need for well-defined data composition strategies on the one hand and for a fully parallel execution on the other. As combining those features is not straightforward, we then propose algorithms to do so and we describe the design and implementation of MOTEUR, a workflow engine that fulfills those requirements. Performance results and overhead quantification are shown to evaluate MOTEUR with respect to existing comparable workflow systems on a production grid
- …
