223 research outputs found
Technical support for Life Sciences communities on a production grid infrastructure
Production operation of large distributed computing infrastructures (DCI)
still requires a lot of human intervention to reach acceptable quality of
service. This may be achievable for scientific communities with solid IT
support, but it remains a show-stopper for others. Some application execution
environments are used to hide runtime technical issues from end users. But they
mostly aim at fault-tolerance rather than incident resolution, and their
operation still requires substantial manpower. A longer-term support activity
is thus needed to ensure sustained quality of service for Virtual Organisations
(VO). This paper describes how the biomed VO has addressed this challenge by
setting up a technical support team. Its organisation, tooling, daily tasks,
and procedures are described. Results are shown in terms of resource usage by
end users, amount of reported incidents, and developed software tools. Based on
our experience, we suggest ways to measure the impact of the technical support,
perspectives to decrease its human cost and make it more community-specific.Comment: HealthGrid'12, Amsterdam : Netherlands (2012
Mondrian Forest for Data Stream Classification Under Memory Constraints
Supervised learning algorithms generally assume the availability of enough
memory to store their data model during the training and test phases. However,
in the Internet of Things, this assumption is unrealistic when data comes in
the form of infinite data streams, or when learning algorithms are deployed on
devices with reduced amounts of memory. In this paper, we adapt the online
Mondrian forest classification algorithm to work with memory constraints on
data streams. In particular, we design five out-of-memory strategies to update
Mondrian trees with new data points when the memory limit is reached. Moreover,
we design trimming mechanisms to make Mondrian trees more robust to concept
drifts under memory constraints. We evaluate our algorithms on a variety of
real and simulated datasets, and we conclude with recommendations on their use
in different situations: the Extend Node strategy appears as the best
out-of-memory strategy in all configurations, whereas different trimming
mechanisms should be adopted depending on whether a concept drift is expected.
All our methods are implemented in the OrpailleCC open-source library and are
ready to be used on embedded systems and connected objects
Implementation of Turing machines with the Scufl data-flow language
International audienceIn this paper, the expressiveness of the simple Scufl data-flow language is studied by showing how it can be used to implement Turing machines. To do that, several non trivial Scufl patterns such as self-looping or sub-workflows are required and we precisely explicit them. The main result of this work is to show how a complex workflow can be implemented using a very simple data-flow language. Beyond that, it shows that Scufl is a Turing complete language, given some restrictions that we discuss
High-Resolution Road Vehicle Collision Prediction for the City of Montreal
Road accidents are an important issue of our modern societies, responsible
for millions of deaths and injuries every year in the world. In Quebec only, in
2018, road accidents are responsible for 359 deaths and 33 thousands of
injuries. In this paper, we show how one can leverage open datasets of a city
like Montreal, Canada, to create high-resolution accident prediction models,
using big data analytics. Compared to other studies in road accident
prediction, we have a much higher prediction resolution, i.e., our models
predict the occurrence of an accident within an hour, on road segments defined
by intersections. Such models could be used in the context of road accident
prevention, but also to identify key factors that can lead to a road accident,
and consequently, help elaborate new policies.
We tested various machine learning methods to deal with the severe class
imbalance inherent to accident prediction problems. In particular, we
implemented the Balanced Random Forest algorithm, a variant of the Random
Forest machine learning algorithm in Apache Spark. Interestingly, we found that
in our case, Balanced Random Forest does not perform significantly better than
Random Forest.
Experimental results show that 85% of road vehicle collisions are detected by
our model with a false positive rate of 13%. The examples identified as
positive are likely to correspond to high-risk situations. In addition, we
identify the most important predictors of vehicle collisions for the area of
Montreal: the count of accidents on the same road segment during previous
years, the temperature, the day of the year, the hour and the visibility
Predicting computational reproducibility of data analysis pipelines in large population studies using collaborative filtering
Evaluating the computational reproducibility of data analysis pipelines has
become a critical issue. It is, however, a cumbersome process for analyses that
involve data from large populations of subjects, due to their computational and
storage requirements. We present a method to predict the computational
reproducibility of data analysis pipelines in large population studies. We
formulate the problem as a collaborative filtering process, with constraints on
the construction of the training set. We propose 6 different strategies to
build the training set, which we evaluate on 2 datasets, a synthetic one
modeling a population with a growing number of subject types, and a real one
obtained with neuroinformatics pipelines. Results show that one sampling
method, "Random File Numbers (Uniform)" is able to predict computational
reproducibility with a good accuracy. We also analyze the relevance of
including file and subject biases in the collaborative filtering model. We
conclude that the proposed method is able to speedup reproducibility
evaluations substantially, with a reduced accuracy loss
- …