dislib: large scale high performance machine learning in Python

Abstract

In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists from exploiting the full potential of machine learning in their research. In this work, we present dislib [1], a distributed machine learning library on top of PyCOMPSs programming model [2] that addresses the issues of other similar existing libraries

    Similar works