8 research outputs found

    dislib: large scale high performance machine learning in Python

    Get PDF
    In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists from exploiting the full potential of machine learning in their research. In this work, we present dislib [1], a distributed machine learning library on top of PyCOMPSs programming model [2] that addresses the issues of other similar existing libraries

    Adaptive anomalous behavior identification in large-scale distributed systems

    Get PDF
    Distributed systems have become pervasive in current society. From laptops and mobile phones, to servers and data centers, most computers communicate and coordinate their activities through some kind of network. Moreover, many economic and commercial activities of today’s society rely on distributed systems. Examples range from widely used large-scale web services such as Google or Facebook, to enterprise networks and banking systems. However, as distributed systems become larger, more complex, and more pervasive, the probability of failures or malicious activities also increases, to the point that some system designers consider failures to be the norm rather than the exception. The negative effects of failures in distributed systems range from economic losses, to sensitive information leaks. As an example, reports show that the the cost of downtime in industry ranges from 100Kto100K to 540K per hour on average. These undesired consequences can be avoided with better monitoring tools that can inform system administrators of the presence of anomalies in the system in a timely manner. However, key challenges remain, such as the difficulty in processing large amounts of information, the huge variety of anomalies that can appear, and the difficulty in characterizing these anomalies. This thesis contributes a novel framework for the online detection and identification of anomalies in large-scale distributed systems that addresses these challenges. Our framework periodically collects system performance metrics, and builds a behaviour characterization from these metrics in a way that maximizes the distance between nor mal and anomalous behaviors. Our framework then uses machine learning techniques to detect previously unseen anomalies, and to identify the type of known anomalies with high accuracy, while overcoming key limitations of existing works in the area. Our framework does not require historical data, can be employed in a plug-and-play manner, adapts to changes in the system behavior, and allows for a flexible deployment that can be tailored to numerous scenarios with different architectures and requirements. In this thesis, we employ our framework in three anomaly detection application domains: distributed systems, large-scale systems, and malicious traffic detection. Extensive experimental studies in these three domains show that our framework is able to detect several types of anomalies with 0.80 Recall on average, and 0.68 mean Precision or 0.082 mean FPR depending on the domain. Moreover, our framework achieves over 0.80 accuracy in the identification of various types of complex anomalous behaviors. These results significantly improve similar works in the three explored research areas. Most importantly, our approach achieves these detection and identification rates with significant advantages over existing works. Specifically, our framework does not rely on historical anomalous data or on assumptions on the characteristics of the anomalies that can make anomaly detection easier. Moreover, our framework provides a flexible and highly scalable design, and an adaptive method that can incorporate new system information at run time.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2017

    Adaptive Performance Anomaly Detection in Distributed Systems Using Online SVMs

    No full text

    Efficient development of high performance data analytics in Python

    Get PDF
    Our society is generating an increasing amount of data at an unprecedented scale, variety, and speed. This also applies to numerous research areas, such as genomics, high energy physics, and astronomy, for which large-scale data processing has become crucial. However, there is still a gap between the traditional scientific computing ecosystem and big data analytics tools and frameworks. On the one hand, high performance computing (HPC) programming models lack productivity, and do not provide means for processing large amounts of data in a simple manner. On the other hand, existing big data processing tools have performance issues in HPC environments, and are not general-purpose. In this paper, we propose and evaluate PyCOMPSs, a task-based programming model for Python, as an excellent solution for distributed big data processing in HPC infrastructures. Among other useful features, PyCOMPSs offers a highly productive general-purpose programming model, is infrastructure-agnostic, and provides transparent data management with support for distributed storage systems. We show how two machine learning algorithms (Cascade SVM and K-means) can be developed with PyCOMPSs, and evaluate PyCOMPSs’ productivity based on these algorithms. Additionally, we evaluate PyCOMPSs performance on an HPC cluster using up to 1,536 cores and 320 million input vectors. Our results show that PyCOMPSs achieves similar performance and scalability to MPI in HPC infrastructures, while providing a much more productive interface that allows the easy development of data analytics algorithms.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement H2020-MSCA-COFUND2016-754433. This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya, Spain (contract 2014-SGR-1051). The research leading to these results has also received funding from the collaboration between Fujitsu and BSC (Script Language Platform).Peer Reviewe

    A survey on the distributed computing stack

    No full text
    In this paper, we review the background and the state of the art of the Distributed Computing software stack. We aim to provide the readers with a comprehensive overview of this area by supplying a detailed big-picture of the latest technologies. First, we introduce the general background of Distributed Computing and propose a layered top–bottom classification of the latest available software. Next, we focus on each abstraction layer, i.e. Application Development (including Task-based Workflows, Dataflows, and Graph Processing), Platform (including Data Sharing and Resource Management), Communication (including Remote Invocation, Message Passing, and Message Queuing), and Infrastructure (including Batch and Interactive systems). For each layer, we give a general background, discuss its technical challenges, review the latest programming languages, programming models, frameworks, libraries, and tools, and provide a summary table comparing the features of each alternative. Finally, we conclude this survey with a discussion of open problems and future directions.This work is partly supported by the Spanish Ministry of Science, Innovation, and Universities through the Severo Ochoa Program (SEV-2015-0493) and the TIN2015-65316-P project. It is also supported by the Generalitat de Catalunya, Spain under contracts 2014-SGR-1051 and 2014-SGR-1272. Cristian Ramon-Cortes pre-doctoral contract is financed by the Spanish Ministry of Science, Innovation, and Universities under the contract BES-2016-076791.Peer ReviewedPostprint (author's final draft

    Impact of age- and gender-specific cut-off values for the fecal immunochemical test for hemoglobin in colorectal cancer screening

    No full text

    Health-status outcomes with invasive or conservative care in coronary disease

    No full text
    BACKGROUND In the ISCHEMIA trial, an invasive strategy with angiographic assessment and revascularization did not reduce clinical events among patients with stable ischemic heart disease and moderate or severe ischemia. A secondary objective of the trial was to assess angina-related health status among these patients. METHODS We assessed angina-related symptoms, function, and quality of life with the Seattle Angina Questionnaire (SAQ) at randomization, at months 1.5, 3, and 6, and every 6 months thereafter in participants who had been randomly assigned to an invasive treatment strategy (2295 participants) or a conservative strategy (2322). Mixed-effects cumulative probability models within a Bayesian framework were used to estimate differences between the treatment groups. The primary outcome of this health-status analysis was the SAQ summary score (scores range from 0 to 100, with higher scores indicating better health status). All analyses were performed in the overall population and according to baseline angina frequency. RESULTS At baseline, 35% of patients reported having no angina in the previous month. SAQ summary scores increased in both treatment groups, with increases at 3, 12, and 36 months that were 4.1 points (95% credible interval, 3.2 to 5.0), 4.2 points (95% credible interval, 3.3 to 5.1), and 2.9 points (95% credible interval, 2.2 to 3.7) higher with the invasive strategy than with the conservative strategy. Differences were larger among participants who had more frequent angina at baseline (8.5 vs. 0.1 points at 3 months and 5.3 vs. 1.2 points at 36 months among participants with daily or weekly angina as compared with no angina). CONCLUSIONS In the overall trial population with moderate or severe ischemia, which included 35% of participants without angina at baseline, patients randomly assigned to the invasive strategy had greater improvement in angina-related health status than those assigned to the conservative strategy. The modest mean differences favoring the invasive strategy in the overall group reflected minimal differences among asymptomatic patients and larger differences among patients who had had angina at baseline

    Initial invasive or conservative strategy for stable coronary disease

    No full text
    BACKGROUND Among patients with stable coronary disease and moderate or severe ischemia, whether clinical outcomes are better in those who receive an invasive intervention plus medical therapy than in those who receive medical therapy alone is uncertain. METHODS We randomly assigned 5179 patients with moderate or severe ischemia to an initial invasive strategy (angiography and revascularization when feasible) and medical therapy or to an initial conservative strategy of medical therapy alone and angiography if medical therapy failed. The primary outcome was a composite of death from cardiovascular causes, myocardial infarction, or hospitalization for unstable angina, heart failure, or resuscitated cardiac arrest. A key secondary outcome was death from cardiovascular causes or myocardial infarction. RESULTS Over a median of 3.2 years, 318 primary outcome events occurred in the invasive-strategy group and 352 occurred in the conservative-strategy group. At 6 months, the cumulative event rate was 5.3% in the invasive-strategy group and 3.4% in the conservative-strategy group (difference, 1.9 percentage points; 95% confidence interval [CI], 0.8 to 3.0); at 5 years, the cumulative event rate was 16.4% and 18.2%, respectively (difference, 121.8 percentage points; 95% CI, 124.7 to 1.0). Results were similar with respect to the key secondary outcome. The incidence of the primary outcome was sensitive to the definition of myocardial infarction; a secondary analysis yielded more procedural myocardial infarctions of uncertain clinical importance. There were 145 deaths in the invasive-strategy group and 144 deaths in the conservative-strategy group (hazard ratio, 1.05; 95% CI, 0.83 to 1.32). CONCLUSIONS Among patients with stable coronary disease and moderate or severe ischemia, we did not find evidence that an initial invasive strategy, as compared with an initial conservative strategy, reduced the risk of ischemic cardiovascular events or death from any cause over a median of 3.2 years. The trial findings were sensitive to the definition of myocardial infarction that was used
    corecore