Search CORE

5,837 research outputs found

PerfXplain: Debugging MapReduce Job Performance

Author: Balazinska Magdalena
Khoussainova Nodira
Suciu Dan
Publication venue
Publication date: 01/01/2012
Field of study

While users today have access to many tools that assist in performing large scale data analysis tasks, understanding the performance characteristics of their parallel computations, such as MapReduce jobs, remains difficult. We present PerfXplain, a system that enables users to ask questions about the relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain provides a new query language for articulating performance queries and an algorithm for generating explanations from a log of past MapReduce job executions. We formally define the notion of an explanation together with three metrics, relevance, precision, and generality, that measure explanation quality. We present the explanation-generation algorithm based on techniques related to decision-tree building. We evaluate the approach on a log of past executions on Amazon EC2, and show that our approach can generate quality explanations, outperforming two naive explanation-generation methods.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Sweep-Line Extensions to the Multiple Object Intersection Problem: Methods and Applications in Graph Mining

Author: Pechlivanoglou Tilemachos
Publication venue
Publication date: 11/05/2020
Field of study

Identifying and quantifying the size of multiple overlapping axis-aligned geometric objects is an essential computational geometry problem. The ability to solve this problem can effectively inform a number of spatial data mining methods and can provide support in decision making for a variety of critical applications. The state-of-the-art approach for addressing such problems resorts to an algorithmic paradigm, collectively known as the sweep-line or plane sweep algorithm. However, its application inherits a number of limitations including lack of versatility and lack of support for ad hoc intersection queries. With these limitations in mind, we design and implement a novel, exact, fast and scalable yet versatile, sweep-line based algorithm, named SLIG. The key idea of our algorithm lies in constructing an auxiliary data structure when the sweep line algorithm is applied, an intersection graph. This graph can effectively be used to provide connectivity properties among overlapping objects and to inform answers to ad hoc intersection queries. It can also be employed to find the location and size of the common area of multiple overlapping objects. SLIG performs significantly faster than classic sweep-line based algorithms, it is more versatile, and provides a suite of powerful querying capabilities. To demonstrate the versatility of our SLIG algorithm we show how it can be utilized for evaluating the importance of nodes in a trajectory network - a type of dynamic network where the nodes are moving objects (cars, pedestrians, etc.) and the edges represent interactions (contacts) between objects as defined by a proximity threshold. The key observation to address the problem is that the time intervals of these interactions can be represented as 1-dimensional axis-aligned geometric objects. Then, a variant of our SLIG algorithm, named SLOT, is utilized that effectively computes the metrics of interest, including node degree, triangle membership and connected components for each node, over time

YorkSpace

Recommended from our members

PARQ: A MEMORY-EFFICIENT APPROACH FOR QUERY-LEVEL PARALLELISM

Author: Gao Qianqian
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2016
Field of study

In the era of big data, people not only enjoy what massive information brings, but also experience the problem of information overload. As the volume of both data and users increasing sharply, more and more studies focus on how to answer a query for interesting information from massive data. However, most memory-based query systems are designed and implemented to optimize the performance in processing a single query and do not support in-memory data sharing among query processing jobs. When they are extended to process multiple concurrent queries, they will suffer the problems of the inefficient use of memory and waste of time. This thesis aims to design and implement a memory-efficient system, ParQ, which can be adopted by memory-based query systems to realize query-level parallelism. The main idea includes constructing a common memory block for maintaining sharable data. By sharing data, ParQ is able to process multiple queries concurrently while reducing memory usage and running time. We apply ParQ to several existing query systems. The experiment results show that ParQ improves the performance in both job completion time and memory usage when executing multiple concurrent query jobs

ScholarWorks@UMass Amherst

Text mining and natural language processing for the early stages of space mission design

Author: Berquand Audrey
Publication venue
Publication date: 01/01/2021
Field of study

Final thesis submitted December 2021 - degree awarded in 2022A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes.A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes

STAX (Strathclyde Repository)

Systematic Analysis of Engineering Change Request Data - Applying Data Mining Tools to Gain New Fact-Based Insights

Author: Arnarsson \ucdvar 6rn
Publication venue
Publication date: 01/01/2020
Field of study

Large, complex system development projects take several years to execute. Such projects involve hundreds of engineers who develop thousands of parts and millions of lines of code. During the course of a project, many design decisions often need to be changed due to the emergence of new information. These changes are often well documented in databases, but due to the complexity of the data, few companies analyze engineering change requests (ECRs) in a comprehensive and structured fashion. ECRs are important in the product development process to enhance a product. The opportunity at hand is that vast amount of data on industrial changes are captured and stored, yet the present challenge is to systematically retrieve and use them in a purposeful way.This PhD thesis explores the growing need of product developers for data expertise and analysis. Product developers increasingly refer to analytics for improvement opportunities for business processes and products. For this reason, we examined the three components necessary to perform data mining and data analytics: exploring and collecting ECR data, collecting domain knowledge for ECR information needs, and applying mathematical tools for solution design and implementation.Results from extensive interviews generated a list of engineering information needs related to ECRs. When preparing for data mining, it is crucial to understand how the end user or the domain expert will and wants to use the extractable information. Results also show industrial case studies where complex product development processes are modeled using the Markov chain Design Structure Matrix to analyze and compare ECR sequences in four projects. In addition, the study investigates how advanced searches based on natural language processing techniques and clustering within engineering databases can help identify related content in documents. This can help product developers conduct better pre-studies as they can now evaluate a short list of the most relevant historical documents that might contain valuable knowledge.The main contribution is an application of data mining algorithms to a novel industrial domain. The state of the art is more up for the algorithms themselves. These proposed procedures and methods were evaluated using industrial data to show patterns for process improvements and cluster similar information. New information derived with data mining and analytics can help product developers make better decisions for new designs or re-designs of processes and products to ensure robust and superior products

Chalmers Research