5,837 research outputs found
PerfXplain: Debugging MapReduce Job Performance
While users today have access to many tools that assist in performing large
scale data analysis tasks, understanding the performance characteristics of
their parallel computations, such as MapReduce jobs, remains difficult. We
present PerfXplain, a system that enables users to ask questions about the
relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain
provides a new query language for articulating performance queries and an
algorithm for generating explanations from a log of past MapReduce job
executions. We formally define the notion of an explanation together with three
metrics, relevance, precision, and generality, that measure explanation
quality. We present the explanation-generation algorithm based on techniques
related to decision-tree building. We evaluate the approach on a log of past
executions on Amazon EC2, and show that our approach can generate quality
explanations, outperforming two naive explanation-generation methods.Comment: VLDB201
Sweep-Line Extensions to the Multiple Object Intersection Problem: Methods and Applications in Graph Mining
Identifying and quantifying the size of multiple overlapping axis-aligned geometric objects is an essential computational geometry problem. The ability to solve this problem can effectively inform a number of spatial data mining methods and can provide support in decision making for a variety of critical applications. The state-of-the-art approach for addressing such problems resorts to an algorithmic paradigm, collectively known as the sweep-line or plane sweep algorithm. However, its application inherits a number of limitations including lack of versatility and lack of support for ad hoc intersection queries. With these limitations in mind, we design and implement a novel, exact, fast and scalable yet versatile, sweep-line based algorithm, named SLIG. The key idea of our algorithm lies in constructing an auxiliary data structure when the sweep line algorithm is applied, an intersection graph. This graph can effectively be used to provide connectivity properties among overlapping objects and to inform answers to ad hoc intersection queries. It can also be employed to find the location and size of the common area of multiple overlapping objects. SLIG performs significantly faster than classic sweep-line based algorithms, it is more versatile, and provides a suite of powerful querying capabilities.
To demonstrate the versatility of our SLIG algorithm we show how it can be utilized for evaluating the importance of nodes in a trajectory network - a type of dynamic network where the nodes are moving objects (cars, pedestrians, etc.) and the edges represent interactions (contacts) between objects as defined by a proximity threshold. The key observation to address the problem is that the time intervals of these interactions can be represented as 1-dimensional axis-aligned geometric objects. Then, a variant of our SLIG algorithm, named SLOT, is utilized that effectively computes the metrics of interest, including node degree, triangle membership and connected components for each node, over time
Recommended from our members
PARQ: A MEMORY-EFFICIENT APPROACH FOR QUERY-LEVEL PARALLELISM
In the era of big data, people not only enjoy what massive information brings, but also experience the problem of information overload. As the volume of both data and users increasing sharply, more and more studies focus on how to answer a query for interesting information from massive data. However, most memory-based query systems are designed and implemented to optimize the performance in processing a single query and do not support in-memory data sharing among query processing jobs. When they are extended to process multiple concurrent queries, they will suffer the problems of the inefficient use of memory and waste of time.
This thesis aims to design and implement a memory-efficient system, ParQ, which can be adopted by memory-based query systems to realize query-level parallelism. The main idea includes constructing a common memory block for maintaining sharable data. By sharing data, ParQ is able to process multiple queries concurrently while reducing memory usage and running time. We apply ParQ to several existing query systems. The experiment results show that ParQ improves the performance in both job completion time and memory usage when executing multiple concurrent query jobs
Text mining and natural language processing for the early stages of space mission design
Final thesis submitted December 2021 - degree awarded in 2022A considerable amount of data related to space mission design has been accumulated
since artificial satellites started to venture into space in the 1950s. This data has today
become an overwhelming volume of information, triggering a significant knowledge
reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants,
text mining and Natural Language Processing techniques have become pervasive
to our daily life.
The work presented in this thesis is one of the first attempts to bridge the gap
between the worlds of space systems engineering and text mining. Several novel models
are thus developed and implemented here, targeting the structuring of accumulated
data through an ontology, but also tasks commonly performed by systems engineers
such as requirement management and heritage analysis. A first collection of documents
related to space systems is gathered for the training of these methods. Eventually, this
work aims to pave the way towards the development of a Design Engineering Assistant
(DEA) for the early stages of space mission design. It is also hoped that this work will
actively contribute to the integration of text mining and Natural Language Processing
methods in the field of space mission design, enhancing current design processes.A considerable amount of data related to space mission design has been accumulated
since artificial satellites started to venture into space in the 1950s. This data has today
become an overwhelming volume of information, triggering a significant knowledge
reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants,
text mining and Natural Language Processing techniques have become pervasive
to our daily life.
The work presented in this thesis is one of the first attempts to bridge the gap
between the worlds of space systems engineering and text mining. Several novel models
are thus developed and implemented here, targeting the structuring of accumulated
data through an ontology, but also tasks commonly performed by systems engineers
such as requirement management and heritage analysis. A first collection of documents
related to space systems is gathered for the training of these methods. Eventually, this
work aims to pave the way towards the development of a Design Engineering Assistant
(DEA) for the early stages of space mission design. It is also hoped that this work will
actively contribute to the integration of text mining and Natural Language Processing
methods in the field of space mission design, enhancing current design processes
Systematic Analysis of Engineering Change Request Data - Applying Data Mining Tools to Gain New Fact-Based Insights
Large, complex system development projects take several years to execute. Such projects involve hundreds of engineers who develop thousands of parts and millions of lines of code. During the course of a project, many design decisions often need to be changed due to the emergence of new information. These changes are often well documented in databases, but due to the complexity of the data, few companies analyze engineering change requests (ECRs) in a comprehensive and structured fashion. ECRs are important in the product development process to enhance a product. The opportunity at hand is that vast amount of data on industrial changes are captured and stored, yet the present challenge is to systematically retrieve and use them in a purposeful way.This PhD thesis explores the growing need of product developers for data expertise and analysis. Product developers increasingly refer to analytics for improvement opportunities for business processes and products. For this reason, we examined the three components necessary to perform data mining and data analytics: exploring and collecting ECR data, collecting domain knowledge for ECR information needs, and applying mathematical tools for solution design and implementation.Results from extensive interviews generated a list of engineering information needs related to ECRs. When preparing for data mining, it is crucial to understand how the end user or the domain expert will and wants to use the extractable information. Results also show industrial case studies where complex product development processes are modeled using the Markov chain Design Structure Matrix to analyze and compare ECR sequences in four projects. In addition, the study investigates how advanced searches based on natural language processing techniques and clustering within engineering databases can help identify related content in documents. This can help product developers conduct better pre-studies as they can now evaluate a short list of the most relevant historical documents that might contain valuable knowledge.The main contribution is an application of data mining algorithms to a novel industrial domain. The state of the art is more up for the algorithms themselves. These proposed procedures and methods were evaluated using industrial data to show patterns for process improvements and cluster similar information. New information derived with data mining and analytics can help product developers make better decisions for new designs or re-designs of processes and products to ensure robust and superior products
- …