3,409 research outputs found
Towards a query language for annotation graphs
The multidimensional, heterogeneous, and temporal nature of speech databases
raises interesting challenges for representation and query. Recently,
annotation graphs have been proposed as a general-purpose representational
framework for speech databases. Typical queries on annotation graphs require
path expressions similar to those used in semistructured query languages.
However, the underlying model is rather different from the customary graph
models for semistructured data: the graph is acyclic and unrooted, and both
temporal and inclusion relationships are important. We develop a query language
and describe optimization techniques for an underlying relational
representation.Comment: 8 pages, 10 figure
Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management
Spreadsheet software is the tool of choice for interactive ad-hoc data
management, with adoption by billions of users. However, spreadsheets are not
scalable, unlike database systems. On the other hand, database systems, while
highly scalable, do not support interactivity as a first-class primitive. We
are developing DataSpread, to holistically integrate spreadsheets as a
front-end interface with databases as a back-end datastore, providing
scalability to spreadsheets, and interactivity to databases, an integration we
term presentational data management (PDM). In this paper, we make a first step
towards this vision: developing a storage engine for PDM, studying how to
flexibly represent spreadsheet data within a database and how to support and
maintain access by position. We first conduct an extensive survey of
spreadsheet use to motivate our functional requirements for a storage engine
for PDM. We develop a natural set of mechanisms for flexibly representing
spreadsheet data and demonstrate that identifying the optimal representation is
NP-Hard; however, we develop an efficient approach to identify the optimal
representation from an important and intuitive subclass of representations. We
extend our mechanisms with positional access mechanisms that don't suffer from
cascading update issues, leading to constant time access and modification
performance. We evaluate these representations on a workload of typical
spreadsheets and spreadsheet operations, providing up to 20% reduction in
storage, and up to 50% reduction in formula evaluation time
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
From social networks to language modeling, the growing scale and importance
of graph data has driven the development of numerous new graph-parallel systems
(e.g., Pregel, GraphLab). By restricting the computation that can be expressed
and introducing new techniques to partition and distribute the graph, these
systems can efficiently execute iterative graph algorithms orders of magnitude
faster than more general data-parallel systems. However, the same restrictions
that enable the performance gains also make it difficult to express many of the
important stages in a typical graph-analytics pipeline: constructing the graph,
modifying its structure, or expressing computation that spans multiple graphs.
As a consequence, existing graph analytics pipelines compose graph-parallel and
data-parallel systems using external storage systems, leading to extensive data
movement and complicated programming model.
To address these challenges we introduce GraphX, a distributed graph
computation framework that unifies graph-parallel and data-parallel
computation. GraphX provides a small, core set of graph-parallel operators
expressive enough to implement the Pregel and PowerGraph abstractions, yet
simple enough to be cast in relational algebra. GraphX uses a collection of
query optimization techniques such as automatic join rewrites to efficiently
implement these graph-parallel operators. We evaluate GraphX on real-world
graphs and workloads and demonstrate that GraphX achieves comparable
performance as specialized graph computation systems, while outperforming them
in end-to-end graph pipelines. Moreover, GraphX achieves a balance between
expressiveness, performance, and ease of use
Data integration support for offshore decommissioning waste management
Offshore oil and gas platforms have a design life of about 25 years whereas the techniques and tools
used for managing their data are constantly evolving. Therefore, data captured about platforms during
their lifetimes will be in varying forms. Additionally, due to the many stakeholders involved with a facility
over its life cycle, information representation of its components varies. These challenges make data
integration difficult. Over the years, data integration technology application in the oil and gas industry
has focused on meeting the needs of asset life cycle stages other than decommissioning. This is the
case because most assets are just reaching the end of their design lives.
Currently, limited work has
been done on integrating life cycle data for offshore decommissioning purposes, and reports by industry
stakeholders underscore this need.
This thesis proposes a method for the integration of the common data types relevant in oil and gas
decommissioning. The key features of the method are that it (i) ensures semantic homogeneity using
knowledge representation languages (Semantic Web) and domain specific reference data (ISO 15926);
and (ii) allows stakeholders to continue to use their current applications. Prototypes of the framework
have been implemented using open source software applications and performance measures made.
The work of this thesis has been motivated by the business case of reusing offshore decommissioning
waste items. The framework developed is generic and can be applied whenever there is a need to
integrate and query disparate data involving oil and gas assets. The prototypes presented show how
the data management challenges associated with assessing the suitability of decommissioned offshore
facility items for reuse can be addressed. The performance of the prototypes show that significant time
and effort is saved compared to the state-ofâtheâart solution. The ability to do this effectively and
efficiently during decommissioning will advance the oil the oil and gas industryâs transition toward a
circular economy and help save on cost
Supporting feature-level software maintenance
Software maintenance is the process of modifying a software system to fix defects, improve performance, add new functionality, or adapt the system to a new environment. A maintenance task is often initiated by a bug report or a request for new functionality. Bug reports typically describe problems with incorrect behaviors or functionalities. These behaviors or functionalities are known as features. Even in very well-designed systems, the source code that implements features is often not completely modularized. The delocalized nature of features makes maintaining them challenging. Since maintenance tasks are expressed in terms of features, the goal of this dissertation is to support software maintenance at the feature-level. We focus on two tasks in particular: feature location and impact analysis via feature coupling.;Feature location is the process of identifying the source code that implements a feature, and it is an essential first step to any maintenance task. There are many existing techniques for feature location that incorporate various types of analyses such as static, dynamic, and textual. In this dissertation, we recognize the advantages of leveraging several types of analyses and introduce a new approach to feature location based on combining dynamic analysis, textual analysis, and web mining algorithms applied to software. The use of web mining for feature location is a novel contribution, and we show that our new techniques based on web mining are significantly more effective than the current state of the art.;After using feature location to identify a feature\u27s source code, maintenance can be completed on that feature. Impact analysis should then be performed to revalidate the system and determine which other features may have been affected by the modifications. We define three feature coupling metrics that capture the relationship between features based on structural information, textual information, and their combination. Our novel feature coupling metrics can be used for impact analysis to quantify the strength of coupling between pairs of features. We performed three empirical studies on open-source software systems to assess the feature coupling metrics and established three major results. First, there is a moderate to strong statistically significant correlation between feature coupling and faults. Second, feature coupling can be used to correctly determine about half of the other features that would be affected by a change to a given feature. Finally, we found that the metrics align with developers\u27 opinions about pairs of features that are actually coupled
Facilitating design learning through faceted classification of in-service information
The maintenance and service records collected and maintained by engineering companies are a useful
resource for the ongoing support of products. Such records are typically semi-structured and contain
key information such as a description of the issue and the product affected. It is suggested that further
value can be realised from the collection of these records for indicating recurrent and systemic issues
which may not have been apparent previously. This paper presents a faceted classification approach to
organise the information collection that might enhance retrieval and also facilitate learning from in-service
experiences. The faceted classification may help to expedite responses to urgent in-service issues as
well as to allow for patterns and trends in the records to be analysed, either automatically using suitable
data mining algorithms or by manually browsing the classification tree. The paper describes the application
of the approach to aerospace in-service records, where the potential for knowledge discovery is
demonstrated
On-premise containerized, light-weight software solutions for Biomedicine
Bioinformatics software systems are critical tools for analysing large-scale biological
data, but their design and implementation can be challenging due to the need for reliability, scalability, and performance. This thesis investigates the impact of several
software approaches on the design and implementation of bioinformatics software
systems. These approaches include software patterns, microservices, distributed
computing, containerisation and container orchestration. The research focuses on
understanding how these techniques affect bioinformatics software systemsâ reliability, scalability, performance, and efficiency. Furthermore, this research highlights
the challenges and considerations involved in their implementation. This study also
examines potential solutions for implementing container orchestration in bioinformatics research teams with limited resources and the challenges of using container
orchestration. Additionally, the thesis considers microservices and distributed computing and how these can be optimised in the design and implementation process to
enhance the productivity and performance of bioinformatics software systems. The
research was conducted using a combination of software development, experimentation, and evaluation. The results show that implementing software patterns can
significantly improve the code accessibility and structure of bioinformatics software
systems. Specifically, microservices and containerisation also enhanced system reliability, scalability, and performance. Additionally, the study indicates that adopting
advanced software engineering practices, such as model-driven design and container
orchestration, can facilitate efficient and productive deployment and management of
bioinformatics software systems, even for researchers with limited resources. Overall, we develop a software system integrating all our findings. Our proposed system
demonstrated the ability to address challenges in bioinformatics. The thesis makes
several key contributions in addressing the research questions surrounding the design,
implementation, and optimisation of bioinformatics software systems using software
patterns, microservices, containerisation, and advanced software engineering principles and practices. Our findings suggest that incorporating these technologies can
significantly improve bioinformatics software systemsâ reliability, scalability, performance, efficiency, and productivity.Bioinformatische Software-Systeme stellen bedeutende Werkzeuge fĂŒr die Analyse
umfangreicher biologischer Daten dar. Ihre Entwicklung und Implementierung kann
jedoch aufgrund der erforderlichen ZuverlÀssigkeit, Skalierbarkeit und LeistungsfÀhigkeit eine Herausforderung darstellen. Das Ziel dieser Arbeit ist es, die Auswirkungen von Software-Mustern, Microservices, verteilten Systemen, Containerisierung
und Container-Orchestrierung auf die Architektur und Implementierung von bioinformatischen Software-Systemen zu untersuchen. Die Forschung konzentriert sich
darauf, zu verstehen, wie sich diese Techniken auf die ZuverlÀssigkeit, Skalierbarkeit,
LeistungsfÀhigkeit und Effizienz von bioinformatischen Software-Systemen auswirken
und welche Herausforderungen mit ihrer Konzeptualisierungen und Implementierung
verbunden sind. Diese Arbeit untersucht auch potenzielle Lösungen zur Implementierung von Container-Orchestrierung in bioinformatischen Forschungsteams mit begrenzten Ressourcen und die EinschrĂ€nkungen bei deren Verwendung in diesem Kontext. Des Weiteren werden die SchlĂŒsselfaktoren, die den Erfolg von bioinformatischen Software-Systemen mit Containerisierung, Microservices und verteiltem Computing beeinflussen, untersucht und wie diese im Design- und Implementierungsprozess optimiert werden können, um die ProduktivitĂ€t und Leistung bioinformatischer
Software-Systeme zu steigern. Die vorliegende Arbeit wurde mittels einer Kombination aus Software-Entwicklung, Experimenten und Evaluation durchgefĂŒhrt. Die
erzielten Ergebnisse zeigen, dass die Implementierung von Software-Mustern, die ZuverlÀssigkeit und Skalierbarkeit von bioinformatischen Software-Systemen erheblich
verbessern kann. Der Einsatz von Microservices und Containerisierung trug ebenfalls zur Steigerung der ZuverlÀssigkeit, Skalierbarkeit und LeistungsfÀhigkeit des
Systems bei. DarĂŒber hinaus legt die Arbeit dar, dass die Anwendung von SoftwareEngineering-Praktiken, wie modellgesteuertem Design und Container-Orchestrierung,
die effiziente und produktive Bereitstellung und Verwaltung von bioinformatischen
Software-Systemen erleichtern kann. Zudem löst die Implementierung dieses SoftwareSystems, Herausforderungen fĂŒr Forschungsgruppen mit begrenzten Ressourcen. Insgesamt hat das System gezeigt, dass es in der Lage ist, Herausforderungen im Bereich
der Bioinformatik zu bewĂ€ltigen und stellt somit ein wertvolles Werkzeug fĂŒr Forscher in diesem Bereich dar. Die vorliegende Arbeit leistet mehrere wichtige BeitrĂ€ge
zur Beantwortung von Forschungsfragen im Zusammenhang mit dem Entwurf, der
Implementierung und der Optimierung von Software-Systemen fĂŒr die Bioinformatik unter Verwendung von Prinzipien und Praktiken der Softwaretechnik. Unsere
Ergebnisse deuten darauf hin, dass die Einbindung dieser Technologien die ZuverlÀssigkeit, Skalierbarkeit, LeistungsfÀhigkeit, Effizienz und ProduktivitÀt bioinformatischer Software-Systeme erheblich verbessern kann
Bulkloading and Maintaining XML Documents
The popularity of XML as a exchange and storage format brings about massive amounts of documents to be stored, maintained and analyzed -- a challenge that traditionally has been tackled with Database Management Systems (DBMS). To open up the content of XML documents to analysis with declarative query languages, efficient bulk loading techniques are necessary.
Database technology has traditionally been offering support for these tasks but yet falls short of providing efficient automation techniques for the challenges that large collections of XML data raise. As storage back-end, many applications rely on relational databases, which are designed towards large data volumes. This paper studies the bulk load and update algorithms for XML data stored in relational format and outlines opportunities and problems. We investigate both (1) bulk insertion and deletion as well as (2) updates in the form of edit scripts which heavily use pointer-chasing techniques which often are considered orthogonal to the algebraic operations relational databases are optimized for. To get the most out of relational database systems, we show that one should make careful use of edit scripts and replace them with bulk operations if more than a very small portion of the database is updated.
We implemented our ideas on top of the Monet Database System and benchmarked their performance
- âŠ