9 research outputs found
Recommended from our members
Physical Plan Instrumentation in Databases: Mechanisms and Applications
Database management systems (DBMSs) are designed with the goal set to compile SQL queries to physical plans that, when executed, provide results to the SQL queries. Building on this functionality, an ever-increasing number of application domains (e.g., provenance management, online query optimization, physical database design, interactive data profiling, monitoring, and interactive data visualization) seek to operate on how queries are executed by the DBMS for a wide variety of purposes ranging from debugging and data explanation to optimization and monitoring. Unfortunately, DBMSs provide little, if any, support to facilitate the development of this class of important application domains. The effect is such that database application developers and database system architects either rewrite the database internals in ad-hoc ways; work around the SQL interface, if possible, with inevitable performance penalties; or even build new databases from scratch only to express and optimize their domain-specific application logic over how queries are executed.
To address this problem in a principled manner in this dissertation, we introduce a prototype DBMS, namely, Smoke, that exposes instrumentation mechanisms in the form of a framework to allow external applications to manipulate physical plans. Intuitively, a physical plan is the underlying representation that DBMSs use to encode how a SQL query will be executed, and providing instrumentation mechanisms at this representation level allows applications to express and optimize their logic on how queries are executed.
Having such an instrumentation-enabled DBMS in-place, we then consider how to express and optimize applications that rely their logic on how queries are executed. To best demonstrate the expressive and optimization power of instrumentation-enabled DBMSs, we express and optimize applications across several important domains including provenance management, interactive data visualization, interactive data profiling, physical database design, online query optimization, and query discovery. Expressivity-wise, we show that Smoke can express known techniques, introduce novel semantics on known techniques, and introduce new techniques across domains. Performance-wise, we show case-by-case that Smoke is on par with or up-to several orders of magnitudes faster than state-of-the-art imperative and declarative implementations of important applications across domains.
As such, we believe our contributions provide evidence and form the basis towards a class of instrumentation-enabled DBMSs with the goal set to express and optimize applications across important domains with core logic over how queries are executed by DBMSs
Semantic Systems. In the Era of Knowledge Graphs
This open access book constitutes the refereed proceedings of the 16th International Conference on Semantic Systems, SEMANTiCS 2020, held in Amsterdam, The Netherlands, in September 2020. The conference was held virtually due to the COVID-19 pandemic
Source associations for the virtual observatory
This thesis presents investigations into different methods of associating astronomical
sources detected at different wavelengths, and describes the development of a tool for
AstroGrid to enable users to associate sources in a fully automated manner.At present when associating sources at different wavelengths it is common for astronomers to select IDs by eye or at least verify probabilistically -determined counterparts by eye. With the new trend for large surveys this is no longer practical as datasets
may contain millions of objects. Previous work on association algorithms has focussed
on case -specific techniques which typically only match a restricted number of objects
with counterparts, and often only those with small positional errors. This thesis addresses the issue that these methods are not adequate in the general case where datasets
may be enormous and source error ellipses large. In such situations matching based
purely on spatial proximity is deficient since there may be hundreds of candidate counterparts within a source error ellipse. We therefore investigate the likelihood ratio as an
association technique, as this allows incorporation of data such as object magnitudes
as well as positions, and prove its applicability in the (difficult association) case of the
FIRBACK survey. We also develop the application of a machine learning technique,
the EM algorithm, and test it against the likelihood ratio method. We determine that
it may be effectively applied to find IDs in surveys with a magnitude distribution with
unrestricted shape. These different association methods are successfully developed into
a tool for AstroGrid to enable users to associate sources in a fully automated manner.We describe detailed analysis of the likelihood ratio method through the association
of a population of far -infrared sources from the FIRBACK survey with optical counterparts from the INT Wide Field Survey. This is a challenging association problem since
the far -infrared sources have a large positional error due to the poor resolution of the
instrument and the relatively long wavelength. We compare two different variants of
the likelihood ratio method in detail, and use the better one to derive optical counterparts for the far -infrared sources. This proves the applicability of the likelihood ratio
method in the case of large source error ellipses where there are numerous candidates
to choose between.The scientific benefits of associating multiwavelength data are illustrated via deducing, for the first time, the nature of the FIRBACK sources. These are identified with
not only an optical counterpart but also with data at up to nine further wavelengths.
Their properties are examined through the comparison of their observed spectral energy
distributions with predictions from radiative transfer models which simulate the emission from both cirrus and starburst components. The far -infrared sources are found to
be 80 per cent star -bursting galaxies with their starburst component at a high optical
depthIt is a common situation in astronomy to wish to investigate a source population for
which we have no prior knowledge about the properties of the source counterparts expected at another wavelength, for example through observations with a new instrument.
In such a case it is necessary to estimate the counterpart magnitude distribution to use
the likelihood ratio association method. Since little was known about the FIRBACK
sources, prior to our research, their optical magnitude distribution had to be estimated
in order to assign them optical IDs. To alleviate this problem we develop a new astronomical application of a machine learning technique known as the EM algorithm which
is used in the field of informatics. This is able to `learn' the source magnitude distribution iteratively. The algorithm is tested on the FIRBACK sources and also radio
sources from the HI Parkes All -Sky Survey (HIPASS) catalogue and is found to be a
very effective association method in the HIPASS case where the background magnitude
distribution is of unrestricted shape.We use the FIRBACK survey far -infrared sources as a test -bed for several different
association methods. The value of bringing together multiwavelength observations is
illustrated through the insights that are gained into the nature of the sources. This
work culminates in the development of an association tool for AstroGrid, the UK Virtual
Observatory project, offering three different association methods: the Poisson method,
the likelihood ratio method and the EM algorithm. This tool is able to return a user
specified number of possible counterparts along with a figure of merit for their match
with a source. We also implement the AstroDAS system to store resulting object pairs
in a database for future use. This prevents the same cross association tasks being
carried out numerous times by different users. The Virtual Observatory aims to link
diverse datasets from across the globe. The extra knowledge available from these may
only be extracted after establishing links between detections in these datasets. Our
AstroGrid association tool is therefore vital to the success of the Virtual Observatory
Recommended from our members
Ranking for Scalable Information Extraction
Information extraction systems are complex software tools that discover structured information in natural language text. For instance, an information extraction system trained to extract tuples for an Occurs-in(Natural Disaster, Location) relation may extract the tuple from the sentence: "A tsunami swept the coast of Hawaii." Having information in structured form enables more sophisticated querying and data mining than what is possible over the natural language text. Unfortunately, information extraction is a time-consuming task. For example, a state-of-the-art information extraction system to extract Occurs-in tuples may take up to two hours to process only 1,000 text documents. Since document collections routinely contain millions of documents or more, improving the efficiency and scalability of the information extraction process over these collections is critical. As a significant step towards this goal, this dissertation presents approaches for (i) enabling the deployment of efficient information extraction systems and (ii) scaling the information extraction process to large volumes of text.
To enable the deployment of efficient information extraction systems, we have developed two crucial building blocks for this task. As a first contribution, we have created REEL, a toolkit to easily implement, evaluate, and deploy full-fledged relation extraction systems. REEL, in contrast to existing toolkits, effectively modularizes the key components involved in relation extraction systems and can integrate other long-established text processing and machine learning toolkits. To define a relation extraction system for a new relation and text collection, users only need to specify the desired configuration, which makes REEL a powerful framework for both research and application building. As a second contribution, we have addressed the problem of building representative extraction task-specific document samples from collections, a step often required by approaches for efficient information extraction. Specifically, we devised fully automatic document sampling techniques for information extraction that can produce better-quality document samples than the state-of-the-art sampling strategies; furthermore, our techniques are substantially more efficient than the existing alternative approaches.
To scale the information extraction process to large volumes of text, we have developed approaches that address the efficiency and scalability of the extraction process by focusing the extraction effort on the collections, documents, and sentences worth processing for a given extraction task. For collections, we have studied both (adaptations of) state-of-the art approaches for estimating the number of documents in a collection that lead to the extraction of tuples as well as information extraction-specific approaches. Using these estimations we can identify the collections worth processing and ignore the rest, for efficiency. For documents, we have developed an adaptive document ranking approach that relies on learning-to-rank techniques to prioritize the documents that are likely to produce tuples for an extraction task of choice. Our approach revises the (learned) ranking decisions periodically as the extraction process progresses and new characteristics of the useful documents are revealed. Finally, for sentences, we have developed an approach based on the sparse group selection problem that identifies sentences|modeled as groups of words|that best characterize the extraction task. Beyond identifying sentences worth processing, our approach aims at selecting sentences that lead to the extraction of unseen, novel tuples. Our approaches are lightweight and efficient, and dramatically improve the efficiency and scalability of the information extraction process. We can often complete the extraction task by focusing on just a very small fraction of the available text, namely, the text that contains relevant information for the extraction task at hand. Our approaches therefore constitute a substantial step towards efficient and scalable information extraction over large volumes of text
Using interpretable machine learning for indoor COâ‚‚ level prediction and occupancy estimation
Management and monitoring of rooms’ environmental conditions is a good step towards
achieving energy efficiency and a healthy indoor environment. However, studies indicate
that some of the current methods used in environmental room monitoring are faced with
some challenges such as high cost and lack of privacy. As a result, there is need to use a
method that is simpler, reliable, affordable and without any privacy issues. Therefore, the
aims of this thesis were: (i) to predict future COâ‚‚ levels using environmental sensor data,
(ii) to determine room occupancy using environmental sensor data and (iii) to create a
prototype dashboard for possible future room management based on the models
developed for room occupancy and COâ‚‚ prediction. Machine learning methods were used
and these included: Gradient Boosting ensemble model (GB), Long Short-Term Memory
recurrent neural network model (LSTM) and Facebook Prophet model for time series
(Prophet). The sensor data were recorded from three different office locations (two test
sites at a university and a real-world commercial office in Glasgow, Scotland, UK). The
results of the analysis show that with LSTM method, a Root Mean Square Error (RMSE)
(absolute fit of the model results to the observed data) of 0.0682 could be achieved for
two-hour time interval COâ‚‚ prediction and with GB, of 82% accuracy could be achieved
for proposed room occupancy estimation. Furthermore, as the model understanding was
raised as a key issue, interpretable machine learning methods (SHapley Additive
exPlanation. (SHAP) and Local Model-agnostic explanations. (LIME)) were used to
interpret room occupancy results obtained by GB model. In addition a dashboard was
designed and prototyped to show room environmental data, predicted COâ‚‚ levels and
estimated room occupancy based on what the sensor data and models might provide for
people managing rooms in different settings. The proposed dashboard that was designed
in this research was evaluated by interested participants and their responses show that the
proposed dashboard could potentially offer inputs to building management towards the
control of heating, ventilation and air-conditioning systems. This in turn could lead to
improved energy efficiency, better planning of shared spaces in buildings, potentially
reducing energy and operational costs, improved environmental conditions for room
occupants; potentially leading to improved health, reduced risks, enhanced comfort and
improved productivity. It is advised that further studies should be conducted at multiple
locations to demonstrate generalisation of the results of the proposed model. In addition,
the end benefits of the model could be assessed through applying its outputs to enhance
the control of HVAC systems, room management systems and safety systems. The health
and productivity of the occupants could be monitored in detail to identify whether
resulting environmental improvements deliver improvements in health and productivity.
The findings of this research contribute new knowledge that could be used to achieve
reliable results in room occupancy estimation using machine learning approach.Management and monitoring of rooms’ environmental conditions is a good step towards
achieving energy efficiency and a healthy indoor environment. However, studies indicate
that some of the current methods used in environmental room monitoring are faced with
some challenges such as high cost and lack of privacy. As a result, there is need to use a
method that is simpler, reliable, affordable and without any privacy issues. Therefore, the
aims of this thesis were: (i) to predict future COâ‚‚ levels using environmental sensor data,
(ii) to determine room occupancy using environmental sensor data and (iii) to create a
prototype dashboard for possible future room management based on the models
developed for room occupancy and COâ‚‚ prediction. Machine learning methods were used
and these included: Gradient Boosting ensemble model (GB), Long Short-Term Memory
recurrent neural network model (LSTM) and Facebook Prophet model for time series
(Prophet). The sensor data were recorded from three different office locations (two test
sites at a university and a real-world commercial office in Glasgow, Scotland, UK). The
results of the analysis show that with LSTM method, a Root Mean Square Error (RMSE)
(absolute fit of the model results to the observed data) of 0.0682 could be achieved for
two-hour time interval COâ‚‚ prediction and with GB, of 82% accuracy could be achieved
for proposed room occupancy estimation. Furthermore, as the model understanding was
raised as a key issue, interpretable machine learning methods (SHapley Additive
exPlanation. (SHAP) and Local Model-agnostic explanations. (LIME)) were used to
interpret room occupancy results obtained by GB model. In addition a dashboard was
designed and prototyped to show room environmental data, predicted COâ‚‚ levels and
estimated room occupancy based on what the sensor data and models might provide for
people managing rooms in different settings. The proposed dashboard that was designed
in this research was evaluated by interested participants and their responses show that the
proposed dashboard could potentially offer inputs to building management towards the
control of heating, ventilation and air-conditioning systems. This in turn could lead to
improved energy efficiency, better planning of shared spaces in buildings, potentially
reducing energy and operational costs, improved environmental conditions for room
occupants; potentially leading to improved health, reduced risks, enhanced comfort and
improved productivity. It is advised that further studies should be conducted at multiple
locations to demonstrate generalisation of the results of the proposed model. In addition,
the end benefits of the model could be assessed through applying its outputs to enhance
the control of HVAC systems, room management systems and safety systems. The health
and productivity of the occupants could be monitored in detail to identify whether
resulting environmental improvements deliver improvements in health and productivity.
The findings of this research contribute new knowledge that could be used to achieve
reliable results in room occupancy estimation using machine learning approach