Search CORE

9 research outputs found

Recommended from our members

Physical Plan Instrumentation in Databases: Mechanisms and Applications

Author: Psallidas Fotis
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Database management systems (DBMSs) are designed with the goal set to compile SQL queries to physical plans that, when executed, provide results to the SQL queries. Building on this functionality, an ever-increasing number of application domains (e.g., provenance management, online query optimization, physical database design, interactive data profiling, monitoring, and interactive data visualization) seek to operate on how queries are executed by the DBMS for a wide variety of purposes ranging from debugging and data explanation to optimization and monitoring. Unfortunately, DBMSs provide little, if any, support to facilitate the development of this class of important application domains. The effect is such that database application developers and database system architects either rewrite the database internals in ad-hoc ways; work around the SQL interface, if possible, with inevitable performance penalties; or even build new databases from scratch only to express and optimize their domain-specific application logic over how queries are executed. To address this problem in a principled manner in this dissertation, we introduce a prototype DBMS, namely, Smoke, that exposes instrumentation mechanisms in the form of a framework to allow external applications to manipulate physical plans. Intuitively, a physical plan is the underlying representation that DBMSs use to encode how a SQL query will be executed, and providing instrumentation mechanisms at this representation level allows applications to express and optimize their logic on how queries are executed. Having such an instrumentation-enabled DBMS in-place, we then consider how to express and optimize applications that rely their logic on how queries are executed. To best demonstrate the expressive and optimization power of instrumentation-enabled DBMSs, we express and optimize applications across several important domains including provenance management, interactive data visualization, interactive data profiling, physical database design, online query optimization, and query discovery. Expressivity-wise, we show that Smoke can express known techniques, introduce novel semantics on known techniques, and introduce new techniques across domains. Performance-wise, we show case-by-case that Smoke is on par with or up-to several orders of magnitudes faster than state-of-the-art imperative and declarative implementations of important applications across domains. As such, we believe our contributions provide evidence and form the basis towards a class of instrumentation-enabled DBMSs with the goal set to express and optimize applications across important domains with core logic over how queries are executed by DBMSs

Columbia University Academic Commons

Advanced Research in Mathematics and Computer Science; Doctoral Conference in Mathematics, Informatics and Education [MIE 2014] Proceedings

Author
Publication venue: St. Kliment Ohridski University Press
Publication date: 01/09/2014
Field of study

Open University of the Netherlands Research Portal

Advanced Research in Mathematics and Computer Science; Doctoral Conference in Mathematics, Informatics and Education [MIE 2014] Proceedings

Author
Publication venue: St. Kliment Ohridski University Press
Publication date: 01/09/2014
Field of study

Open University of the Netherlands Research Portal

Semantic Systems. In the Era of Knowledge Graphs : 16th International Conference on Semantic Systems, SEMANTiCS 2020, Amsterdam, The Netherlands, September 7–10, 2020, Proceedings

Author: Alam Mehwish
Blomqvist Eva
Boer Victor de
Groth Paul
Kieseberg Peter
Kirrane Sabrina
Käfer Tobias
Meroño-Peñuela Albert
Pandit Harshvardhan J.
Pellegrini Tassilo
Publication venue: Springer International Publishing
Publication date: 24/06/2021
Field of study

KITopen

Semantic Systems : In the Era of Knowledge Graphs:16th International Conference on Semantic Systems, SEMANTiCS 2020, Amsterdam, The Netherlands, September 7-10, 2020 : proceedings

Author: Alam M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

International Migration, Integration and Social Cohesion online publications

Semantic Systems. In the Era of Knowledge Graphs

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book constitutes the refereed proceedings of the 16th International Conference on Semantic Systems, SEMANTiCS 2020, held in Amsterdam, The Netherlands, in September 2020. The conference was held virtually due to the COVID-19 pandemic

OAPEN Library

Source associations for the virtual observatory

Author: Taylor Emma
Publication venue: The University of Edinburgh
Publication date: 01/01/2007
Field of study

This thesis presents investigations into different methods of associating astronomical sources detected at different wavelengths, and describes the development of a tool for AstroGrid to enable users to associate sources in a fully automated manner.At present when associating sources at different wavelengths it is common for astronomers to select IDs by eye or at least verify probabilistically -determined counterparts by eye. With the new trend for large surveys this is no longer practical as datasets may contain millions of objects. Previous work on association algorithms has focussed on case -specific techniques which typically only match a restricted number of objects with counterparts, and often only those with small positional errors. This thesis addresses the issue that these methods are not adequate in the general case where datasets may be enormous and source error ellipses large. In such situations matching based purely on spatial proximity is deficient since there may be hundreds of candidate counterparts within a source error ellipse. We therefore investigate the likelihood ratio as an association technique, as this allows incorporation of data such as object magnitudes as well as positions, and prove its applicability in the (difficult association) case of the FIRBACK survey. We also develop the application of a machine learning technique, the EM algorithm, and test it against the likelihood ratio method. We determine that it may be effectively applied to find IDs in surveys with a magnitude distribution with unrestricted shape. These different association methods are successfully developed into a tool for AstroGrid to enable users to associate sources in a fully automated manner.We describe detailed analysis of the likelihood ratio method through the association of a population of far -infrared sources from the FIRBACK survey with optical counterparts from the INT Wide Field Survey. This is a challenging association problem since the far -infrared sources have a large positional error due to the poor resolution of the instrument and the relatively long wavelength. We compare two different variants of the likelihood ratio method in detail, and use the better one to derive optical counterparts for the far -infrared sources. This proves the applicability of the likelihood ratio method in the case of large source error ellipses where there are numerous candidates to choose between.The scientific benefits of associating multiwavelength data are illustrated via deducing, for the first time, the nature of the FIRBACK sources. These are identified with not only an optical counterpart but also with data at up to nine further wavelengths. Their properties are examined through the comparison of their observed spectral energy distributions with predictions from radiative transfer models which simulate the emission from both cirrus and starburst components. The far -infrared sources are found to be 80 per cent star -bursting galaxies with their starburst component at a high optical depthIt is a common situation in astronomy to wish to investigate a source population for which we have no prior knowledge about the properties of the source counterparts expected at another wavelength, for example through observations with a new instrument. In such a case it is necessary to estimate the counterpart magnitude distribution to use the likelihood ratio association method. Since little was known about the FIRBACK sources, prior to our research, their optical magnitude distribution had to be estimated in order to assign them optical IDs. To alleviate this problem we develop a new astronomical application of a machine learning technique known as the EM algorithm which is used in the field of informatics. This is able to `learn' the source magnitude distribution iteratively. The algorithm is tested on the FIRBACK sources and also radio sources from the HI Parkes All -Sky Survey (HIPASS) catalogue and is found to be a very effective association method in the HIPASS case where the background magnitude distribution is of unrestricted shape.We use the FIRBACK survey far -infrared sources as a test -bed for several different association methods. The value of bringing together multiwavelength observations is illustrated through the insights that are gained into the nature of the sources. This work culminates in the development of an association tool for AstroGrid, the UK Virtual Observatory project, offering three different association methods: the Poisson method, the likelihood ratio method and the EM algorithm. This tool is able to return a user specified number of possible counterparts along with a figure of merit for their match with a source. We also implement the AstroDAS system to store resulting object pairs in a database for future use. This prevents the same cross association tasks being carried out numerous times by different users. The Virtual Observatory aims to link diverse datasets from across the globe. The extra knowledge available from these may only be extracted after establishing links between detections in these datasets. Our AstroGrid association tool is therefore vital to the success of the Virtual Observatory

Edinburgh Research Archive

Recommended from our members

Ranking for Scalable Information Extraction

Author: Barrio Gonzalez Pablo Javier
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2015
Field of study

Information extraction systems are complex software tools that discover structured information in natural language text. For instance, an information extraction system trained to extract tuples for an Occurs-in(Natural Disaster, Location) relation may extract the tuple from the sentence: "A tsunami swept the coast of Hawaii." Having information in structured form enables more sophisticated querying and data mining than what is possible over the natural language text. Unfortunately, information extraction is a time-consuming task. For example, a state-of-the-art information extraction system to extract Occurs-in tuples may take up to two hours to process only 1,000 text documents. Since document collections routinely contain millions of documents or more, improving the efficiency and scalability of the information extraction process over these collections is critical. As a significant step towards this goal, this dissertation presents approaches for (i) enabling the deployment of efficient information extraction systems and (ii) scaling the information extraction process to large volumes of text. To enable the deployment of efficient information extraction systems, we have developed two crucial building blocks for this task. As a first contribution, we have created REEL, a toolkit to easily implement, evaluate, and deploy full-fledged relation extraction systems. REEL, in contrast to existing toolkits, effectively modularizes the key components involved in relation extraction systems and can integrate other long-established text processing and machine learning toolkits. To define a relation extraction system for a new relation and text collection, users only need to specify the desired configuration, which makes REEL a powerful framework for both research and application building. As a second contribution, we have addressed the problem of building representative extraction task-specific document samples from collections, a step often required by approaches for efficient information extraction. Specifically, we devised fully automatic document sampling techniques for information extraction that can produce better-quality document samples than the state-of-the-art sampling strategies; furthermore, our techniques are substantially more efficient than the existing alternative approaches. To scale the information extraction process to large volumes of text, we have developed approaches that address the efficiency and scalability of the extraction process by focusing the extraction effort on the collections, documents, and sentences worth processing for a given extraction task. For collections, we have studied both (adaptations of) state-of-the art approaches for estimating the number of documents in a collection that lead to the extraction of tuples as well as information extraction-specific approaches. Using these estimations we can identify the collections worth processing and ignore the rest, for efficiency. For documents, we have developed an adaptive document ranking approach that relies on learning-to-rank techniques to prioritize the documents that are likely to produce tuples for an extraction task of choice. Our approach revises the (learned) ranking decisions periodically as the extraction process progresses and new characteristics of the useful documents are revealed. Finally, for sentences, we have developed an approach based on the sparse group selection problem that identifies sentences|modeled as groups of words|that best characterize the extraction task. Beyond identifying sentences worth processing, our approach aims at selecting sentences that lead to the extraction of unseen, novel tuples. Our approaches are lightweight and efficient, and dramatically improve the efficiency and scalability of the information extraction process. We can often complete the extraction task by focusing on just a very small fraction of the available text, namely, the text that contains relevant information for the extraction task at hand. Our approaches therefore constitute a substantial step towards efficient and scalable information extraction over large volumes of text

Columbia University Academic Commons

Using interpretable machine learning for indoor CO₂ level prediction and occupancy estimation

Author: Ugwuanyi Chika
Publication venue
Publication date
Field of study

Management and monitoring of rooms’ environmental conditions is a good step towards achieving energy efficiency and a healthy indoor environment. However, studies indicate that some of the current methods used in environmental room monitoring are faced with some challenges such as high cost and lack of privacy. As a result, there is need to use a method that is simpler, reliable, affordable and without any privacy issues. Therefore, the aims of this thesis were: (i) to predict future CO₂ levels using environmental sensor data, (ii) to determine room occupancy using environmental sensor data and (iii) to create a prototype dashboard for possible future room management based on the models developed for room occupancy and CO₂ prediction. Machine learning methods were used and these included: Gradient Boosting ensemble model (GB), Long Short-Term Memory recurrent neural network model (LSTM) and Facebook Prophet model for time series (Prophet). The sensor data were recorded from three different office locations (two test sites at a university and a real-world commercial office in Glasgow, Scotland, UK). The results of the analysis show that with LSTM method, a Root Mean Square Error (RMSE) (absolute fit of the model results to the observed data) of 0.0682 could be achieved for two-hour time interval CO₂ prediction and with GB, of 82% accuracy could be achieved for proposed room occupancy estimation. Furthermore, as the model understanding was raised as a key issue, interpretable machine learning methods (SHapley Additive exPlanation. (SHAP) and Local Model-agnostic explanations. (LIME)) were used to interpret room occupancy results obtained by GB model. In addition a dashboard was designed and prototyped to show room environmental data, predicted CO₂ levels and estimated room occupancy based on what the sensor data and models might provide for people managing rooms in different settings. The proposed dashboard that was designed in this research was evaluated by interested participants and their responses show that the proposed dashboard could potentially offer inputs to building management towards the control of heating, ventilation and air-conditioning systems. This in turn could lead to improved energy efficiency, better planning of shared spaces in buildings, potentially reducing energy and operational costs, improved environmental conditions for room occupants; potentially leading to improved health, reduced risks, enhanced comfort and improved productivity. It is advised that further studies should be conducted at multiple locations to demonstrate generalisation of the results of the proposed model. In addition, the end benefits of the model could be assessed through applying its outputs to enhance the control of HVAC systems, room management systems and safety systems. The health and productivity of the occupants could be monitored in detail to identify whether resulting environmental improvements deliver improvements in health and productivity. The findings of this research contribute new knowledge that could be used to achieve reliable results in room occupancy estimation using machine learning approach.Management and monitoring of rooms’ environmental conditions is a good step towards achieving energy efficiency and a healthy indoor environment. However, studies indicate that some of the current methods used in environmental room monitoring are faced with some challenges such as high cost and lack of privacy. As a result, there is need to use a method that is simpler, reliable, affordable and without any privacy issues. Therefore, the aims of this thesis were: (i) to predict future CO₂ levels using environmental sensor data, (ii) to determine room occupancy using environmental sensor data and (iii) to create a prototype dashboard for possible future room management based on the models developed for room occupancy and CO₂ prediction. Machine learning methods were used and these included: Gradient Boosting ensemble model (GB), Long Short-Term Memory recurrent neural network model (LSTM) and Facebook Prophet model for time series (Prophet). The sensor data were recorded from three different office locations (two test sites at a university and a real-world commercial office in Glasgow, Scotland, UK). The results of the analysis show that with LSTM method, a Root Mean Square Error (RMSE) (absolute fit of the model results to the observed data) of 0.0682 could be achieved for two-hour time interval CO₂ prediction and with GB, of 82% accuracy could be achieved for proposed room occupancy estimation. Furthermore, as the model understanding was raised as a key issue, interpretable machine learning methods (SHapley Additive exPlanation. (SHAP) and Local Model-agnostic explanations. (LIME)) were used to interpret room occupancy results obtained by GB model. In addition a dashboard was designed and prototyped to show room environmental data, predicted CO₂ levels and estimated room occupancy based on what the sensor data and models might provide for people managing rooms in different settings. The proposed dashboard that was designed in this research was evaluated by interested participants and their responses show that the proposed dashboard could potentially offer inputs to building management towards the control of heating, ventilation and air-conditioning systems. This in turn could lead to improved energy efficiency, better planning of shared spaces in buildings, potentially reducing energy and operational costs, improved environmental conditions for room occupants; potentially leading to improved health, reduced risks, enhanced comfort and improved productivity. It is advised that further studies should be conducted at multiple locations to demonstrate generalisation of the results of the proposed model. In addition, the end benefits of the model could be assessed through applying its outputs to enhance the control of HVAC systems, room management systems and safety systems. The health and productivity of the occupants could be monitored in detail to identify whether resulting environmental improvements deliver improvements in health and productivity. The findings of this research contribute new knowledge that could be used to achieve reliable results in room occupancy estimation using machine learning approach

STAX (Strathclyde Repository)