410 research outputs found

    Advances in Learning and Understanding with Graphs through Machine Learning

    Get PDF
    Graphs have increasingly become a crucial way of representing large, complex and disparate datasets from a range of domains, including many scientific disciplines. Graphs are particularly useful at capturing complex relationships or interdependencies within or even between datasets, and enable unique insights which are not possible with other data formats. Over recent years, significant improvements in the ability of machine learning approaches to automatically learn from and identify patterns in datasets have been made. However due to the unique nature of graphs, and the data they are used to represent, employing machine learning with graphs has thus far proved challenging. A review of relevant literature has revealed that key challenges include issues arising with macro-scale graph learning, interpretability of machine learned representations and a failure to incorporate the temporal dimension present in many datasets. Thus, the work and contributions presented in this thesis primarily investigate how modern machine learning techniques can be adapted to tackle key graph mining tasks, with a particular focus on optimal macro-level representation, interpretability and incorporating temporal dynamics into the learning process. The majority of methods employed are novel approaches centered around attempting to use artificial neural networks in order to learn from graph datasets. Firstly, by devising a novel graph fingerprint technique, it is demonstrated that this can successfully be applied to two different tasks whilst out-performing established baselines, namely graph comparison and classification. Secondly, it is shown that a mapping can be found between certain topological features and graph embeddings. This, for perhaps the the first time, suggests that it is possible that machines are learning something analogous to human knowledge acquisition, thus bringing interpretability to the graph embedding process. Thirdly, in exploring two new models for incorporating temporal information into the graph learning process, it is found that including such information is crucial to predictive performance in certain key tasks, such as link prediction, where state-of-the-art baselines are out-performed. The overall contribution of this work is to provide greater insight into and explanation of the ways in which machine learning with respect to graphs is emerging as a crucial set of techniques for understanding complex datasets. This is important as these techniques can potentially be applied to a broad range of scientific disciplines. The thesis concludes with an assessment of limitations and recommendations for future research

    Using Hadoop to implement a semantic method for assessing the quality of medical data

    Get PDF
    Recent technological advances in modern healthcare have lead to a vast wealth of patient data being collected. This data is not only utilised for diagnosis but also has the potential to be used for medical research. However, there are often many errors in datasets used for medical research, with one study finding error rates ranging from 2.3% to 26.9% in a selection of medical research databases. Previous methods of automatically assessing data quality have often relied on threshold rules. These rules can sometimes miss errors requiring complex domain knowledge to correctly identify. To combat this, a semantic framework has been developed to assess the quality of medical data expressed in the form of linked open data. Early work in this direction revealed that existing triplestores are unable to cope with the large amounts of medical data. In this thesis, a system for storing and querying medical RDF data using Hadoop is de-veloped. This approach enables the creation of an inherently parallel framework that will scale the workload across a cluster. Unlike existing solutions, this framework uses highly optimised joining strategies to enable the completion of eight separate SPARQL queries, comprising over eighty distinct joins, in only two Map/Reduce iterations. Results are pre-sented comparing both na¨ıve and optimised versions of the solution against Jena TDB, demonstrating the superior performance of the Hadoop system and its viability for assess-ing the quality of medical data
    corecore