Search CORE

236 research outputs found

“Feeling Better Than Ever”:Are there any internally consistent responses to the challenge of ‘better than perfect’ human health enhancement technology in a health technology appraisal context?

Author: Bates Alex
Publication venue: Lancaster University
Publication date: 21/08/2022
Field of study

Most modern publicly funded national healthcare systems (NHSes) make decisions about which technologies to fund and which to reject through the principles of health economics, and specifically the principles of ‘Health Technology Appraisal’ (HTA). Current HTA methods implicitly assume that health is anchored between zero (worst possible health) and one (best possible health), but the mathematics underlying HTA does not require this – mathematically the concept of ‘better than perfect’ health is entirely meaningful. However, to date there are no examples of technologies which actually create ‘better than perfect’ health and so the problem has never really been considered by health economists. This PhD thesis proposes that human enhancements – technologies which can “modify basic parameters of the human condition, which were previously thought immutable” (Bostrom & Roache, 2008) – might be able to create ‘better than perfect’ health states, and traces some of the implications of this for NHSes under current HTA rules. The key observation is that there is no obvious practical limit to how much better than ‘perfect’ health could get, and therefore a risk that following HTA rules blindly could lead to an NHS becoming ‘Subverted’ – NHSes becoming vehicles for prescribing this wonderful enhancement rather than making sick people healthier. It is therefore critical that the NHS regulators – and most specifically HTA agencies – adopt a systematic approach to ‘better than perfect’ healthcare to prevent this outcome if they believe it to be unjust. To begin to develop such a systematic approach, this thesis creates an economic theory of human enhancement and tests whether there is any approach which is consistent with all implications of this theory. The study draws heavily on interdisciplinary readings of the relatively developed bioethics literature on ‘better than perfect’ health and the health economic methods of health technology appraisal. It is hoped that the results from this approach will inform the response of HTA agencies and other regulators to the emerging issue of ‘better than perfect’ healthcare technologies in a health technology appraisal context, as well as providing meaningful avenues of further research on the same topic

Lancaster E-Prints

Organising and structuring a visual diary using visual interest point detectors

Author: Blighe Michael
Publication venue: Dublin City University. CLARITY: The Centre for Sensor Web Technologies
Publication date: 01/03/2009
Field of study

As wearable cameras become more popular, researchers are increasingly focusing on novel applications to manage the large volume of data these devices produce. One such application is the construction of a Visual Diary from an individual’s photographs. Microsoft’s SenseCam, a device designed to passively record a Visual Diary and cover a typical day of the user wearing the camera, is an example of one such device. The vast quantity of images generated by these devices means that the management and organisation of these collections is not a trivial matter. We believe wearable cameras, such as SenseCam, will become more popular in the future and the management of the volume of data generated by these devices is a key issue. Although there is a significant volume of work in the literature in the object detection and recognition and scene classification fields, there is little work in the area of setting detection. Furthermore, few authors have examined the issues involved in analysing extremely large image collections (like a Visual Diary) gathered over a long period of time. An algorithm developed for setting detection should be capable of clustering images captured at the same real world locations (e.g. in the dining room at home, in front of the computer in the office, in the park, etc.). This requires the selection and implementation of suitable methods to identify visually similar backgrounds in images using their visual features. We present a number of approaches to setting detection based on the extraction of visual interest point detectors from the images. We also analyse the performance of two of the most popular descriptors - Scale Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF).We present an implementation of a Visual Diary application and evaluate its performance via a series of user experiments. Finally, we also outline some techniques to allow the Visual Diary to automatically detect new settings, to scale as the image collection continues to grow substantially over time, and to allow the user to generate a personalised summary of their data

DCU Online Research Access Service

Machine Learning for Auditory Hierarchy

Author: Coleman William
Publication venue: Technological University Dublin
Publication date: 01/01/2021
Field of study

Coleman, W. (2021). Machine Learning for Auditory Hierarchy. This dissertation is submitted for the degree of Doctor of Philosophy, Technological University Dublin. Audio content is predominantly delivered in a stereo audio file of a static, pre-formed mix. The content creator makes volume, position and effects decisions, generally for presentation in stereo speakers, but has no control ultimately over how the content will be consumed. This leads to poor listener experience when, for example, a feature film is mixed such that the dialogue is at a low level relative to the sound effects. Consumers can complain that they must turn the volume up to hear the words, but back down again because the effects levels are too loud. Addressing this problem requires a television mix optimised for the stereo speakers used in the vast majority of homes, which is not always available

Arrow@TUDublin

Towards a human-centric data economy

Author: Andrés Azcoitia Santiago
Publication venue
Publication date: 10/05/2023
Field of study

Spurred by widespread adoption of artificial intelligence and machine learning, “data” is becoming a key production factor, comparable in importance to capital, land, or labour in an increasingly digital economy. In spite of an ever-growing demand for third-party data in the B2B market, firms are generally reluctant to share their information. This is due to the unique characteristics of “data” as an economic good (a freely replicable, non-depletable asset holding a highly combinatorial and context-specific value), which moves digital companies to hoard and protect their “valuable” data assets, and to integrate across the whole value chain seeking to monopolise the provision of innovative services built upon them. As a result, most of those valuable assets still remain unexploited in corporate silos nowadays. This situation is shaping the so-called data economy around a number of champions, and it is hampering the benefits of a global data exchange on a large scale. Some analysts have estimated the potential value of the data economy in US$2.5 trillion globally by 2025. Not surprisingly, unlocking the value of data has become a central policy of the European Union, which also estimated the size of the data economy in 827C billion for the EU27 in the same period. Within the scope of the European Data Strategy, the European Commission is also steering relevant initiatives aimed to identify relevant cross-industry use cases involving different verticals, and to enable sovereign data exchanges to realise them. Among individuals, the massive collection and exploitation of personal data by digital firms in exchange of services, often with little or no consent, has raised a general concern about privacy and data protection. Apart from spurring recent legislative developments in this direction, this concern has raised some voices warning against the unsustainability of the existing digital economics (few digital champions, potential negative impact on employment, growing inequality), some of which propose that people are paid for their data in a sort of worldwide data labour market as a potential solution to this dilemma [114, 115, 155]. From a technical perspective, we are far from having the required technology and algorithms that will enable such a human-centric data economy. Even its scope is still blurry, and the question about the value of data, at least, controversial. Research works from different disciplines have studied the data value chain, different approaches to the value of data, how to price data assets, and novel data marketplace designs. At the same time, complex legal and ethical issues with respect to the data economy have risen around privacy, data protection, and ethical AI practices. In this dissertation, we start by exploring the data value chain and how entities trade data assets over the Internet. We carry out what is, to the best of our understanding, the most thorough survey of commercial data marketplaces. In this work, we have catalogued and characterised ten different business models, including those of personal information management systems, companies born in the wake of recent data protection regulations and aiming at empowering end users to take control of their data. We have also identified the challenges faced by different types of entities, and what kind of solutions and technology they are using to provide their services. Then we present a first of its kind measurement study that sheds light on the prices of data in the market using a novel methodology. We study how ten commercial data marketplaces categorise and classify data assets, and which categories of data command higher prices. We also develop classifiers for comparing data products across different marketplaces, and we study the characteristics of the most valuable data assets and the features that specific vendors use to set the price of their data products. Based on this information and adding data products offered by other 33 data providers, we develop a regression analysis for revealing features that correlate with prices of data products. As a result, we also implement the basic building blocks of a novel data pricing tool capable of providing a hint of the market price of a new data product using as inputs just its metadata. This tool would provide more transparency on the prices of data products in the market, which will help in pricing data assets and in avoiding the inherent price fluctuation of nascent markets. Next we turn to topics related to data marketplace design. Particularly, we study how buyers can select and purchase suitable data for their tasks without requiring a priori access to such data in order to make a purchase decision, and how marketplaces can distribute payoffs for a data transaction combining data of different sources among the corresponding providers, be they individuals or firms. The difficulty of both problems is further exacerbated in a human-centric data economy where buyers have to choose among data of thousands of individuals, and where marketplaces have to distribute payoffs to thousands of people contributing personal data to a specific transaction. Regarding the selection process, we compare different purchase strategies depending on the level of information available to data buyers at the time of making decisions. A first methodological contribution of our work is proposing a data evaluation stage prior to datasets being selected and purchased by buyers in a marketplace. We show that buyers can significantly improve the performance of the purchasing process just by being provided with a measurement of the performance of their models when trained by the marketplace with individual eligible datasets. We design purchase strategies that exploit such functionality and we call the resulting algorithm Try Before You Buy, and our work demonstrates over synthetic and real datasets that it can lead to near-optimal data purchasing with only O(N) instead of the exponential execution time - O(2N) - needed to calculate the optimal purchase. With regards to the payoff distribution problem, we focus on computing the relative value of spatio-temporal datasets combined in marketplaces for predicting transportation demand and travel time in metropolitan areas. Using large datasets of taxi rides from Chicago, Porto and New York we show that the value of data is different for each individual, and cannot be approximated by its volume. Our results reveal that even more complex approaches based on the “leave-one-out” value, are inaccurate. Instead, more complex and acknowledged notions of value from economics and game theory, such as the Shapley value, need to be employed if one wishes to capture the complex effects of mixing different datasets on the accuracy of forecasting algorithms. However, the Shapley value entails serious computational challenges. Its exact calculation requires repetitively training and evaluating every combination of data sources and hence O(N!) or O(2N) computational time, which is unfeasible for complex models or thousands of individuals. Moreover, our work paves the way to new methods of measuring the value of spatio-temporal data. We identify heuristics such as entropy or similarity to the average that show a significant correlation with the Shapley value and therefore can be used to overcome the significant computational challenges posed by Shapley approximation algorithms in this specific context. We conclude with a number of open issues and propose further research directions that leverage the contributions and findings of this dissertation. These include monitoring data transactions to better measure data markets, and complementing market data with actual transaction prices to build a more accurate data pricing tool. A human-centric data economy would also require that the contributions of thousands of individuals to machine learning tasks are calculated daily. For that to be feasible, we need to further optimise the efficiency of data purchasing and payoff calculation processes in data marketplaces. In that direction, we also point to some alternatives to repetitively training and evaluating a model to select data based on Try Before You Buy and approximate the Shapley value. Finally, we discuss the challenges and potential technologies that help with building a federation of standardised data marketplaces. The data economy will develop fast in the upcoming years, and researchers from different disciplines will work together to unlock the value of data and make the most out of it. Maybe the proposal of getting paid for our data and our contribution to the data economy finally flies, or maybe it is other proposals such as the robot tax that are finally used to balance the power between individuals and tech firms in the digital economy. Still, we hope our work sheds light on the value of data, and contributes to making the price of data more transparent and, eventually, to moving towards a human-centric data economy.This work has been supported by IMDEA Networks InstitutePrograma de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Georgios Smaragdakis.- Secretario: Ángel Cuevas Rumín.- Vocal: Pablo Rodríguez Rodrígue

Universidad Carlos III de Madrid e-Archivo

Audio-coupled video content understanding of unconstrained video sequences

Author: Jose E.F.C. Lopes (7170161)
Publication venue
Publication date: 01/01/2011
Field of study

Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

Loughborough University Institutional Repository

Biometrics Writer Recognition for Arabic language: Analysis and Classification techniques using Subwords Features

Author: Maliki Makki Jasim Radhi
Publication venue
Publication date: 01/08/2015
Field of study

Handwritten text in any language is believed to convey a great deal of information about writers’ personality and identity. Indeed, handwritten signature has long been accepted as an authentication of the writer’s physical stamp on financial and legal deals as well official/personal documents and works of art. Handwritten documents are frequently used as evidences in forensic tasks. Handwriting skills is learnt and developed from the early schooling stages. Research interest in behavioral biometrics was the main driving force behind the growth in research into Writer Identification (WI) from handwritten text, but recent rise in terrorism associated with extreme religious ideologies spreading primarily, but not exclusively, from the middle-east has led to a surge of interest in WI from handwritten text in Arabic and similar languages. This thesis is the main outcome of extensive research investigations conducted with the aim of developing an automatic identification of a person from handwritten Arabic text samples. My motivations and interests, as an Iraqi researcher, emanate from my multi-faceted desires to provide scientific support for my people in their fight against terrorism by providing forensic evidences, and as contribute to the ongoing digitization of the Iraqi National archive as well as the wealth of religious and historical archives in Iraq and the middle-east. Good knowledge of the underlying language is invaluable in this project. Despite the rising interest in this recognition modality worldwide, Arabic writer identification has not been addressed as extensively as Latin writer identification. However, in recent years some new Arabic writer identification approaches have been proposed some of which are reviewed in this thesis. Arabic is a cursive language when handwritten. This means that each and every writer in this language develops some unique features that could demonstrate writer’s habits and style. These habits and styles are considered as unique WI features and determining factors. Existing dominating approaches to WI are based on recognizing handwriting habits/styles are embedded in certain parts/components of the written texts. Although the appearance of these components within long text contain rich information and clues to writer identity, the most common approaches to WI in Arabic in the literature are based on features extracted from paragraph(s), line(s), word(s), character(s), and/or a part of a character. Generally, Arabic words are made up of one or more subwords at the end of each; there is a connected stroke with a certain style of which seem to be most representative of writers habits. Another feature of Arabic writing is to do with diacritics that are added to written words/subwords, to add meaning and pronunciation. Subwords are more frequent in written Arabic text and appear as part of several different words or as full individual words. Thus, we propose a new innovative approach based on a seemingly plausible hypothesis that subwords based WI yields significant increase in accuracy over existing approaches. The thesis most significant contributions can be summarized as follows: - Developed a high performing segmentation of scanned text images, that combines threshold based binarisation, morphological operation and active shape model. - Defined digital measures and formed a 15-dimensional feature vectors representations of subwords that implicitly cover its diacritics and strokes. A pilot study that incrementally added features according to writer discriminating power. This reduced subwords feature vector dimension to 8, two of which were modelled as time series. - For the dependent 8-dimensional WI scheme, we identify the best performing set of subwords (best 22 subwords out of 49 then followed by best 11 out of these 22 subwords). - We established the validity of our hypothesis for different versions of subwords based WI schemes by providing empirical evidence when testing on a number of existing text dependent and in text-dependent databases plus a simulated text-in text-dependent DB. The text-dependent scenario results exhibited possible present of the Doddington Zoo phenomena. - The final optimal subword based WI scheme, not only removes the need to include diacritics as part of the subword but also demonstrating that including diacritics within subwords impairs the WI discriminating power of subwords. This should not be taken to discredit research that are based on diacritics based WI. Also in this subword body (without diacritics) base WI scheme, resulted in eliminating the presence of Doddington Zoo effect. - Finally, a significant but un-intended consequence of using subwords for WI is that there is no difference between a text-independent scenario and text-dependent one. In fact, we shall demonstrate that the text-dependent database of the 27-words can be used to simulate the testing of the scheme for an in text-dependent database without the need to record such a DB. Finally, we discussed ways of optimising the performance of our last scheme by considering possible ways of complementing our scheme using the addition of various image texture analysis features to be extracted from subwords, lines, paragraphs or entire file of the scabbed image. These included LBP and Gabor Filter. We also suggested the possible addition of few more features

BEAR (Buckingham E-Archive of Research)

An Integrated Cybersecurity Risk Management (I-CSRM) Framework for Critical Infrastructure Protection

Author: Kure H.
Kure H.
Publication venue: University of East London
Publication date: 01/01/2021
Field of study

Risk management plays a vital role in tackling cyber threats within the Cyber-Physical System (CPS) for overall system resilience. It enables identifying critical assets, vulnerabilities, and threats and determining suitable proactive control measures to tackle the risks. However, due to the increased complexity of the CPS, cyber-attacks nowadays are more sophisticated and less predictable, which makes risk management task more challenging. This research aims for an effective Cyber Security Risk Management (CSRM) practice using assets criticality, predication of risk types and evaluating the effectiveness of existing controls. We follow a number of techniques for the proposed unified approach including fuzzy set theory for the asset criticality, machine learning classifiers for the risk predication and Comprehensive Assessment Model (CAM) for evaluating the effectiveness of the existing controls. The proposed approach considers relevant CSRM concepts such as threat actor attack pattern, Tactic, Technique and Procedure (TTP), controls and assets and maps these concepts with the VERIS community dataset (VCDB) features for the purpose of risk predication. Also, the tool serves as an additional component of the proposed framework that enables asset criticality, risk and control effectiveness calculation for a continuous risk assessment. Lastly, the thesis employs a case study to validate the proposed i-CSRM framework and i-CSRMT in terms of applicability. Stakeholder feedback is collected and evaluated using critical criteria such as ease of use, relevance, and usability. The analysis results illustrate the validity and acceptability of both the framework and tool for an effective risk management practice within a real-world environment. The experimental results reveal that using the fuzzy set theory in assessing assets' criticality, supports stakeholder for an effective risk management practice. Furthermore, the results have demonstrated the machine learning classifiers’ have shown exemplary performance in predicting different risk types including denial of service, cyber espionage, and Crimeware. An accurate prediction can help organisations model uncertainty with machine learning classifiers, detect frequent cyber-attacks, affected assets, risk types, and employ the necessary corrective actions for its mitigations. Lastly, to evaluate the effectiveness of the existing controls, the CAM approach is used, and the result shows that some controls such as network intrusion, authentication, and anti-virus show high efficacy in controlling or reducing risks. Evaluating control effectiveness helps organisations to know how effective the controls are in reducing or preventing any form of risk before an attack occurs. Also, organisations can implement new controls earlier. The main advantage of using the CAM approach is that the parameters used are objective, consistent and applicable to CPS

UEL Research Repository at University of East London

Constructing a reference standard for sports science and clinical movement sets using IMU-based motion capture technology

Author: Gilbert Thomas Jamin
Publication venue: UCL (University College London)
Publication date: 28/10/2021
Field of study

Motion analysis has improved greatly over the years through the development of low-cost inertia sensors. Such sensors have shown promising accuracy for both sport and medical applications, facilitating the possibility of a new reference standard to be constructed. Current gold standards within motion capture, such as high-speed camera-based systems and image processing, are not suitable for many movement-sets within both sports science and clinical movement analysis due to restrictions introduced by the movement sets. These restrictions include cost, portability, local environment constraints (such as light level) and poor line of sight accessibility. This thesis focusses on developing a magnetometer-less IMU-based motion capturing system to detect and classify two challenging movement sets: Basic stances during a Shaolin Kung Fu dynamic form, and severity levels from the modified UPDRS (Unified Parkinson’s Disease Rating Scale) analysis tapping exercise. This project has contributed three datasets. The Shaolin Kung Fu dataset is comprised of 5 dynamic movements repeated over 350 times by 8 experienced practitioners. The dataset was labelled by a professional Shaolin Kung Fu master. Two modified UPDRS datasets were constructed, one for each of the two locations measured. The modified UPDRS datasets comprised of 5 severity levels each with 100 self-emulated movement samples. The modified UPDRS dataset was labelled by a researcher in neuropsychological assessment. The errors associated with IMU systems has been reduced significantly through a combination of a Complementary filter and applying the constraints imposed by the range of movements available in human joints. Novel features have been extracted from each dataset. A piecewise feature set based on a moving window approach has been applied to the Shaolin Kung Fu dataset. While a combination of standard statistical features and a Durbin Watson analysis has been extracted from the modified UPDRS measurements. The project has also contributed a comparison of 24 models has been done on all 3 datasets and the optimal model for each dataset has been determined. The resulting models were commensurate with current gold standards. The Shaolin Kung Fu dataset was classified with the computational costly fine decision tree algorithm using 400 splits, resulting in: an accuracy of 98.9%, a precision of 96.9%, a recall value of 99.1%, and a F1-score of 98.0%. A novel approach of using sequential forward feature analysis was used to determine the minimum number of IMU devices required as well as the optimal number of IMU devices. The modified UPDRS datasets were then classified using a support vector machine algorithm requiring various kernels to achieve their highest accuracies. The measurements were repeated with a sensor located on the wrist and finger, with the wrist requiring a linear kernel and the finger a quadratic kernel. Both locations achieved an accuracy, precision, recall, and F1-score of 99.2%. Additionally, the project contributed an evaluation to the effect sensor location has on the proposed models. It was concluded that the IMU-based system has the potential to construct a reference standard both in sports science and clinical movement analysis. Data protection security and communication speeds were limitations in the system constructed due to the measured data being transferred from the devices via Bluetooth Low Energy communication. These limitations were considered and evaluated in the future works of this project

UCL Discovery

Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

Author: Aldraimli M.
Aldraimli M.
Publication venue: University of Westminster
Publication date: 01/01/2023
Field of study

This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

WestminsterResearch

The impact of domain knowledge-driven variable derivation on classifier performance for corporate data mining

Author: Welcker Laura Joana Maria
Publication venue: Faculty of Business and Economic Sciences
Publication date: 01/01/2015
Field of study

The technological progress in terms of increasing computational power and growing virtual space to collect data offers great potential for businesses to benefit from data mining applications. Data mining can create a competitive advantage for corporations by discovering business relevant information, such as patterns, relationships, and rules. The role of the human user within the data mining process is crucial, which is why the research area of domain knowledge becomes increasingly important. This thesis investigates the impact of domain knowledge-driven variable derivation on classifier performance for corporate data mining. Domain knowledge is defined as methodological, data and business know-how. The thesis investigates the topic from a new perspective by shifting the focus from a one-sided approach, namely a purely analytic or purely theoretical approach towards a target group-oriented (researcher and practitioner) approach which puts the methodological aspect by means of a scientific guideline in the centre of the research. In order to ensure feasibility and practical relevance of the guideline, it is adapted and applied to the requirements of a practical business case. Thus, the thesis examines the topic from both perspectives, a theoretical and practical perspective. Therewith, it overcomes the limitation of a one-sided approach which mostly lacks practical relevance or generalisability of the results. The primary objective of this thesis is to provide a scientific guideline which should enable both practitioners and researchers to move forward the domain knowledge-driven research for variable derivation on a corporate basis. In the theoretical part, a broad overview of the main aspects which are necessary to undertake the research are given, such as the concept of domain knowledge, the data mining task of classification, variable derivation as a subtask of data preparation, and evaluation techniques. This part of the thesis refers to the methodological aspect of domain knowledge. In the practical part, a research design is developed for testing six hypotheses related to domain knowledge-driven variable derivation. The major contribution of the empirical study is concerned with testing the impact of domain knowledge on a real business data set compared to the impact of a standard and randomly derived data set. The business application of the research is a binary classification problem in the domain of an insurance business, which deals with the prediction of damages in legal expenses insurances. Domain knowledge is expressed through deriving the corporate variables by means of the business and data-driven constructive induction strategy. Six variable derivation steps are investigated: normalisation, instance relation, discretisation, categorical encoding, ratio, and multivariate mathematical function. The impact of the domain knowledge is examined by pairwise (with and without derived variables) performance comparisons for five classification techniques (decision trees, naive Bayes, logistic regression, artificial neural networks, k-nearest neighbours). The impact is measured by two classifier performance criteria: sensitivity and area under the ROC-curve (AUC). The McNemar significance test is used to verify the results. Based on the results, two hypotheses are clearly verified and accepted, three hypotheses are partly verified, and one hypothesis had to be rejected on the basis of the case study results. The thesis reveals a significant positive impact of domain knowledge-driven variable derivation on classifier performance for options of all six tested steps. Furthermore, the findings indicate that the classification technique influences the impact of the variable derivation steps, and the bundling of steps has a significant higher performance impact if the variables are derived by using domain knowledge (compared to a non-knowledge application). Finally, the research turns out that an empirical examination of the domain knowledge impact is very complex due to a high level of interaction between the selected research parameters (variable derivation step, classification technique, and performance criteria)

Nelson Mandela University

South East Academic Libraries System (SEALS)