2,116 research outputs found
Recommended from our members
Joint Multivariate Modelling and Prediction for Genetic and Biomedical Data
In the area of statistical genetics, classical genome-wide association studies (GWAS) assess the association between a biological characteristic and genetic variants, working with one variant at a time in a regression model, and reporting the most significant associations. These studies test genetic markers individually, even though the data may exhibit multivariate structure due to the way genes are transmitted together from the parents to the offspring. Despite considering covariates like age and sex in the model, the classical GWAS does not account for the joint effects of genetic variants. Moreover, when multiple genetic variants within a gene have small effects on a phenotype, testing them individually can lack statistical power, but testing them together in a joint model can be more useful in pooling together all the evidence. In this thesis, I reviewed different multivariate testing procedures in joint multivariate model settings, explored their properties, and demonstrated them in further real-life database applications, such as enhancing statistical power by conditioning on major variants.
I studied the mathematical properties of various multivariate test procedures, particularly within the context of multiple linear regression. Considering the theoretical aspect as well as their availability in literature, I adapt various multivariate test procedures for canonical correlation in multiple regression settings. These procedures have been demonstrated to asymptotically follow the chi-square distribution. Importantly, these test procedures exhibit asymptotic equivalence among themselves and with the Wald test statistic. This indicates that the Wald test statistic may be sufficient for future studies, given its equivalence to the multivariate test procedures.
In many cases, there are known databases of major genetic variants that have a substantial effect on the trait. In such situations, it makes sense statistically to condition on these major variants to improve power in detecting associations with new variants, but this is not a common practice in GWAS applications. In this study, we also showed theoretically and computationally how conducting a joint analysis of the genetic variants in a multiple regression model, where the estimated effect of a new variant is conditioned upon some major variants, can improve the performance of the model in terms of reducing the standard error and improving the power. The amount of gain of power will depend on the correlation between the response and the covariates, as well as the correlation
between the covariates. I further show that conditional results can sometimes
be obtained from publicly available summary statistics reported for univariate associations in published GWAS studies, even when the individual-level data are unavailable. A prominent example of such a trait is skin color, for which there are many studies consistently identifying a handful of major genes. I looked into a dataset of over 6,500 mixed-ethnicity Latin Americans to see how the conditioning process can improve the detection power of GWAS studies and identify new genetic variants in such a situation.
In practical applications, the statistical models I worked with for association testing can be carried forward for predictive purposes in new datasets. In this thesis, I have also demonstrated mathematical formulations of prediction errors in different linear models, including simple linear regression models, as well as shrinkage methods like ridge regression and lasso regression. These expressions for prediction errors show the inherent trade-off between bias and variance at both individual data points and across a set of observations. Moreover, these formulations have found the connections between prediction errors and genetic heritability that can enhance prediction performance in genetic association studies. Additionally, I reviewed various statistical and machine learning predictive models. Based on a dental morphology dataset, I compared their performance using classification metrics such as average error rate and maximum classification error rate per specimen
On the Generation of Realistic and Robust Counterfactual Explanations for Algorithmic Recourse
This recent widespread deployment of machine learning algorithms presents many new challenges. Machine learning algorithms are usually opaque and can be particularly difficult to interpret. When humans are involved, algorithmic and automated decisions can negatively impact people’s lives. Therefore, end users would like to be insured against potential harm. One popular way to achieve this is to provide end users access to algorithmic recourse, which gives end users negatively affected by algorithmic decisions the opportunity to reverse unfavorable decisions, e.g., from a loan denial to a loan acceptance. In this thesis, we design recourse algorithms to meet various end user needs. First, we propose methods for the generation of realistic recourses. We use generative models to suggest recourses likely to occur under the data distribution. To this end, we shift the recourse action from the input space to the generative model’s latent space, allowing to generate counterfactuals that lie in regions with data support. Second, we observe that small changes applied to the recourses prescribed to end users likely invalidate the suggested recourse after being nosily implemented in practice. Motivated by this observation, we design methods for the generation of robust recourses and for assessing the robustness of recourse algorithms to data deletion requests. Third, the lack of a commonly used code-base for counterfactual explanation and algorithmic recourse algorithms and the vast array of evaluation measures in literature make it difficult to compare the per formance of different algorithms. To solve this problem, we provide an open source benchmarking library that streamlines the evaluation process and can be used for benchmarking, rapidly developing new methods, and setting up new
experiments. In summary, our work contributes to a more reliable interaction of end users and machine learned models by covering fundamental aspects of the recourse process and suggests new solutions towards generating realistic and robust counterfactual explanations for algorithmic recourse
UMSL Bulletin 2023-2024
The 2023-2024 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1088/thumbnail.jp
Multidisciplinary perspectives on Artificial Intelligence and the law
This open access book presents an interdisciplinary, multi-authored, edited collection of chapters on Artificial Intelligence (‘AI’) and the Law. AI technology has come to play a central role in the modern data economy. Through a combination of increased computing power, the growing availability of data and the advancement of algorithms, AI has now become an umbrella term for some of the most transformational technological breakthroughs of this age. The importance of AI stems from both the opportunities that it offers and the challenges that it entails. While AI applications hold the promise of economic growth and efficiency gains, they also create significant risks and uncertainty. The potential and perils of AI have thus come to dominate modern discussions of technology and ethics – and although AI was initially allowed to largely develop without guidelines or rules, few would deny that the law is set to play a fundamental role in shaping the future of AI. As the debate over AI is far from over, the need for rigorous analysis has never been greater. This book thus brings together contributors from different fields and backgrounds to explore how the law might provide answers to some of the most pressing questions raised by AI. An outcome of the Católica Research Centre for the Future of Law and its interdisciplinary working group on Law and Artificial Intelligence, it includes contributions by leading scholars in the fields of technology, ethics and the law.info:eu-repo/semantics/publishedVersio
AI in medical diagnosis : AI prediction & human judgment
AI has long been regarded as a panacea for decision-making and many other aspects of knowledge work; as something that will help humans get rid of their shortcomings. We believe that AI can be a useful asset to support decision-makers, but not that it should replace decision-makers. Decision-making uses algorithmic analysis, but it is not solely algorithmic analysis; it also involves other factors, many of which are very human, such as creativity, intuition, emotions, feelings, and value judgments. We have conducted semi-structured open-ended research interviews with 17 dermatologists to understand what they expect from an AI application to deliver to medical diagnosis. We have found four aggregate dimensions along which the thinking of dermatologists can be described: the ways in which our participants chose to interact with AI, responsibility, 'explainability', and the new way of thinking (mindset) needed for working with AI. We believe that our findings will help physicians who might consider using AI in their diagnosis to understand how to use AI beneficially. It will also be useful for AI vendors in improving their understanding of how medics want to use AI in diagnosis. Further research will be needed to examine if our findings have relevance in the wider medical field and beyond
Predicting Paid Certification in Massive Open Online Courses
Massive open online courses (MOOCs) have been proliferating because of the free or low-cost offering of content for learners, attracting the attention of many stakeholders across the entire educational landscape. Since 2012, coined as “the Year of the MOOCs”, several platforms have gathered millions of learners in just a decade. Nevertheless, the certification rate of both free and paid courses has been low, and only about 4.5–13% and 1–3%, respectively, of the total number of enrolled learners obtain a certificate at the end of their courses. Still, most research concentrates on completion, ignoring the certification problem, and especially its financial aspects. Thus, the research described in the present thesis aimed to investigate paid certification in MOOCs, for the first time, in a comprehensive way, and as early as the first week of the course, by exploring its various levels. First, the latent correlation between learner activities and their paid certification decisions was examined by (1) statistically comparing the activities of non-paying learners with course purchasers and (2) predicting paid certification using different machine learning (ML) techniques. Our temporal (weekly) analysis showed statistical significance at various levels when comparing the activities of non-paying learners with those of the certificate purchasers across the five courses analysed. Furthermore, we used the learner’s activities (number of step accesses, attempts, correct and wrong answers, and time spent on learning steps) to build our paid certification predictor, which achieved promising balanced accuracies (BAs), ranging from 0.77 to 0.95. Having employed simple predictions based on a few clickstream variables, we then analysed more in-depth what other information can be extracted from MOOC interaction (namely discussion forums) for paid certification prediction. However, to better explore the learners’ discussion forums, we built, as an original contribution, MOOCSent, a cross- platform review-based sentiment classifier, using over 1.2 million MOOC sentiment-labelled reviews. MOOCSent addresses various limitations of the current sentiment classifiers including (1) using one single source of data (previous literature on sentiment classification in MOOCs was based on single platforms only, and hence less generalisable, with relatively low number of instances compared to our obtained dataset;) (2) lower model outputs, where most of the current models are based on 2-polar
iii
iv
classifier (positive or negative only); (3) disregarding important sentiment indicators, such as emojis and emoticons, during text embedding; and (4) reporting average performance metrics only, preventing the evaluation of model performance at the level of class (sentiment). Finally, and with the help of MOOCSent, we used the learners’ discussion forums to predict paid certification after annotating learners’ comments and replies with the sentiment using MOOCSent. This multi-input model contains raw data (learner textual inputs), sentiment classification generated by MOOCSent, computed features (number of likes received for each textual input), and several features extracted from the texts (character counts, word counts, and part of speech (POS) tags for each textual instance). This experiment adopted various deep predictive approaches – specifically that allow multi-input architecture - to early (i.e., weekly) investigate if data obtained from MOOC learners’ interaction in discussion forums can predict learners’ purchase decisions (certification). Considering the staggeringly low rate of paid certification in MOOCs, this present thesis contributes to the knowledge and field of MOOC learner analytics with predicting paid certification, for the first time, at such a comprehensive (with data from over 200 thousand learners from 5 different discipline courses), actionable (analysing learners decision from the first week of the course) and longitudinal (with 23 runs from 2013 to 2017) scale. The present thesis contributes with (1) investigating various conventional and deep ML approaches for predicting paid certification in MOOCs using learner clickstreams (Chapter 5) and course discussion forums (Chapter 7), (2) building the largest MOOC sentiment classifier (MOOCSent) based on learners’ reviews of the courses from the leading MOOC platforms, namely Coursera, FutureLearn and Udemy, and handles emojis and emoticons using dedicated lexicons that contain over three thousand corresponding explanatory words/phrases, (3) proposing and developing, for the first time, multi-input model for predicting certification based on the data from discussion forums which synchronously processes the textual (comments and replies) and numerical (number of likes posted and received, sentiments) data from the forums, adapting the suitable classifier for each type of data as explained in detail in Chapter 7
UMSL Bulletin 2022-2023
The 2022-2023 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1087/thumbnail.jp
Explainable Representations for Relation Prediction in Knowledge Graphs
Knowledge graphs represent real-world entities and their relations in a
semantically-rich structure supported by ontologies. Exploring this data with
machine learning methods often relies on knowledge graph embeddings, which
produce latent representations of entities that preserve structural and local
graph neighbourhood properties, but sacrifice explainability. However, in tasks
such as link or relation prediction, understanding which specific features
better explain a relation is crucial to support complex or critical
applications.
We propose SEEK, a novel approach for explainable representations to support
relation prediction in knowledge graphs. It is based on identifying relevant
shared semantic aspects (i.e., subgraphs) between entities and learning
representations for each subgraph, producing a multi-faceted and explainable
representation.
We evaluate SEEK on two real-world highly complex relation prediction tasks:
protein-protein interaction prediction and gene-disease association prediction.
Our extensive analysis using established benchmarks demonstrates that SEEK
achieves significantly better performance than standard learning representation
methods while identifying both sufficient and necessary explanations based on
shared semantic aspects.Comment: 16 pages, 3 figure
Decision trees to evaluate the risk of developing multiple sclerosis
IntroductionMultiple sclerosis (MS) is a persistent neurological condition impacting the central nervous system (CNS). The precise cause of multiple sclerosis is still uncertain; however, it is thought to arise from a blend of genetic and environmental factors. MS diagnosis includes assessing medical history, conducting neurological exams, performing magnetic resonance imaging (MRI) scans, and analyzing cerebrospinal fluid. While there is currently no cure for MS, numerous treatments exist to address symptoms, decelerate disease progression, and enhance the quality of life for individuals with MS.MethodsThis paper introduces a novel machine learning (ML) algorithm utilizing decision trees to address a key objective: creating a predictive tool for assessing the likelihood of MS development. It achieves this by combining prevalent demographic risk factors, specifically gender, with crucial immunogenetic risk markers, such as the alleles responsible for human leukocyte antigen (HLA) class I molecules and the killer immunoglobulin-like receptors (KIR) genes responsible for natural killer lymphocyte receptors.ResultsThe study included 619 healthy controls and 299 patients affected by MS, all of whom originated from Sardinia. The gender feature has been disregarded due to its substantial bias in influencing the classification outcomes. By solely considering immunogenetic risk markers, the algorithm demonstrates an ability to accurately identify 73.24% of MS patients and 66.07% of individuals without the disease.DiscussionGiven its notable performance, this system has the potential to support clinicians in monitoring the relatives of MS patients and identifying individuals who are at an increased risk of developing the disease
Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer
Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues.
Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice.
Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings.
This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer.
Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis.
Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene.
Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis
- …