812 research outputs found
Computational and experimental studies on the reaction mechanism of bio-oil components with additives for increased stability and fuel quality
As one of the world’s largest palm oil producers, Malaysia encountered a major disposal problem as vast amount of oil palm biomass wastes are produced. To overcome this problem, these biomass wastes can be liquefied into biofuel with fast pyrolysis technology. However, further upgradation of fast pyrolysis bio-oil via direct solvent addition was required to overcome it’s undesirable attributes. In addition, the high production cost of biofuels often hinders its commercialisation. Thus, the designed solvent-oil blend needs to achieve both fuel functionality and economic targets to be competitive with the conventional diesel fuel.
In this thesis, a multi-stage computer-aided molecular design (CAMD) framework was employed for bio-oil solvent design. In the design problem, molecular signature descriptors were applied to accommodate different classes of property prediction models. However, the complexity of the CAMD problem increases as the height of signature increases due to the combinatorial nature of higher order signature. Thus, a consistency rule was developed reduce the size of the CAMD problem. The CAMD problem was then further extended to address the economic aspects via fuzzy multi-objective optimisation approach.
Next, a rough-set based machine learning (RSML) model has been proposed to correlate the feedstock characterisation and pyrolysis condition with the pyrolysis bio-oil properties by generating decision rules. The generated decision rules were analysed from a scientific standpoint to identify the underlying patterns, while ensuring the rules were logical. The decision rules generated can be used to select optimal feedstock composition and pyrolysis condition to produce pyrolysis bio-oil of targeted fuel properties.
Next, the results obtained from the computational approaches were verified through experimental study. The generated pyrolysis bio-oils were blended with the identified solvents at various mixing ratio. In addition, emulsification of the solvent-oil blend in diesel was also conducted with the help of surfactants. Lastly, potential extensions and prospective work for this study have been discuss in the later part of this thesis. To conclude, this thesis presented the combination of computational and experimental approaches in upgrading the fuel properties of pyrolysis bio-oil. As a result, high quality biofuel can be generated as a cleaner burning replacement for conventional diesel fuel
Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data
The advancements in intelligent systems have contributed tremendously to the fields of bioinformatics, health, and medicine. Intelligent classification and prediction techniques have been used in studying microarray datasets, which store information about the ways used to express the genes, to assist greatly in diagnosing chronic diseases, such as cancer in its earlier stage, which is important and challenging. However, the high-dimensionality and noisy nature of the microarray data lead to slow performance and low cancer classification accuracy while using machine learning techniques. In this paper, a hybrid filter-genetic feature selection approach has been proposed to solve the high-dimensional microarray datasets problem which ultimately enhances the performance of cancer classification precision. First, the filter feature selection methods including information gain, information gain ratio, and Chi-squared are applied in this study to select the most significant features of cancerous microarray datasets. Then, a genetic algorithm has been employed to further optimize and enhance the selected features in order to improve the proposed method’s capability for cancer classification. To test the proficiency of the proposed scheme, four cancerous microarray datasets were used in the study—this primarily included breast, lung, central nervous system, and brain cancer datasets. The experimental results show that the proposed hybrid filter-genetic feature selection approach achieved better performance of several common machine learning methods in terms of Accuracy, Recall, Precision, and F-measure
Fair Causal Feature Selection
Causal feature selection has recently received increasing attention in
machine learning. Existing causal feature selection algorithms select unique
causal features of a class variable as the optimal feature subset. However, a
class variable usually has multiple states, and it is unfair to select the same
causal features for different states of a class variable. To address this
problem, we employ the class-specific mutual information to evaluate the causal
information carried by each state of the class attribute, and theoretically
analyze the unique relationship between each state and the causal features.
Based on this, a Fair Causal Feature Selection algorithm (FairCFS) is proposed
to fairly identifies the causal features for each state of the class variable.
Specifically, FairCFS uses the pairwise comparisons of class-specific mutual
information and the size of class-specific mutual information values from the
perspective of each state, and follows a divide-and-conquer framework to find
causal features. The correctness and application condition of FairCFS are
theoretically proved, and extensive experiments are conducted to demonstrate
the efficiency and superiority of FairCFS compared to the state-of-the-art
approaches
A method for estimating yield of maize inbred lines by assimilating WOFOST model with Sentinel-2 satellite data
Maize is the most widely planted food crop in China, and maize inbred lines, as the basis of maize genetic breeding and seed breeding, have a significant impact on China’s seed security and food safety. Satellite remote sensing technology has been widely used for growth monitoring and yield estimation of various crops, but it is still doubtful whether the existing remote sensing monitoring means can distinguish the growth difference between maize inbred lines and hybrids and accurately estimate the yield of maize inbred lines. This paper explores a method for estimating the yield of maize inbred lines based on the assimilation of crop models and remote sensing data, initially solves the problem. At first, this paper analyzed the WOFOST(World Food Studies)model parameter sensitivity and used the MCMC(Markov Chain Monte Carlo) method to calibrate the sensitive parameters to obtain the parameter set of maize inbred lines differing from common hybrid maize; then the vegetation indices were selected to establish an empirical model with the measured LAI(Leaf Area Index) at three key development stages to obtain the remotely sensed estimated LAI; finally, the yield of maize inbred lines in the study area was estimated and mapped pixel by pixel using the EnKF(Ensemble Kalman Filter) data assimilation algorithm. Also, this paper compares a method of assimilation by setting a single parameter. Instead of the WOFOST parameter optimization process, a parameter representing the growth weakness of the inbred lines was set in WOFOST to distinguish the inbred lines from the hybrids. The results showed that the yield estimated by the two methods compared with the field measured yield data had R2: 0.56 and 0.18, and RMSE: 684.90 Kg/Ha and 949.95 Kg/Ha, respectively, which proved that the crop growth model of maize inbred lines established in this study combined with the data assimilation method could initially achieve the growth monitoring and yield estimation of maize inbred lines
Machine Learning Approaches to Understanding and Predicting Cancer Screening Follow Through with Population and Health System Data
Introduction
Cancer is the second leading cause of death in the United States and cancer screening is a primary tool to reduce mortality. However, not all who are recommended to be screened actually follow through. This study investigates whether electronic medical record and geographic data is suitable to predict which patients are at risk of missing recommended screenings. The goal of this investigation is to design a data informed system that can automate the prediction of those at risk for missing screenings and provide insights into underlying reasons. This will enable resources to be focused to increase cancer screening adherence, with the overall goal of reducing mortality from cancer. Methods Data for this study was sourced from de-identified electronic medical records from the Medical University of South Carolina’s patient population and publicly available geographic datasets. This data was used to train a series of machine learning models to predict which patients would follow through with cancer screening tests, and describe underlying associations to diagnoses data, cancer histories and social determinants of health. Results This study found that it was possible to systematically identify small groups of female patients that are unlikely to follow through with mammogram screening. However, similar results were not found predicting lung cancer screening follow-though. Additionally, patterns associating social determinants at the county level cannot be used to make accurate predictions about individual patient follow through. It was also demonstrated that the core relationship between screening and mortality does not hold in high proportion minority areas. Conclusion
This study successfully shows that an automated system for identifying small groups of patients unlikely to complete mammogram screening is achievable and sets forth a methodology to development. It also provides valuable insights into the nature of social determinants associated with patients and their limits when geographically attributed
Towards an Unsupervised Bayesian Network Pipeline for Explainable Prediction, Decision Making and Discovery
An unsupervised learning pipeline for discrete Bayesian networks is proposed to facilitate prediction, decision making, discovery of patterns, and transparency in challenging real-world AI applications, and contend with data limitations. We explore methods for discretizing data, and notably apply the pipeline to prediction and prevention of preterm birth
Timely Classification of Encrypted or ProtocolObfuscated Internet Traffic Using Statistical Methods
Internet traffic classification aims to identify the type of application or protocol that generated
a particular packet or stream of packets on the network. Through traffic classification,
Internet Service Providers (ISPs), governments, and network administrators can
access basic functions and several solutions, including network management, advanced
network monitoring, network auditing, and anomaly detection. Traffic classification is
essential as it ensures the Quality of Service (QoS) of the network, as well as allowing
efficient resource planning.
With the increase of encrypted or obfuscated protocol traffic on the Internet and multilayer
data encapsulation, some classical classification methods have lost interest from the
scientific community. The limitations of traditional classification methods based on port
numbers and payload inspection to classify encrypted or obfuscated Internet traffic have
led to significant research efforts focused on Machine Learning (ML) based classification
approaches using statistical features from the transport layer. In an attempt to increase
classification performance, Machine Learning strategies have gained interest from the scientific
community and have shown promise in the future of traffic classification, specially
to recognize encrypted traffic.
However, ML approach also has its own limitations, as some of these methods have a
high computational resource consumption, which limits their application when classifying
large traffic or realtime
flows. Limitations of ML application have led to the investigation
of alternative approaches, including featurebased
procedures and statistical methods. In
this sense, statistical analysis methods, such as distances and divergences, have been used
to classify traffic in large flows and in realtime.
The main objective of statistical distance is to differentiate flows and find a pattern in
traffic characteristics through statistical properties, which enable classification. Divergences
are functional expressions often related to information theory, which measure the
degree of discrepancy between any two distributions.
This thesis focuses on proposing a new methodological approach to classify encrypted
or obfuscated Internet traffic based on statistical methods that enable the evaluation of
network traffic classification performance, including the use of computational resources
in terms of CPU and memory. A set of traffic classifiers based on KullbackLeibler
and
JensenShannon
divergences, and Euclidean, Hellinger, Bhattacharyya, and Wootters distances
were proposed. The following are the four main contributions to the advancement
of scientific knowledge reported in this thesis.
First, an extensive literature review on the classification of encrypted and obfuscated Internet traffic was conducted. The results suggest that portbased
and payloadbased
methods are becoming obsolete due to the increasing use of traffic encryption and multilayer
data encapsulation. MLbased
methods are also becoming limited due to their computational
complexity. As an alternative, Support Vector Machine (SVM), which is also
an ML method, and the KolmogorovSmirnov
and Chisquared
tests can be used as reference
for statistical classification. In parallel, the possibility of using statistical methods
for Internet traffic classification has emerged in the literature, with the potential of good
results in classification without the need of large computational resources. The potential
statistical methods are Euclidean Distance, Hellinger Distance, Bhattacharyya Distance,
Wootters Distance, as well as KullbackLeibler
(KL) and JensenShannon
divergences.
Second, we present a proposal and implementation of a classifier based on SVM for P2P
multimedia traffic, comparing the results with KolmogorovSmirnov
(KS) and Chisquare
tests. The results suggest that SVM classification with Linear kernel leads to a better classification
performance than KS and Chisquare
tests, depending on the value assigned to
the Self C parameter. The SVM method with Linear kernel and suitable values for the Self
C parameter may be a good choice to identify encrypted P2P multimedia traffic on the
Internet.
Third, we present a proposal and implementation of two classifiers based on KL Divergence
and Euclidean Distance, which are compared to SVM with Linear kernel, configured
with the standard Self C parameter, showing a reduced ability to classify flows based
solely on packet sizes compared to KL and Euclidean Distance methods. KL and Euclidean
methods were able to classify all tested applications, particularly streaming and P2P,
where for almost all cases they efficiently identified them with high accuracy, with reduced
consumption of computational resources. Based on the obtained results, it can be
concluded that KL and Euclidean Distance methods are an alternative to SVM, as these
statistical approaches can operate in realtime
and do not require retraining every time a
new type of traffic emerges.
Fourth, we present a proposal and implementation of a set of classifiers for encrypted
Internet traffic, based on JensenShannon
Divergence and Hellinger, Bhattacharyya, and
Wootters Distances, with their respective results compared to those obtained with methods
based on Euclidean Distance, KL, KS, and ChiSquare.
Additionally, we present a comparative
qualitative analysis of the tested methods based on Kappa values and Receiver
Operating Characteristic (ROC) curves. The results suggest average accuracy values above
90% for all statistical methods, classified as ”almost perfect reliability” in terms of Kappa
values, with the exception of KS. This result indicates that these methods are viable options
to classify encrypted Internet traffic, especially Hellinger Distance, which showed
the best Kappa values compared to other classifiers. We conclude that the considered
statistical methods can be accurate and costeffective
in terms of computational resource
consumption to classify network traffic. Our approach was based on the classification of Internet network traffic, focusing on statistical
distances and divergences. We have shown that it is possible to classify and obtain
good results with statistical methods, balancing classification performance and the
use of computational resources in terms of CPU and memory. The validation of the proposal
supports the argument of this thesis, which proposes the implementation of statistical
methods as a viable alternative to Internet traffic classification compared to methods
based on port numbers, payload inspection, and ML.A classificação de tráfego Internet visa identificar o tipo de aplicação ou protocolo que
gerou um determinado pacote ou fluxo de pacotes na rede. Através da classificação de
tráfego, Fornecedores de Serviços de Internet (ISP), governos e administradores de rede
podem ter acesso às funções básicas e várias soluções, incluindo gestão da rede, monitoramento
avançado de rede, auditoria de rede e deteção de anomalias. Classificar o tráfego é
essencial, pois assegura a Qualidade de Serviço (QoS) da rede, além de permitir planear
com eficiência o uso de recursos.
Com o aumento de tráfego cifrado ou protocolo ofuscado na Internet e do encapsulamento
de dados multicamadas, alguns métodos clássicos da classificação perderam interesse de
investigação da comunidade científica. As limitações dos métodos tradicionais da classificação
com base no número da porta e na inspeção de carga útil payload para classificar
o tráfego de Internet cifrado ou ofuscado levaram a esforços significativos de investigação
com foco em abordagens da classificação baseadas em técnicas de Aprendizagem
Automática (ML) usando recursos estatísticos da camada de transporte. Na tentativa
de aumentar o desempenho da classificação, as estratégias de Aprendizagem Automática
ganharam o interesse da comunidade científica e se mostraram promissoras no futuro da
classificação de tráfego, principalmente no reconhecimento de tráfego cifrado.
No entanto, a abordagem em ML também têm as suas próprias limitações,
pois alguns
desses métodos possuem um elevado consumo de recursos computacionais, o que limita
a sua aplicação para classificação de grandes fluxos de tráfego ou em tempo real. As limitações
no âmbito da aplicação de ML levaram à investigação de abordagens alternativas,
incluindo procedimentos baseados em características e métodos estatísticos. Neste sentido,
os métodos de análise estatística, tais como distâncias e divergências, têm sido utilizados
para classificar tráfego em grandes fluxos e em tempo real.
A distância estatística possui como objetivo principal diferenciar os fluxos e permite encontrar
um padrão nas características de tráfego através de propriedades estatísticas, que
possibilitam a classificação. As divergências são expressões funcionais frequentemente
relacionadas com a teoria da informação, que mede o grau de discrepância entre duas
distribuições quaisquer.
Esta tese focase
na proposta de uma nova abordagem metodológica para classificação de
tráfego cifrado ou ofuscado da Internet com base em métodos estatísticos que possibilite
avaliar o desempenho da classificação de tráfego de rede, incluindo a utilização de recursos
computacionais, em termos de CPU e memória. Foi proposto um conjunto de classificadores
de tráfego baseados nas Divergências de KullbackLeibler
e JensenShannon
e Distâncias Euclidiana, Hellinger, Bhattacharyya e Wootters. A seguir resumemse
os tese.
Primeiro, realizámos uma ampla revisão de literatura sobre classificação de tráfego cifrado
e ofuscado de Internet. Os resultados sugerem que os métodos baseados em porta e
baseados em carga útil estão se tornando obsoletos em função do crescimento da utilização
de cifragem de tráfego e encapsulamento de dados multicamada. O tipo de métodos
baseados em ML também está se tornando limitado em função da complexidade computacional.
Como alternativa, podese
utilizar a Máquina de Vetor de Suporte (SVM),
que também é um método de ML, e os testes de KolmogorovSmirnov
e Quiquadrado
como referência de comparação da classificação estatística. Em paralelo, surgiu na literatura
a possibilidade de utilização de métodos estatísticos para classificação de tráfego
de Internet, com potencial de bons resultados na classificação sem aporte de grandes recursos
computacionais. Os métodos estatísticos potenciais são as Distâncias Euclidiana,
Hellinger, Bhattacharyya e Wootters, além das Divergências de Kullback–Leibler (KL) e
JensenShannon.
Segundo, apresentamos uma proposta e implementação de um classificador baseado na
Máquina de Vetor de Suporte (SVM) para o tráfego multimédia P2P (PeertoPeer),
comparando
os resultados com os testes de KolmogorovSmirnov
(KS) e Quiquadrado.
Os
resultados sugerem que a classificação da SVM com kernel Linear conduz a um melhor
desempenho da classificação do que os testes KS e Quiquadrado,
dependente do valor
atribuído ao parâmetro Self C. O método SVM com kernel Linear e com valores adequados
para o parâmetro Self C pode ser uma boa escolha para identificar o tráfego Par a Par
(P2P) multimédia cifrado na Internet.
Terceiro, apresentamos uma proposta e implementação de dois classificadores baseados
na Divergência de KullbackLeibler (KL) e na Distância Euclidiana, sendo comparados
com a SVM com kernel Linear, configurado para o parâmestro Self C padrão, apresenta
reduzida
capacidade de classificar fluxos com base apenas nos tamanhos dos pacotes
em relação aos métodos KL e Distância Euclidiana. Os métodos KL e Euclidiano foram
capazes de classificar todas as aplicações testadas, destacandose
streaming e P2P, onde
para quase todos os casos foi eficiente identificálas
com alta precisão, com reduzido consumo
de recursos computacionais.Com base nos resultados obtidos, podese
concluir que
os métodos KL e Distância Euclidiana são uma alternativa à SVM, porque essas abordagens
estatísticas podem operar em tempo real e não precisam de retreinamento cada vez
que surge um novo tipo de tráfego.
Quarto, apresentamos uma proposta e implementação de um conjunto de classificadores
para o tráfego de Internet cifrado, baseados na Divergência de JensenShannon
e nas Distâncias
de Hellinger, Bhattacharyya e Wootters, sendo os respetivos resultados comparados
com os resultados obtidos com os métodos baseados na Distância Euclidiana, KL, KS e Quiquadrado.
Além disso, apresentamos uma análise qualitativa comparativa dos
métodos testados com base nos valores de Kappa e Curvas Característica de Operação do
Receptor (ROC). Os resultados sugerem valores médios de precisão acima de 90% para todos
os métodos estatísticos, classificados como “confiabilidade quase perfeita” em valores
de Kappa, com exceçãode KS. Esse resultado indica que esses métodos são opções viáveis
para a classificação de tráfego cifrado da Internet, em especial a Distância de Hellinger,
que apresentou os melhores resultados do valor de Kappa em comparaçãocom os demais
classificadores. Concluise
que os métodos estatísticos considerados podem ser precisos e
económicos em termos de consumo de recursos computacionais para classificar o tráfego
da rede.
A nossa abordagem baseouse
na classificação de tráfego de rede Internet, focando em
distâncias e divergências estatísticas. Nós mostramos que é possível classificar e obter
bons resultados com métodos estatísticos, equilibrando desempenho de classificação e
uso de recursos computacionais em termos de CPU e memória. A validação da proposta
sustenta o argumento desta tese, que propõe a implementação de métodos estatísticos
como alternativa viável à classificação de tráfego da Internet em relação aos métodos com
base no número da porta, na inspeção de carga útil e de ML.Thesis prepared at Instituto de Telecomunicações Delegação
da Covilhã and at the Department
of Computer Science of the University of Beira Interior, and submitted to the
University of Beira Interior for discussion in public session to obtain the Ph.D. Degree in
Computer Science and Engineering.
This work has been funded by Portuguese FCT/MCTES through national funds and, when
applicable, cofunded
by EU funds under the project UIDB/50008/2020, and by operation
Centro010145FEDER000019
C4
Centro
de Competências em Cloud Computing,
cofunded
by the European Regional Development Fund (ERDF/FEDER) through
the Programa Operacional Regional do Centro (Centro 2020). This work has also been
funded by CAPES (Brazilian Federal Agency for Support and Evaluation of Graduate Education)
within the Ministry of Education of Brazil under a scholarship supported by the
International Cooperation Program CAPES/COFECUB Project
9090134/
2013 at the
University of Beira Interior
A Survey on Causal Discovery: Theory and Practice
Understanding the laws that govern a phenomenon is the core of scientific
progress. This is especially true when the goal is to model the interplay
between different aspects in a causal fashion. Indeed, causal inference itself
is specifically designed to quantify the underlying relationships that connect
a cause to its effect. Causal discovery is a branch of the broader field of
causality in which causal graphs is recovered from data (whenever possible),
enabling the identification and estimation of causal effects. In this paper, we
explore recent advancements in a unified manner, provide a consistent overview
of existing algorithms developed under different settings, report useful tools
and data, present real-world applications to understand why and how these
methods can be fruitfully exploited
If interpretability is the answer, what is the question?
Due to the ability to model even complex dependencies, machine learning (ML) can be used to tackle a broad range of (high-stakes) prediction problems. The complexity of the resulting models comes at the cost of transparency, meaning that it is difficult to understand the model by inspecting its parameters.
This opacity is considered problematic since it hampers the transfer of knowledge from the model, undermines the agency of individuals affected by algorithmic decisions, and makes it more challenging to expose non-robust or unethical behaviour.
To tackle the opacity of ML models, the field of interpretable machine learning (IML) has emerged. The field is motivated by the idea that if we could understand the model's behaviour -- either by making the model itself interpretable or by inspecting post-hoc explanations -- we could also expose unethical and non-robust behaviour, learn about the data generating process, and restore the agency of affected individuals. IML is not only a highly active area of research, but the developed techniques are also widely applied in both industry and the sciences.
Despite the popularity of IML, the field faces fundamental criticism, questioning whether IML actually helps in tackling the aforementioned problems of ML and even whether it should be a field of research in the first place:
First and foremost, IML is criticised for lacking a clear goal and, thus, a clear definition of what it means for a model to be interpretable. On a similar note, the meaning of existing methods is often unclear, and thus they may be misunderstood or even misused to hide unethical behaviour. Moreover, estimating conditional-sampling-based techniques poses a significant computational challenge.
With the contributions included in this thesis, we tackle these three challenges for IML.
We join a range of work by arguing that the field struggles to define and evaluate "interpretability" because incoherent interpretation goals are conflated. However, the different goals can be disentangled such that coherent requirements can inform the derivation of the respective target estimands. We demonstrate this with the examples of two interpretation contexts: recourse and scientific inference.
To tackle the misinterpretation of IML methods, we suggest deriving formal interpretation rules that link explanations to aspects of the model and data. In our work, we specifically focus on interpreting feature importance. Furthermore, we collect interpretation pitfalls and communicate them to a broader audience.
To efficiently estimate conditional-sampling-based interpretation techniques, we propose two methods that leverage the dependence structure in the data to simplify the estimation problems for Conditional Feature Importance (CFI) and SAGE.
A causal perspective proved to be vital in tackling the challenges: First, since IML problems such as algorithmic recourse are inherently causal; Second, since causality helps to disentangle the different aspects of model and data and, therefore, to distinguish the insights that different methods provide; And third, algorithms developed for causal structure learning can be leveraged for the efficient estimation of conditional-sampling based IML methods.Aufgrund der Fähigkeit, selbst komplexe Abhängigkeiten zu modellieren, kann maschinelles Lernen (ML) zur Lösung eines breiten Spektrums von anspruchsvollen Vorhersageproblemen eingesetzt werden.
Die Komplexität der resultierenden Modelle geht auf Kosten der Interpretierbarkeit, d. h. es ist schwierig, das Modell durch die Untersuchung seiner Parameter zu verstehen.
Diese Undurchsichtigkeit wird als problematisch angesehen, da sie den Wissenstransfer aus dem Modell behindert, sie die Handlungsfähigkeit von Personen, die von algorithmischen Entscheidungen betroffen sind, untergräbt und sie es schwieriger macht, nicht robustes oder unethisches Verhalten aufzudecken.
Um die Undurchsichtigkeit von ML-Modellen anzugehen, hat sich das Feld des interpretierbaren maschinellen Lernens (IML) entwickelt.
Dieses Feld ist von der Idee motiviert, dass wir, wenn wir das Verhalten des Modells verstehen könnten - entweder indem wir das Modell selbst interpretierbar machen oder anhand von post-hoc Erklärungen - auch unethisches und nicht robustes Verhalten aufdecken, über den datengenerierenden Prozess lernen und die Handlungsfähigkeit betroffener Personen wiederherstellen könnten.
IML ist nicht nur ein sehr aktiver Forschungsbereich, sondern die entwickelten Techniken werden auch weitgehend in der Industrie und den Wissenschaften angewendet.
Trotz der Popularität von IML ist das Feld mit fundamentaler Kritik konfrontiert, die in Frage stellt, ob IML tatsächlich dabei hilft, die oben genannten Probleme von ML anzugehen, und ob es überhaupt ein Forschungsgebiet sein sollte:
In erster Linie wird an IML kritisiert, dass es an einem klaren Ziel und damit an einer klaren Definition dessen fehlt, was es für ein Modell bedeutet, interpretierbar zu sein. Weiterhin ist die Bedeutung bestehender Methoden oft unklar, so dass sie missverstanden oder sogar missbraucht werden können, um unethisches Verhalten zu verbergen. Letztlich stellt die Schätzung von auf bedingten Stichproben basierenden Verfahren eine erhebliche rechnerische Herausforderung dar.
In dieser Arbeit befassen wir uns mit diesen drei grundlegenden Herausforderungen von IML.
Wir schließen uns der Argumentation an, dass es schwierig ist, "Interpretierbarkeit" zu definieren und zu bewerten, weil inkohärente Interpretationsziele miteinander vermengt werden. Die verschiedenen Ziele lassen sich jedoch entflechten, sodass kohärente Anforderungen die Ableitung der jeweiligen Zielgrößen informieren. Wir demonstrieren dies am Beispiel von zwei Interpretationskontexten: algorithmischer Regress
und wissenschaftliche Inferenz.
Um der Fehlinterpretation von IML-Methoden zu begegnen, schlagen wir vor, formale Interpretationsregeln abzuleiten, die Erklärungen mit Aspekten des Modells und der Daten verknüpfen. In unserer Arbeit konzentrieren wir uns speziell auf die Interpretation von sogenannten Feature Importance Methoden. Darüber hinaus tragen wir wichtige Interpretationsfallen zusammen und kommunizieren sie an ein breiteres Publikum.
Zur effizienten Schätzung auf bedingten Stichproben basierender Interpretationstechniken schlagen wir zwei Methoden vor, die die Abhängigkeitsstruktur in den Daten nutzen, um die Schätzprobleme für Conditional Feature Importance (CFI) und SAGE zu vereinfachen.
Eine kausale Perspektive erwies sich als entscheidend für die Bewältigung der Herausforderungen: Erstens, weil IML-Probleme wie der algorithmische Regress inhärent kausal sind; zweitens, weil Kausalität hilft, die verschiedenen Aspekte von Modell und Daten zu entflechten und somit die Erkenntnisse, die verschiedene Methoden liefern, zu unterscheiden; und drittens können wir Algorithmen, die für das Lernen kausaler Struktur entwickelt wurden, für die effiziente Schätzung von auf bindingten Verteilungen basierenden IML-Methoden verwenden
- …