1,181 research outputs found
Active Discovery of Network Roles for Predicting the Classes of Network Nodes
Nodes in real world networks often have class labels, or underlying
attributes, that are related to the way in which they connect to other nodes.
Sometimes this relationship is simple, for instance nodes of the same class are
may be more likely to be connected. In other cases, however, this is not true,
and the way that nodes link in a network exhibits a different, more complex
relationship to their attributes. Here, we consider networks in which we know
how the nodes are connected, but we do not know the class labels of the nodes
or how class labels relate to the network links. We wish to identify the best
subset of nodes to label in order to learn this relationship between node
attributes and network links. We can then use this discovered relationship to
accurately predict the class labels of the rest of the network nodes.
We present a model that identifies groups of nodes with similar link
patterns, which we call network roles, using a generative blockmodel. The model
then predicts labels by learning the mapping from network roles to class labels
using a maximum margin classifier. We choose a subset of nodes to label
according to an iterative margin-based active learning strategy. By integrating
the discovery of network roles with the classifier optimisation, the active
learning process can adapt the network roles to better represent the network
for node classification. We demonstrate the model by exploring a selection of
real world networks, including a marine food web and a network of English
words. We show that, in contrast to other network classifiers, this model
achieves good classification accuracy for a range of networks with different
relationships between class labels and network links
PRESISTANT: Learning based assistant for data pre-processing
Data pre-processing is one of the most time consuming and relevant steps in a
data analysis process (e.g., classification task). A given data pre-processing
operator (e.g., transformation) can have positive, negative or zero impact on
the final result of the analysis. Expert users have the required knowledge to
find the right pre-processing operators. However, when it comes to non-experts,
they are overwhelmed by the amount of pre-processing operators and it is
challenging for them to find operators that would positively impact their
analysis (e.g., increase the predictive accuracy of a classifier). Existing
solutions either assume that users have expert knowledge, or they recommend
pre-processing operators that are only "syntactically" applicable to a dataset,
without taking into account their impact on the final analysis. In this work,
we aim at providing assistance to non-expert users by recommending data
pre-processing operators that are ranked according to their impact on the final
analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the
impact of pre-processing operators on the performance (e.g., predictive
accuracy) of 5 different classification algorithms, such as J48, Naive Bayes,
PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the
recommendations provided by our tool, show that PRESISTANT can effectively help
non-experts in order to achieve improved results in their analytical tasks
Mining time-series data using discriminative subsequences
Time-series data is abundant, and must be analysed to extract usable knowledge. Local-shape-based methods offer improved performance for many problems, and a
comprehensible method of understanding both data and models.
For time-series classification, we transform the data into a local-shape space using a shapelet transform. A shapelet is a time-series subsequence that is discriminative
of the class of the original series. We use a heterogeneous ensemble classifier on the transformed data. The accuracy of our method is significantly better than the time-series classification benchmark (1-nearest-neighbour with dynamic time-warping distance), and significantly better than the previous best shapelet-based classifiers.
We use two methods to increase interpretability: First, we cluster the shapelets using a novel, parameterless clustering method based on Minimum Description Length,
reducing dimensionality and removing duplicate shapelets. Second, we transform the shapelet data into binary data reflecting the presence or absence of particular
shapelets, a representation that is straightforward to interpret and understand.
We supplement the ensemble classifier with partial classifocation. We generate rule sets on the binary-shapelet data, improving performance on certain classes, and revealing the relationship between the shapelets and the class label. To aid interpretability, we use a novel algorithm, BruteSuppression, that can substantially reduce
the size of a rule set without negatively affecting performance, leading to a more compact, comprehensible model.
Finally, we propose three novel algorithms for unsupervised mining of approximately repeated patterns in time-series data, testing their performance in terms of
speed and accuracy on synthetic data, and on a real-world electricity-consumption device-disambiguation problem. We show that individual devices can be found automatically
and in an unsupervised manner using a local-shape-based approach
Data Mining
Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment
SemGrAM - Integrating semantic graphs into association rule mining
To date, most association rule mining algorithms
have assumed that the domains of items are either
discrete or, in a limited number of cases, hierarchical,
categorical or linear. This constrains the search for
interesting rules to those that satisfy the specified
quality metrics as independent values or as higher
level concepts of those values. However, in many
cases the determination of a single hierarchy is not
practicable and, for many datasets, an item’s value
may be taken from a domain that is more conveniently
structured as a graph with weights indicating
semantic (or conceptual) distance. Research in the
development of algorithms that generate disjunctive
association rules has allowed the production of
rules such as Radios V TVs -> Cables. In many
cases there is little semantic relationship between
the disjunctive terms and arguably less readable
rules such as Radios V Tuesday -> Cables can
result. This paper describes two association rule
mining algorithms, SemGrAMG and SemGrAMP,
that accommodate conceptual distance information
contained in a semantic graph. The SemGrAM
algorithms permit the discovery of rules that include
an association between sets of cognate groups of
item values. The paper discusses the algorithms, the
design decisions made during their development and
some experimental results.Sydney, NS
Biclustering electronic health records to unravel disease presentation patterns
Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2019A Esclerose Lateral Amiotrófica (ELA) é uma doença neurodegenerativa heterogénea com padrões de apresentação altamente variáveis. Dada a natureza heterogénea dos doentes com ELA, aquando do diagnóstico os clÃnicos normalmente estimam a progressão da doença utilizando uma taxa de decaimento funcional, calculada com base na Escala Revista de Avaliação Funcional de ELA (ALSFRS-R). A utilização de modelos de Aprendizagem Automática que consigam lidar com este padrões complexos é necessária para compreender a doença, melhorar os cuidados aos doentes e a sua sobrevivência. Estes modelos devem ser explicáveis para que os clÃnicos possam tomar decisões informadas. Desta forma, o nosso objectivo é descobrir padrões de apresentação da doença, para isso propondo uma nova abordagem de Prospecção de Dados: Descoberta de Meta-atributos Discriminativos (DMD), que utiliza uma combinação de Biclustering, Classificação baseada em Biclustering e Prospecção de Regras de Associação para Classificação. Estes padrões (chamados de Meta-atributos) são compostos por subconjuntos de atributos discriminativos conjuntamente com os seus valores, permitindo assim distinguir e caracterizar subgrupos de doentes com padrões similares de apresentação da doença. Os Registos de Saúde Electrónicos (RSE) utilizados neste trabalho provêm do conjunto de dados JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis), composto por questões standardizadas acerca de factores de risco, mutações genéticas, atributos clÃnicos ou informação de sobrevivência de uma coorte de doentes e controlos seguidos pelo consórcio ENCALS (European Network to Cure ALS), que inclui vários paÃses europeus, incluindo Portugal. Nesta tese a metodologia proposta foi utilizada na parte portuguesa do conjunto de dados ONWebDUALS para encontrar padrões de apresentação da doença que: 1) distinguissem os doentes de ELA dos seus controlos e 2) caracterizassem grupos de doentes de ELA com diferentes taxas de progressão (categorizados em grupos Lentos, Neutros e Rápidos). Nenhum padrão coerente emergiu das experiências efectuadas para a primeira tarefa. Contudo, para a segunda tarefa os padrões encontrados para cada um dos três grupos de progressão foram reconhecidos e validados por clÃnicos especialistas em ELA, como sendo caracterÃsticas relevantes de doentes com progressão Lenta, Neutra e Rápida. Estes resultados sugerem que a nossa abordagem genérica baseada em Biclustering tem potencial para identificar padrões de apresentação noutros problemas ou doenças semelhantes.Amyotrophic Lateral Sclerosis (ALS) is a heterogeneous neurodegenerative disease with a high variability of presentation patterns. Given the heterogeneous nature of ALS patients and targeting a better prognosis, clinicians usually estimate disease progression at diagnosis using the rate of decay computed from the Revised ALS Functional Rating Scale (ALSFRS-R). In this context, the use of Machine Learning models able to unravel the complexity of disease presentation patterns is paramount for disease understanding, targeting improved patient care and longer survival times. Furthermore, explainable models are vital, since clinicians must be able to understand the reasoning behind a given model’s result before making a decision that can impact a patient’s life. Therefore we aim at unravelling disease presentation patterns by proposing a new Data Mining approach called Discriminative Meta-features Discovery (DMD), which uses a combination of Biclustering, Biclustering-based Classification and Class Association Rule Mining. These patterns (called Metafeatures) are composed of discriminative subsets of features together with their values, allowing to distinguish and characterize subgroups of patients with similar disease presentation patterns. The Electronic Health Record (EHR) data used in this work comes from the JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis) dataset, comprised of standardized questionnaire answers regarding risk factors, genetic mutations, clinical features and survival information from a cohort of patients and controls from ENCALS (European Network to Cure ALS), a consortium of diverse European countries, including Portugal. In this work the proposed methodology was used on the ONWebDUALS Portuguese EHR data to find disease presentation patterns that: 1) distinguish the ALS patients from their controls and 2) characterize groups of ALS patients with different progression rates (categorized into Slow, Neutral and Fast groups). No clear pattern emerged from the experiments performed for the first task. However, in the second task the patterns found for each of the three progression groups were recognized and validated by ALS expert clinicians, as being relevant characteristics of slow, neutral and fast progressing patients. These results suggest that our generic Biclustering approach is a promising way to unravel disease presentation patterns and could be applied to similar problems and other diseases
Diagnosis and Prognosis of Occupational disorders based on Machine Learn- ing Techniques applied to Occupational Profiles
Work-related disorders have a global influence on people’s well-being and quality of life
and are a financial burden for organizations because they reduce productivity, increase
absenteeism, and promote early retirement. Work-related musculoskeletal disorders, in
particular, represent a significant fraction of the total in all occupational contexts. In
automotive and industrial settings where workers are exposed to work-related muscu-
loskeletal disorders risk factors, occupational physicians are responsible for monitoring
workers’ health protection profiles. Occupational technicians report in the Occupational
Health Protection Profiles database to understand which exposure to occupational work-
related musculoskeletal disorder risk factors should be ensured for a given worker. Occu-
pational Health Protection Profiles databases describe the occupational physician states,
and which exposure the physicians considers necessary to ensure the worker’s health
protection in terms of their functional work ability. The application of Human-Centered
explainable artificial intelligence can support the decision making to go from worker’s
Functional Work Ability to explanations by integrating explainability into medical (re-
striction) and supporting in two decision contexts: prognosis and diagnosis of individual,
work related and organizational risk condition. Although previous machine learning ap-
proaches provided good predictions, their application in an actual occupational setting
is limited because their predictions are difficult to interpret and hence, not actionable.
In this thesis, injured body parts in which the ability changed in a worker’s functional
work ability status are targeted. On the one hand, artificial intelligence algorithms can
help technical teams, occupational physicians, and ergonomists determine a worker’s
workplace risk via the diagnosis and prognosis of body part(s) injuries; on the other hand,
these approaches can help prevent work-related musculoskeletal disorders by identifying
which processes are lacking in working condition improvement and which workplaces
have a better match between the remaining functional work abilities. A sample of 2025
for the prognosis part (from the years of 2019 to 2020) and 7857 for the prognosis part
of Occupational Health Protection Profiles based on Functional Work Ability textual re-
ports in the Portuguese language in automotive industry factory. Machine learning-based Natural Language Processing methods were implemented to extract standardized infor-
mation. The prognosis and diagnosis of Occupational Health Protection Profiles factors
were developed in reliable Human-Centered explainable artificial intelligence system to
promote a trustworthy Human-Centered explainable artificial intelligence system (enti-
tled Industrial microErgo application). The most suitable regression models to predict
the next medical appointment for the injured body regions were the models based on
CatBoost regression, with R square and an RMSLE of 0.84 and 1.23 weeks, respectively.
In parallel, CatBoost’s best regression model for most body parts is the prediction of
the next injured body parts based on these two errors. This information can help tech-
nical industrial teams understand potential risk factors for Occupational Health Protec-
tion Profiles and identify warning signs of the early stages of musculoskeletal disorders.Os transtornos relacionados ao trabalho têm influência global no bem-estar e na quali-
dade de vida das pessoas e são um ônus financeiro para as organizações, pois reduzem a
produtividade, aumentam o absenteÃsmo e promovem a aposentadoria precoce. Os distúr-
bios osteomusculares relacionados ao trabalho, em particular, representam uma fração
significativa do total em todos os contextos ocupacionais. Em ambientes automotivos e
industriais onde os trabalhadores estão expostos a fatores de risco de distúrbios osteomus-
culares relacionados ao trabalho, os médicos do trabalho são responsáveis por monitorar
os perfis de proteção à saúde dos trabalhadores. Os técnicos do trabalho reportam-se Ã
base de dados dos Perfis de Proteção da Saúde Ocupacional para compreender quais os
fatores de risco de exposição a perturbações músculo-esqueléticas relacionadas com o tra-
balho que devem ser assegurados para um determinado trabalhador. As bases de dados
de Perfis de Proteção à Saúde Ocupacional descrevem os estados do médico do trabalho
e quais exposições os médicos consideram necessária para garantir a proteção da saúde
do trabalhador em termos de sua capacidade funcional para o trabalho. A aplicação da
inteligência artificial explicável centrada no ser humano pode apoiar a tomada de decisão
para ir da capacidade funcional de trabalho do trabalhador às explicações, integrando a
explicabilidade à médica (restrição) e apoiando em dois contextos de decisão: prognóstico
e diagnóstico da condição de risco individual, relacionado ao trabalho e organizacional .
Embora as abordagens anteriores de aprendizado de máquina tenham fornecido boas pre-
visões, sua aplicação em um ambiente ocupacional real é limitada porque suas previsões
são difÃceis de interpretar e portanto, não acionável. Nesta tese, as partes do corpo lesiona-
das nas quais a habilidade mudou no estado de capacidade funcional para o trabalho do
trabalhador são visadas. Por um lado, os algoritmos de inteligência artificial podem aju-
dar as equipes técnicas, médicos do trabalho e ergonomistas a determinar o risco no local
de trabalho de um trabalhador por meio do diagnóstico e prognóstico de lesões em partes
do corpo; por outro lado, essas abordagens podem ajudar a prevenir distúrbios muscu-
loesqueléticos relacionados ao trabalho, identificando quais processos estão faltando na
melhoria das condições de trabalho e quais locais de trabalho têm uma melhor correspon-
dência entre as habilidades funcionais restantes do trabalho. Para esta tese, foi utilizada uma base de dados com Perfis de Proteção à Saúde Ocupacional, que se baseiam em relató-
rios textuais de Aptidão para o Trabalho em lÃngua portuguesa, de uma fábrica da indús-
tria automóvel (Auto Europa). Uma amostra de 2025 ficheiros foi utilizada para a parte de
prognóstico (de 2019 a 2020) e uma amostra de 7857 ficheiros foi utilizada para a parte de
diagnóstico. . Aprendizado de máquina- métodos baseados em Processamento de Lingua-
gem Natural foram implementados para extrair informações padronizadas. O prognóstico
e diagnóstico dos fatores de Perfis de Proteção à Saúde Ocupacional foram desenvolvidos
em um sistema confiável de inteligência artificial explicável centrado no ser humano (inti-
tulado Industrial microErgo application). Os modelos de regressão mais adequados para
prever a próxima consulta médica para as regiões do corpo lesionadas foram os modelos
baseados na regressão CatBoost, com R quadrado e RMSLE de 0,84 e 1,23 semanas, res-
pectivamente. Em paralelo, a previsão das próximas partes do corpo lesionadas com base
nesses dois erros relatados pelo CatBoost como o melhor modelo de regressão para a mai-
oria das partes do corpo. Essas informações podem ajudar as equipes técnicas industriais
a entender os possÃveis fatores de risco para os Perfis de Proteção à Saúde Ocupacio-
nal e identificar sinais de alerta dos estágios iniciais de distúrbios musculoesqueléticos
- …