1,181 research outputs found

    Active Discovery of Network Roles for Predicting the Classes of Network Nodes

    Full text link
    Nodes in real world networks often have class labels, or underlying attributes, that are related to the way in which they connect to other nodes. Sometimes this relationship is simple, for instance nodes of the same class are may be more likely to be connected. In other cases, however, this is not true, and the way that nodes link in a network exhibits a different, more complex relationship to their attributes. Here, we consider networks in which we know how the nodes are connected, but we do not know the class labels of the nodes or how class labels relate to the network links. We wish to identify the best subset of nodes to label in order to learn this relationship between node attributes and network links. We can then use this discovered relationship to accurately predict the class labels of the rest of the network nodes. We present a model that identifies groups of nodes with similar link patterns, which we call network roles, using a generative blockmodel. The model then predicts labels by learning the mapping from network roles to class labels using a maximum margin classifier. We choose a subset of nodes to label according to an iterative margin-based active learning strategy. By integrating the discovery of network roles with the classifier optimisation, the active learning process can adapt the network roles to better represent the network for node classification. We demonstrate the model by exploring a selection of real world networks, including a marine food web and a network of English words. We show that, in contrast to other network classifiers, this model achieves good classification accuracy for a range of networks with different relationships between class labels and network links

    PRESISTANT: Learning based assistant for data pre-processing

    Get PDF
    Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

    Mining time-series data using discriminative subsequences

    Get PDF
    Time-series data is abundant, and must be analysed to extract usable knowledge. Local-shape-based methods offer improved performance for many problems, and a comprehensible method of understanding both data and models. For time-series classification, we transform the data into a local-shape space using a shapelet transform. A shapelet is a time-series subsequence that is discriminative of the class of the original series. We use a heterogeneous ensemble classifier on the transformed data. The accuracy of our method is significantly better than the time-series classification benchmark (1-nearest-neighbour with dynamic time-warping distance), and significantly better than the previous best shapelet-based classifiers. We use two methods to increase interpretability: First, we cluster the shapelets using a novel, parameterless clustering method based on Minimum Description Length, reducing dimensionality and removing duplicate shapelets. Second, we transform the shapelet data into binary data reflecting the presence or absence of particular shapelets, a representation that is straightforward to interpret and understand. We supplement the ensemble classifier with partial classifocation. We generate rule sets on the binary-shapelet data, improving performance on certain classes, and revealing the relationship between the shapelets and the class label. To aid interpretability, we use a novel algorithm, BruteSuppression, that can substantially reduce the size of a rule set without negatively affecting performance, leading to a more compact, comprehensible model. Finally, we propose three novel algorithms for unsupervised mining of approximately repeated patterns in time-series data, testing their performance in terms of speed and accuracy on synthetic data, and on a real-world electricity-consumption device-disambiguation problem. We show that individual devices can be found automatically and in an unsupervised manner using a local-shape-based approach

    Data Mining

    Get PDF
    Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment

    SemGrAM - Integrating semantic graphs into association rule mining

    Get PDF
    To date, most association rule mining algorithms have assumed that the domains of items are either discrete or, in a limited number of cases, hierarchical, categorical or linear. This constrains the search for interesting rules to those that satisfy the specified quality metrics as independent values or as higher level concepts of those values. However, in many cases the determination of a single hierarchy is not practicable and, for many datasets, an item’s value may be taken from a domain that is more conveniently structured as a graph with weights indicating semantic (or conceptual) distance. Research in the development of algorithms that generate disjunctive association rules has allowed the production of rules such as Radios V TVs -> Cables. In many cases there is little semantic relationship between the disjunctive terms and arguably less readable rules such as Radios V Tuesday -> Cables can result. This paper describes two association rule mining algorithms, SemGrAMG and SemGrAMP, that accommodate conceptual distance information contained in a semantic graph. The SemGrAM algorithms permit the discovery of rules that include an association between sets of cognate groups of item values. The paper discusses the algorithms, the design decisions made during their development and some experimental results.Sydney, NS

    Biclustering electronic health records to unravel disease presentation patterns

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2019A Esclerose Lateral Amiotrófica (ELA) é uma doença neurodegenerativa heterogénea com padrões de apresentação altamente variáveis. Dada a natureza heterogénea dos doentes com ELA, aquando do diagnóstico os clínicos normalmente estimam a progressão da doença utilizando uma taxa de decaimento funcional, calculada com base na Escala Revista de Avaliação Funcional de ELA (ALSFRS-R). A utilização de modelos de Aprendizagem Automática que consigam lidar com este padrões complexos é necessária para compreender a doença, melhorar os cuidados aos doentes e a sua sobrevivência. Estes modelos devem ser explicáveis para que os clínicos possam tomar decisões informadas. Desta forma, o nosso objectivo é descobrir padrões de apresentação da doença, para isso propondo uma nova abordagem de Prospecção de Dados: Descoberta de Meta-atributos Discriminativos (DMD), que utiliza uma combinação de Biclustering, Classificação baseada em Biclustering e Prospecção de Regras de Associação para Classificação. Estes padrões (chamados de Meta-atributos) são compostos por subconjuntos de atributos discriminativos conjuntamente com os seus valores, permitindo assim distinguir e caracterizar subgrupos de doentes com padrões similares de apresentação da doença. Os Registos de Saúde Electrónicos (RSE) utilizados neste trabalho provêm do conjunto de dados JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis), composto por questões standardizadas acerca de factores de risco, mutações genéticas, atributos clínicos ou informação de sobrevivência de uma coorte de doentes e controlos seguidos pelo consórcio ENCALS (European Network to Cure ALS), que inclui vários países europeus, incluindo Portugal. Nesta tese a metodologia proposta foi utilizada na parte portuguesa do conjunto de dados ONWebDUALS para encontrar padrões de apresentação da doença que: 1) distinguissem os doentes de ELA dos seus controlos e 2) caracterizassem grupos de doentes de ELA com diferentes taxas de progressão (categorizados em grupos Lentos, Neutros e Rápidos). Nenhum padrão coerente emergiu das experiências efectuadas para a primeira tarefa. Contudo, para a segunda tarefa os padrões encontrados para cada um dos três grupos de progressão foram reconhecidos e validados por clínicos especialistas em ELA, como sendo características relevantes de doentes com progressão Lenta, Neutra e Rápida. Estes resultados sugerem que a nossa abordagem genérica baseada em Biclustering tem potencial para identificar padrões de apresentação noutros problemas ou doenças semelhantes.Amyotrophic Lateral Sclerosis (ALS) is a heterogeneous neurodegenerative disease with a high variability of presentation patterns. Given the heterogeneous nature of ALS patients and targeting a better prognosis, clinicians usually estimate disease progression at diagnosis using the rate of decay computed from the Revised ALS Functional Rating Scale (ALSFRS-R). In this context, the use of Machine Learning models able to unravel the complexity of disease presentation patterns is paramount for disease understanding, targeting improved patient care and longer survival times. Furthermore, explainable models are vital, since clinicians must be able to understand the reasoning behind a given model’s result before making a decision that can impact a patient’s life. Therefore we aim at unravelling disease presentation patterns by proposing a new Data Mining approach called Discriminative Meta-features Discovery (DMD), which uses a combination of Biclustering, Biclustering-based Classification and Class Association Rule Mining. These patterns (called Metafeatures) are composed of discriminative subsets of features together with their values, allowing to distinguish and characterize subgroups of patients with similar disease presentation patterns. The Electronic Health Record (EHR) data used in this work comes from the JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis) dataset, comprised of standardized questionnaire answers regarding risk factors, genetic mutations, clinical features and survival information from a cohort of patients and controls from ENCALS (European Network to Cure ALS), a consortium of diverse European countries, including Portugal. In this work the proposed methodology was used on the ONWebDUALS Portuguese EHR data to find disease presentation patterns that: 1) distinguish the ALS patients from their controls and 2) characterize groups of ALS patients with different progression rates (categorized into Slow, Neutral and Fast groups). No clear pattern emerged from the experiments performed for the first task. However, in the second task the patterns found for each of the three progression groups were recognized and validated by ALS expert clinicians, as being relevant characteristics of slow, neutral and fast progressing patients. These results suggest that our generic Biclustering approach is a promising way to unravel disease presentation patterns and could be applied to similar problems and other diseases

    Diagnosis and Prognosis of Occupational disorders based on Machine Learn- ing Techniques applied to Occupational Profiles

    Get PDF
    Work-related disorders have a global influence on people’s well-being and quality of life and are a financial burden for organizations because they reduce productivity, increase absenteeism, and promote early retirement. Work-related musculoskeletal disorders, in particular, represent a significant fraction of the total in all occupational contexts. In automotive and industrial settings where workers are exposed to work-related muscu- loskeletal disorders risk factors, occupational physicians are responsible for monitoring workers’ health protection profiles. Occupational technicians report in the Occupational Health Protection Profiles database to understand which exposure to occupational work- related musculoskeletal disorder risk factors should be ensured for a given worker. Occu- pational Health Protection Profiles databases describe the occupational physician states, and which exposure the physicians considers necessary to ensure the worker’s health protection in terms of their functional work ability. The application of Human-Centered explainable artificial intelligence can support the decision making to go from worker’s Functional Work Ability to explanations by integrating explainability into medical (re- striction) and supporting in two decision contexts: prognosis and diagnosis of individual, work related and organizational risk condition. Although previous machine learning ap- proaches provided good predictions, their application in an actual occupational setting is limited because their predictions are difficult to interpret and hence, not actionable. In this thesis, injured body parts in which the ability changed in a worker’s functional work ability status are targeted. On the one hand, artificial intelligence algorithms can help technical teams, occupational physicians, and ergonomists determine a worker’s workplace risk via the diagnosis and prognosis of body part(s) injuries; on the other hand, these approaches can help prevent work-related musculoskeletal disorders by identifying which processes are lacking in working condition improvement and which workplaces have a better match between the remaining functional work abilities. A sample of 2025 for the prognosis part (from the years of 2019 to 2020) and 7857 for the prognosis part of Occupational Health Protection Profiles based on Functional Work Ability textual re- ports in the Portuguese language in automotive industry factory. Machine learning-based Natural Language Processing methods were implemented to extract standardized infor- mation. The prognosis and diagnosis of Occupational Health Protection Profiles factors were developed in reliable Human-Centered explainable artificial intelligence system to promote a trustworthy Human-Centered explainable artificial intelligence system (enti- tled Industrial microErgo application). The most suitable regression models to predict the next medical appointment for the injured body regions were the models based on CatBoost regression, with R square and an RMSLE of 0.84 and 1.23 weeks, respectively. In parallel, CatBoost’s best regression model for most body parts is the prediction of the next injured body parts based on these two errors. This information can help tech- nical industrial teams understand potential risk factors for Occupational Health Protec- tion Profiles and identify warning signs of the early stages of musculoskeletal disorders.Os transtornos relacionados ao trabalho têm influência global no bem-estar e na quali- dade de vida das pessoas e são um ônus financeiro para as organizações, pois reduzem a produtividade, aumentam o absenteísmo e promovem a aposentadoria precoce. Os distúr- bios osteomusculares relacionados ao trabalho, em particular, representam uma fração significativa do total em todos os contextos ocupacionais. Em ambientes automotivos e industriais onde os trabalhadores estão expostos a fatores de risco de distúrbios osteomus- culares relacionados ao trabalho, os médicos do trabalho são responsáveis por monitorar os perfis de proteção à saúde dos trabalhadores. Os técnicos do trabalho reportam-se à base de dados dos Perfis de Proteção da Saúde Ocupacional para compreender quais os fatores de risco de exposição a perturbações músculo-esqueléticas relacionadas com o tra- balho que devem ser assegurados para um determinado trabalhador. As bases de dados de Perfis de Proteção à Saúde Ocupacional descrevem os estados do médico do trabalho e quais exposições os médicos consideram necessária para garantir a proteção da saúde do trabalhador em termos de sua capacidade funcional para o trabalho. A aplicação da inteligência artificial explicável centrada no ser humano pode apoiar a tomada de decisão para ir da capacidade funcional de trabalho do trabalhador às explicações, integrando a explicabilidade à médica (restrição) e apoiando em dois contextos de decisão: prognóstico e diagnóstico da condição de risco individual, relacionado ao trabalho e organizacional . Embora as abordagens anteriores de aprendizado de máquina tenham fornecido boas pre- visões, sua aplicação em um ambiente ocupacional real é limitada porque suas previsões são difíceis de interpretar e portanto, não acionável. Nesta tese, as partes do corpo lesiona- das nas quais a habilidade mudou no estado de capacidade funcional para o trabalho do trabalhador são visadas. Por um lado, os algoritmos de inteligência artificial podem aju- dar as equipes técnicas, médicos do trabalho e ergonomistas a determinar o risco no local de trabalho de um trabalhador por meio do diagnóstico e prognóstico de lesões em partes do corpo; por outro lado, essas abordagens podem ajudar a prevenir distúrbios muscu- loesqueléticos relacionados ao trabalho, identificando quais processos estão faltando na melhoria das condições de trabalho e quais locais de trabalho têm uma melhor correspon- dência entre as habilidades funcionais restantes do trabalho. Para esta tese, foi utilizada uma base de dados com Perfis de Proteção à Saúde Ocupacional, que se baseiam em relató- rios textuais de Aptidão para o Trabalho em língua portuguesa, de uma fábrica da indús- tria automóvel (Auto Europa). Uma amostra de 2025 ficheiros foi utilizada para a parte de prognóstico (de 2019 a 2020) e uma amostra de 7857 ficheiros foi utilizada para a parte de diagnóstico. . Aprendizado de máquina- métodos baseados em Processamento de Lingua- gem Natural foram implementados para extrair informações padronizadas. O prognóstico e diagnóstico dos fatores de Perfis de Proteção à Saúde Ocupacional foram desenvolvidos em um sistema confiável de inteligência artificial explicável centrado no ser humano (inti- tulado Industrial microErgo application). Os modelos de regressão mais adequados para prever a próxima consulta médica para as regiões do corpo lesionadas foram os modelos baseados na regressão CatBoost, com R quadrado e RMSLE de 0,84 e 1,23 semanas, res- pectivamente. Em paralelo, a previsão das próximas partes do corpo lesionadas com base nesses dois erros relatados pelo CatBoost como o melhor modelo de regressão para a mai- oria das partes do corpo. Essas informações podem ajudar as equipes técnicas industriais a entender os possíveis fatores de risco para os Perfis de Proteção à Saúde Ocupacio- nal e identificar sinais de alerta dos estágios iniciais de distúrbios musculoesqueléticos
    • …
    corecore