772 research outputs found

    Subgroup discovery for structured target concepts

    Get PDF
    The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in data—i.e., named sub-populations— whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen für das Auffinden von Subgruppen in Daten—d. h. benannte Teilpopulationen—deren Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes außergewöhnlich ist. Es handelt sich hierbei um ein leistungsfähiges Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschränkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche übertragen. Wir führen das Konzept der repräsentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch über bekannte Trends in den Daten hinausgehen können. Für Entitäten mit zusätzlicher relationalen Information, die als Graph kodiert werden kann, führen wir ein neuartiges Maß für robuste Verbundenheit ein, das die etablierten alternativen Dichtemaße verbessert; anschließend stellen wir eine Methode bereit, die dieses Maß verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere Beiträge in diesem Rahmen gipfeln in der Einführung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen für u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusätzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. Für den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch für jede andere Art von Kernel-Methode, führen wir zusätzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen während der Zählung der Random Walks, während wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue Fähigkeit zu nutzen. Mit diesen Beiträgen erweitern wir die Anwendbarkeit der Subgruppentdeckung gründlich und definieren wir sie im Endeffekt als Kernel-Methode neu

    A Survey of Sequential Pattern Based E-Commerce Recommendation Systems

    Get PDF
    E-commerce recommendation systems usually deal with massive customer sequential databases, such as historical purchase or click stream sequences. Recommendation systems’ accuracy can be improved if complex sequential patterns of user purchase behavior are learned by integrating sequential patterns of customer clicks and/or purchases into the user–item rating matrix input of collaborative filtering. This review focuses on algorithms of existing E-commerce recommendation systems that are sequential pattern-based. It provides a comprehensive and comparative performance analysis of these systems, exposing their methodologies, achievements, limitations, and potential for solving more important problems in this domain. The review shows that integrating sequential pattern mining of historical purchase and/or click sequences into a user–item matrix for collaborative filtering can (i) improve recommendation accuracy, (ii) reduce user–item rating data sparsity, (iii) increase the novelty rate of recommendations, and (iv) improve the scalability of recommendation systems

    2017 GREAT Day Program

    Get PDF
    SUNY Geneseo’s Eleventh Annual GREAT Day.https://knightscholar.geneseo.edu/program-2007/1011/thumbnail.jp

    WhatsUp: An event resolution approach for co-occurring events in social media

    Get PDF
    The rapid growth of social media networks has resulted in the generation of a vast data amount, making it impractical to conduct manual analyses to extract newsworthy events. Thus, automated event detection mechanisms are invaluable to the community. However, a clear majority of the available approaches rely only on data statistics without considering linguistics. A few approaches involved linguistics, only to extract textual event details without the corresponding temporal details. Since linguistics define words’ structure and meaning, a severe information loss can happen without considering them. Targeting this limitation, we propose a novel method named WhatsUp to detect temporal and fine-grained textual event details, using linguistics captured by self-learned word embeddings and their hierarchical relationships and statistics captured by frequency-based measures. We evaluate our approach on recent social media data from two diverse domains and compare the performance with several state-of-the-art methods. Evaluations cover temporal and textual event aspects, and results show that WhatsUp notably outperforms state-of-the-art methods. We also analyse the efficiency, revealing that WhatsUp is sufficiently fast for (near) real-time detection. Further, the usage of unsupervised learning techniques, including self-learned embedding, makes our approach expandable to any language, platform and domain and provides capabilities to understand data-specific linguistics

    Visual Analytics of Co-Occurrences to Discover Subspaces in Structured Data

    Get PDF
    We present an approach that shows all relevant subspaces of categorical data condensed in a single picture. We model the categorical values of the attributes as co-occurrences with data partitions generated from structured data using pattern mining. We show that these co-occurrences are a-priori, allowing us to greatly reduce the search space, effectively generating the condensed picture where conventional approaches filter out several subspaces as these are deemed insignificant. The task of identifying interesting subspaces is common but difficult due to exponential search spaces and the curse of dimensionality. One application of such a task might be identifying a cohort of patients defined by attributes such as gender, age, and diabetes type that share a common patient history, which is modeled as event sequences. Filtering the data by these attributes is common but cumbersome and often does not allow a comparison of subspaces. We contribute a powerful multi-dimensional pattern exploration approach (MDPE-approach) agnostic to the structured data type that models multiple attributes and their characteristics as co-occurrences, allowing the user to identify and compare thousands of subspaces of interest in a single picture. In our MDPE-approach, we introduce two methods to dramatically reduce the search space, outputting only the boundaries of the search space in the form of two tables. We implement the MDPE-approach in an interactive visual interface (MDPE-vis) that provides a scalable, pixel-based visualization design allowing the identification, comparison, and sense-making of subspaces in structured data. Our case studies using a gold-standard dataset and external domain experts confirm our approach’s and implementation’s applicability. A third use case sheds light on the scalability of our approach and a user study with 15 participants underlines its usefulness and power

    Measuring the impact of COVID-19 on hospital care pathways

    Get PDF
    Care pathways in hospitals around the world reported significant disruption during the recent COVID-19 pandemic but measuring the actual impact is more problematic. Process mining can be useful for hospital management to measure the conformance of real-life care to what might be considered normal operations. In this study, we aim to demonstrate that process mining can be used to investigate process changes associated with complex disruptive events. We studied perturbations to accident and emergency (A &E) and maternity pathways in a UK public hospital during the COVID-19 pandemic. Co-incidentally the hospital had implemented a Command Centre approach for patient-flow management affording an opportunity to study both the planned improvement and the disruption due to the pandemic. Our study proposes and demonstrates a method for measuring and investigating the impact of such planned and unplanned disruptions affecting hospital care pathways. We found that during the pandemic, both A &E and maternity pathways had measurable reductions in the mean length of stay and a measurable drop in the percentage of pathways conforming to normative models. There were no distinctive patterns of monthly mean values of length of stay nor conformance throughout the phases of the installation of the hospital’s new Command Centre approach. Due to a deficit in the available A &E data, the findings for A &E pathways could not be interpreted

    Development and application of new machine learning models for the study of colorectal cancer

    Get PDF
    En la actualidad, en el ámbito sanitario, hay un interés creciente en la consideración de técnicas de Inteligencia Artificial, en concreto técnicas de Aprendizaje Automático o Machine Learning, que tan buenos resultados están proporcionando desde hace tiempo en diferentes ámbitos, como la industria, el comercio electrónico, la educación, etc. Sin embargo, en el ámbito de la sanidad hay un reto aún mayor ya que, además de necesitar sistemas muy probados, puesto que sus resultados van a repercutir directamente en la salud de las personas, también es necesario alcanzar un buen equilibrio en cuanto a interpretabilidad. Esto es de gran importancia ya que, actualmente, con métodos de caja negra, que pueden llegar a ser muy precisos, es difícil saber qué motivó que el sistema automático tomara una decisión y no otra. Esto puede generar rechazo entre los profesionales sanitarios debido a la inseguridad que pueden llegar a sentir por no poder explicar una decisión clínica tomada en base a un sistema de apoyo a la toma de decisiones. En este contexto, desde el primer momento establecimos que la interpretabilidad de los resultados debía ser una de las premisas que gobernara transversalmente todo el trabajo que se desarrollara en esta tesis doctoral. En este sentido, todos los desarrollos realizados generan bien árboles de clasificación (los cuales dan lugar a reglas interpretables) o bien reglas de asociación que describen relaciones entre los datos existentes. Por otro lado, el cáncer colorrectal es una neoplasia maligna con una alta morbimortalidad tanto en hombres como en mujeres. Esta requiere, indiscutiblemente, de una atención multidisciplinar en la que diferentes profesionales sanitarios (médicos de familia, gastroenterólogos, radiólogos, cirujanos, oncólogos, farmacéuticos, personal de enfermería, etc.) realicen un abordaje conjunto de la patología para ofrecer la mejor atención posible al paciente. Pero además, en adelante, sería muy interesante incorporar a científicos de datos en ese equipo multidisciplinar, ya que se puede sacar un gran partido a toda la información que se genera diariamente sobre esta patología. En esta tesis doctoral se ha planteado, precisamente, el estudio de un conjunto de datos de pacientes con cáncer colorrectal con un un conjunto de técnicas de inteligencia artificial y el desarrollo de nuevos modelos de aprendizaje automático para el mismo. Los resultados han sido los que se exponen a continuación: Una revisión bibliográfica sobre el uso de Machine Learning aplicado a cáncer colorrectal, a partir de la cual se ha realizado una taxonomía de los trabajos existentes a fecha de realización del estudio del estado del arte. Esta taxonomía clasifica los diferentes trabajos estudiados atendiendo a diferentes criterios como son el tipo de dataset utilizado, el tipo de algoritmo implementado, el tamaño del dataset y su disponibilidad pública, el uso o no de algoritmos de selección de características y el uso o no de técnicas de extracción de características. Un modelo de extracción de reglas de asociación de clases con la intención de entender mejor por qué algunos pacientes podrían sufrir complicaciones tras una intervención quirúrgica o recidivas de su cáncer. Este trabajo ha dado lugar a una metodología para la obtención de descripciones interpretables y manejables (es importante que las reglas generadas tengan un tamaño reducido de manera que así sea útil para los sanitarios). Un modelo de selección de características y de instancias para poder inducir mejores árboles de clasificación. Un algoritmo de Evolución Gramatical para inducir una gran variedad de árboles de clasificación tan precisos como los obtenidos por los conocidos métodos C4.5 y CART. En este caso, se ha utilizado la librería PonyGE2 de Python y, debido a su escasa especificidad para aplicación a nuestro problema, se han desarrollado una serie de operadores que permiten inducir árboles más interpretables en comparación con los que produce PonyGE2 de forma estándar. Los resultados obtenidos en cada uno de los desarrollos realizados se han comparado con los resultados proporcionados por métodos existentes en la literatura y de reconocido prestigio, tanto del campo de la clasificación como del campo de la minería de reglas de asociación, demostrándose una mejor adaptación de nuestros modelos a las características que presentaba el conjunto de datos de estudio, y que pueden ser de aplicación a otros casos.Today, in healthcare, there is a growing interest in considering Artificial Intelligence techniques, specifically Machine Learning techniques, which have been providing good results in different fields such as industry, e‑commerce, education, etc., since a long time ago. However, in the field of healthcare there is an even greater challenge because it is needed both highly tested systems, since their results will have a direct impact on people's health, and a good level in terms of interpretability. This is very important since with black box methods, which can be very precise, it will be dificult to know what motivated the automatic system to take one decision or any other. This fact can generate rejection among healthcare professionals due to the insecurity they may feel because they cannot explain a clinical decision taken on the basis of a decision support system. In this context, from the very begining we established that the interpretability of the results should be one of the premises leading all the work carried out in this doctoral thesis. In this sense, all the developments carried out generate either classification trees (which produce interpretable rules) or association rules that describe relationships between existing data. On the other hand, colorectal cancer is a malignant neoplasia with a high morbidity and mortality rate in both men and women, which unquestionably requires multidisciplinary care in which different healthcare professionals (family doctors, gastroenterologists, radiologists, surgeons, oncologists, pharmacists, nursing staff, etc.) take a joint approach to the pathology in order to offer the best possible care to the patient. But it would also be very interesting to incorporate data scientists into this multidisciplinary team in the future, as they can make the most of all the information that is generated on this pathology daily. In this doctoral thesis, it has been proposed the study of a dataset of patients with colorectal cancer with a set of artificial intelligence techniques and the development of new machine learning models for it. The results are shown below: A literature review on the use of Machine Learning applied to colorectal cancer, from which a taxonomy of the existing works has been produced. This taxonomy classifies the different works of the state‑of‑the‑arte according to different criterio such as the type of dataset that has been used, the type of algorithm that has been implemented, the size of the dataset and its public availability, the use or not of feature selection algorithms and the use or not of feature extraction techniques. A class association rule extraction model with the intention of better understanding why some patients might experience complications after surgery or recurrence of their cancer. This work has given rise to a methodology for obtaining interpretable and manageable descriptions (it is important that the generated rules have a reduced size so that they are useful for practitioners). A feature and instance selection model to induce better classification trees. A Grammatical Evolution algorithm to induce a wide variety of classification trees as accurate as those obtained by the well‑known C4.5 and CART methods. In this case, the PonyGE2 Python library has been used and, due to its low specificity for application to our problem, a series of operators have been developed, which allow inducing more interpretable trees compared to those produced by PonyGE2 in a standard way. The results obtained in each of the developments carried out have been compared with the results provided by well known methods existing in the literature, both in the field of classification and in the field of association rule mining, demonstrating a better fit of our models to the features of the dataset, which can be applied to other cases. great efficiency in our models. This demonstrates that it is possible to reach a good balance between precision and interpretability

    Design and Evaluation of Parallel and Scalable Machine Learning Research in Biomedical Modelling Applications

    Get PDF
    The use of Machine Learning (ML) techniques in the medical field is not a new occurrence and several papers describing research in that direction have been published. This research has helped in analysing medical images, creating responsive cardiovascular models, and predicting outcomes for medical conditions among many other applications. This Ph.D. aims to apply such ML techniques for the analysis of Acute Respiratory Distress Syndrome (ARDS) which is a severe condition that affects around 1 in 10.000 patients worldwide every year with life-threatening consequences. We employ previously developed mechanistic modelling approaches such as the “Nottingham Physiological Simulator,” through which better understanding of ARDS progression can be gleaned, and take advantage of the growing volume of medical datasets available for research (i.e., “big data”) and the advances in ML to develop, train, and optimise the modelling approaches. Additionally, the onset of the COVID-19 pandemic while this Ph.D. research was ongoing provided a similar application field to ARDS, and made further ML research in medical diagnosis applications possible. Finally, we leverage the available Modular Supercomputing Architecture (MSA) developed as part of the Dynamical Exascale Entry Platform~- Extreme Scale Technologies (DEEP-EST) EU Project to scale up and speed up the modelling processes. This Ph.D. Project is one element of the Smart Medical Information Technology for Healthcare (SMITH) project wherein the thesis research can be validated by clinical and medical experts (e.g. Uniklinik RWTH Aachen).Notkun vélnámsaðferða (ML) í læknavísindum er ekki ný af nálinni og hafa nokkrar greinar verið birtar um rannsóknir á því sviði. Þessar rannsóknir hafa hjálpað til við að greina læknisfræðilegar myndir, búa til svörunarlíkön fyrir hjarta- og æðakerfi og spá fyrir um útkomu sjúkdóma meðal margra annarra notkunarmöguleika. Markmið þessarar doktorsrannsóknar er að beita slíkum ML aðferðum við greiningu á bráðu andnauðarheilkenni (ARDS), alvarlegan sjúkdóm sem hrjáir um 1 af hverjum 10.000 sjúklingum á heimsvísu á ári hverju með lífshættulegum afleiðingum. Til að framkvæma þessa greiningu notum við áður þróaðar aðferðir við líkanasmíði, s.s. „Nottingham Physiological Simulator“, sem nota má til að auka skilning á framvindu ARDS-sjúkdómsins. Við nýtum okkur vaxandi umfang læknisfræðilegra gagnasafna sem eru aðgengileg til rannsókna (þ.e. „stórgögn“), framfarir í vélnámi til að þróa, þjálfa og besta líkanaaðferðirnar. Þar að auki hófst COVID-19 faraldurinn þegar doktorsrannsóknin var í vinnslu, sem setti svipað svið fram og ARDS og gerði frekari rannsóknir á ML í læknisfræði mögulegar. Einnig nýtum við tiltæka einingaskipta högun ofurtölva, „Modular Supercomputing Architecture“ (MSA), sem er þróuð sem hluti af „Dynamical Exascale Entry Platform“ - Extreme Scale Technologies (DEEP-EST) verkefnisáætlun ESB til að kvarða og hraða líkanasmíðinni. Þetta doktorsverkefni er einn þáttur í SMITH-verkefninu (e. Smart Medical Information Technology for Healthcare) þar sem sérfræðingar í klíník og læknisfræði geta staðfest rannsóknina (t.d. Uniklinik RWTH Aachen)

    FootApp: An AI-powered system for football match annotation

    Get PDF
    In the last years, scientific and industrial research has experienced a growing interest in acquiring large annotated data sets to train artificial intelligence algorithms for tackling problems in different domains. In this context, we have observed that even the market for football data has substantially grown. The analysis of football matches relies on the annotation of both individual players’ and team actions, as well as the athletic performance of players. Consequently, annotating football events at a fine-grained level is a very expensive and error-prone task. Most existing semi-automatic tools for football match annotation rely on cameras and computer vision. However, those tools fall short in capturing team dynamics and in extracting data of players who are not visible in the camera frame. To address these issues, in this manuscript we present FootApp, an AI-based system for football match annotation. First, our system relies on an advanced and mixed user interface that exploits both vocal and touch interaction. Second, the motor performance of players is captured and processed by applying machine learning algorithms to data collected from inertial sensors worn by players. Artificial intelligence techniques are then used to check the consistency of generated labels, including those regarding the physical activity of players, to automatically recognize annotation errors. Notably, we implemented a full prototype of the proposed system, performing experiments to show its effectiveness in a real-world adoption scenario

    Correlating contexts and NFR conflicts from event logs

    Get PDF
    In the design of autonomous systems, it is important to consider the preferences of the interested parties to improve the user experience. These preferences are often associated with the contexts in which each system is likely to operate. The operational behavior of a system must also meet various non-functional requirements (NFRs), which can present different levels of conflict depending on the operational context. This work aims to model correlations between the individual contexts and the consequent conflicts between NFRs. The proposed approach is based on analyzing the system event logs, tracing them back to the leaf elements at the specification level and providing a contextual explanation of the system’s behavior. The traced contexts and NFR conflicts are then mined to produce Context-Context and Context-NFR conflict sequential rules. The proposed Contextual Explainability (ConE) framework uses BERT-based pre-trained language models and sequential rule mining libraries for deriving the above correlations. Extensive evaluations are performed to compare the existing state-of-the-art approaches. The best-fit solutions are chosen to integrate within the ConE framework. Based on experiments, an accuracy of 80%, a precision of 90%, a recall of 97%, and an F1-score of 88% are recorded for the ConE framework on the sequential rules that were mined
    corecore