1,114 research outputs found

    Colossal Trajectory Mining: A unifying approach to mine behavioral mobility patterns

    Get PDF
    Spatio-temporal mobility patterns are at the core of strategic applications such as urban planning and monitoring. Depending on the strength of spatio-temporal constraints, different mobility patterns can be defined. While existing approaches work well in the extraction of groups of objects sharing fine-grained paths, the huge volume of large-scale data asks for coarse-grained solutions. In this paper, we introduce Colossal Trajectory Mining (CTM) to efficiently extract heterogeneous mobility patterns out of a multidimensional space that, along with space and time dimensions, can consider additional trajectory features (e.g., means of transport or activity) to characterize behavioral mobility patterns. The algorithm is natively designed in a distributed fashion, and the experimental evaluation shows its scalability with respect to the involved features and the cardinality of the trajectory dataset

    Subgroup discovery for structured target concepts

    Get PDF
    The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in data—i.e., named sub-populations— whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen für das Auffinden von Subgruppen in Daten—d. h. benannte Teilpopulationen—deren Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes außergewöhnlich ist. Es handelt sich hierbei um ein leistungsfähiges Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschränkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche übertragen. Wir führen das Konzept der repräsentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch über bekannte Trends in den Daten hinausgehen können. Für Entitäten mit zusätzlicher relationalen Information, die als Graph kodiert werden kann, führen wir ein neuartiges Maß für robuste Verbundenheit ein, das die etablierten alternativen Dichtemaße verbessert; anschließend stellen wir eine Methode bereit, die dieses Maß verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere Beiträge in diesem Rahmen gipfeln in der Einführung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen für u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusätzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. Für den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch für jede andere Art von Kernel-Methode, führen wir zusätzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen während der Zählung der Random Walks, während wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue Fähigkeit zu nutzen. Mit diesen Beiträgen erweitern wir die Anwendbarkeit der Subgruppentdeckung gründlich und definieren wir sie im Endeffekt als Kernel-Methode neu

    2017 GREAT Day Program

    Get PDF
    SUNY Geneseo’s Eleventh Annual GREAT Day.https://knightscholar.geneseo.edu/program-2007/1011/thumbnail.jp

    MBA: Market Basket Analysis Using Frequent Pattern Mining Techniques

    Get PDF
    This Market Basket Analysis (MBA) is a data mining technique that uses frequent pattern mining algorithms to discover patterns of co-occurrence among items that are frequently purchased together. It is commonly used in retail and e-commerce businesses to generate association rules that describe the relationships between different items, and to make recommendations to customers based on their previous purchases. MBA is a powerful tool for identifying patterns of co-occurrence and generating insights that can improve sales and marketing strategies. Although a numerous works has been carried out to handle the computational cost for discovering the frequent itemsets, but it still needs more exploration and developments. In this paper, we introduce an efficient Bitwise-Based data structure technique for mining frequent pattern in large-scale databases. The algorithm scans the original database once, using the Bitwise-Based data representations as well as vertical database layout, compared to the well-known Apriori and FP-Growth algorithm. Bitwise-Based technique enhance the problems of multiple passes over the original database, hence, minimizes the execution time. Extensive experiments have been carried out to validate our technique, which outperform Apriori, Éclat, FP-growth, and H-mine in terms of execution time for Market Basket Analysis

    Fraction-score: a generalized support measure for weighted and maximal co-location pattern mining

    Get PDF
    Co-location patterns, which capture the phenomenon that objects with certain labels are often located in close geographic proximity, are defined based on a support measure which quantifies the prevalence of a pattern candidate in the form of a label set. Existing support measures share the idea of counting the number of instances of a given label set C as its support, where an instance of C is an object set whose objects collectively carry all labels in C and are located close to one another. However, they suffer from various weaknesses, e.g., fail to capture all possible instances, or overlook the cases when multiple instances overlap. In this paper, we propose a new measure called Fraction-Score which counts instances fractionally if they overlap. Fraction-Score captures all possible instances, and handles the cases where instances overlap appropriately (so that the supports defined are more meaningful and anti-monotonic). We develop efficient algorithms to solve the co-location pattern mining problem defined with Fraction-Score. Furthermore, to obtain representative patterns, we develop an efficient algorithm for mining the maximal co-location patterns, which are those patterns without proper superset patterns. We conduct extensive experiments using real and synthetic datasets, which verified the superiority of our proposals

    Measuring the impact of COVID-19 on hospital care pathways

    Get PDF
    Care pathways in hospitals around the world reported significant disruption during the recent COVID-19 pandemic but measuring the actual impact is more problematic. Process mining can be useful for hospital management to measure the conformance of real-life care to what might be considered normal operations. In this study, we aim to demonstrate that process mining can be used to investigate process changes associated with complex disruptive events. We studied perturbations to accident and emergency (A &E) and maternity pathways in a UK public hospital during the COVID-19 pandemic. Co-incidentally the hospital had implemented a Command Centre approach for patient-flow management affording an opportunity to study both the planned improvement and the disruption due to the pandemic. Our study proposes and demonstrates a method for measuring and investigating the impact of such planned and unplanned disruptions affecting hospital care pathways. We found that during the pandemic, both A &E and maternity pathways had measurable reductions in the mean length of stay and a measurable drop in the percentage of pathways conforming to normative models. There were no distinctive patterns of monthly mean values of length of stay nor conformance throughout the phases of the installation of the hospital’s new Command Centre approach. Due to a deficit in the available A &E data, the findings for A &E pathways could not be interpreted

    CFLCA: High Performance based Heart disease Prediction System using Fuzzy Learning with Neural Networks

    Get PDF
    Human Diseases are increasing rapidly in today’s generation mainly due to the life style of people like poor diet, lack of exercises, drugs and alcohol consumption etc. But the most spreading disease that is commonly around 80% of people death direct and indirectly heart disease basis. In future (approximately after 10 years) maximum number of people may expire cause of heart diseases. Due to these reasons, many of researchers providing enormous remedy, data analysis in various proposed technologies for diagnosing heart diseases with plenty of medical data which is related to heart disease. In field of Medicine regularly receives very wide range of medical data in the form of text, image, audio, video, signal pockets, etc. This database contains raw dataset which consist of inconsistent and redundant data. The health care system is no doubt very rich in aspect of storing data but at the same time very poor in fetching knowledge. Data mining (DM) methods can help in extracting a valuable knowledge by applying DM terminologies like clustering, regression, segmentation, classification etc. After the collection of data when the dataset becomes larger and more complex than data mining algorithms and clustering algorithms (D-Tree, Neural Networks, K-means, etc.) are used. To get accuracy and precision values improved with proposed method of Cognitive Fuzzy Learning based Clustering Algorithm (CFLCA) method. CFLCA methodology creates advanced meta indexing for n-dimensional unstructured data. The heart disease dataset used after data enrichment and feature engineering with UCI machine learning algorithm, attain high level accurate and prediction rate. Through this proposed CFLCA algorithm is having high accuracy, precision and recall values of data analysis for heart diseases detection

    Design and Evaluation of Parallel and Scalable Machine Learning Research in Biomedical Modelling Applications

    Get PDF
    The use of Machine Learning (ML) techniques in the medical field is not a new occurrence and several papers describing research in that direction have been published. This research has helped in analysing medical images, creating responsive cardiovascular models, and predicting outcomes for medical conditions among many other applications. This Ph.D. aims to apply such ML techniques for the analysis of Acute Respiratory Distress Syndrome (ARDS) which is a severe condition that affects around 1 in 10.000 patients worldwide every year with life-threatening consequences. We employ previously developed mechanistic modelling approaches such as the “Nottingham Physiological Simulator,” through which better understanding of ARDS progression can be gleaned, and take advantage of the growing volume of medical datasets available for research (i.e., “big data”) and the advances in ML to develop, train, and optimise the modelling approaches. Additionally, the onset of the COVID-19 pandemic while this Ph.D. research was ongoing provided a similar application field to ARDS, and made further ML research in medical diagnosis applications possible. Finally, we leverage the available Modular Supercomputing Architecture (MSA) developed as part of the Dynamical Exascale Entry Platform~- Extreme Scale Technologies (DEEP-EST) EU Project to scale up and speed up the modelling processes. This Ph.D. Project is one element of the Smart Medical Information Technology for Healthcare (SMITH) project wherein the thesis research can be validated by clinical and medical experts (e.g. Uniklinik RWTH Aachen).Notkun vélnámsaðferða (ML) í læknavísindum er ekki ný af nálinni og hafa nokkrar greinar verið birtar um rannsóknir á því sviði. Þessar rannsóknir hafa hjálpað til við að greina læknisfræðilegar myndir, búa til svörunarlíkön fyrir hjarta- og æðakerfi og spá fyrir um útkomu sjúkdóma meðal margra annarra notkunarmöguleika. Markmið þessarar doktorsrannsóknar er að beita slíkum ML aðferðum við greiningu á bráðu andnauðarheilkenni (ARDS), alvarlegan sjúkdóm sem hrjáir um 1 af hverjum 10.000 sjúklingum á heimsvísu á ári hverju með lífshættulegum afleiðingum. Til að framkvæma þessa greiningu notum við áður þróaðar aðferðir við líkanasmíði, s.s. „Nottingham Physiological Simulator“, sem nota má til að auka skilning á framvindu ARDS-sjúkdómsins. Við nýtum okkur vaxandi umfang læknisfræðilegra gagnasafna sem eru aðgengileg til rannsókna (þ.e. „stórgögn“), framfarir í vélnámi til að þróa, þjálfa og besta líkanaaðferðirnar. Þar að auki hófst COVID-19 faraldurinn þegar doktorsrannsóknin var í vinnslu, sem setti svipað svið fram og ARDS og gerði frekari rannsóknir á ML í læknisfræði mögulegar. Einnig nýtum við tiltæka einingaskipta högun ofurtölva, „Modular Supercomputing Architecture“ (MSA), sem er þróuð sem hluti af „Dynamical Exascale Entry Platform“ - Extreme Scale Technologies (DEEP-EST) verkefnisáætlun ESB til að kvarða og hraða líkanasmíðinni. Þetta doktorsverkefni er einn þáttur í SMITH-verkefninu (e. Smart Medical Information Technology for Healthcare) þar sem sérfræðingar í klíník og læknisfræði geta staðfest rannsóknina (t.d. Uniklinik RWTH Aachen)

    Analysis of Student Behavior and Score Prediction in Assistments Online Learning

    Get PDF
    Understanding and analyzing student behavior is paramount in enhancing online learning, and this thesis delves into the subject by presenting an in-depth analysis of student behavior and score prediction in the ASSISTments online learning platform. We used data from the EDM Cup 2023 Kaggle Competition to answer four key questions. First, we explored how students seeking hints and explanations affect their performance in assignments, shedding light on the role of guidance in learning. Second, we looked at the connection between students mastering specific skills and their performance in related assignments, giving insights into the effectiveness of curriculum alignment. Third, we identified important features from student activity data to improve grade prediction, helping identify at-risk students early and monitor their progress. Lastly, we used graph representation learning to understand complex relationships in the data, leading to more accurate predictive models. This research enhances our understanding of data mining in online learning, with implications for personalized learning and support mechanisms

    Learning high-order interactions for polygenic risk prediction

    Get PDF
    Within the framework of precision medicine, the stratification of individual genetic susceptibility based on inherited DNA variation has paramount relevance. However, one of the most relevant pitfalls of traditional Polygenic Risk Scores (PRS) approaches is their inability to model complex high-order non-linear SNP-SNP interactions and their effect on the phenotype (e.g. epistasis). Indeed, they incur in a computational challenge as the number of possible interactions grows exponentially with the number of SNPs considered, affecting the statistical reliability of the model parameters as well. In this work, we address this issue by proposing a novel PRS approach, called High-order Interactions-aware Polygenic Risk Score (hiPRS), that incorporates high-order interactions in modeling polygenic risk. The latter combines an interaction search routine based on frequent itemsets mining and a novel interaction selection algorithm based on Mutual Information, to construct a simple and interpretable weighted model of user-specified dimensionality that can predict a given binary phenotype. Compared to traditional PRSs methods, hiPRS does not rely on GWAS summary statistics nor any external information. Moreover, hiPRS differs from Machine Learning-based approaches that can include complex interactions in that it provides a readable and interpretable model and it is able to control overfitting, even on small samples. In the present work we demonstrate through a comprehensive simulation study the superior performance of hiPRS w.r.t. state of the art methods, both in terms of scoring performance and interpretability of the resulting model. We also test hiPRS against small sample size, class imbalance and the presence of noise, showcasing its robustness to extreme experimental settings. Finally, we apply hiPRS to a case study on real data from DACHS cohort, defining an interaction-aware scoring model to predict mortality of stage II-III Colon-Rectal Cancer patients treated with oxaliplatin
    corecore