2,911 research outputs found

    Computational Analyses of Metagenomic Data

    Get PDF
    Metagenomics studies the collective microbial genomes extracted from a particular environment without requiring the culturing or isolation of individual genomes, addressing questions revolving around the composition, functionality, and dynamics of microbial communities. The intrinsic complexity of metagenomic data and the diversity of applications call for efficient and accurate computational methods in data handling. In this thesis, I present three primary projects that collectively focus on the computational analysis of metagenomic data, each addressing a distinct topic. In the first project, I designed and implemented an algorithm named Mapbin for reference-free genomic binning of metagenomic assemblies. Binning aims to group a mixture of genomic fragments based on their genome origin. Mapbin enhances binning results by building a multilayer network that combines the initial binning, assembly graph, and read-pairing information from paired-end sequencing data. The network is further partitioned by the community-detection algorithm, Infomap, to yield a new binning result. Mapbin was tested on multiple simulated and real datasets. The results indicated an overall improvement in the common binning quality metrics. The second and third projects are both derived from ImMiGeNe, a collaborative and multidisciplinary study investigating the interplay between gut microbiota, host genetics, and immunity in stem-cell transplantation (SCT) patients. In the second project, I conducted microbiome analyses for the metagenomic data. The workflow included the removal of contaminant reads and multiple taxonomic and functional profiling. The results revealed that the SCT recipients' samples yielded significantly fewer reads with heavy contamination of the host DNA, and their microbiomes displayed evident signs of dysbiosis. Finally, I discussed several inherent challenges posed by extremely low levels of target DNA and high levels of contamination in the recipient samples, which cannot be rectified solely through bioinformatics approaches. The primary goal of the third project is to design a set of primers that can be used to cover bacterial flagellin genes present in the human gut microbiota. Considering the notable diversity of flagellins, I incorporated a method to select representative bacterial flagellin gene sequences, a heuristic approach based on established primer design methods to generate a degenerate primer set, and a selection method to filter genes unlikely to occur in the human gut microbiome. As a result, I successfully curated a reduced yet representative set of primers that would be practical for experimental implementation

    Configuration Management of Distributed Systems over Unreliable and Hostile Networks

    Get PDF
    Economic incentives of large criminal profits and the threat of legal consequences have pushed criminals to continuously improve their malware, especially command and control channels. This thesis applied concepts from successful malware command and control to explore the survivability and resilience of benign configuration management systems. This work expands on existing stage models of malware life cycle to contribute a new model for identifying malware concepts applicable to benign configuration management. The Hidden Master architecture is a contribution to master-agent network communication. In the Hidden Master architecture, communication between master and agent is asynchronous and can operate trough intermediate nodes. This protects the master secret key, which gives full control of all computers participating in configuration management. Multiple improvements to idempotent configuration were proposed, including the definition of the minimal base resource dependency model, simplified resource revalidation and the use of imperative general purpose language for defining idempotent configuration. Following the constructive research approach, the improvements to configuration management were designed into two prototypes. This allowed validation in laboratory testing, in two case studies and in expert interviews. In laboratory testing, the Hidden Master prototype was more resilient than leading configuration management tools in high load and low memory conditions, and against packet loss and corruption. Only the research prototype was adaptable to a network without stable topology due to the asynchronous nature of the Hidden Master architecture. The main case study used the research prototype in a complex environment to deploy a multi-room, authenticated audiovisual system for a client of an organization deploying the configuration. The case studies indicated that imperative general purpose language can be used for idempotent configuration in real life, for defining new configurations in unexpected situations using the base resources, and abstracting those using standard language features; and that such a system seems easy to learn. Potential business benefits were identified and evaluated using individual semistructured expert interviews. Respondents agreed that the models and the Hidden Master architecture could reduce costs and risks, improve developer productivity and allow faster time-to-market. Protection of master secret keys and the reduced need for incident response were seen as key drivers for improved security. Low-cost geographic scaling and leveraging file serving capabilities of commodity servers were seen to improve scaling and resiliency. Respondents identified jurisdictional legal limitations to encryption and requirements for cloud operator auditing as factors potentially limiting the full use of some concepts

    Pristup specifikaciji i generisanju proizvodnih procesa zasnovan na inženjerstvu vođenom modelima

    Get PDF
    In this thesis, we present an approach to the production process specification and generation based on the model-driven paradigm, with the goal to increase the flexibility of factories and respond to the challenges that emerged in the era of Industry 4.0 more efficiently. To formally specify production processes and their variations in the Industry 4.0 environment, we created a novel domain-specific modeling language, whose models are machine-readable. The created language can be used to model production processes that can be independent of any production system, enabling process models to be used in different production systems, and process models used for the specific production system. To automatically transform production process models dependent on the specific production system into instructions that are to be executed by production system resources, we created an instruction generator. Also, we created generators for different manufacturing documentation, which automatically transform production process models into manufacturing documents of different types. The proposed approach, domain-specific modeling language, and software solution contribute to introducing factories into the digital transformation process. As factories must rapidly adapt to new products and their variations in the era of Industry 4.0, production must be dynamically led and instructions must be automatically sent to factory resources, depending on products that are to be created on the shop floor. The proposed approach contributes to the creation of such a dynamic environment in contemporary factories, as it allows to automatically generate instructions from process models and send them to resources for execution. Additionally, as there are numerous different products and their variations, keeping the required manufacturing documentation up to date becomes challenging, which can be done automatically by using the proposed approach and thus significantly lower process designers' time.У овој дисертацији представљен је приступ спецификацији и генерисању производних процеса заснован на инжењерству вођеном моделима, у циљу повећања флексибилности постројења у фабрикама и ефикаснијег разрешавања изазова који се појављују у ери Индустрије 4.0. За потребе формалне спецификације производних процеса и њихових варијација у амбијенту Индустрије 4.0, креиран је нови наменски језик, чије моделе рачунар може да обради на аутоматизован начин. Креирани језик има могућност моделовања производних процеса који могу бити независни од производних система и тиме употребљени у различитим постројењима или фабрикама, али и производних процеса који су специфични за одређени систем. Како би моделе производних процеса зависних од конкретног производног система било могуће на аутоматизован начин трансформисати у инструкције које ресурси производног система извршавају, креиран је генератор инструкција. Такође су креирани и генератори техничке документације, који на аутоматизован начин трансформишу моделе производних процеса у документе различитих типова. Употребом предложеног приступа, наменског језика и софтверског решења доприноси се увођењу фабрика у процес дигиталне трансформације. Како фабрике у ери Индустрије 4.0 морају брзо да се прилагоде новим производима и њиховим варијацијама, неопходно је динамички водити производњу и на аутоматизован начин слати инструкције ресурсима у фабрици, у зависности од производа који се креирају у конкретном постројењу. Тиме што је у предложеном приступу могуће из модела процеса аутоматизовано генерисати инструкције и послати их ресурсима, доприноси се креирању једног динамичког окружења у савременим фабрикама. Додатно, услед великог броја различитих производа и њихових варијација, постаје изазовно одржавати неопходну техничку документацију, што је у предложеном приступу могуће урадити на аутоматизован начин и тиме значајно уштедети време пројектаната процеса.U ovoj disertaciji predstavljen je pristup specifikaciji i generisanju proizvodnih procesa zasnovan na inženjerstvu vođenom modelima, u cilju povećanja fleksibilnosti postrojenja u fabrikama i efikasnijeg razrešavanja izazova koji se pojavljuju u eri Industrije 4.0. Za potrebe formalne specifikacije proizvodnih procesa i njihovih varijacija u ambijentu Industrije 4.0, kreiran je novi namenski jezik, čije modele računar može da obradi na automatizovan način. Kreirani jezik ima mogućnost modelovanja proizvodnih procesa koji mogu biti nezavisni od proizvodnih sistema i time upotrebljeni u različitim postrojenjima ili fabrikama, ali i proizvodnih procesa koji su specifični za određeni sistem. Kako bi modele proizvodnih procesa zavisnih od konkretnog proizvodnog sistema bilo moguće na automatizovan način transformisati u instrukcije koje resursi proizvodnog sistema izvršavaju, kreiran je generator instrukcija. Takođe su kreirani i generatori tehničke dokumentacije, koji na automatizovan način transformišu modele proizvodnih procesa u dokumente različitih tipova. Upotrebom predloženog pristupa, namenskog jezika i softverskog rešenja doprinosi se uvođenju fabrika u proces digitalne transformacije. Kako fabrike u eri Industrije 4.0 moraju brzo da se prilagode novim proizvodima i njihovim varijacijama, neophodno je dinamički voditi proizvodnju i na automatizovan način slati instrukcije resursima u fabrici, u zavisnosti od proizvoda koji se kreiraju u konkretnom postrojenju. Time što je u predloženom pristupu moguće iz modela procesa automatizovano generisati instrukcije i poslati ih resursima, doprinosi se kreiranju jednog dinamičkog okruženja u savremenim fabrikama. Dodatno, usled velikog broja različitih proizvoda i njihovih varijacija, postaje izazovno održavati neophodnu tehničku dokumentaciju, što je u predloženom pristupu moguće uraditi na automatizovan način i time značajno uštedeti vreme projektanata procesa

    Structuring the State’s Voice of Contention in Harmonious Society: How Party Newspapers Cover Social Protests in China

    Get PDF
    During the Chinese Communist Party’s (CCP) campaign of building a ‘harmonious society’, how do the official newspapers cover the instances of social contention on the ground? Answering this question will shed light not only on how the party press works but also on how the state and the society interact in today’s China. This thesis conceptualises this phenomenon with a multi-faceted and multi-levelled notion of ‘state-initiated contentious public sphere’ to capture the complexity of mediated relations between the state and social contention in the party press. Adopting a relational approach, this thesis analyses 1758 news reports of ‘mass incident’ in the People’s Daily and the Guangming Daily between 2004 and 2020, employing cluster analysis, qualitative comparative analysis, and social network analysis. The thesis finds significant differences in the patterns of contentious coverage in the party press at the level of event and province and an uneven distribution of attention to social contention across incidents and regions. For ‘reported regions’, the thesis distinguishes four types of coverage and presents how party press responds differently to social contention in different scenarios at the provincial level. For ‘identified incidents’, the thesis distinguishes a cumulative type of visibility based on the quantity of coverage from a relational visibility based on the structure emerging from coverage and explains how different news-making rationales determine whether instances receive similar amounts of coverage or occupy similar positions within coverage. Eventually, by demonstrating how the Chinese state strategically uses party press to respond to social contention and how social contention is journalistically placed in different positions in the state’s eyes, this thesis argues that what social contention leads to is the establishment of complex state-contention relations channelled through the party press

    Online semi-supervised learning in non-stationary environments

    Get PDF
    Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and balanced data, immediately or after some delay, to extract worthwhile knowledge from the continuous and rapid data streams. However, in many real-world applications such as Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of Things sensors and real-time data on the Internet. Manual labelling of these data streams is not practical due to time consumption and the need for domain expertise. Another challenge is learning under Non-Stationary Environments (NSEs), which occurs due to changes in the data distributions in a set of input variables and/or class labels. The problem of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms have no access to the true class labels directly when the concept evolves. Several approaches exist that deal with NSE and EVL in isolation. However, few algorithms address both issues simultaneously. This research directly responds to ILNSE’s challenge in proposing two novel algorithms “Predictor for Streaming Data with Scarce Labels” (PSDSL) and Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label scarcity issues in online machine learning. The key capabilities of PSDSL include learning from a small amount of labelled data in an incremental or online manner and being available to predict at any time. To achieve this, PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it continuously learns from incoming data and updates the model as new labelled or unlabelled data becomes available over time. Furthermore, it can predict under NSE conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier, which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch and adapt to the conditions. The PSDSL adapts to learning states between self-learning, micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of the data stream. HDWM makes use of “seed” learners of different types in an ensemble to maintain its diversity. The ensembles are simply the combination of predictive models grouped to improve the predictive performance of a single classifier. PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than existing approaches on most real-time data streams including randomised data instances. PSDSL performed significantly better than ‘Static’ i.e. the classifier is not updated after it is trained with the first examples in the data streams. When applied to MOA-generated data streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC, while SCARGC performed the same as the Static. PSDSL achieved better average prediction accuracies in a short time than SCARGC. The HDWM algorithm is evaluated on artificial and real-world data streams against existing well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic DWM algorithm. The results showed that HDWM performed significantly better than WMA and DWM. Also, when recurring concept drifts were present, the predictive performance of HDWM showed an improvement over DWM. In both drift and real-world streams, significance tests and post hoc comparisons found significant differences between algorithms, HDWM performed significantly better than DWM and WMA when applied to MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms benefit from the use of both forgetting and retaining the models. The algorithm also provides the independence of selecting the optimal base classifier in its ensemble depending on the problem. A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts during the cluster labelling process. In this process, PSDSL transforms the centroids’ information of micro-clusters into micro-instances and generates new clusters called Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and successfully guide the cluster labelling process after the concept drifts in the absence of true class labels. PSDSL has been evaluated on real-world problem ‘keystroke dynamics’, and the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC (81.6%), while the Static (49.0%) significantly degrades the performance due to changes in the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found highly fluctuated between (41.1% to 81.6%) based on different values of parameter ‘k’ (number of clusters), while PSDSL automatically determine the best values for this parameter

    Fairness-aware Machine Learning in Educational Data Mining

    Get PDF
    Fairness is an essential requirement of every educational system, which is reflected in a variety of educational activities. With the extensive use of Artificial Intelligence (AI) and Machine Learning (ML) techniques in education, researchers and educators can analyze educational (big) data and propose new (technical) methods in order to support teachers, students, or administrators of (online) learning systems in the organization of teaching and learning. Educational data mining (EDM) is the result of the application and development of data mining (DM), and ML techniques to deal with educational problems, such as student performance prediction and student grouping. However, ML-based decisions in education can be based on protected attributes, such as race or gender, leading to discrimination of individual students or subgroups of students. Therefore, ensuring fairness in ML models also contributes to equity in educational systems. On the other hand, bias can also appear in the data obtained from learning environments. Hence, bias-aware exploratory educational data analysis is important to support unbiased decision-making in EDM. In this thesis, we address the aforementioned issues and propose methods that mitigate discriminatory outcomes of ML algorithms in EDM tasks. Specifically, we make the following contributions: We perform bias-aware exploratory analysis of educational datasets using Bayesian networks to identify the relationships among attributes in order to understand bias in the datasets. We focus the exploratory data analysis on features having a direct or indirect relationship with the protected attributes w.r.t. prediction outcomes. We perform a comprehensive evaluation of the sufficiency of various group fairness measures in predictive models for student performance prediction problems. A variety of experiments on various educational datasets with different fairness measures are performed to provide users with a broad view of unfairness from diverse aspects. We deal with the student grouping problem in collaborative learning. We introduce the fair-capacitated clustering problem that takes into account cluster fairness and cluster cardinalities. We propose two approaches, namely hierarchical clustering and partitioning-based clustering, to obtain fair-capacitated clustering. We introduce the multi-fair capacitated (MFC) students-topics grouping problem that satisfies students' preferences while ensuring balanced group cardinalities and maximizing the diversity of members regarding the protected attribute. We propose three approaches: a greedy heuristic approach, a knapsack-based approach using vanilla maximal 0-1 knapsack formulation, and an MFC knapsack approach based on group fairness knapsack formulation. In short, the findings described in this thesis demonstrate the importance of fairness-aware ML in educational settings. We show that bias-aware data analysis, fairness measures, and fairness-aware ML models are essential aspects to ensure fairness in EDM and the educational environment.Ministry of Science and Culture of Lower Saxony/LernMINT/51410078/E

    Analysis in Web 3D Environments of Thematic Research Networks on Immersive Learning through Variation of Clustering Criteria

    Get PDF
    Este projeto tem como objetivo desenvolver uma ferramenta de visualização 3D baseada na web que proporciona uma compreensão global do campo de Aprendizagem Imersiva. As redes temáti- cas são uma abordagem bem estabelecida para lidar com esses desafios. Portanto, foi realizada uma revisão sistemática da literatura para extrair métodos e critérios de clustering em redes temáticas. A metodologia empregada neste estudo é a pesquisa de Design Science Research, que envolveu o desenvolvimento e a avaliação iterativos da ferramenta de visualização. Entre- vistas com especialistas foram realizadas para identificar os requisitos, e métodos rigorosos, in- cluindo gravação, análise e transcrição das entrevistas, foram aplicados para verificar a relevância da pesquisa. A ferramenta utiliza uma abordagem de visualização de node-link para visualizar estratégias, práticas e artigos associados à aprendizagem imersiva. Além disso, oferece uma var- iedade de funcionalidades de filtragem, permitindo que os usuários filtrem por estratégias, práticas, autores, instituições e outros. Além disso, a ferramenta incorpora várias funcionalidades de clus- tering, como detecção de comunidades usando o algoritmo de Louvain, com variação de critérios de clustering , como associação de temas e de artigos, citação de artigos, co-citação e outros. Os usuários também podem controlar a estrutura da rede modificando o tamanho das clustering, o número e as cores das comunidades. A ferramenta apresenta métodos exploratórios de redes temáticas para navegar no ambiente. Ao combinar redes temáticas com capacidades de clustering e filtragem, essa ferramenta tem como objetivo fornecer uma compreensão global do campo cien- tífico. Sua integração única de tecnologias Web e 3D, juntamente com métodos exploratórios, a diferencia das ferramentas de visualização existentes. Os poderosos algoritmos de clustering da ferramenta, oferecendo critérios diversos para entender as relações conceituais, têm o potencial de ter um impacto significativo na comunidade de aprendizagem imersiva. Ela é projetada para servir como um artefato inovador que aprimora as capacidades analíticas de pesquisadores, educadores e estudantes na área.This project aims to develop a Web-based 3D visualization tool that provides a global understanding of the field of Immersive Learning. Thematic networks are a well-established approach for addressing such challenges. Therefore, a systematic literature review was conducted to extract clustering methods and criteria in thematic networks. The methodology employed in this study is Design Science research, which involved iterative development and evaluation of the visualization tool. Expert interviews were conducted to identify requirements, and rigorous methods, including recording, analyzing, and transcribing interviews, were applied to ascertain the research's relevance. The tool utilizes a node-link visualization approach to represent immersive learning strategies, practices, and associated papers. Additionally, it offers a range of filtering functionali- ties, allowing users to filter by strategies, practices, authors, institutions, and more. Furthermore, the tool incorporates various clustering functionalities, such as community detection using the Louvain Algorithm, with variable clustering criteria such as theme and paper association, paper citation, co-citation, and others. Users can also control the network's structure by modifying clus- ter size, number, and community colors. The tool features thematic networks exploratory methods for navigating the environment. By combining thematic networks with clustering and filtering capabilities, this tool aims to provide a global understanding of the scientific field. Its unique integration of Web and 3D technologies, along with exploratory methods, distinguishes it from existing visualization tools. The tool's powerful clustering algorithms, offering diverse criteria for understanding concept relationships, have the potential to make a significant impact in the immersive learning community. It is designed to serve as an innovative artifact that enhances the analytical capabilities of researchers, educators, and students in the field

    Conformance Checking-based Concept Drift Detection in Process Mining

    Get PDF
    One of the main challenges of process mining is to obtain models that represent a process as simply and accurately as possible. Both characteristics can be greatly influenced by changes in the control flow of the process throughout its life cycle. In this thesis we propose the use of conformance metrics to monitor such changes in a way that allows the division of the log into sub-logs representing different versions of the process over time. The validity of the hypothesis has been formally demonstrated, showing that all kinds of changes in the process flow can be captured using these approaches, including sudden, gradual drifts on both clean and noisy environments, where differentiating between anomalous executions and real changes can be tricky

    Advances and Applications of DSmT for Information Fusion. Collected Works, Volume 5

    Get PDF
    This fifth volume on Advances and Applications of DSmT for Information Fusion collects theoretical and applied contributions of researchers working in different fields of applications and in mathematics, and is available in open-access. The collected contributions of this volume have either been published or presented after disseminating the fourth volume in 2015 in international conferences, seminars, workshops and journals, or they are new. The contributions of each part of this volume are chronologically ordered. First Part of this book presents some theoretical advances on DSmT, dealing mainly with modified Proportional Conflict Redistribution Rules (PCR) of combination with degree of intersection, coarsening techniques, interval calculus for PCR thanks to set inversion via interval analysis (SIVIA), rough set classifiers, canonical decomposition of dichotomous belief functions, fast PCR fusion, fast inter-criteria analysis with PCR, and improved PCR5 and PCR6 rules preserving the (quasi-)neutrality of (quasi-)vacuous belief assignment in the fusion of sources of evidence with their Matlab codes. Because more applications of DSmT have emerged in the past years since the apparition of the fourth book of DSmT in 2015, the second part of this volume is about selected applications of DSmT mainly in building change detection, object recognition, quality of data association in tracking, perception in robotics, risk assessment for torrent protection and multi-criteria decision-making, multi-modal image fusion, coarsening techniques, recommender system, levee characterization and assessment, human heading perception, trust assessment, robotics, biometrics, failure detection, GPS systems, inter-criteria analysis, group decision, human activity recognition, storm prediction, data association for autonomous vehicles, identification of maritime vessels, fusion of support vector machines (SVM), Silx-Furtif RUST code library for information fusion including PCR rules, and network for ship classification. Finally, the third part presents interesting contributions related to belief functions in general published or presented along the years since 2015. These contributions are related with decision-making under uncertainty, belief approximations, probability transformations, new distances between belief functions, non-classical multi-criteria decision-making problems with belief functions, generalization of Bayes theorem, image processing, data association, entropy and cross-entropy measures, fuzzy evidence numbers, negator of belief mass, human activity recognition, information fusion for breast cancer therapy, imbalanced data classification, and hybrid techniques mixing deep learning with belief functions as well
    corecore