87 research outputs found

    Synchronization Inspired Data Mining

    Get PDF
    Advances of modern technologies produce huge amounts of data in various fields, increasing the need for efficient and effective data mining tools to uncover the information contained implicitly in the data. This thesis mainly aims to propose innovative and solid algorithms for data mining from a novel perspective: synchronization. Synchronization is a prevalent phenomenon in nature that a group of events spontaneously come into co-occurrence with a common rhythm through mutual interactions. The mechanism of synchronization allows controlling of complex processes by simple operations based on interactions between objects. The first main part of this thesis focuses on developing the innovative algorithms for data mining. Inspired by the concept of synchronization, this thesis presents Sync (Clustering by Synchronization), a novel approach to clustering. In combination with the Minimum Description Length principle (MDL), it allows discovering the intrinsic clusters without any data distribution assumptions and parameters setting. In addition, relying on the dierent dynamic behaviors of objects during the process towards synchronization,the algorithm SOD (Synchronization-based Outlier Detection) is further proposed. The outlier objects can be naturally flagged by the denition of Local Synchronization Factor (LSF). To cure the curse of dimensionality in clustering,a subspace clustering algorithm ORSC is introduced which automatically detects clusters in subspaces of the original feature space. This approach proposes a weighted local interaction model to ensure all objects in a common cluster, which accommodate in arbitrarily oriented subspace, naturally move together. In order to reveal the underlying patterns in graphs, a graph partitioning approach RSGC (Robust Synchronization-based Graph Clustering) is presented. The key philosophy of RSGC is to consider graph clustering as a dynamic process towards synchronization. Inherited from the powerful concept of synchronization, RSGC shows several desirable properties that don't exist in other competitive methods. For all presented algorithms, their efficiency and eectiveness are thoroughly analyzed. The benets over traditional approaches are further demonstrated by evaluating them on synthetic as well as real-world data sets. Not only the theory research on novel data mining algorithms, the second main part of the thesis focuses on brain network analysis based on Diusion Tensor Images (DTI). A new framework for automated white matter tracts clustering is rst proposed to identify the meaningful ber bundles in the Human Brain by combining ideas from time series mining with density-based clustering. Subsequently, the enhancement and variation of this approach is discussed allowing for a more robust, efficient, or eective way to find hierarchies of ber bundles. Based on the structural connectivity network, an automated prediction framework is proposed to analyze and understand the abnormal patterns in patients of Alzheimer's Disease

    Seeing Is Not Always Believing: Invisible Collision Attack and Defence on Pre-Trained Models

    Full text link
    Large-scale pre-trained models (PTMs) such as BERT and GPT have achieved great success in diverse fields. The typical paradigm is to pre-train a big deep learning model on large-scale data sets, and then fine-tune the model on small task-specific data sets for downstream tasks. Although PTMs have rapidly progressed with wide real-world applications, they also pose significant risks of potential attacks. Existing backdoor attacks or data poisoning methods often build up the assumption that the attacker invades the computers of victims or accesses the target data, which is challenging in real-world scenarios. In this paper, we propose a novel framework for an invisible attack on PTMs with enhanced MD5 collision. The key idea is to generate two equal-size models with the same MD5 checksum by leveraging the MD5 chosen-prefix collision. Afterwards, the two ``same" models will be deployed on public websites to induce victims to download the poisoned model. Unlike conventional attacks on deep learning models, this new attack is flexible, covert, and model-independent. Additionally, we propose a simple defensive strategy for recognizing the MD5 chosen-prefix collision and provide a theoretical justification for its feasibility. We extensively validate the effectiveness and stealthiness of our proposed attack and defensive method on different models and data sets

    Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models

    Full text link
    Pre-trained language models (PLMs) are known to be overly parameterized and have significant redundancy, indicating a small degree of freedom of the PLMs. Motivated by the observation, in this paper, we study the problem of re-parameterizing and fine-tuning PLMs from a new perspective: Discovery of intrinsic task-specific subspace. Specifically, by exploiting the dynamics of the fine-tuning process for a given task, the parameter optimization trajectory is learned to uncover its intrinsic task-specific subspace. A key finding is that PLMs can be effectively fine-tuned in the subspace with a small number of free parameters. Beyond, we observe some outlier dimensions emerging during fine-tuning in the subspace. Disabling these dimensions degrades the model performance significantly. This suggests that these dimensions are crucial to induce task-specific knowledge to downstream tasks.Comment: ACL 2023 (main conference, long paper

    Large-scale Multi-view Subspace Clustering in Linear Time

    Full text link
    A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.Comment: Accepted by AAAI 202

    Predicting multiple functions of sustainable flood retention basins under uncertainty via multi-instance multi-label learning

    Get PDF
    The ambiguity of diverse functions of sustainable flood retention basins (SFRBs) may lead to conflict and risk in water resources planning and management. How can someone provide an intuitive yet efficient strategy to uncover and distinguish the multiple potential functions of SFRBs under uncertainty? In this study, by exploiting both input and output uncertainties of SFRBs, the authors developed a new data-driven framework to automatically predict the multiple functions of SFRBs by using multi-instance multi-label (MIML) learning. A total of 372 sustainable flood retention basins, characterized by 40 variables associated with confidence levels, were surveyed in Scotland, UK. A Gaussian model with Monte Carlo sampling was used to capture the variability of variables (i.e., input uncertainty), and the MIML-support vector machine (SVM) algorithm was subsequently applied to predict the potential functions of SFRBs that have not yet been assessed, allowing for one basin belonging to different types (i.e., output uncertainty). Experiments demonstrated that the proposed approach enables effective automatic prediction of the potential functions of SFRBs (e.g., accuracy >93%). The findings suggest that the functional uncertainty of SFRBs under investigation can be better assessed in a more comprehensive and cost-effective way, and the proposed data-driven approach provides a promising method of doing so for water resources management

    How accurate are approximate quantum chemical methods at modelling solute?solvent interactions in solvated clusters?

    Get PDF
    In this paper, the performance of a wide range of DFT methods is assessed for the calculation of interaction energies of thermal clusters of a solute in water. Three different charge states (neutral, proton transfer transition state and zwitterion) of glycine were solvated by 1 to 40 water molecules as sampled from molecular dynamics simulations. While some ab initio composite methods that employ insufficiently large basis sets incurred significant errors even for a cluster containing only 5 water molecules relative to the W1X-2 benchmark, the DLPNO-CCSD(T)/CBS and DSD-PBEP86 (triple zeta basis set) levels of theory predicted very accurate interaction energies. These levels of theory were used to benchmark the performance of 16 density functionals from different rungs of Jacob\u27s Ladder. Of the Rung 4 functionals examined, the ωB97M-V and ωB97X-V functionals stood out for predicting absolute interaction energies in 40-water clusters with mean absolute deviations (MAD) ?4 kJ mol-1. The B3LYP-D3(BJ) functional performed exceptionally well with a MAD ?1.7 kJ mol-1 and is the overall best performing method. Calculations of relative interaction energies allow for cancellation of systematic errors, including basis set truncation and superposition errors, and the ωB97M-V and B3LYP-D3(BJ) double zeta basis set calculations yielded relative interaction energies that are within ?3 kJ mol-1 of the benchmark. The ONIOM approximation provides another strategy for accelerating the calculation of accurate absolute interaction energies provided that the calculations have converged with respect to the size of the "high-level-layer"

    Ongoing Slow Fluctuations in V1 Impact on Visual Perception

    Get PDF
    The human brain's ongoing activity is characterized by intrinsic networks of coherent fluctuations, measured for example with correlated functional magnetic resonance imaging signals. So far, however, the brain processes underlying this ongoing blood oxygenation level dependent (BOLD) signal orchestration and their direct relevance for human behavior are not sufficiently understood. In this study, we address the question of whether and how ongoing BOLD activity within intrinsic occipital networks impacts on conscious visual perception. To this end, backwardly masked targets were presented in participants' left visual field only, leaving the ipsi-lateral occipital areas entirely free from direct effects of task throughout the experiment. Signal time courses of ipsi-lateral BOLD fluctuations in visual areas V1 and V2 were then used as proxies for the ongoing contra-lateral BOLD activity within the bilateral networks. Magnitude and phase of these fluctuations were compared in trials with and without conscious visual perception, operationalized by means of subjective confidence ratings. Our results show that ipsilateral BOLD magnitudes in V1 were significantly higher at times of peak response when the target was perceived consciously. A significant difference between conscious and non-conscious perception with regard to the pre-target phase of an intrinsic-frequency regime suggests that ongoing V1 fluctuations exert a decisive impact on the access to consciousness already before stimulation. Both effects were absent in V2. These results thus support the notion that ongoing slow BOLD activity within intrinsic networks covering V1 represents localized processes that modulate the degree of readiness for the emergence of visual consciousness
    corecore