754 research outputs found

    A Bibliographic View on Constrained Clustering

    Full text link
    A keyword search on constrained clustering on Web-of-Science returned just under 3,000 documents. We ran automatic analyses of those, and compiled our own bibliography of 183 papers which we analysed in more detail based on their topic and experimental study, if any. This paper presents general trends of the area and its sub-topics by Pareto analysis, using citation count and year of publication. We list available software and analyse the experimental sections of our reference collection. We found a notable lack of large comparison experiments. Among the topics we reviewed, applications studies were most abundant recently, alongside deep learning, active learning and ensemble learning.Comment: 18 pages, 11 figures, 177 reference

    Differentiable Clustering with Perturbed Spanning Forests

    Full text link
    We introduce a differentiable clustering method based on minimum-weight spanning forests, a variant of spanning trees with several connected components. Our method relies on stochastic perturbations of solutions of linear programs, for smoothing and efficient gradient computations. This allows us to include clustering in end-to-end trainable pipelines. We show that our method performs well even in difficult settings, such as datasets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several real world datasets for supervised and semi-supervised tasks

    Unsupervised and semi-supervised fuzzy clustering with multiple kernels.

    Get PDF
    For real-world clustering tasks, the input data is typically not easily separable due to the highly complex data structure or when clusters vary in size, density and shape. Recently, kernel-based clustering has been proposed to perform clustering in a higher-dimensional feature space spanned by embedding maps and corresponding kernel functions. Although good results were obtained using the Gaussian kernel function, its performance depends on the selection of the scaling parameter among an extensive range of possibilities. This step is often heavily influenced by prior knowledge about the data and by the patterns we expect to discover. Unfortunately, it is often unclear which kernels are more suitable for a particular task. The problem is aggravated for many real-world clustering applications, in which the distributions of the different clusters in the feature space exhibit large variations. Thus, in the absence of a priori knowledge, a single kernel selected from a predefined group is sometimes insufficient to represent the data. One way to learn optimal scaling parameters is through an exhaustive search of one optimal scaling parameter for each cluster. However, this approach is not practical since it is computationally expensive, especially when the data includes a large number of clusters and when the dynamic range of possible values of the scaling parameters is large. Moreover, the evaluation of the resulting partition in order to select the optimal parameters is not an easy task. To overcome the above drawbacks, we introduce two novel fuzzy clustering techniques that use Multiple Kernel Learning to provide an elegant solution for parameter selection. The Fuzzy C-Means with Multiple Kernels algorithm (FCMK) simultaneously finds the optimal partition and the cluster-dependent kernel combination weights that reflect the intrinsic structure of the data. The Relational Fuzzy Clustering with Multiple Kernels (RFCMK) learns the kernel combination weights by optimizing the relational dissimilarities. Consequently, the learned kernel combination weights reflect the relative density, size, and position of each cluster with respect to the other clusters. We also extended FCMK and RFCMK to the semi-supervised paradigms. We show that the incorporation of prior knowledge in the unsupervised clustering task in the form of a small set of constraints on which instances should or should not reside in the same cluster, guides the unsupervised approaches to a better partitioning of the data and avoid local minima, especially for high dimensional real world data. All of the proposed algorithms are optimized iteratively by dynamically updating the partition and the kernel combination weights in each iteration. This makes these algorithms simple and fast. Moreover, our algorithms are formulated to work on both vector and relational data. This makes them applicable to data where objects cannot be represented by vectors or when clusters of similar objects cannot be represented efficiently by a single prototype. We also introduced two relational fuzzy clustering with multiple kernel algorithms for large data to deal with the scalability issue of RFCMK. The random sample and extend RFCMK (rseRFCMK) computes cluster prototypes from a smaller sample of randomly selected objects, and then extends the partition to the remainder of the data. The single pass RFCMK (spRFCMK) sequentially loads manageable sized chunks, clustering the chunks in a single pass, and then combining the results from each chunk. Our extensive experiments show that RFCMK and SS-RFCMK outperform existing algorithms. In particular, we show that when data include clusters with various intrinsic structures and densities, learning kernel weights that vary over clusters is crucial in obtaining a good partition

    Unsupervised learning of relation detection patterns

    Get PDF
    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.Postprint (published version

    An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering

    Full text link
    The minimum sum-of-squares clustering (MSSC), or k-means type clustering, is traditionally considered an unsupervised learning task. In recent years, the use of background knowledge to improve the cluster quality and promote interpretability of the clustering process has become a hot research topic at the intersection of mathematical optimization and machine learning research. The problem of taking advantage of background information in data clustering is called semi-supervised or constrained clustering. In this paper, we present a branch-and-cut algorithm for semi-supervised MSSC, where background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the lower bound procedure, we solve the semidefinite programming relaxation of the MSSC discrete optimization model, and we use a cutting-plane procedure for strengthening the bound. For the upper bound, instead, by using integer programming tools, we use an adaptation of the k-means algorithm to the constrained case. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points with different combinations of must-link and cannot-link constraints and with a generic number of features. This problem size is about four times larger than the one of the instances solved by state-of-the-art exact algorithms

    Unsupervised deep learning of human brain diffusion magnetic resonance imaging tractography data

    Get PDF
    L'imagerie par résonance magnétique de diffusion est une technique non invasive permettant de connaître la microstructure organisationnelle des tissus biologiques. Les méthodes computationnelles qui exploitent la préférence orientationnelle de la diffusion dans des structures restreintes pour révéler les voies axonales de la matière blanche du cerveau sont appelées tractographie. Ces dernières années, diverses méthodes de tractographie ont été utilisées avec succès pour découvrir l'architecture de la matière blanche du cerveau. Pourtant, ces techniques de reconstruction souffrent d'un certain nombre de défauts dérivés d'ambiguïtés fondamentales liées à l'information orientationnelle. Cela a des conséquences dramatiques, puisque les cartes de connectivité de la matière blanche basées sur la tractographie sont dominées par des faux positifs. Ainsi, la grande proportion de voies invalides récupérées demeure un des principaux défis à résoudre par la tractographie pour obtenir une description anatomique fiable de la matière blanche. Des approches méthodologiques innovantes sont nécessaires pour aider à résoudre ces questions. Les progrès récents en termes de puissance de calcul et de disponibilité des données ont rendu possible l'application réussie des approches modernes d'apprentissage automatique à une variété de problèmes, y compris les tâches de vision par ordinateur et d'analyse d'images. Ces méthodes modélisent et trouvent les motifs sous-jacents dans les données, et permettent de faire des prédictions sur de nouvelles données. De même, elles peuvent permettre d'obtenir des représentations compactes des caractéristiques intrinsèques des données d'intérêt. Les approches modernes basées sur les données, regroupées sous la famille des méthodes d'apprentissage profond, sont adoptées pour résoudre des tâches d'analyse de données d'imagerie médicale, y compris la tractographie. Dans ce contexte, les méthodes deviennent moins dépendantes des contraintes imposées par les approches classiques utilisées en tractographie. Par conséquent, les méthodes inspirées de l'apprentissage profond conviennent au changement de paradigme requis, et peuvent ouvrir de nouvelles possibilités de modélisation, en améliorant ainsi l'état de l'art en tractographie. Dans cette thèse, un nouveau paradigme basé sur les techniques d'apprentissage de représentation est proposé pour générer et analyser des données de tractographie. En exploitant les architectures d'autoencodeurs, ce travail tente d'explorer leur capacité à trouver un code optimal pour représenter les caractéristiques des fibres de la matière blanche. Les contributions proposées exploitent ces représentations pour une variété de tâches liées à la tractographie, y compris (i) le filtrage et (ii) le regroupement efficace sur les résultats générés par d'autres méthodes, ainsi que (iii) la reconstruction proprement dite des fibres de la matière blanche en utilisant une méthode générative. Ainsi, les méthodes issues de cette thèse ont été nommées (i) FINTA (Filtering in Tractography using Autoencoders), (ii) CINTA (Clustering in Tractography using Autoencoders), et (iii) GESTA (Generative Sampling in Bundle Tractography using Autoencoders), respectivement. Les performances des méthodes proposées sont évaluées par rapport aux méthodes de l'état de l'art sur des données de diffusion synthétiques et des données de cerveaux humains chez l'adulte sain in vivo. Les résultats montrent que (i) la méthode de filtrage proposée offre une sensibilité et spécificité supérieures par rapport à d'autres méthodes de l'état de l'art; (ii) le regroupement des tractes dans des faisceaux est fait de manière consistante; et (iii) l'approche générative échantillonnant des tractes comble mieux l'espace de la matière blanche dans des régions difficiles à reconstruire. Enfin, cette thèse révèle les possibilités des autoencodeurs pour l'analyse des données des fibres de la matière blanche, et ouvre la voie à fournir des données de tractographie plus fiables.Abstract : Diffusion magnetic resonance imaging is a non-invasive technique providing insights into the organizational microstructure of biological tissues. The computational methods that exploit the orientational preference of the diffusion in restricted structures to reveal the brain's white matter axonal pathways are called tractography. In recent years, a variety of tractography methods have been successfully used to uncover the brain's white matter architecture. Yet, these reconstruction techniques suffer from a number of shortcomings derived from fundamental ambiguities inherent to the orientation information. This has dramatic consequences, since current tractography-based white matter connectivity maps are dominated by false positive connections. Thus, the large proportion of invalid pathways recovered remains one of the main challenges to be solved by tractography to obtain a reliable anatomical description of the white matter. Methodological innovative approaches are required to help solving these questions. Recent advances in computational power and data availability have made it possible to successfully apply modern machine learning approaches to a variety of problems, including computer vision and image analysis tasks. These methods model and learn the underlying patterns in the data, and allow making accurate predictions on new data. Similarly, they may enable to obtain compact representations of the intrinsic features of the data of interest. Modern data-driven approaches, grouped under the family of deep learning methods, are being adopted to solve medical imaging data analysis tasks, including tractography. In this context, the proposed methods are less dependent on the constraints imposed by current tractography approaches. Hence, deep learning-inspired methods are suit for the required paradigm shift, may open new modeling possibilities, and thus improve the state of the art in tractography. In this thesis, a new paradigm based on representation learning techniques is proposed to generate and to analyze tractography data. By harnessing autoencoder architectures, this work explores their ability to find an optimal code to represent the features of the white matter fiber pathways. The contributions exploit such representations for a variety of tractography-related tasks, including efficient (i) filtering and (ii) clustering on results generated by other methods, and (iii) the white matter pathway reconstruction itself using a generative method. The methods issued from this thesis have been named (i) FINTA (Filtering in Tractography using Autoencoders), (ii) CINTA (Clustering in Tractography using Autoencoders), and (iii) GESTA (Generative Sampling in Bundle Tractography using Autoencoders), respectively. The proposed methods' performance is assessed against current state-of-the-art methods on synthetic data and healthy adult human brain in vivo data. Results show that the (i) introduced filtering method has superior sensitivity and specificity over other state-of-the-art methods; (ii) the clustering method groups streamlines into anatomically coherent bundles with a high degree of consistency; and (iii) the generative streamline sampling technique successfully improves the white matter coverage in hard-to-track bundles. In summary, this thesis unlocks the potential of deep autoencoder-based models for white matter data analysis, and paves the way towards delivering more reliable tractography data

    Automatic text filtering using limited supervision learning for epidemic intelligence

    Get PDF
    [no abstract

    Macro-micro approach for mining public sociopolitical opinion from social media

    Get PDF
    During the past decade, we have witnessed the emergence of social media, which has prominence as a means for the general public to exchange opinions towards a broad range of topics. Furthermore, its social and temporal dimensions make it a rich resource for policy makers and organisations to understand public opinion. In this thesis, we present our research in understanding public opinion on Twitter along three dimensions: sentiment, topics and summary. In the first line of our work, we study how to classify public sentiment on Twitter. We focus on the task of multi-target-specific sentiment recognition on Twitter, and propose an approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. We show the state-of-the-art performance on two datasets including a multi-target Twitter corpus on UK elections which we make public available for the research community. Additionally we also conduct two preliminary studies including cross-domain emotion classification on discourse around arts and cultural experiences, and social spam detection to improve the signal-to-noise ratio of our sentiment corpus. Our second line of work focuses on automatic topical clustering of tweets. Our aim is to group tweets into a number of clusters, with each cluster representing a meaningful topic, story, event or a reason behind a particular choice of sentiment. We explore various ways of tackling this challenge and propose a two-stage hierarchical topic modelling system that is efficient and effective in achieving our goal. Lastly, for our third line of work, we study the task of summarising tweets on common topics, with the goal to provide informative summaries for real-world events/stories or explanation underlying the sentiment expressed towards an issue/entity. As most existing tweet summarisation approaches rely on extractive methods, we propose to apply state-of-the-art neural abstractive summarisation model for tweets. We also tackle the challenge of cross-medium supervised summarisation with no target-medium training resources. To the best of our knowledge, there is no existing work on studying neural abstractive summarisation on tweets. In addition, we present a system for providing interactive visualisation of topic-entity sentiments and the corresponding summaries in chronological order. Throughout our work presented in this thesis, we conduct experiments to evaluate and verify the effectiveness of our proposed models, comparing to relevant baseline methods. Most of our evaluations are quantitative, however, we do perform qualitative analyses where it is appropriate. This thesis provides insights and findings that can be used for better understanding public opinion in social media
    • …
    corecore