56 research outputs found

    Voting-Based Consensus of Data Partitions

    Get PDF
    Over the past few years, there has been a renewed interest in the consensus problem for ensembles of partitions. Recent work is primarily motivated by the developments in the area of combining multiple supervised learners. Unlike the consensus of supervised classifications, the consensus of data partitions is a challenging problem due to the lack of globally defined cluster labels and to the inherent difficulty of data clustering as an unsupervised learning problem. Moreover, the true number of clusters may be unknown. A fundamental goal of consensus methods for partitions is to obtain an optimal summary of an ensemble and to discover a cluster structure with accuracy and robustness exceeding those of the individual ensemble partitions. The quality of the consensus partitions highly depends on the ensemble generation mechanism and on the suitability of the consensus method for combining the generated ensemble. Typically, consensus methods derive an ensemble representation that is used as the basis for extracting the consensus partition. Most ensemble representations circumvent the labeling problem. On the other hand, voting-based methods establish direct parallels with consensus methods for supervised classifications, by seeking an optimal relabeling of the ensemble partitions and deriving an ensemble representation consisting of a central aggregated partition. An important element of the voting-based aggregation problem is the pairwise relabeling of an ensemble partition with respect to a representative partition of the ensemble, which is refered to here as the voting problem. The voting problem is commonly formulated as a weighted bipartite matching problem. In this dissertation, a general theoretical framework for the voting problem as a multi-response regression problem is proposed. The problem is formulated as seeking to estimate the uncertainties associated with the assignments of the objects to the representative clusters, given their assignments to the clusters of an ensemble partition. A new voting scheme, referred to as cumulative voting, is derived as a special instance of the proposed regression formulation corresponding to fitting a linear model by least squares estimation. The proposed formulation reveals the close relationships between the underlying loss functions of the cumulative voting and bipartite matching schemes. A useful feature of the proposed framework is that it can be applied to model substantial variability between partitions, such as a variable number of clusters. A general aggregation algorithm with variants corresponding to cumulative voting and bipartite matching is applied and a simulation-based analysis is presented to compare the suitability of each scheme to different ensemble generation mechanisms. The bipartite matching is found to be more suitable than cumulative voting for a particular generation model, whereby each ensemble partition is generated as a noisy permutation of an underlying labeling, according to a probability of error. For ensembles with a variable number of clusters, it is proposed that the aggregated partition be viewed as an estimated distributional representation of the ensemble, on the basis of which, a criterion may be defined to seek an optimally compressed consensus partition. The properties and features of the proposed cumulative voting scheme are studied. In particular, the relationship between cumulative voting and the well-known co-association matrix is highlighted. Furthermore, an adaptive aggregation algorithm that is suited for the cumulative voting scheme is proposed. The algorithm aims at selecting the initial reference partition and the aggregation sequence of the ensemble partitions the loss of mutual information associated with the aggregated partition is minimized. In order to subsequently extract the final consensus partition, an efficient agglomerative algorithm is developed. The algorithm merges the aggregated clusters such that the maximum amount of information is preserved. Furthermore, it allows the optimal number of consensus clusters to be estimated. An empirical study using several artificial and real-world datasets demonstrates that the proposed cumulative voting scheme leads to discovering substantially more accurate consensus partitions compared to bipartite matching, in the case of ensembles with a relatively large or a variable number of clusters. Compared to other recent consensus methods, the proposed method is found to be comparable with or better than the best performing methods. Moreover, accurate estimates of the true number of clusters are often achieved using cumulative voting, whereas consistently poor estimates are achieved based on bipartite matching. The empirical evidence demonstrates that the bipartite matching scheme is not suitable for these types of ensembles

    Unsupervised learning of relation detection patterns

    Get PDF
    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.Postprint (published version

    XII. Magyar Számítógépes Nyelvészeti Konferencia

    Get PDF

    Spectral analysis of spatial processes

    Get PDF

    Recherche d'images par le contenu, analyse multirésolution et modèles de régression logistique

    Get PDF
    Cette thèse, présente l'ensemble de nos contributions relatives à la recherche d'images par le contenu à l'aide de l'analyse multirésolution ainsi qu'à la classification linéaire et nonlinéaire. Dans la première partie, nous proposons une méthode simple et rapide de recherche d'images par le contenu. Pour représenter les images couleurs, nous introduisons de nouveaux descripteurs de caractéristiques qui sont des histogrammes pondérés par le gradient multispectral. Afin de mesurer le degré de similarité entre deux images d'une façon rapide et efficace, nous utilisons une pseudo-métrique pondérée qui utilise la décomposition en ondelettes et la compression des histogrammes extraits des images. Les poids de la pseudo-métrique sont ajustés à l'aide du modèle classique de régression logistique afin d'améliorer sa capacité à discriminer et la précision de la recherche. Dans la deuxième partie, nous proposons un nouveau modèle bayésien de régression logistique fondé sur une méthode variationnelle. Une comparaison de ce nouveau modèle au modèle classique de régression logistique est effectuée dans le cadre de la recherche d'images. Nous illustrons par la suite que le modèle bayésien permet par rapport au modèle classique une amélioration notoire de la capacité à discriminer de la pseudo-métrique et de la précision de recherche. Dans la troisième partie, nous détaillons la dérivation du nouveau modèle bayésien de régression logistique fondé sur une méthode variationnelle et nous comparons ce modèle au modèle classique de régression logistique ainsi qu'à d'autres classificateurs linéaires présents dans la littérature. Nous comparons par la suite, notre méthode de recherche, utilisant le modèle bayésien de régression logistique, à d'autres méthodes de recherches déjà publiées. Dans la quatrième partie, nous introduisons la sélection des caractéristiques pour améliorer notre méthode de recherche utilisant le modèle introduit ci-dessus. En effet, la sélection des caractéristiques permet de donner automatiquement plus d'importance aux caractéristiques qui discriminent le plus et moins d'importance aux caractéristiques qui discriminent le moins. Finalement, dans la cinquième partie, nous proposons un nouveau modèle bayésien d'analyse discriminante logistique construit à l'aide de noyaux permettant ainsi une classification nonlinéaire flexible

    On the detection of latent structures in categorical data

    Get PDF
    With the growing availability of huge amounts of data it is increasingly important to uncover the underlying data generating structures. The present work focusses on the detection of latent structures for categorical data, which have been treated less intensely in the literature. In regression models categorical variables are either the responses or part of the covariates. Alternative strategies have to be used to detect the underlying structures. The first part of this thesis is dedicated to regression models with an excessive number of parameters. More concrete, we consider models with various categorical covariates and a potentially large number of categories. In addition, it is investigated how fixed effects models can be used to model the heterogeneity in longitudinal and cross-sectional data. One interesting aspect is to identify the categories or units that have to be distinguished with respect to their effect on the response. The objective is to detect ``latent groups'' that share the same effects on the response variable. A novel approach to the clustering of categorical predictors or fixed effects is introduced, which is based on recursive partitioning techniques. In contrast to competing methods that use specific penalties the proposed algorithm also works in high-dimensional settings. The second part of this thesis deals with item response models, which can be considered as regression models that aim at measuring ``latent abilities'' of persons. In item response theory one uses indicators such as the answers of persons to a collection of items to infer on the underlying abilities. When developing psychometric tests one has to be aware of the phenomenon of Differential Item Functioning (DIF). An item response model is affected by DIF if the difficulty of an item among equally able persons depends on characteristics of the persons, such as the membership to a racial or ethnic subgroup. A general tree-based method is proposed that simultaneously detects the items and subgroups of persons that carry DIF including a set of variables on different scales. Compared to classical approaches a main advantage is that the proposed method automatically identifies regions of the covariate space that are responsible for DIF and do not have to be prespecified. In addition, extensions to the detection of non-uniform DIF are developed. The last part of the thesis addresses regression models for rating scale data that are frequently used in behavioural research. Heterogeneity among respondents caused by ``latent response styles'' can lead to biased estimates and can affect the conclusion drawn from the observed ratings. The focus is on symmetric response categories and a specific form of response style, namely the tendency to the middle or extreme categories. In ordinal regression models a stronger or weaker concentration in the middle can also be interpreted as varying dispersion. The strength of the proposed models is that they can be embedded into the framework of generalized linear models and therefore inference techniques and asymptotic results for this class of models are available. In addition, a visualization tool is developed that makes the interpretation of effects easy accessible

    Supervised ranking : from semantics to algorithms

    Get PDF

    Models and Algorithms for Whole-Genome Evolution and their Use in Phylogenetic Inference

    Get PDF
    The rapid accumulation of sequenced genomes offers the chance to resolve longstanding questions about the evolutionary histories, or phylogenies, of groups of organisms. The relatively rare occurrence of large-scale evolutionary events in a whole genome, events such as genome rearrangements, duplications and losses, enables us to extract a strong and robust phylogenetic signal from whole-genome data. The work presented in this dissertation focuses on models and algorithms for whole-genome evolution and their use in phylogenetic inference. We designed algorithms to estimate pairwise genomic distances from large-scale genomic changes. We refined the evolutionary models on whole-genome evolution. We also made use of these results to provide fast and accurate methods for phylogenetic inference, that scales up, in both speed and accuracy, to modern high-resolution whole-genome data. We designed algorithms to estimate the true evolutionary distance between two genomes under genome rearrangements, and also under rearrangements, plus gains and losses. We refined the evolutionary model to be the first mathematical model to preserve the structural dichotomy in genomic organization between most prokaryotes and most eukaryotes. Those models and associated distance estimators provide a basis for studying facets of possible mechanisms of evolution through simulation and application to real genomes. Phylogenetic analyses from whole-genome data have been limited to small collections of genomes and low-resolution data; they have also lacked an effective assessment of robustness. We developed an approach that combines our distance estimator, any standard distance-based reconstruction algorithm, and a novel bootstrapping method based on resampling genomic adjacencies. The resulting tool overcomes a serious and long-standing impediment to the use of whole-genome data in phylogenetic inference and provides results comparable in accuracy and robustness to distance-based methods for sequence data. Maximum-likelihood approaches have been successfully applied to phylogenetic inferences for aligned sequences, but such applications remain primitive for whole-genome data. We developed a maximum-likelihood approach to phylogenetic analysis from whole-genome data. In combination with our bootstrap scheme, this new approach yields the first reliable phylogenetic tool for the analysis of whole-genome data at the level of syntenic blocks
    • …
    corecore