2,449 research outputs found

    Testing statistical hypothesis on random trees and applications to the protein classification problem

    Full text link
    Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS218 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Graph indexing and retrieval based on graph prototypes

    Get PDF
    [ANGLÈS] Taking a query from a high number of data stored into a database, as fast as possible, is a recurrent problem in the field of computer sciences practically since its origins. At the existence of this problem, it’s necessary to add, moreover, the fact that actually databases contains data types of more diverse and unexpected character possible. Now we are not talking about originating databases which only contained sets of numbers or characters strings. (...) All that I want to make into the present work and I think that was achieved as far as possible, has been to develop and to present a methodology to carry out this process. The Metric Trees of prototypes are based on a well-known strategy, which is based on grouping the data stored in database at the smartest possible way. But also we has added the concept of a graph prototype. A structure that contains information of a set of instances represented by graphs, used until now for classification and recognition. In this thesis we have used graphs as representatives of elements that have to be queried in databases. Note that graphs have the capacity to represent complex objects, for this reason the number of graph databases is increasing. Due to in the literature appears different ways to build a prototype, the work presented here shows a comparative study between the main methods. Combining these two concepts, the Metric Tree and the graph prototype, we propose the construction of metric trees where the graph prototypes are routing nodes to help to decide the way to explore when we make a search in the tree. We have used Metric Trees to make classification and to find all instances that are lower than a maximum distance. (...)[CATALÀ] El trobar-nos davant una gran quantitat de dades i tenir que fer cerques d’aquestes el més ràpid possible és un problema recurrent en el camp de les ciències de la computació pràcticament des dels seus orígens. A l'existència d'aquest problema, se li ha d’afegir, a més a més, el fet de que actualment les bases de dades emmagatzemen tipus de dades de la naturalesa més diversa i molts cops inesperada possible. Ja no parlem de les bases de dades originaries que únicament contenien números o cadenes caràcters. (...) El que he volgut en aquest treball i penso que en la mesura del que era possible s'ha aconseguit, és desenvolupar i presentar una metodologia per portar a terme aquest procés. Els Metric Trees de prototips, que es basen en la ja coneguda estratègia d'agrupar les dades que anem guardant a una base de dades de la forma més intel·ligent possible per no haver d’explorar totes les instàncies que tenim quan volem fer una cerca, però a més a més s'ha afegit el concepte de prototip. Una estructura, que agrupa la informació d'un conjunt d'instàncies, utilitzada fins ara per a fer classificació i reconeixement. Conjugant aquests dos conceptes, el de Metric Tree i el de prototip, plantejem la construcció d'arbres de cerca on els prototips siguin els nodes intermedis, que ens ajudin a decidir quin camí explorar quan volem fer una cerca sobre l'arbre. I utilitzant, aquests tant per a fer classificació com per a buscar totes les instàncies que estiguin una distància més petita d’una distància máxima. Tot això tenint present, que les dades amb que treballem són grafs, és a dir que la metodologia presentada, té la versatilitat de poder-se aplicar, a qualsevol tipus d'informació que es pugui representar d'aquesta manera. (...

    Some algorithms to solve a bi-objectives problem for team selection

    Get PDF
    In real life, many problems are instances of combinatorial optimization. Cross-functional team selection is one of the typical issues. The decision-maker has to select solutions among (kh) solutions in the decision space, where k is the number of all candidates, and h is the number of members in the selected team. This paper is our continuing work since 2018; here, we introduce the completed version of the Min Distance to the Boundary model (MDSB) that allows access to both the "deep" and "wide" aspects of the selected team. The compromise programming approach enables decision-makers to ignore the parameters in the decision-making process. Instead, they point to the one scenario they expect. The aim of model construction focuses on finding the solution that matched the most to the expectation. We develop two algorithms: one is the genetic algorithm and another based on the philosophy of DC programming (DC) and its algorithm (DCA) to find the optimal solution. We also compared the introduced algorithms with the MIQP-CPLEX search algorithm to show their effectiveness

    Migration, Marriage, and Social Mobility : Women in Sweden 1880-1900

    Get PDF
    We study the social mobility of women by looking at the connection between migration and marriage outcomes using complete count census data for Sweden. The censuses 1880-1900 have been linked at the individual level, enabling us to follow 100,000 women from their parental home to their new marital household. Marriage market imbalances were not an important push factor for migration but we find a strong association between migration distance and marriage outcomes, both in terms of overall marriage probabilities and in terms of partner selection. These results highlight the importance of migration for women’s social mobility during industrialization

    IST Austria Thesis

    Get PDF
    Hybrid zones represent evolutionary laboratories, where recombination brings together alleles in combinations which have not previously been tested by selection. This provides an excellent opportunity to test the effect of molecular variation on fitness, and how this variation is able to spread through populations in a natural context. The snapdragon Antirrhinum majus is polymorphic in the wild for two loci controlling the distribution of yellow and magenta floral pigments. Where the yellow A. m. striatum and the magenta A. m. pseudomajus meet along a valley in the Spanish Pyrenees they form a stable hybrid zone Alleles at these loci recombine to give striking transgressive variation for flower colour. The sharp transition in phenotype over ~1km implies strong selection maintaining the hybrid zone. An indirect assay of pollinator visitation in the field found that pollinators forage in a positive-frequency dependent manner on Antirrhinum, matching previous data on fruit set. Experimental arrays and paternity analysis of wild-pollinated seeds demonstrated assortative mating for pigmentation alleles, and that pollinator behaviour alone is sufficient to explain this pattern. Selection by pollinators should be sufficiently strong to maintain the hybrid zone, although other mechanisms may be at work. At a broader scale I examined evolutionary transitions between yellow and anthocyanin pigmentation in the tribe Antirrhinae, and found that selection has acted strate that pollinators are a major determinant of reproductive success and mating patterns in wild Antirrhinum
    • …
    corecore