3,695 research outputs found

    Mining Top-K Frequent Itemsets Through Progressive Sampling

    Full text link
    We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real bench- mark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and publication in the ECML PKDD 2010 special issue of the Data Mining and Knowledge Discovery journa

    Acceleration of generalized hypergeometric functions through precise remainder asymptotics

    Full text link
    We express the asymptotics of the remainders of the partial sums {s_n} of the generalized hypergeometric function q+1_F_q through an inverse power series z^n n^l \sum_k c_k/n^k, where the exponent l and the asymptotic coefficients {c_k} may be recursively computed to any desired order from the hypergeometric parameters and argument. From this we derive a new series acceleration technique that can be applied to any such function, even with complex parameters and at the branch point z=1. For moderate parameters (up to approximately ten) a C implementation at fixed precision is very effective at computing these functions; for larger parameters an implementation in higher than machine precision would be needed. Even for larger parameters, however, our C implementation is able to correctly determine whether or not it has converged; and when it converges, its estimate of its error is accurate.Comment: 36 pages, 6 figures, LaTeX2e. Fixed sign error in Eq. (2.28), added several references, added comparison to other methods, and added discussion of recursion stabilit

    Evolutionary distances in the twilight zone -- a rational kernel approach

    Get PDF
    Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.Comment: to appear in PLoS ON

    The PHF21B gene is associated with major depression and modulates the stress response

    Get PDF
    Major depressive disorder (MDD) affects around 350 million people worldwide; however, the underlying genetic basis remains largely unknown. In this study, we took into account that MDD is a gene-environment disorder, in which stress is a critical component, and used whole-genome screening of functional variants to investigate the 'missing heritability' in MDD. Genome-wide association studies (GWAS) using single- and multi-locus linear mixed-effect models were performed in a Los Angeles Mexican-American cohort (196 controls, 203 MDD) and in a replication European-ancestry cohort (499 controls, 473 MDD). Our analyses took into consideration the stress levels in the control populations. The Mexican-American controls, comprised primarily of recent immigrants, had high levels of stress due to acculturation issues and the European-ancestry controls with high stress levels were given higher weights in our analysis. We identified 44 common and rare functional variants associated with mild to moderate MDD in the Mexican-American cohort (genome-wide false discovery rate, FDR, <0.05), and their pathway analysis revealed that the three top overrepresented Gene Ontology (GO) processes were innate immune response, glutamate receptor signaling and detection of chemical stimulus in smell sensory perception. Rare variant analysis replicated the association of the PHF21B gene in the ethnically unrelated European-ancestry cohort. The TRPM2 gene, previously implicated in mood disorders, may also be considered replicated by our analyses. Whole-genome sequencing analyses of a subset of the cohorts revealed that European-ancestry individuals have a significantly reduced (50%) number of single nucleotide variants compared with Mexican-American individuals, and for this reason the role of rare variants may vary across populations. PHF21b variants contribute significantly to differences in the levels of expression of this gene in several brain areas, including the hippocampus. Furthermore, using an animal model of stress, we found that Phf21b hippocampal gene expression is significantly decreased in animals resilient to chronic restraint stress when compared with non-chronically stressed animals. Together, our results reveal that including stress level data enables the identification of novel rare functional variants associated with MDD.M-L Wong, M Arcos-Burgos, S Liu, J I Vélez, C Yu, B T Baune, M C Jawahar, V Arolt, U Dannlowski, A Chuah, G A Huttley, R Fogarty, M D Lewis, S R Bornstein, and J Licini

    Imaging Electronic Correlations in Twisted Bilayer Graphene near the Magic Angle

    Get PDF
    Twisted bilayer graphene with a twist angle of around 1.1{\deg} features a pair of isolated flat electronic bands and forms a strongly correlated electronic platform. Here, we use scanning tunneling microscopy to probe local properties of highly tunable twisted bilayer graphene devices and show that the flat bands strongly deform when aligned with the Fermi level. At half filling of the bands, we observe the development of gaps originating from correlated insulating states. Near charge neutrality, we find a previously unidentified correlated regime featuring a substantially enhanced flat band splitting that we describe within a microscopic model predicting a strong tendency towards nematic ordering. Our results provide insights into symmetry breaking correlation effects and highlight the importance of electronic interactions for all filling factors in twisted bilayer graphene.Comment: Main text 9 pages, 4 figures; Supplementary Information 25 page

    Admixture Mapping Scans Identify a Locus Affecting Retinal Vascular Caliber in Hypertensive African Americans: the Atherosclerosis Risk in Communities (ARIC) Study

    Get PDF
    Retinal vascular caliber provides information about the structure and health of the microvascular system and is associated with cardiovascular and cerebrovascular diseases. Compared to European Americans, African Americans tend to have wider retinal arteriolar and venular caliber, even after controlling for cardiovascular risk factors. This has suggested the hypothesis that differences in genetic background may contribute to racial/ethnic differences in retinal vascular caliber. Using 1,365 ancestry-informative SNPs, we estimated the percentage of African ancestry (PAA) and conducted genome-wide admixture mapping scans in 1,737 African Americans from the Atherosclerosis Risk in Communities (ARIC) study. Central retinal artery equivalent (CRAE) and central retinal vein equivalent (CRVE) representing summary measures of retinal arteriolar and venular caliber, respectively, were measured from retinal photographs. PAA was significantly correlated with CRVE (ρ = 0.071, P = 0.003), but not CRAE (ρ = 0.032, P = 0.182). Using admixture mapping, we did not detect significant admixture association with either CRAE (genome-wide score = −0.73) or CRVE (genome-wide score = −0.69). An a priori subgroup analysis among hypertensive individuals detected a genome-wide significant association of CRVE with greater African ancestry at chromosome 6p21.1 (genome-wide score = 2.31, locus-specific LOD = 5.47). Each additional copy of an African ancestral allele at the 6p21.1 peak was associated with an average increase in CRVE of 6.14 µm in the hypertensives, but had no significant effects in the non-hypertensives (P for heterogeneity <0.001). Further mapping in the 6p21.1 region may uncover novel genetic variants affecting retinal vascular caliber and further insights into the interaction between genetic effects of the microvascular system and hypertension

    Genome-wide signatures of convergent evolution in echolocating mammals

    Get PDF
    Evolution is typically thought to proceed through divergence of genes, proteins, and ultimately phenotypes(1-3). However, similar traits might also evolve convergently in unrelated taxa due to similar selection pressures(4,5). Adaptive phenotypic convergence is widespread in nature, and recent results from a handful of genes have suggested that this phenomenon is powerful enough to also drive recurrent evolution at the sequence level(6-9). Where homoplasious substitutions do occur these have long been considered the result of neutral processes. However, recent studies have demonstrated that adaptive convergent sequence evolution can be detected in vertebrates using statistical methods that model parallel evolution(9,10) although the extent to which sequence convergence between genera occurs across genomes is unknown. Here we analyse genomic sequence data in mammals that have independently evolved echolocation and show for the first time that convergence is not a rare process restricted to a handful of loci but is instead widespread, continuously distributed and commonly driven by natural selection acting on a small number of sites per locus. Systematic analyses of convergent sequence evolution in 805,053 amino acids within 2,326 orthologous coding gene sequences compared across 22 mammals (including four new bat genomes) revealed signatures consistent with convergence in nearly 200 loci. Strong and significant support for convergence among bats and the dolphin was seen in numerous genes linked to hearing or deafness, consistent with an involvement in echolocation. Surprisingly we also found convergence in many genes linked to vision: the convergent signal of many sensory genes was robustly correlated with the strength of natural selection. This first attempt to detect genome-wide convergent sequence evolution across divergent taxa reveals the phenomenon to be much more pervasive than previously recognised

    Codominant scoring of AFLP in association panels

    Get PDF
    A study on the codominant scoring of AFLP markers in association panels without prior knowledge on genotype probabilities is described. Bands are scored codominantly by fitting normal mixture models to band intensities, illustrating and optimizing existing methodology, which employs the EM-algorithm. We study features that improve the performance of the algorithm, and the unmixing in general, like parameter initialization, restrictions on parameters, data transformation, and outlier removal. Parameter restrictions include equal component variances, equal or nearly equal distances between component means, and mixing probabilities according to Hardy–Weinberg Equilibrium. Histogram visualization of band intensities with superimposed normal densities, and optional classification scores and other grouping information, assists further in the codominant scoring. We find empirical evidence favoring the square root transformation of the band intensity, as was found in segregating populations. Our approach provides posterior genotype probabilities for marker loci. These probabilities can form the basis for association mapping and are more useful than the standard scoring categories A, H, B, C, D. They can also be used to calculate predictors for additive and dominance effects. Diagnostics for data quality of AFLP markers are described: preference for three-component mixture model, good separation between component means, and lack of singletons for the component with highest mean. Software has been developed in R, containing the models for normal mixtures with facilitating features, and visualizations. The methods are applied to an association panel in tomato, comprising 1,175 polymorphic markers on 94 tomato hybrids, as part of a larger study within the Dutch Centre for BioSystems Genomics
    corecore