3,695 research outputs found
Mining Top-K Frequent Itemsets Through Progressive Sampling
We study the use of sampling for efficiently mining the top-K frequent
itemsets of cardinality at most w. To this purpose, we define an approximation
to the top-K frequent itemsets to be a family of itemsets which includes
(resp., excludes) all very frequent (resp., very infrequent) itemsets, together
with an estimate of these itemsets' frequencies with a bounded error. Our first
result is an upper bound on the sample size which guarantees that the top-K
frequent itemsets mined from a random sample of that size approximate the
actual top-K frequent itemsets, with probability larger than a specified value.
We show that the upper bound is asymptotically tight when w is constant. Our
main algorithmic contribution is a progressive sampling approach, combined with
suitable stopping conditions, which on appropriate inputs is able to extract
approximate top-K frequent itemsets from samples whose sizes are smaller than
the general upper bound. In order to test the stopping conditions, this
approach maintains the frequency of all itemsets encountered, which is
practical only for small w. However, we show how this problem can be mitigated
by using a variation of Bloom filters. A number of experiments conducted on
both synthetic and real bench- mark datasets show that using samples
substantially smaller than the original dataset (i.e., of size defined by the
upper bound or reached through the progressive sampling approach) enable to
approximate the actual top-K frequent itemsets with accuracy much higher than
what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and
publication in the ECML PKDD 2010 special issue of the Data Mining and
Knowledge Discovery journa
Acceleration of generalized hypergeometric functions through precise remainder asymptotics
We express the asymptotics of the remainders of the partial sums {s_n} of the
generalized hypergeometric function q+1_F_q through an inverse power series z^n
n^l \sum_k c_k/n^k, where the exponent l and the asymptotic coefficients {c_k}
may be recursively computed to any desired order from the hypergeometric
parameters and argument. From this we derive a new series acceleration
technique that can be applied to any such function, even with complex
parameters and at the branch point z=1. For moderate parameters (up to
approximately ten) a C implementation at fixed precision is very effective at
computing these functions; for larger parameters an implementation in higher
than machine precision would be needed. Even for larger parameters, however,
our C implementation is able to correctly determine whether or not it has
converged; and when it converges, its estimate of its error is accurate.Comment: 36 pages, 6 figures, LaTeX2e. Fixed sign error in Eq. (2.28), added
several references, added comparison to other methods, and added discussion
of recursion stabilit
Plasma high sensitivity troponin T levels in adult survivors of childhood leukaemias: determinants and associations with cardiac function
published_or_final_versio
Evolutionary distances in the twilight zone -- a rational kernel approach
Phylogenetic tree reconstruction is traditionally based on multiple sequence
alignments (MSAs) and heavily depends on the validity of this information
bottleneck. With increasing sequence divergence, the quality of MSAs decays
quickly. Alignment-free methods, on the other hand, are based on abstract
string comparisons and avoid potential alignment problems. However, in general
they are not biologically motivated and ignore our knowledge about the
evolution of sequences. Thus, it is still a major open question how to define
an evolutionary distance metric between divergent sequences that makes use of
indel information and known substitution models without the need for a multiple
alignment. Here we propose a new evolutionary distance metric to close this
gap. It uses finite-state transducers to create a biologically motivated
similarity score which models substitutions and indels, and does not depend on
a multiple sequence alignment. The sequence similarity score is defined in
analogy to pairwise alignments and additionally has the positive semi-definite
property. We describe its derivation and show in simulation studies and
real-world examples that it is more accurate in reconstructing phylogenies than
competing methods. The result is a new and accurate way of determining
evolutionary distances in and beyond the twilight zone of sequence alignments
that is suitable for large datasets.Comment: to appear in PLoS ON
The PHF21B gene is associated with major depression and modulates the stress response
Major depressive disorder (MDD) affects around 350 million people worldwide; however, the underlying genetic basis remains largely unknown. In this study, we took into account that MDD is a gene-environment disorder, in which stress is a critical component, and used whole-genome screening of functional variants to investigate the 'missing heritability' in MDD. Genome-wide association studies (GWAS) using single- and multi-locus linear mixed-effect models were performed in a Los Angeles Mexican-American cohort (196 controls, 203 MDD) and in a replication European-ancestry cohort (499 controls, 473 MDD). Our analyses took into consideration the stress levels in the control populations. The Mexican-American controls, comprised primarily of recent immigrants, had high levels of stress due to acculturation issues and the European-ancestry controls with high stress levels were given higher weights in our analysis. We identified 44 common and rare functional variants associated with mild to moderate MDD in the Mexican-American cohort (genome-wide false discovery rate, FDR, <0.05), and their pathway analysis revealed that the three top overrepresented Gene Ontology (GO) processes were innate immune response, glutamate receptor signaling and detection of chemical stimulus in smell sensory perception. Rare variant analysis replicated the association of the PHF21B gene in the ethnically unrelated European-ancestry cohort. The TRPM2 gene, previously implicated in mood disorders, may also be considered replicated by our analyses. Whole-genome sequencing analyses of a subset of the cohorts revealed that European-ancestry individuals have a significantly reduced (50%) number of single nucleotide variants compared with Mexican-American individuals, and for this reason the role of rare variants may vary across populations. PHF21b variants contribute significantly to differences in the levels of expression of this gene in several brain areas, including the hippocampus. Furthermore, using an animal model of stress, we found that Phf21b hippocampal gene expression is significantly decreased in animals resilient to chronic restraint stress when compared with non-chronically stressed animals. Together, our results reveal that including stress level data enables the identification of novel rare functional variants associated with MDD.M-L Wong, M Arcos-Burgos, S Liu, J I Vélez, C Yu, B T Baune, M C Jawahar, V Arolt, U Dannlowski, A Chuah, G A Huttley, R Fogarty, M D Lewis, S R Bornstein, and J Licini
Prevalence of Cataract Surgery and Visual Outcomes in Indian Immigrants in Singapore: The Singapore Indian Eye Study
10.1371/journal.pone.0075584PLoS ONE810-POLN
Imaging Electronic Correlations in Twisted Bilayer Graphene near the Magic Angle
Twisted bilayer graphene with a twist angle of around 1.1{\deg} features a
pair of isolated flat electronic bands and forms a strongly correlated
electronic platform. Here, we use scanning tunneling microscopy to probe local
properties of highly tunable twisted bilayer graphene devices and show that the
flat bands strongly deform when aligned with the Fermi level. At half filling
of the bands, we observe the development of gaps originating from correlated
insulating states. Near charge neutrality, we find a previously unidentified
correlated regime featuring a substantially enhanced flat band splitting that
we describe within a microscopic model predicting a strong tendency towards
nematic ordering. Our results provide insights into symmetry breaking
correlation effects and highlight the importance of electronic interactions for
all filling factors in twisted bilayer graphene.Comment: Main text 9 pages, 4 figures; Supplementary Information 25 page
Admixture Mapping Scans Identify a Locus Affecting Retinal Vascular Caliber in Hypertensive African Americans: the Atherosclerosis Risk in Communities (ARIC) Study
Retinal vascular caliber provides information about the structure and health of the microvascular system and is associated with cardiovascular and cerebrovascular diseases. Compared to European Americans, African Americans tend to have wider retinal arteriolar and venular caliber, even after controlling for cardiovascular risk factors. This has suggested the hypothesis that differences in genetic background may contribute to racial/ethnic differences in retinal vascular caliber. Using 1,365 ancestry-informative SNPs, we estimated the percentage of African ancestry (PAA) and conducted genome-wide admixture mapping scans in 1,737 African Americans from the Atherosclerosis Risk in Communities (ARIC) study. Central retinal artery equivalent (CRAE) and central retinal vein equivalent (CRVE) representing summary measures of retinal arteriolar and venular caliber, respectively, were measured from retinal photographs. PAA was significantly correlated with CRVE (ρ = 0.071, P = 0.003), but not CRAE (ρ = 0.032, P = 0.182). Using admixture mapping, we did not detect significant admixture association with either CRAE (genome-wide score = −0.73) or CRVE (genome-wide score = −0.69). An a priori subgroup analysis among hypertensive individuals detected a genome-wide significant association of CRVE with greater African ancestry at chromosome 6p21.1 (genome-wide score = 2.31, locus-specific LOD = 5.47). Each additional copy of an African ancestral allele at the 6p21.1 peak was associated with an average increase in CRVE of 6.14 µm in the hypertensives, but had no significant effects in the non-hypertensives (P for heterogeneity <0.001). Further mapping in the 6p21.1 region may uncover novel genetic variants affecting retinal vascular caliber and further insights into the interaction between genetic effects of the microvascular system and hypertension
Genome-wide signatures of convergent evolution in echolocating mammals
Evolution is typically thought to proceed through divergence of genes, proteins, and ultimately phenotypes(1-3). However, similar traits might also evolve convergently in unrelated taxa due to similar selection pressures(4,5). Adaptive phenotypic convergence is widespread in nature, and recent results from a handful of genes have suggested that this phenomenon is powerful enough to also drive recurrent evolution at the sequence level(6-9). Where homoplasious substitutions do occur these have long been considered the result of neutral processes. However, recent studies have demonstrated that adaptive convergent sequence evolution can be detected in vertebrates using statistical methods that model parallel evolution(9,10) although the extent to which sequence convergence between genera occurs across genomes is unknown. Here we analyse genomic sequence data in mammals that have independently evolved echolocation and show for the first time that convergence is not a rare process restricted to a handful of loci but is instead widespread, continuously distributed and commonly driven by natural selection acting on a small number of sites per locus. Systematic analyses of convergent sequence evolution in 805,053 amino acids within 2,326 orthologous coding gene sequences compared across 22 mammals (including four new bat genomes) revealed signatures consistent with convergence in nearly 200 loci. Strong and significant support for convergence among bats and the dolphin was seen in numerous genes linked to hearing or deafness, consistent with an involvement in echolocation. Surprisingly we also found convergence in many genes linked to vision: the convergent signal of many sensory genes was robustly correlated with the strength of natural selection. This first attempt to detect genome-wide convergent sequence evolution across divergent taxa reveals the phenomenon to be much more pervasive than previously recognised
Codominant scoring of AFLP in association panels
A study on the codominant scoring of AFLP markers in association panels without prior knowledge on genotype probabilities is described. Bands are scored codominantly by fitting normal mixture models to band intensities, illustrating and optimizing existing methodology, which employs the EM-algorithm. We study features that improve the performance of the algorithm, and the unmixing in general, like parameter initialization, restrictions on parameters, data transformation, and outlier removal. Parameter restrictions include equal component variances, equal or nearly equal distances between component means, and mixing probabilities according to Hardy–Weinberg Equilibrium. Histogram visualization of band intensities with superimposed normal densities, and optional classification scores and other grouping information, assists further in the codominant scoring. We find empirical evidence favoring the square root transformation of the band intensity, as was found in segregating populations. Our approach provides posterior genotype probabilities for marker loci. These probabilities can form the basis for association mapping and are more useful than the standard scoring categories A, H, B, C, D. They can also be used to calculate predictors for additive and dominance effects. Diagnostics for data quality of AFLP markers are described: preference for three-component mixture model, good separation between component means, and lack of singletons for the component with highest mean. Software has been developed in R, containing the models for normal mixtures with facilitating features, and visualizations. The methods are applied to an association panel in tomato, comprising 1,175 polymorphic markers on 94 tomato hybrids, as part of a larger study within the Dutch Centre for BioSystems Genomics
- …