59 research outputs found

    Machine Learning

    Get PDF

    Fast Label Embeddings via Randomized Linear Algebra

    Full text link
    Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.Comment: To appear in the proceedings of the ECML/PKDD 2015 conference. Reference implementation available at https://github.com/pmineiro/randembe

    Sparsest factor analysis for clustering variables: a matrix decomposition approach

    Get PDF
    We propose a new procedure for sparse factor analysis (FA) such that each variable loads only one common factor. Thus, the loading matrix has a single nonzero element in each row and zeros elsewhere. Such a loading matrix is the sparsest possible for certain number of variables and common factors. For this reason, the proposed method is named sparsest FA (SSFA). It may also be called FA-based variable clustering, since the variables loading the same common factor can be classified into a cluster. In SSFA, all model parts of FA (common factors, their correlations, loadings, unique factors, and unique variances) are treated as fixed unknown parameter matrices and their least squares function is minimized through specific data matrix decomposition. A useful feature of the algorithm is that the matrix of common factor scores is re-parameterized using QR decomposition in order to efficiently estimate factor correlations. A simulation study shows that the proposed procedure can exactly identify the true sparsest models. Real data examples demonstrate the usefulness of the variable clustering performed by SSFA

    Piecewise polynomial approximation of probability density functions with application to uncertainty quantification for stochastic PDEs

    Full text link
    The probability density function (PDF) associated with a given set of samples is approximated by a piecewise-linear polynomial constructed with respect to a binning of the sample space. The kernel functions are a compactly supported basis for the space of such polynomials, i.e. finite element hat functions, that are centered at the bin nodes rather than at the samples, as is the case for the standard kernel density estimation approach. This feature naturally provides an approximation that is scalable with respect to the sample size. On the other hand, unlike other strategies that use a finite element approach, the proposed approximation does not require the solution of a linear system. In addition, a simple rule that relates the bin size to the sample size eliminates the need for bandwidth selection procedures. The proposed density estimator has unitary integral, does not require a constraint to enforce positivity, and is consistent. The proposed approach is validated through numerical examples in which samples are drawn from known PDFs. The approach is also used to determine approximations of (unknown) PDFs associated with outputs of interest that depend on the solution of a stochastic partial differential equation

    Current measures of metabolic heterogeneity within cervical cancer do not predict disease outcome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A previous study evaluated the intra-tumoral heterogeneity observed in the uptake of F-18 fluorodeoxyglucose (FDG) in pre-treatment positron emission tomography (PET) scans of cancers of the uterine cervix as an indicator of disease outcome. This was done via a novel statistic which ostensibly measured the spatial variations in intra-tumoral metabolic activity. In this work, we argue that statistic is intrinsically <it>non</it>-spatial, and that the apparent delineation between unsuccessfully- and successfully-treated patient groups via that statistic is spurious.</p> <p>Methods</p> <p>We first offer a straightforward mathematical demonstration of our argument. Next, we recapitulate an assiduous re-analysis of the originally published data which was derived from FDG-PET imagery. Finally, we present the results of a principal component analysis of FDG-PET images similar to those previously analyzed.</p> <p>Results</p> <p>We find that the previously published measure of intra-tumoral heterogeneity is intrinsically non-spatial, and actually is only a surrogate for tumor volume. We also find that an optimized linear combination of more canonical heterogeneity quantifiers does not predict disease outcome.</p> <p>Conclusions</p> <p>Current measures of intra-tumoral metabolic activity are not predictive of disease outcome as has been claimed previously. The implications of this finding are: clinical categorization of patients based upon these statistics is invalid; more sophisticated, and perhaps innately-geometric, quantifications of metabolic activity are required for predicting disease outcome.</p

    Manifold Learning for Human Population Structure Studies

    Get PDF
    The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the “intrinsic dimensionality” of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis

    Evaluation of extra-virgin olive oils shelf life using an electronic tongue-chemometric approach

    Get PDF
    Physicochemical quality parameters, olfactory and gustatoryretronasal positive sensations of extra-virgin olive oils vary during storage leading to a decrease in the overall quality. Olive oil quality decline may prevent the compliance of olive oil quality with labeling and significantly reduce shelf life, resulting in important economic losses and negatively condition the consumer confidence. The feasibility of applying an electronic tongue to assess olive oils usual commercial light storage conditions and storage time was evaluated and compared with the discrimination potential of physicochemical or positive olfactory/gustatory sensorial parameters. Linear discriminant models, based on subsets of 58 electronic tongue sensor signals, selected by the meta-heuristic simulated annealing variable selection algorithm, allowed the correct classification of olive oils according to the light exposition conditions and/or storage time (sensitivities and specificities for leave-one-out cross-validation: 8296 %). The predictive performance of the E-tongue approach was further evaluated using an external independent dataset selected using the KennardStone algorithm and, in general, better classification rates (sensitivities and specificities for external dataset: 67100 %) were obtained compared to those achieved using physicochemical or sensorial data. So, the work carried out is a proof-of-principle that the proposed electrochemical device could be a practical and versatile tool for, in a single and fast electrochemical assay, successfully discriminate olive oils with different storage times and/or exposed to different light conditions.The authors acknowledge the financial support from the strategic funding of UID/BIO/04469/2013 unit, from Project POCI-01-0145-FEDER-006984—Associate Laboratory LSRELCM funded by FEDER funds through COMPETE2020—Programa Operacional Competitividade e Internacionalização (POCI)—and by national funds through FCT—Fundação para a Ciência e a Tecnologia and under the strategic funding of UID/BIO/04469/2013 unit. Nuno Rodrigues thanks FCT, POPH-QREN and FSE for the Ph.D. Grant (SFRH/BD/104038/2014).info:eu-repo/semantics/publishedVersio

    Gender, Obesity and Repeated Elevation of C-Reactive Protein: Data from the CARDIA Cohort

    Get PDF
    C-reactive Protein (CRP) measurements above 10 mg/L have been conventionally treated as acute inflammation and excluded from epidemiologic studies of chronic inflammation. However, recent evidence suggest that such CRP elevations can be seen even with chronic inflammation. The authors assessed 3,300 participants in The Coronary Artery Risk Development in Young Adults study, who had two or more CRP measurements between 1992/3 and 2005/6 to a) investigate characteristics associated with repeated CRP elevation above 10 mg/L; b) identify subgroups at high risk of repeated elevation; and c) investigate the effect of different CRP thresholds on the probability of an elevation being one-time rather than repeated. 225 participants (6.8%) had one-time and 103 (3.1%) had repeated CRP elevation above 10 mg/L. Repeated elevation was associated with obesity, female gender, low income, and sex hormone use. The probability of an elevation above 10 mg/L being one-time rather than repeated was lowest (51%) in women with body mass index above 31 kg/m2, compared to 82% in others. These findings suggest that CRP elevations above 10 mg/L in obese women are likely to be from chronic rather than acute inflammation, and that CRP thresholds above 10 mg/L may be warranted to distinguish acute from chronic inflammation in obese women

    Limitations of Gene Duplication Models: Evolution of Modules in Protein Interaction Networks

    Get PDF
    It has been generally acknowledged that the module structure of protein interaction networks plays a crucial role with respect to the functional understanding of these networks. In this paper, we study evolutionary aspects of the module structure of protein interaction networks, which forms a mesoscopic level of description with respect to the architectural principles of networks. The purpose of this paper is to investigate limitations of well known gene duplication models by showing that these models are lacking crucial structural features present in protein interaction networks on a mesoscopic scale. This observation reveals our incomplete understanding of the structural evolution of protein networks on the module level

    Ensemble preconditioning for Markov chain Monte Carlo simulation

    Get PDF
    We describe parallel Markov chain Monte Carlo methods that propagate a collective ensemble of paths, with local covariance information calculated from neighboring replicas. The use of collective dynamics eliminates multiplicative noise and stabilizes the dynamics thus providing a practical approach to difficult anisotropic sampling problems in high dimensions. Numerical experiments with model problems demonstrate that dramatic potential speedups, compared to various alternative schemes, are attainable
    corecore