13 research outputs found

    Two to Five Truths in Non-Negative Matrix Factorization

    Full text link
    In this paper, we explore the role of matrix scaling on a matrix of counts when building a topic model using non-negative matrix factorization. We present a scaling inspired by the normalized Laplacian (NL) for graphs that can greatly improve the quality of a non-negative matrix factorization. The results parallel those in the spectral graph clustering work of \cite{Priebe:2019}, where the authors proved adjacency spectral embedding (ASE) spectral clustering was more likely to discover core-periphery partitions and Laplacian Spectral Embedding (LSE) was more likely to discover affinity partitions. In text analysis non-negative matrix factorization (NMF) is typically used on a matrix of co-occurrence ``contexts'' and ``terms" counts. The matrix scaling inspired by LSE gives significant improvement for text topic models in a variety of datasets. We illustrate the dramatic difference a matrix scalings in NMF can greatly improve the quality of a topic model on three datasets where human annotation is available. Using the adjusted Rand index (ARI), a measure cluster similarity we see an increase of 50\% for Twitter data and over 200\% for a newsgroup dataset versus using counts, which is the analogue of ASE. For clean data, such as those from the Document Understanding Conference, NL gives over 40\% improvement over ASE. We conclude with some analysis of this phenomenon and some connections of this scaling with other matrix scaling methods

    Proportion and characteristics of secondary progressive multiple sclerosis in five European registries using objective classifiers

    Get PDF
    Background: To assign a course of secondary progressive multiple sclerosis (MS) (SPMS) may be difficult and the proportion of persons with SPMS varies between reports. An objective method for disease course classification may give a better estimation of the relative proportions of relapsing-remitting MS (RRMS) and SPMS and may identify situations where SPMS is under reported.Materials and methods: Data were obtained for 61,900 MS patients from MS registries in the Czech Republic, Denmark, Germany, Sweden, and the United Kingdom (UK), including date of birth, sex, SP conversion year, visits with an Expanded Disability Status Scale (EDSS) score, MS onset and diagnosis date, relapses, and disease-modifying treatment (DMT) use. We included RRMS or SPMS patients with at least one visit between January 2017 and December 2019 if ≥ 18 years of age. We applied three objective methods: A set of SPMS clinical trial inclusion criteria ("EXPAND criteria") modified for a real-world evidence setting, a modified version of the MSBase algorithm, and a decision tree-based algorithm recently published.Results: The clinically assigned proportion of SPMS varied from 8.7% (Czechia) to 34.3% (UK). Objective classifiers estimated the proportion of SPMS from 15.1% (Germany by the EXPAND criteria) to 58.0% (UK by the decision tree method). Due to different requirements of number of EDSS scores, classifiers varied in the proportion they were able to classify; from 18% (UK by the MSBase algorithm) to 100% (the decision tree algorithm for all registries). Objectively classified SPMS patients were older, converted to SPMS later, had higher EDSS at index date and higher EDSS at conversion. More objectively classified SPMS were on DMTs compared to the clinically assigned.Conclusion: SPMS appears to be systematically underdiagnosed in MS registries. Reclassified patients were more commonly on DMTs.</p

    SCRIB and PUF60 Are Primary Drivers of the Multisystemic Phenotypes of the 8q24.3 Copy-Number Variant.

    Get PDF
    Copy-number variants (CNVs) represent a significant interpretative challenge, given that each CNV typically affects the dosage of multiple genes. Here we report on five individuals with coloboma, microcephaly, developmental delay, short stature, and craniofacial, cardiac, and renal defects who harbor overlapping microdeletions on 8q24.3. Fine mapping localized a commonly deleted 78 kb region that contains three genes: SCRIB, NRBP2, and PUF60. In vivo dissection of the CNV showed discrete contributions of the planar cell polarity effector SCRIB and the splicing factor PUF60 to the syndromic phenotype, and the combinatorial suppression of both genes exacerbated some, but not all, phenotypic components. Consistent with these findings, we identified an individual with microcephaly, short stature, intellectual disability, and heart defects with a de novo c.505C&gt;T variant leading to a p.His169Tyr change in PUF60. Functional testing of this allele in vivo and in vitro showed that the mutation perturbs the relative dosage of two PUF60 isoforms and, subsequently, the splicing efficiency of downstream PUF60 targets. These data inform the functions of two genes not associated previously with human genetic disease and demonstrate how CNVs can exhibit complex genetic architecture, with the phenotype being the amalgam of both discrete dosage dysfunction of single transcripts and also of binary genetic interactions
    corecore