31 research outputs found

    Doctor of Philosophy in Computing

    Get PDF
    dissertationIn the last two decades, an increasingly large amount of data has become available. Massive collections of videos, astronomical observations, social networking posts, network routing information, mobile location history and so forth are examples of real world data requiring processing for applications ranging from classi?cation to predictions. Computational resources grow at a far more constrained rate, and hence the need for ef?cient algorithms that scale well. Over the past twenty years high quality theoretical algorithms have been developed for two central problems: nearest neighbor search and dimensionality reduction over Euclidean distances in worst case distributions. These two tasks are interesting in their own right. Nearest neighbor corresponds to a database query lookup, while dimensionality reduction is a form of compression on massive data. Moreover, these are also subroutines in algorithms ranging from clustering to classi?cation. However, many highly relevant settings and distance measures have not received similar attention to that of worst case point sets in Euclidean space. The Bregman divergences include the information theoretic distances, such as entropy, of most relevance in many machine learning applications and yet prior to this dissertation lacked ef?cient dimensionality reductions, nearest neighbor algorithms, or even lower bounds on what could be possible. Furthermore, even in the Euclidean setting, theoretical algorithms do not leverage that almost all real world datasets have signi?cant low-dimensional substructure. In this dissertation, we explore different models and techniques for similarity search and dimensionality reduction. What upper bounds can be obtained for nearest neighbors for Bregman divergences? What upper bounds can be achieved for dimensionality reduction for information theoretic measures? Are these problems indeed intrinsically of harder computational complexity than in the Euclidean setting? Can we improve the state of the art nearest neighbor algorithms for real world datasets in Euclidean space? These are the questions we investigate in this dissertation, and that we shed some new insight on. In the ?rst part of our dissertation, we focus on Bregman divergences. We exhibit nearest neighbor algorithms, contingent on a distributional constraint on the datasets. We next show lower bounds suggesting that is in some sense inherent to the problem complexity. After this we explore dimensionality reduction techniques for the Jensen-Shannon and Hellinger distances, two popular information theoretic measures. In the second part, we show that even for the more well-studied Euclidean case, worst case nearest neighbor algorithms can be improved upon sharply for real world datasets with spectral structure

    Spectral Approaches to Nearest Neighbor Search

    Full text link
    We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In particular, we consider a semi-random setting where a dataset PP in Rd\mathbb{R}^d is chosen arbitrarily from an unknown subspace of low dimension kdk\ll d, and then perturbed by fully dd-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on dd and logn\log n (where n=Pn=|P|) for large ranges of kk, dd and nn. Our algorithms use a repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is {\em much larger} than the interpoint distances in PP. Our motivation is that in practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst case datasets. In this paper we aim to provide theoretical justification for this disparity.Comment: Accepted in the proceedings of FOCS 2014. 30 pages and 4 figure

    Approximate Bregman near neighbors in sublinear time: beyond the triangle inequality

    Get PDF
    pre-printBregman divergences are important distance measures that are used extensively in data-driven applications such as computer vision, text mining, and speech processing, and are a key focus of interest in machine learning. Answering nearest neighbor (NN) queries under these measures is very important in these applications and has been the subject of extensive study, but is problematic because these distance measures lack metric properties like symmetry and the triangle inequality. In this paper, we present the first provably approximate nearest-neighbor (ANN) algorithms. These process queries in O(logn) time for Bregman divergences in fixed dimensional spaces. We also obtain polylogn bounds for a more abstract class of distance measures (containing Bregman divergences) which satisfy certain structural properties . Both of these bounds apply to both the regular asymmetric Bregman divergences as well as their symmetrized versions. To do so, we develop two geometric properties vital to our analysis: a reverse triangle inequality (RTI) and a relaxed triangle inequality called m-defectiveness where m is a domain-dependent parameter. Bregman divergences satisfy the RTI but not m-defectiveness. However, we show that the square root of a Bregman divergence does satisfy m-defectiveness. This allows us to then utilize both properties in an efficient search data structure that follows the general two-stage paradigm of a ring-tree decomposition followed by a quad tree search used in previous near-neighbor algorithms for Euclidean space and spaces of bounded doubling dimension. Our first algorithm resolves a query for a d-dimensional (1+e)-ANN in O ( logne )O(d) time and O (nlogd-1 n) space and holds for generic m-defective distance measures satisfying a RTI. Our second algorithm is more specific in analysis to the Bregman divergences and uses a further structural constant, the maximum ratio of second derivatives over each dimension of our domain (c0). This allows us to locate a (1+e)-ANN in O(logn) time and O(n) space, where there is a further (c0)d factor in the big-Oh for the query time

    Streaming Verification of Graph Properties

    Get PDF
    Streaming interactive proofs (SIPs) are a framework for outsourced computation. A computationally limited streaming client (the verifier) hands over a large data set to an untrusted server (the prover) in the cloud and the two parties run a protocol to confirm the correctness of result with high probability. SIPs are particularly interesting for problems that are hard to solve (or even approximate) well in a streaming setting. The most notable of these problems is finding maximum matchings, which has received intense interest in recent years but has strong lower bounds even for constant factor approximations. In this paper, we present efficient streaming interactive proofs that can verify maximum matchings exactly. Our results cover all flavors of matchings (bipartite/non-bipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP. In particular, these are the first efficient results for weighted matchings and for metric TSP in any streaming verification model.Comment: 26 pages, 2 figure, 1 tabl

    Global burden of chronic respiratory diseases and risk factors, 1990–2019: an update from the Global Burden of Disease Study 2019

    Get PDF
    Background Updated data on chronic respiratory diseases (CRDs) are vital in their prevention, control, and treatment in the path to achieving the third UN Sustainable Development Goals (SDGs), a one-third reduction in premature mortality from non-communicable diseases by 2030. We provided global, regional, and national estimates of the burden of CRDs and their attributable risks from 1990 to 2019. Methods Using data from the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2019, we estimated mortality, years lived with disability, years of life lost, disability-adjusted life years (DALYs), prevalence, and incidence of CRDs, i.e. chronic obstructive pulmonary disease (COPD), asthma, pneumoconiosis, interstitial lung disease and pulmonary sarcoidosis, and other CRDs, from 1990 to 2019 by sex, age, region, and Socio-demographic Index (SDI) in 204 countries and territories. Deaths and DALYs from CRDs attributable to each risk factor were estimated according to relative risks, risk exposure, and the theoretical minimum risk exposure level input. Findings In 2019, CRDs were the third leading cause of death responsible for 4.0 million deaths (95% uncertainty interval 3.6–4.3) with a prevalence of 454.6 million cases (417.4–499.1) globally. While the total deaths and prevalence of CRDs have increased by 28.5% and 39.8%, the age-standardised rates have dropped by 41.7% and 16.9% from 1990 to 2019, respectively. COPD, with 212.3 million (200.4–225.1) prevalent cases, was the primary cause of deaths from CRDs, accounting for 3.3 million (2.9–3.6) deaths. With 262.4 million (224.1–309.5) prevalent cases, asthma had the highest prevalence among CRDs. The age-standardised rates of all burden measures of COPD, asthma, and pneumoconiosis have reduced globally from 1990 to 2019. Nevertheless, the age-standardised rates of incidence and prevalence of interstitial lung disease and pulmonary sarcoidosis have increased throughout this period. Low- and low-middle SDI countries had the highest age-standardised death and DALYs rates while the high SDI quintile had the highest prevalence rate of CRDs. The highest deaths and DALYs from CRDs were attributed to smoking globally, followed by air pollution and occupational risks. Non-optimal temperature and high body-mass index were additional risk factors for COPD and asthma, respectively. Interpretation Albeit the age-standardised prevalence, death, and DALYs rates of CRDs have decreased, they still cause a substantial burden and deaths worldwide. The high death and DALYs rates in low and low-middle SDI countries highlights the urgent need for improved preventive, diagnostic, and therapeutic measures. Global strategies for tobacco control, enhancing air quality, reducing occupational hazards, and fostering clean cooking fuels are crucial steps in reducing the burden of CRDs, especially in low- and lower-middle income countries. Funding Bill & Melinda Gates Foundation

    Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021

    Get PDF
    Background Diabetes is one of the leading causes of death and disability worldwide, and affects people regardless of country, age group, or sex. Using the most recent evidentiary and analytical framework from the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD), we produced location-specific, age-specific, and sex-specific estimates of diabetes prevalence and burden from 1990 to 2021, the proportion of type 1 and type 2 diabetes in 2021, the proportion of the type 2 diabetes burden attributable to selected risk factors, and projections of diabetes prevalence through 2050. Methods Estimates of diabetes prevalence and burden were computed in 204 countries and territories, across 25 age groups, for males and females separately and combined; these estimates comprised lost years of healthy life, measured in disability-adjusted life-years (DALYs; defined as the sum of years of life lost [YLLs] and years lived with disability [YLDs]). We used the Cause of Death Ensemble model (CODEm) approach to estimate deaths due to diabetes, incorporating 25 666 location-years of data from vital registration and verbal autopsy reports in separate total (including both type 1 and type 2 diabetes) and type-specific models. Other forms of diabetes, including gestational and monogenic diabetes, were not explicitly modelled. Total and type 1 diabetes prevalence was estimated by use of a Bayesian meta-regression modelling tool, DisMod-MR 2.1, to analyse 1527 location-years of data from the scientific literature, survey microdata, and insurance claims; type 2 diabetes estimates were computed by subtracting type 1 diabetes from total estimates. Mortality and prevalence estimates, along with standard life expectancy and disability weights, were used to calculate YLLs, YLDs, and DALYs. When appropriate, we extrapolated estimates to a hypothetical population with a standardised age structure to allow comparison in populations with different age structures. We used the comparative risk assessment framework to estimate the risk-attributable type 2 diabetes burden for 16 risk factors falling under risk categories including environmental and occupational factors, tobacco use, high alcohol use, high body-mass index (BMI), dietary factors, and low physical activity. Using a regression framework, we forecast type 1 and type 2 diabetes prevalence through 2050 with Socio-demographic Index (SDI) and high BMI as predictors, respectively. Findings In 2021, there were 529 million (95% uncertainty interval [UI] 500–564) people living with diabetes worldwide, and the global age-standardised total diabetes prevalence was 6·1% (5·8–6·5). At the super-region level, the highest age-standardised rates were observed in north Africa and the Middle East (9·3% [8·7–9·9]) and, at the regional level, in Oceania (12·3% [11·5–13·0]). Nationally, Qatar had the world's highest age-specific prevalence of diabetes, at 76·1% (73·1–79·5) in individuals aged 75–79 years. Total diabetes prevalence—especially among older adults—primarily reflects type 2 diabetes, which in 2021 accounted for 96·0% (95·1–96·8) of diabetes cases and 95·4% (94·9–95·9) of diabetes DALYs worldwide. In 2021, 52·2% (25·5–71·8) of global type 2 diabetes DALYs were attributable to high BMI. The contribution of high BMI to type 2 diabetes DALYs rose by 24·3% (18·5–30·4) worldwide between 1990 and 2021. By 2050, more than 1·31 billion (1·22–1·39) people are projected to have diabetes, with expected age-standardised total diabetes prevalence rates greater than 10% in two super-regions: 16·8% (16·1–17·6) in north Africa and the Middle East and 11·3% (10·8–11·9) in Latin America and Caribbean. By 2050, 89 (43·6%) of 204 countries and territories will have an age-standardised rate greater than 10%. Interpretation Diabetes remains a substantial public health issue. Type 2 diabetes, which makes up the bulk of diabetes cases, is largely preventable and, in some cases, potentially reversible if identified and managed early in the disease course. However, all evidence indicates that diabetes prevalence is increasing worldwide, primarily due to a rise in obesity caused by multiple factors. Preventing and controlling type 2 diabetes remains an ongoing challenge. It is essential to better understand disparities in risk factor profiles and diabetes burden across populations, to inform strategies to successfully control diabetes risk factors within the context of multiple and complex drivers.publishedVersio

    Global burden of chronic respiratory diseases and risk factors, 1990–2019: an update from the Global Burden of Disease Study 2019

    Get PDF
    Background: Updated data on chronic respiratory diseases (CRDs) are vital in their prevention, control, and treatment in the path to achieving the third UN Sustainable Development Goals (SDGs), a one-third reduction in premature mortality from non-communicable diseases by 2030. We provided global, regional, and national estimates of the burden of CRDs and their attributable risks from 1990 to 2019. Methods: Using data from the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2019, we estimated mortality, years lived with disability, years of life lost, disability-adjusted life years (DALYs), prevalence, and incidence of CRDs, i.e. chronic obstructive pulmonary disease (COPD), asthma, pneumoconiosis, interstitial lung disease and pulmonary sarcoidosis, and other CRDs, from 1990 to 2019 by sex, age, region, and Socio-demographic Index (SDI) in 204 countries and territories. Deaths and DALYs from CRDs attributable to each risk factor were estimated according to relative risks, risk exposure, and the theoretical minimum risk exposure level input. Findings: In 2019, CRDs were the third leading cause of death responsible for 4.0 million deaths (95% uncertainty interval 3.6–4.3) with a prevalence of 454.6 million cases (417.4–499.1) globally. While the total deaths and prevalence of CRDs have increased by 28.5% and 39.8%, the age-standardised rates have dropped by 41.7% and 16.9% from 1990 to 2019, respectively. COPD, with 212.3 million (200.4–225.1) prevalent cases, was the primary cause of deaths from CRDs, accounting for 3.3 million (2.9–3.6) deaths. With 262.4 million (224.1–309.5) prevalent cases, asthma had the highest prevalence among CRDs. The age-standardised rates of all burden measures of COPD, asthma, and pneumoconiosis have reduced globally from 1990 to 2019. Nevertheless, the age-standardised rates of incidence and prevalence of interstitial lung disease and pulmonary sarcoidosis have increased throughout this period. Low- and low-middle SDI countries had the highest age-standardised death and DALYs rates while the high SDI quintile had the highest prevalence rate of CRDs. The highest deaths and DALYs from CRDs were attributed to smoking globally, followed by air pollution and occupational risks. Non-optimal temperature and high body-mass index were additional risk factors for COPD and asthma, respectively. Interpretation: Albeit the age-standardised prevalence, death, and DALYs rates of CRDs have decreased, they still cause a substantial burden and deaths worldwide. The high death and DALYs rates in low and low-middle SDI countries highlights the urgent need for improved preventive, diagnostic, and therapeutic measures. Global strategies for tobacco control, enhancing air quality, reducing occupational hazards, and fostering clean cooking fuels are crucial steps in reducing the burden of CRDs, especially in low- and lower-middle income countries. Funding: Bill & Melinda Gates Foundation
    corecore