27 research outputs found
On the information theory of clustering, registration, and blockchains
Progress in data science depends on the collection and storage of large volumes of reliable data, efficient and consistent inference based on this data, and trusting such computations made by untrusted peers. Information theory provides the means to analyze statistical inference algorithms, inspires the design of statistically consistent learning algorithms, and informs the design of large-scale systems for information storage and sharing. In this thesis, we focus on the problems of reliability, universality, integrity, trust, and provenance in data storage, distributed computing, and information processing algorithms and develop technical solutions and mathematical insights using information-theoretic tools.
In unsupervised information processing we consider the problems of data clustering and image registration. In particular, we evaluate the performance of the max mutual information method for image registration by studying its error exponent and prove its universal asymptotic optimality. We further extend this to design the max multiinformation method for universal multi-image registration and prove its universal asymptotic optimality. We then evaluate the non-asymptotic performance of image registration to understand the effects of the properties of the image transformations and the channel noise on the algorithms.
In data clustering we study the problem of independence clustering of sources using multivariate information functionals. In particular, we define consistent image clustering algorithms using the cluster information, and define a new multivariate information functional called illum information that inspires other independence clustering methods. We also consider the problem of clustering objects based on labels provided by temporary and long-term workers in a crowdsourcing platform. Here we define budget-optimal universal clustering algorithms using distributional identicality and temporal dependence in the responses of workers.
For the problem of reliable data storage, we consider the use of blockchain systems, and design secure distributed storage codes to reduce the cost of cold storage of blockchain ledgers. Additionally, we use dynamic zone allocation strategies to enhance the integrity and confidentiality of these systems, and frame optimization problems for designing codes applicable for cloud storage and data insurance.
Finally, for the problem of establishing trust in computations over untrusting peer-to-peer networks, we develop a large-scale blockchain system by defining the validation protocols and compression scheme to facilitate an efficient audit of computations that can be shared in a trusted manner across peers over the immutable blockchain ledger. We evaluate the system over some simple synthetic computational experiments and highlights its capacity in identifying anomalous computations and enhancing computational integrity
Beliefs and expertise in sequential decision making
This work explores a sequential decision making problem with agents having diverse expertise and mismatched beliefs. We consider an N-agent sequential binary hypothesis test in which each agent sequentially makes a decision
based not only on a private observation, but also on previous agents’ decisions. In addition, the agents have their own beliefs instead of the true prior, and have varying expertise in terms of the noise variance in the private signal. We focus on the risk of the last-acting agent, where precedent agents are selfish. Thus, we call this advisor(s)-advisee sequential decision making. We first derive the optimal decision rule by recursive belief update and conclude, counterintuitively, that beliefs deviating from the true prior could be optimal in this setting. The impact of diverse noise levels (which means diverse expertise levels) in the two-agent case is also considered and the analytical properties of the optimal belief curves are given. These curves, for certain cases, resemble probability weighting functions from cumulative prospect theory, and so we also discuss the choice of Prelec weighting functions as an approximation for the optimal beliefs, and
the possible psychophysical optimality of human beliefs. Next, we consider an advisor selection problem where in the advisee of a certain belief chooses an advisor from a set of candidates with varying beliefs. We characterize the decision region for choosing such an advisor and argue that an advisee with beliefs varying from the true prior often ends up selecting a suboptimal advisor, indicating the need for a social planner. We close with a discussion on the implications of the study toward designing artificial intelligence systems for augmenting human intelligence.https://arxiv.org/abs/1812.04419First author draf
Beliefs in Decision-Making Cascades
This work explores a social learning problem with agents having nonidentical
noise variances and mismatched beliefs. We consider an -agent binary
hypothesis test in which each agent sequentially makes a decision based not
only on a private observation, but also on preceding agents' decisions. In
addition, the agents have their own beliefs instead of the true prior, and have
nonidentical noise variances in the private signal. We focus on the Bayes risk
of the last agent, where preceding agents are selfish.
We first derive the optimal decision rule by recursive belief update and
conclude, counterintuitively, that beliefs deviating from the true prior could
be optimal in this setting. The effect of nonidentical noise levels in the
two-agent case is also considered and analytical properties of the optimal
belief curves are given. Next, we consider a predecessor selection problem
wherein the subsequent agent of a certain belief chooses a predecessor from a
set of candidates with varying beliefs. We characterize the decision region for
choosing such a predecessor and argue that a subsequent agent with beliefs
varying from the true prior often ends up selecting a suboptimal predecessor,
indicating the need for a social planner. Lastly, we discuss an augmented
intelligence design problem that uses a model of human behavior from cumulative
prospect theory and investigate its near-optimality and suboptimality.Comment: final version, to appear in IEEE Transactions on Signal Processin
Large expert-curated database for benchmarking document similarity detection in biomedical literature search
Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe
Nations within a nation: variations in epidemiological transition across the states of India, 1990–2016 in the Global Burden of Disease Study
18% of the world's population lives in India, and many states of India have populations similar to those of large countries. Action to effectively improve population health in India requires availability of reliable and comprehensive state-level estimates of disease burden and risk factors over time. Such comprehensive estimates have not been available so far for all major diseases and risk factors. Thus, we aimed to estimate the disease burden and risk factors in every state of India as part of the Global Burden of Disease (GBD) Study 2016
Recommended from our members
Global burden of 288 causes of death and life expectancy decomposition in 204 countries and territories and 811 subnational locations, 1990–2021: a systematic analysis for the Global Burden of Disease Study 2021
BACKGROUND Regular, detailed reporting on population health by underlying cause of death is fundamental for public health decision making. Cause-specific estimates of mortality and the subsequent effects on life expectancy worldwide are valuable metrics to gauge progress in reducing mortality rates. These estimates are particularly important following large-scale mortality spikes, such as the COVID-19 pandemic. When systematically analysed, mortality rates and life expectancy allow comparisons of the consequences of causes of death globally and over time, providing a nuanced understanding of the effect of these causes on global populations. METHODS The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2021 cause-of-death analysis estimated mortality and years of life lost (YLLs) from 288 causes of death by age-sex-location-year in 204 countries and territories and 811 subnational locations for each year from 1990 until 2021. The analysis used 56 604 data sources, including data from vital registration and verbal autopsy as well as surveys, censuses, surveillance systems, and cancer registries, among others. As with previous GBD rounds, cause-specific death rates for most causes were estimated using the Cause of Death Ensemble model-a modelling tool developed for GBD to assess the out-of-sample predictive validity of different statistical models and covariate permutations and combine those results to produce cause-specific mortality estimates-with alternative strategies adapted to model causes with insufficient data, substantial changes in reporting over the study period, or unusual epidemiology. YLLs were computed as the product of the number of deaths for each cause-age-sex-location-year and the standard life expectancy at each age. As part of the modelling process, uncertainty intervals (UIs) were generated using the 2·5th and 97·5th percentiles from a 1000-draw distribution for each metric. We decomposed life expectancy by cause of death, location, and year to show cause-specific effects on life expectancy from 1990 to 2021. We also used the coefficient of variation and the fraction of population affected by 90% of deaths to highlight concentrations of mortality. Findings are reported in counts and age-standardised rates. Methodological improvements for cause-of-death estimates in GBD 2021 include the expansion of under-5-years age group to include four new age groups, enhanced methods to account for stochastic variation of sparse data, and the inclusion of COVID-19 and other pandemic-related mortality-which includes excess mortality associated with the pandemic, excluding COVID-19, lower respiratory infections, measles, malaria, and pertussis. For this analysis, 199 new country-years of vital registration cause-of-death data, 5 country-years of surveillance data, 21 country-years of verbal autopsy data, and 94 country-years of other data types were added to those used in previous GBD rounds. FINDINGS The leading causes of age-standardised deaths globally were the same in 2019 as they were in 1990; in descending order, these were, ischaemic heart disease, stroke, chronic obstructive pulmonary disease, and lower respiratory infections. In 2021, however, COVID-19 replaced stroke as the second-leading age-standardised cause of death, with 94·0 deaths (95% UI 89·2-100·0) per 100 000 population. The COVID-19 pandemic shifted the rankings of the leading five causes, lowering stroke to the third-leading and chronic obstructive pulmonary disease to the fourth-leading position. In 2021, the highest age-standardised death rates from COVID-19 occurred in sub-Saharan Africa (271·0 deaths [250·1-290·7] per 100 000 population) and Latin America and the Caribbean (195·4 deaths [182·1-211·4] per 100 000 population). The lowest age-standardised death rates from COVID-19 were in the high-income super-region (48·1 deaths [47·4-48·8] per 100 000 population) and southeast Asia, east Asia, and Oceania (23·2 deaths [16·3-37·2] per 100 000 population). Globally, life expectancy steadily improved between 1990 and 2019 for 18 of the 22 investigated causes. Decomposition of global and regional life expectancy showed the positive effect that reductions in deaths from enteric infections, lower respiratory infections, stroke, and neonatal deaths, among others have contributed to improved survival over the study period. However, a net reduction of 1·6 years occurred in global life expectancy between 2019 and 2021, primarily due to increased death rates from COVID-19 and other pandemic-related mortality. Life expectancy was highly variable between super-regions over the study period, with southeast Asia, east Asia, and Oceania gaining 8·3 years (6·7-9·9) overall, while having the smallest reduction in life expectancy due to COVID-19 (0·4 years). The largest reduction in life expectancy due to COVID-19 occurred in Latin America and the Caribbean (3·6 years). Additionally, 53 of the 288 causes of death were highly concentrated in locations with less than 50% of the global population as of 2021, and these causes of death became progressively more concentrated since 1990, when only 44 causes showed this pattern. The concentration phenomenon is discussed heuristically with respect to enteric and lower respiratory infections, malaria, HIV/AIDS, neonatal disorders, tuberculosis, and measles. INTERPRETATION Long-standing gains in life expectancy and reductions in many of the leading causes of death have been disrupted by the COVID-19 pandemic, the adverse effects of which were spread unevenly among populations. Despite the pandemic, there has been continued progress in combatting several notable causes of death, leading to improved global life expectancy over the study period. Each of the seven GBD super-regions showed an overall improvement from 1990 and 2021, obscuring the negative effect in the years of the pandemic. Additionally, our findings regarding regional variation in causes of death driving increases in life expectancy hold clear policy utility. Analyses of shifting mortality trends reveal that several causes, once widespread globally, are now increasingly concentrated geographically. These changes in mortality concentration, alongside further investigation of changing risks, interventions, and relevant policy, present an important opportunity to deepen our understanding of mortality-reduction strategies. Examining patterns in mortality concentration might reveal areas where successful public health interventions have been implemented. Translating these successes to locations where certain causes of death remain entrenched can inform policies that work to improve life expectancy for people everywhere. FUNDING Bill & Melinda Gates Foundation
Decision Making in Star Networks with Incorrect Beliefs
Consider a Bayesian binary decision-making problem in star networks, where
local agents make selfish decisions independently, and a fusion agent makes a
final decision based on aggregated decisions and its own private signal. In
particular, we assume all agents have private beliefs for the true prior
probability, based on which they perform Bayesian decision making. We focus on
the Bayes risk of the fusion agent and counterintuitively find that incorrect
beliefs could achieve a smaller risk than that when agents know the true prior.
It is of independent interest for sociotechnical system design that the optimal
beliefs of local agents resemble human probability reweighting models from
cumulative prospect theory.
We also consider asymptotic characterization of the optimal beliefs and
fusion agent's risk in the number of local agents. We find that the optimal
risk of the fusion agent converges to zero exponentially fast as the number of
local agents grows. Furthermore, having an identical constant belief is
asymptotically optimal in the sense of the risk exponent. For additive Gaussian
noise, the optimal belief turns out to be a simple function of only error costs
and the risk exponent can be explicitly characterized.Comment: final version, to appear in IEEE Transactions on Signal Processin