30 research outputs found

    Resampling approaches in biometrical applications

    Get PDF

    Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies

    Get PDF
    The lack of replicability in research findings from different scientific disciplines has gained wide attention in the last few years and led to extensive discussions. In this `replication crisis', different types of uncertainty play an important role, which occur at different points of data collection and statistical analysis. Nevertheless, the consequences are often ignored in current research practices with the risk of low credibility and reliability of research findings. For the analysis and the development of solutions to this problem, we define measurement uncertainty, sampling uncertainty, data pre-processing uncertainty, method uncertainty, and model uncertainty, and investigate them in particular in the context of regression analyses. Therefore, we consider data from observational studies with the focus on high dimensionality and heterogeneous variables, which are characteristics of growing importance. High dimensional data, i.e., data with more variables than observations, play an important role in the area of medical research, where large amounts of molecular data (omics data) can be collected with ever decreasing expense and effort. Where several types of omics data are available, we are additionally faced with heterogeneity. Moreover, heterogeneous data can be found in many observational studies, where data originate from different sources, or where variables of different types are collected. This work comprises four contributions with different approaches to this topic and a different focus of investigation. Contribution 1 can be considered as a practical example to illustrate data pre-processing and method uncertainty in the context of prediction and variable selection from high dimensional and heterogeneous data. In the first part of this paper, we introduce the development of priority-Lasso, a hierarchical method for prediction using multi-omics data. Priority-Lasso is based on standard Lasso and assumes a pre-specified priority order of blocks of data. The idea is to successively fit Lasso models on these blocks of data and to take the linear predictor from every fit as an offset in the fit of the block with next lowest priority. In the second part, we apply this method in a current study of acute myeloid leukemia (AML) and compare its performance to standard Lasso. We illustrate data pre-processing and method uncertainty, caused by different choices of variable definitions and specifications of settings in the application of the method. These choices result in different effect estimates and thus different prediction performances and selected variables. In the second contribution, we compare method uncertainty with sampling uncertainty in the context of variable selection and ranking of omics biomarkers. For this purpose, we develop a user-friendly and versatile framework. We apply this framework on data from AML patients with high dimensional and heterogeneous characteristics and explore three different scenarios: First, variable selection in multivariable regression based on multi-omics data, second, variable ranking based on variable importance measures from random forests, and, third, identification of genes based on differential gene expression analysis. In contributions 3 and 4, we apply the vibration of effects framework, which was initially used to analyze model uncertainty in a large epidemiological study (NHANES), to assess and compare different types of uncertainty. The two contributions intensively address the methodological extension of this framework to different types of uncertainty. In contribution 3, we describe the extension of the vibration of effects framework to sampling and data pre-processing uncertainty. As a practical illustration, we take a large data set from psychological research with heterogeneous variable structure (SAPA-project), and examine sampling, model and data pre-processing uncertainty in the context of logistic regression for varying sample sizes. Beyond the comparison of single types of uncertainty, we introduce a strategy which allows quantifying cumulative model and data pre-processing uncertainty and analyzing their relative contributions to the total uncertainty with a variance decomposition. Finally, we extend the vibration of effects framework to measurement uncertainty in contribution 4. In a practical example, we conduct a comparison study between sampling, model and measurement uncertainty on the NHANES data set in the context of survival analysis. We focus on different scenarios of measurement uncertainty which differ in the choice of variables considered to be measured with error. Moreover, we analyze the behavior of different types of uncertainty with increasing sample sizes in a large simulation study

    Prediction of Medical Outcomes with Modern Modelling Techniques

    Get PDF
    Het doel van dit onderzoek is te onderzoeken onder welke omstandigheden en onder welke condities relatief moderne modelleringstechnieken zoals support vector machines, neural networks en random forests voordelen zouden kunnen hebben in medisch-wetenschappelijk onderzoek en in de medische praktijk in vergelijking met meer traditionele modelleringstechnieken, zoals lineaire regressie, logistische regressie en Cox regressie

    A COMPREHENSIVE PIPELINE FOR CLASS COMPARISON AND CLASS PREDICTION IN CANCER RESEARCH

    Get PDF
    Personalized medicine is an emerging field that promises to bring radical changes in healthcare and may be defined as \u201ca medical model using molecular profiling technologies for tailoring the right therapeutic strategy for the right person at the right time, and determine the predisposition to disease at the population level and to deliver timely and stratified prevention\u201d. The sequencing of the human genome together with the development and implementation of new high throughput technologies has provided access to large \u2018omics\u2019 (e.g. genomics, proteomics) data, bringing a better understanding of cancer biology and enabling new approaches to diagnosis, drug development, and individualized therapy. \u2018Omics\u2019 data have the potential as cancer biomarkers but no consolidated guidelines have been established for discovery analyses. In the context of the EDERA project, funded by the Italian Association for Cancer Research, a structured pipeline was developed with innovative applications of existing bioinformatics methods including: 1) the combination of the results of two statistical tests (t and Anderson-Darling) to detect features with significant fold change or general distributional differences in class comparison; 2) the application of a bootstrap selection procedure together with machine learning techniques to guarantee result generalizability and study the interconnections among the selected features in class prediction. Such a pipeline was successfully applied to plasmatic microRNA, identifying five hemolysis related microRNAs and to Secondary ElectroSpray Ionization-Mass Spectrometry data, in which case eight mass spectrometry signals were found able to discriminate exhaled breath from breast cancer patients from that of healthy individuals

    Hyperspectral Imaging for Fine to Medium Scale Applications in Environmental Sciences

    Get PDF
    The aim of the Special Issue “Hyperspectral Imaging for Fine to Medium Scale Applications in Environmental Sciences” was to present a selection of innovative studies using hyperspectral imaging (HSI) in different thematic fields. This intention reflects the technical developments in the last three decades, which have brought the capacity of HSI to provide spectrally, spatially and temporally detailed data, favoured by e.g., hyperspectral snapshot technologies, miniaturized hyperspectral sensors and hyperspectral microscopy imaging. The present book comprises a suite of papers in various fields of environmental sciences—geology/mineral exploration, digital soil mapping, mapping and characterization of vegetation, and sensing of water bodies (including under-ice and underwater applications). In addition, there are two rather methodically/technically-oriented contributions dealing with the optimized processing of UAV data and on the design and test of a multi-channel optical receiver for ground-based applications. All in all, this compilation documents that HSI is a multi-faceted research topic and will remain so in the future

    IDENTIFICATION OF CIRCULATING BIOMARKERS FOR THE EARLY DIAGNOSIS OF COLORECTAL CANCER: METHODOLOGICAL ASPECTS

    Get PDF
    The present PhD research project starts from the need of deeply investigating some methodological aspects related to the identification, validation and application, in a routine clinical setting, of new non-invasive biomarkers for the (early) detection of cancer. Specifically, I investigated the statistical-methodological issues related to the identification and validation of molecular-based signatures detected with qPCR-based platforms by using the colorectal cancer as disease model (CRC-INT study). Colorectal cancer (CRC) is one of the major causes of cancer death in western countries1,2. More than 90% of CRC cases occur after the age of 50 years2 and, on the basis of the natural history of CRC progression and of the long time-interval of progression from normal mucosa to invasive cancer many efforts have been focused on the implementation of screening programs for CRC prevention and detection in its early stage, especially in this cohort of subjects. Currently adopted screening programs are based on colonoscopy (invasive, but still considered as the gold standard for both detection and removal of lesions) or on tests that search the human haemoglobin in stool (i.e. faecal occult blood test, FOBT/FIT). The latter are less invasive and easier to carry out, but showed low sensitivity values for polyps identification. In Italy, screening programs are based on FOBT/FIT - offered every 2 years to residents of 50-69/74 years-old - or on flexible sigmoidoscopy (FS) offered to a single age cohort, generally at 58-60 years-old. As concerns FIT programmes, quantitative haemoglobin analysis is performed in a centralized reference laboratory using the threshold of 100ng/ml of faecal haemoglobin as cut-off value to determine the positivity to the test. People with a negative test are invited to repeat the test after 2 years. Subjects with a positive test (FIT+) are contacted in order to perform a total colonoscopy (TC) in referral centres during dedicated sessions. According to 2011-2012 data, colonoscopy is performed by the 81% of FIT+ subjects and a diagnosis of carcinoma is formulated in 5% of FIT+ subjects and that of advanced adenoma in a further 25%. Subjects with non cancerous lesions are enrolled in follow-up programs according to the colonoscopy\u2019s output, whereas subjects with a screen-detected CRC undergo surgery(www.giscor.it). MicroRNAs have been studied intensively in the field of oncological research and several studies have highlighted their easy detectability in plasma and serum3, suggesting their possible role as non-invasive biomarkers for the diagnosis and monitoring of human cancers. However, their detection and quantification could be influenced by haemolysis, i.e. pink discoloration of serum or plasma due to the release of the red blood cells into the fluids4,5,6,7. In addition, several Authors are now highlighting the need of shared workflows for the entire process of miRNA identification (pre-analytical and analytical) as well as for the statistical analysis. Accordingly, we recently proposed a workflow that tries to schematize all the key phases involved in biomarkers studies, from biomarker discovery to their analytical and clinical validation, including the issues related to the development of operative procedures for their analysis8. The process usually begins with a discovery phase, followed by a validation one and ultimately by the clinical application of the identified biomarker signature. Two additional assay-oriented steps (assay optimization and assay development) could be introduced in the workflow, before and/or after the validation phase. From a statistical-methodological point of view, the main issues involved in this workflow are those related to (i) data normalization of high-throughput qPCR data and the (ii) building and validation of miRNA-based signatures. Data normalization represents a crucial pre-processing step aimed both at removing experimentally-induced variation and at differentiating true biological changes. Inappropriate normalization strategies can affect the results of the subsequent analysis and, as a consequence, the conclusion drawn from the results. In the miRNA context, there are no yet verified and shared reference RNA in serum and plasma that can be used for data normalization. Several methods have been proposed to solve the problem of reference RNA selection. Currently, the most accepted and widely used method for data normalization of circulating miRNAs is that proposed by Mestdagh9, based on the computation of the global mean of the expressed miRNAs. This method is valid if a large number of miRNAs are profiled, but it is almost never applicable in validation studies focused on a limited number of miRNAs. To overcome this issue, the same Authors proposed to search the set of reference miRNAs that resembles the mean expression value of all the miRNAs, and use that set for data normalization. Starting from this, we developed a comprehensive data-driven normalization method for high-throughput qPCR data that identifies a small set of miRNAs to be used as reference for data normalization in view of the subsequent validation studies10,11, using results obtained with the global mean method as reference method. This algorithm was also implemented in a R-function (NqA algorithm). As concerns the biomarker studies based on the use of high-throughput assays, it should be considered that in such setting a wide range of weak biomarkers are constantly identified. In such setting, a single molecular biomarker alone may not achieve satisfactory performance for patients classification, so that the linear combination of these biomarkers in a more powerful composite score could represent a suitable approach to achieve higher diagnostic performances. As reported by Yan12, several methods are available, such as those based on the searching of the alpha value that maximizes the AUC value of the linear combination of p-biomarkers (grid-search methods), or the conventional logistic regression model when the purpose is to guide healthcare professionals in their decision-making. Prediction model studies are usually organized in a model development and in a model validation phase, or in a combination of both. In the first phase the aim is to derive a multivariate prediction model by selecting the most relevant predictors and combing them into a multivariate model, whereas the model validation consists in the evaluation of the performance of the model on other data, not used for model development. As far as miRNAs are concerned, the process starts with the identification of the candidate miRNAs that should be included in the initial multivariate model, as reported in (8). Once identified these candidates, the model development phase could start with the fitting of the initial multivariate model, by opportunely taking into consideration the number of event-per-variable (EPV). Penalized regression models (PMLE) 13, in which the beta value are obtained by maximising the penalized log-likelihood, can be used to prevent overfitting when a large number of covariates are present in a model with respect to the number of outcome events. According to the penalty term (i.e. the functional form of the constraints) and the tuning parameter (i.e. the amount of shrinkage applied to the coefficients), different penalized regression models can be fitted13 Once defined the initial model and fitted it with the proper procedures, another important theme is the definition of the final model, which could correspond to the full initial model or to a reduced one. For standard regression models, backward elimination or forward selection can be used for this purpose, even if the latter does not provide a simultaneous assessment of the effects of all the candidates in the model14, whereas for PMLE, a reduced model can be obtained using the R-square method15. An alternative approach to the standard stepwise/backward methods is the all subsets regression, which can discover combinations of variables that explain more variation in patients outcome than those obtained by using the standard stepwise/backward algorithms8. This approach has several potential advantages, but also some drawbacks, including the possibility of selecting models which omit important predictors (i.e. pre-existing evidences). The performance of a developed model should be then assessed by evaluating discrimination and calibration. Discrimination refers to the ability of the model to distinguish individuals with the disease from those without the disease (measured with the c-index or the equivalent area under the ROC curve), whereas calibration refers to the agreement between the probability of developing/having the outcome of interest as estimated by the model, and the observed outcome. In addition, it is important to consider that the performance of the developed model could be too optimistic because the same data are used for developing and testing the model14,17. In fact, when applied to new subjects, the performance of the model is generally lower than that observed in the development phase. Therefore, it is necessary to evaluate the performance of a developed model in new individuals before its implementation and application in clinical practice. Two different types of validation can be adopted, according to the design of the study and the data available: internal and external validation. Internal validation can be adopted when only one sample is available: approaches vary from the single splitting (one time) of the study data in a training and a testing set to the repeated splitting of the data a large number of times (i.e leave-one-out, k-fold and repeated random split cross-validation). Alternatively, the bootstrapping can be adopted when the development sample is relatively small and/or a large number of candidate predictors is under investigation14. The bootstrapping procedure allows the use of all the data for model development, and it also provides information about the level of model overfitting and optimism as well as about what can be expected when the model is applied to new individuals from the same theoretical source population. Even if these internal validation methods can correctly control overfitting and optimisms, they cannot substitute external validation, which consists in testing the model on new subjects17. The objective is to apply the original model to new data and to quantify the model\u2019s predictive performance, without any re-estimation of the parameters included in the model. The new set of individuals may come from the same institution but be recruited in a different period (i.e. temporal validation), or they may come from other institutions/contexts (i.e. geographical validation)17. When a poorer performance of a prediction model is obtained on new individuals, instead of developing a new model (sometimes by repeating the entire selection of predictors), a valid alternative is to update the existing model by adjusting (or recalibrating) the model by the local circumstance or setting of the validation sample at hand17,18. In this way, the updated model combines the information captured in the original model on the development dataset with information from new individuals, theoretically improving the transportability to other individuals. Moreover, to improve the performance of a clinical prediction model, new marker(s) can be incorporated in the existing model19. The above-described methodology was applied to the CRC-INT study, aimed at identifying plasma circulating miRNAs to be used as biomarkers for the early detection of CRC lesions, using the FIT positive individuals as target population. The study included three cohorts of FIT+ subjects: discovery (DC), internal validation (IVC) and external validation (EVC). Blood was collected before colonoscopy and circulating miRNAs extracted from plasma were analyzed by using PCR assays. The principal aim of the discovery phase was to investigate the suitability of searching miRNAs in plasma from FIT+ individuals as well as identify a set of reference and candidate miRNAs to be deeply investigated in the subsequent phases focused on prospectively enrolled subjects. During the discovery phase, the expression levels of human miRNAs on a cohort of already available plasma samples from FIT+ individuals who have performed a screening colonoscopy at INT was used. As output, a subset of reference miRNAs was identified using the NqA R-function and a total of candidate miRNAs were identified as showing significantly different expressions in subjects with proliferative lesions vs subjects without lesions, or in subjects with a specific proliferative lesions. Based on these results, a custom micro fluidic card including the candidate and reference miRNAs was designed to be used in the following internal validation phase. Before moving to the custom-made assay, we performed a technical validation phase (on the same DC samples) aimed at evaluating the level of reproducibility between the involved assays (i.e. the high-throughput assay used in the discovery phase and the custom-made one to be used in the IVC/EVC20). In addition, an ad-hoc in-vitro controlled haemolysis experiment was implemented by artificially introducing different percentages of red blood cells (RBCs) in a haemolysis-free plasma sample21. Results evidenced that miRNAs known as haemolysis-related in literature were confirmed also in our experiment as influenced by haemolysis, whereas all our reference and 70% of candidate miRNAs were not influenced by haemolysis. Candidate miRNAs, showing relevant changes with respect to the haemolysis-free plasma sample were not considered in the subsequent statistical analysis. In addition, by taking advantage of the availability, for each contaminated tube, of haemolysis indexes (spectrophotometrically measured) and known RBC concentration, a calibration curve was generated with the aim to estimate the unknown percentage of RBCs in new plasma samples. For the analysis of the IVC data, we adopted an approach based on the all-subset analysis and the PMLE method to estimate the miRNA-based signatures, in order to take into consideration the peculiarity of the scenario under investigation, such as many weak biomarkers measured in plasma using platforms developed for research purposes only. We then performed a signature selection (i.e EPV > 3, significant AUC and finite shrinkage value) in order to select on the IVC only few signatures to be tested on the EVC. The latter includes FIT+ subjects enrolled at our Institute and also in other Hospitals joining the CRC-screening program of the Local Health Authority of Milan. Statistical analysis of the EVC cohort is ongoing: preliminary results confirmed the predictive capability of some of the identified signatures, even if lower performance with respect to that obtained on the IVC. Further analysis will be performed to evaluate a possible effect of the variable \u201cHospital\u201d as well as to apply different model validation and model updating approaches. The evaluation of the added value of the developed miRNA-based signatures to pre-existing prediction models and the gain brought in by the introduction of the miR-test in the existing CRC screening diagnostic workflow will be eventually evaluated. References 1 Jemal, A., Bray, F., Center, M. M., Ferlay, J., Ward, E., & Forman, D. (2011). Global cancer statistics. CA: A Cancer Journal for Clinicians, 61(2), 69-90. 2 Mazeh, H., Mizrahi, I., Ilyayev, N., Halle, D., Brucher, B., Bilchik, A., et al. (2013). The diagnostic and prognostic role of microRNA in colorectal cancer - a comprehensive review. Journal of Cancer, 4(3), 281-295. 3 Chen, X., Ba, Y., Ma, L., Cai, X., Yin, Y., Wang, K., et al. (2008). Characterization of microRNAs in serum: A novel class of biomarkers for diagnosis of cancer and other diseases. Cell Research, 18(10), 997-1006. 4 Kirschner, M. B., Kao, S. C., Edelman, J. J., Armstrong, N. J., Vallely, M. P., van Zandwijk, N., et al. (2011). Haemolysis during sample preparation alters microRNA content of plasma. PloS One, 6(9), e24145. 5 Pritchard, C. C., Kroh, E., Wood, B., Arroyo, J. D., Dougherty, K. J., Miyaji, M. M., et al. (2012). Blood cell origin of circulating microRNAs: A cautionary note for cancer biomarker studies. Cancer Prevention Research (Philadelphia, Pa.), 5(3), 492-497. 6 Kirschner, M. B., Edelman, J. J., Kao, S. C., Vallely, M. P., van Zandwijk, N., & Reid, G. (2013). The impact of hemolysis on cell-free microRNA biomarkers. Frontiers in Genetics, 4, 94. 7 Yamada, A., Cox, M. A., Gaffney, K. A., Moreland, A., Boland, C. R., & Goel, A. (2014). Technical factors involved in the measurement of circulating microRNA biomarkers for the detection of colorectal neoplasia. PloS One, 9(11), e112481. 8 Verderio, P., Bottelli, S., Pizzamiglio, S., & Ciniselli, C. M. (2016). Developing miRNA signatures: A multivariate prospective. British Journal of Cancer, 115(1), 1-4. 9 Mestdagh, P., Van Vlierberghe, P., De Weer, A., Muth, D., Westermann, F., Speleman, F., et al. (2009). A novel and universal method for microRNA RT-qPCR data normalization. Genome Biology, 10(6), R64-2009-10-6-r64 10 Pizzamiglio, S., Bottelli, S., Ciniselli, C. M., Zanutto, S., Bertan, C., Gariboldi, M., et al. (2014). A normalization strategy for the analysis of plasma microRNA qPCR data in colorectal cancer. International Journal of Cancer, 134(8), 2016-2018. 11 Verderio, P., Bottelli, S., Ciniselli, C. M., Pierotti, M. A., Gariboldi, M., & Pizzamiglio, S. (2014). NqA: An R-based algorithm for the normalization and analysis of microRNA quantitative real-time polymerase chain reaction data. Analytical Biochemistry, 461, 7-9. 12 Yan, L., Tian, L., & Liu, S. (2015). Combining large number of weak biomarkers based on AUC. Statistics in Medicine, 34(29), 3811-3830. 13 Pavlou, M., Ambler, G., Seaman, S., De Iorio, M., & Omar, R. Z. (2016). Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Statistics in Medicine, 35(7), 1159-1177. 14 Moons, K. G., Kengne, A. P., Woodward, M., Royston, P., Vergouwe, Y., Altman, D. G., et al. (2012). Risk prediction models: I. development, internal validation, and assessing the incremental value of a new (bio)marker. Heart (British Cardiac Society), 98(9), 683-690. 15 Moons, K. G., Donders, A. R., Steyerberg, E. W., & Harrell, F. E. (2004). Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: A clinical example. Journal of Clinical Epidemiology, 57(12), 1262-1270. 16 Altman, D. G., & Royston, P. (2000). What do we mean by validating a prognostic model? Statistics in Medicine, 19(4), 453-473. 17 Moons, K. G., Kengne, A. P., Grobbee, D. E., Royston, P., Vergouwe, Y., Altman, D. G., et al. (2012). Risk prediction models: II. external validation, model updating, and impact assessment. Heart (British Cardiac Society), 98(9), 691-698. 18 Vergouwe, Y., Nieboer, D., Oostenbrink, R., Debray, T. P., Murray, G. D., Kattan, M. W., et al. (2016). A closed testing procedure to select an appropriate method for updating prediction models. Statistics in Medicine, 19 Nieboer, D., Vergouwe, Y., Ankerst, D. P., Roobol, M. J., & Steyerberg, E. W. (2016). Improving prediction models with new markers: A comparison of updating strategies. BMC Medical Research Methodology, 16(1), 128. 20 Verderio, P., Bottelli, S., Ciniselli, C. M., Pierotti, M. A., Zanutto, S., Gariboldi, M., et al. (2015). Moving from discovery to validation in circulating microRNA research. The International Journal of Biological Markers, 30(2), e258-61. 21 Pizzamiglio, S., Zanutto, S., Ciniselli, C. M., Belfiore, A., Bottelli, S., Gariboldi, M., et al. (2017). A methodological procedure for evaluating the impact of hemolysis on circulating microRNAs. Oncology Letters, 13(1), 315-320

    Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK): Explanation and Elaboration

    Get PDF
    The REMARK “elaboration and explanation” guideline, by Doug Altman and colleagues, provides a detailed reference for authors on important issues to consider when designing, conducting, and analyzing tumor marker prognostic studies

    Gaining Insight into Determinants of Physical Activity using Bayesian Network Learning

    Get PDF
    Contains fulltext : 228326pre.pdf (preprint version ) (Open Access) Contains fulltext : 228326pub.pdf (publisher's version ) (Open Access)BNAIC/BeneLearn 202

    On the Formation and Economic Implications of Subjective Beliefs and Individual Preferences

    Get PDF
    The conceptual framework of neoclassical economics posits that individual decision-making processes can be represented as maximization of some objective function. In this framework, people's goals and desires are expressed through the means of preferences over outcomes; in addition, in choosing according to these objectives, people employ subjective beliefs about the likelihood of unknown states of the world. For instance, in the subjective expected utility paradigm, people linearly combine their probabilistic beliefs and preferences over outcomes to form an expected utility function. Much of the parsimony and power of theoretical economic analysis stems from the striking generality and simplicity of this framework. At the same time, the crucial importance of preferences and beliefs in our conceptual apparatus in combination with the heterogeneity in choice behavior that is observed across many economic contexts raises a number of empirical questions. For example, how much heterogeneity do we observe in core preference or belief dimensions that are relevant for a broad range of economic behaviors? If such preferences and beliefs exhibit heterogeneity, then what are the origins of this heterogeneity? How do beliefs and preferences form to begin with? And how does variation in beliefs and preferences translate into economically important heterogeneity in choice behavior? This thesis is organized around these broad questions and hence seeks to contribute to the goal of providing an improved empirical understanding of the foundations and economic implications of individual decision-making processes. The content of this work reflects the deep belief that understanding and conceptualizing decision-making requires economists to embrace ideas from a broad range of fields. Accordingly, this thesis draws insights and techniques from the literatures on behavioral and experimental economics, cultural economics, household finance, comparative development, cognitive psychology, and anthropology. Chapters 1 through 3 combine methods from experimental economics, household finance, and cognitive psychology to investigate the effects of bounded rationality on the formation and explanatory power of subjective beliefs. Chapters 4 through 6 use tools from cultural economics, anthropology, and comparative development to study the cross-country variation in economic preferences as well as its origins and implications. The formation of beliefs about payoff-relevant states of the world crucially hinges on an adequate processing of incoming information. However, oftentimes, the information people receive is rather complex in nature. Chapters 1 and 2 investigate how boundedly rational people form beliefs when their information is subject to sampling biases, i.e., when the information pieces people receive are either not mutually independent or systematically selected. Chapter 1 is motivated by Akerlof and Shiller's popular narrative that from time to time some individuals or even entire markets undergo excessive belief swings, which refers to the idea that sometimes people are overly optimistic and sometimes overly pessimistic over, say, the future development of the stock market. In particular, Akerlof and Shiller argue that such "exuberance" or excessive pessimism might be driven by the pervasive "telling and re-telling of stories". In fact, many real information structures such as the news media generate correlated rather than mutually independent signals, and hence give rise to severe double-counting problems. However, clean evidence on how people form beliefs in correlated information environments is missing. Chapter 1, which is joint work with Florian Zimmermann, provides clean experimental evidence that many people neglect such double-counting problems in the updating process, so that beliefs are excessively sensitive to well-connected information sources and follow an overshooting pattern. In addition, in an experimental asset market, correlation neglect not only drives overoptimism and overpessimism at the individual level, but also gives rise to a predictable pattern of over- and underpricing. Finally, investigating the mechanisms underlying the strong heterogeneity in the presence of the bias, a series of treatment manipulations reveals that many people struggle with identifying double-counting problems in the first place, so that exogenous shifts in subjects' focus have large effects on beliefs. Chapter 2 takes as starting point the big public debate about increased political polarization in the United States, which refers to the fact that political beliefs tend to drift apart over time across social and political groups. Popular narratives by, e.g., Sunstein, Bishop, and Pariser posit that such polarization is driven by people selecting into environments in which they are predominantly exposed to information that confirms their prior beliefs. This pattern introduces a selection problem into the belief formation process, which may result in polarization if people failed to take the non-representativeness among their signals into account. However, again, we do not have meaningful evidence on how people actually form beliefs in such "homophilous" environments. Thus, Chapter 2 shows experimentally that many people do not take into account how their own prior decisions shape their informational environment, but rather largely base their views on their local information sample. In consequence, beliefs excessively depend on people's priors and tend to be too extreme, akin to the concerns about "echo chambers" driving irrational belief polarization across social groups. Strikingly, the distribution of individuals' naivete follows a pronounced bimodal structure - people either fully account for the selection problem or do not adjust for it at all. Allowing for interaction between these heterogeneous updating types induces little learning: neither the endogenous acquisition of advice nor exogenously induced dissent lead to a convergence of beliefs across types, suggesting that the belief heterogeneity induced by selected information may persist over time. Finally, the paper provides evidence that selection neglect is conceptually closely related to correlation neglect in that both cognitive biases appear to be driven by selective attentional patterns. Taken together, chapters 1 and 2 show that many people struggle with processing information that is subject to sampling issues. What is more, the chapters also show that these biases might share common cognitive foundations, hence providing hope for a unified attention-based theory of boundedly rational belief formation. While laboratory experimental techniques are a great tool to study the formation of beliefs, they cannot shed light on the relationship between beliefs and economically important choices. In essentially all economic models, beliefs mechanically map into choice behavior. However, it is not evident that people's beliefs play the same role in generating observed behavior across heterogeneous individuals: while some people's decision process might be well-approximated by the belief and preference-driven choice rules envisioned by economic models, other people might use, e.g., simple rules of thumb instead, implying that their beliefs should be largely irrelevant for their choices. That is, bounded rationality might not only affect the formation of beliefs, but also the mapping from beliefs to choices. In Chapter 3, Tilman Drerup, Hans-Martin von Gaudecker, and I take up this conjecture in the context of measurement error problems in household finance: while subjective expectations are important primitives in models of portfolio choice, their direct measurement often yields imprecise and inconsistent measures, which is typically treated as a pure measurement error problem. In contrast to this perspective, we argue that individual-level variation in the precision of subjective expectations measures can actually be productively exploited to gain insights into whether economic models of portfolio choice provide an adequate representation of individual decision processes. Using a novel dataset on experimentally measured subjective stock market expectations and real stock market decisions collected from a large probability sample of the Dutch population, we estimate a semiparametric double index model to explore this conjecture. Our results show that investment decisions exhibit little variation in economic model primitives when individuals provide error-ridden belief statements. In contrast, they predict strong variation in investment decisions for individuals who report precise expectation measures. These findings indicate that the degree of precision in expectations data provides useful information to uncover heterogeneity in choice behavior, and that boundedly rational beliefs need not necessarily map into irrational choices. In the standard neoclassical framework, people's beliefs only serve the purpose of achieving a given set of goals. In many applications of economic interest, these goals are well-characterized by a small set of preferences, i.e., risk aversion, patience, and social preferences. Prior research has shown that these preferences vary systematically in the population, and that they are broadly predictive of those behaviors economic theory supposes them to. At the same time, this empirical evidence stems from often fairly special samples in a given country, hence precluding an analysis of how general the variation and predictive power in preferences is across cultural, economic, and institutional backgrounds. In addition, it is conceivable that preferences vary not just at an individual level, but also across entire populations - if so, what are the deep historical or cultural origins of this variation, and what are its (aggregate) economic implications? Chapters 4 through 6 take up these questions by presenting and analyzing the Global Preference Survey (GPS), a novel globally representative dataset on risk and time preferences, positive and negative reciprocity, altruism, and trust for 80,000 individuals, drawn as representative samples from 76 countries around the world, representing 90 percent of both the world's population and global income. In joint work with Armin Falk, Anke Becker, Thomas Dohmen, David Huffman, and Uwe Sunde, Chapter 4 presents the GPS data and shows that the global distribution of preferences exhibits substantial variation across countries, which is partly systematic: certain preferences appear in combination, and follow distinct economic, institutional, and geographic patterns. The heterogeneity in preferences across individuals is even more pronounced and varies systematically with age, gender, and cognitive ability. Around the world, the preference measures are predictive of a wide range of individual-level behaviors including savings and schooling decisions, labor market and health choices, prosocial behaviors, and family structure. We also shed light on the cultural origins of preference variation around the globe using data on language structure. The magnitude of the cross-country variation in preferences is striking and raises the immediate question of what brought it about. Chapter 5 presents joint work with Anke Becker and Armin Falk in which we use the GPS to show that the migratory movements of our early ancestors thousands of years ago have left a footprint in the contemporary cross-country distributions of preferences over risk and social interactions. Across a wide range of regression specifications, differences in preferences between populations are significantly increasing in the length of time elapsed since the respective groups shared common ancestors. This result obtains for risk aversion, altruism, positive reciprocity, and trust, and holds for various proxies for the structure and timing of historical population breakups, including genetic and linguistic data or predicted measures of migratory distance. In addition, country-level preference endowments are non-linearly associated with migratory distance from East Africa, i.e., genetic diversity. In combination with the relationships between language structure and preferences established in Chapter 4, these results point to the importance of very long-run events for understanding the global distribution of some of the key economic traits. Given these findings on the very deep roots of the cross-country variation in preferences, an interesting - and conceptually different - question is whether such country-level preference profiles might have systematic aggregate economic implications. Indeed, according to standard dynamic choice theories, patience is a key driving factor behind the accumulation of productive resources and hence ultimately of income not just at an individual, but also at a macroeconomic level. Using the GPS data on patience, Chapter 6 (joint work with Thomas Dohmen, Armin Falk, David Huffman, and Uwe Sunde) investigates the empirical relevance of this hypothesis in the context of a micro-founded development framework. Around the world, patient people invest more into human and physical capital and have higher incomes. At the macroeconomic level, we establish a significant reduced-form relationship between patience and contemporary income as well as medium- and long-run growth rates, with patience explaining a substantial fraction of development differences across countries and subnational regions. In line with a conceptual framework in which patience drives income through the accumulation of productive resources, average patience also strongly correlates with aggregate human and physical capital accumulation as well as investments into productivity. Taken together, this thesis has a number of unifying themes and insights. First, consistent with the vast heterogeneity in observed choices, people exhibit a large amount of variation in beliefs and preferences, and in how they combine these into choice rules. Second, at least part of this heterogeneity is systematic and has identifyable sources: preferences over risk, time, and social interactions appear to have very deep historical or cultural origins, but also systematically vary with individual characteristics; belief heterogeneity, on the other hand, is partly driven by bounded rationality and its systematic, predictable effects on information-processing. Third, and finally, this heterogeneity in beliefs and preferences is likely to have real economic implications: across cultural and institutional backgrounds, preferences correlate with the types of behaviors that economic models envision them to, not just across individuals, but also at the macroeconomic level; subjective beliefs are predictive of behavior, too, albeit with the twist that certain subgroups of the population do not appear to entertain stable belief distributions to begin with. In sum, (I believe that) much insight is to be gained from further exploring these fascinating topics
    corecore