10,328 research outputs found

    On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law

    Full text link
    Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on ``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the common training answer is 'no'. Second, the OOD test set is used for model selection. Third, a model's in-domain performance is assessed after retraining it on in-domain splits (VQA v2) that exhibit a more balanced distribution of labels. These three practices defeat the objective of evaluating generalization, and put into question the value of methods specifically designed for this dataset. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide short- and long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation

    Results readiness in social protection and labor operations

    Get PDF
    The main focus of the social protection and labor portfolio is on strengthening client's institutional capacity in the design and implementation of programs, but projects are not well equipped to track progress in this area. Correspondingly, there is a need to strengthen approaches to measuring and monitoring a'missing middle'of service delivery, precisely those areas for which counterpart institutions are responsible during the course of a project. In particular, better measures of the primary functions of social protection and labor agencies are needed, such as identifying and enrolling beneficiaries, targeting, payment systems, fraud and error control, performance monitoring of service delivery providers, responsiveness to citizens, transparency, efficiency, management information systems and monitoring and evaluation systems. New World Bank initiatives particularly standard core indicators by sector and the introduction of results based investment lending call for substantial improvements in the use of monitoring and evaluation (M&E). Impact evaluations are included in about half of projects and should continue to be used selectively and strategically, particularly when the program is innovative, replicable and/ or scalable to reach a broader set of beneficiaries, addresses a knowledge gap and is likely to have a substantial policy impact. Structuring evaluations around core themes with common outcome measures is fundamental to building a global knowledge base on development effectiveness.Poverty Monitoring&Analysis,Poverty and Social Impact Analysis,E-Business,Safety Nets and Transfers,Housing&Human Habitats

    The role of macroeconomic policies in the global crisis

    Get PDF
    This paper argues that the lack of timely and decisive policy action to correct domestic and external imbalances contributed crucially to the build-up of financial excesses that led to the financial crisis and the Great Recession. We focus on 2002-07 and perform a number of counterfactual simulations to investigate two central elements of the story, namely: (a) an over-expansionary US monetary policy and the absence of effective macro-prudential supervision, which permitted a prolonged expansion of debt-financed consumer spending; (b) the decision of China and other emerging countries to pursue an export-led growth strategy supported by pegging their currencies to the US dollar, resulting in a huge build-up of their official reserves, in conjunction with sluggish domestic demand in surplus advanced economies characterized by low potential output growth. The results of the simulations lend support to the view that if substantial, globally coordinated demand rebalancing had been undertaken in a timely manner, the macroeconomic and financial imbalances would not have accumulated to the extent that they did and the financial turmoil might have had less drastic global consequences.global imbalances, financial crisis, monetary policy, macroprudential regulation, structural reforms.

    Evaluation in the practice of development

    Get PDF
    Knowledge about development effectiveness is constrained by two factors. First, the project staff in governments and international agencies who decide how much to invest in research on specific interventions are often not well informed about the returns to rigorous evaluation and (even when they are)cannot be expected to take full account of the external benefits to others from new knowledge. This leads to under-investment in evaluative research. Second, while standard methods of impact evaluation are useful, they often leave many questions about development effectiveness unanswered. The paper proposes ten steps for making evaluations more relevant to the needs of practitioners. It is argued that more attention needs to be given to identifying policy-relevant questions (including the case for intervention); that a broader approach should be taken to the problems of internal validity; and that the problems of external validity (including scaling up) merit more attention.Poverty Monitoring&Analysis,Science Education,Scientific Research&Science Parks,Population Policies,Tertiary Education

    Reviving project appraisal at the World Bank

    Get PDF
    The authors focus on two broad questions: 1) what is the proper role for project evaluation in today's world, where countries have reduced major economic distortions and are reconsidering the role of the state? and 2) besides project evaluation, how else can economic analysis ensure high-quality projects? The authors argue for a shift in the emphasis of project evaluation away from a concern with precise rate of return calculations to a broader examination of the rationale for public provision. In this context, three areas critical for proper project appraisal are the counterfactual private sector supply response, the fiscal impact, and the fungibility of lending. (1) Counterfactual private sector supply response. Any type of cost-benefit analysis - be it in the public or the private sector - requires the project evaluator to specify the counterfactual: what wouldthe world have looked like in the absence of the project? Since World Bank projects are public sector projects, the relevant counterfactual involves assessing what the private sector would have otherwise provided, and the relevant magnitude for evaluation purposes is the net contribution of the public project. Failure to consider explicitly the private sector counterfactual during evaluation biases the lending mix of the Bank away from projects with strong public good characteristics toward projects with private good characteristics. (2) Fiscal impact. Applying the private sector couterfactual would lead the Bank to undertake projects with a reasonable case for public intervention, such as basic infrastructure, primary education, and rural health. These projects typically share the characteristics that costs are borne by the public sector while benefits are enjoyed by the private sector. But in the absence of nondistortionary, lump sum taxes, there is likely to be a positive marginal cost of taxation and a premium on public income. Since the Bank has not used such a premium and treats public costs and private benefits equally, it has systematically overestimated the net benefits of these projects. (3) Fungibility of lending. Project-specific appraisal can at best assess only the rate of return and the acceptability of the project being appraised. This limitation is problematic because the project might have been undertaken even without Bank financing. If that is the case, the Bank is actually financing some other project - one not subject to appraisal by the Bank - that would not have been in the investment program without Bank financing. This problem arises because financial resources are fungible to some extent. One way to alleviate this concern is to conduct public expenditure reviews before embarking on the appraisal and financing of specific projects. Furthermore, financing a portion of the government's sectoral investment program may be more effective than project-specific lending.Decentralization,Health Economics&Finance,Public Health Promotion,Poverty Monitoring&Analysis,Environmental Economics&Policies,Health Economics&Finance,Environmental Economics&Policies,Poverty Monitoring&Analysis,Economic Theory&Research,Health Monitoring&Evaluation
    corecore