163,928 research outputs found

    What's the Difference? How Foundation Trustees View Evaluation

    Get PDF
    Trustee Evaluation ToolkitTrustees care deeply about impact. Understanding results is part of their fiduciary duty. As foundations strive to improve performance, advance accountability and share knowledge, their desire for evaluation -- reliable data on organizational effectiveness -- grows. Based on discussions with trustees, we've heard that current evaluation approaches don't always generate useful information. In too many cases, foundation evaluation practices don't align with trustee needs. Trustees across the United States believe there are ways to improve how we determine the effectiveness of social investments. FSG Social Impact Advisors, with funding from the James Irvine Foundation, interviewed dozens of foundation trustees, CEOs and evaluation experts to uncover critical issues and exciting ideas related to evaluation. This "toolkit" shares highlights from these interviews, and explores innovative new approaches

    DiPerF: an automated DIstributed PERformance testing Framework

    Full text link
    We present DiPerF, a distributed performance testing framework, aimed at simplifying and automating service performance evaluation. DiPerF coordinates a pool of machines that test a target service, collects and aggregates performance metrics, and generates performance statistics. The aggregate data collected provide information on service throughput, on service "fairness" when serving multiple clients concurrently, and on the impact of network latency on service performance. Furthermore, using this data, it is possible to build predictive models that estimate a service performance given the service load. We have tested DiPerF on 100+ machines on two testbeds, Grid3 and PlanetLab, and explored the performance of job submission services (pre WS GRAM and WS GRAM) included with Globus Toolkit 3.2.Comment: 8 pages, 8 figures, will appear in IEEE/ACM Grid2004, November 200

    MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models

    Full text link
    We introduce MultiMedEval, an open-source toolkit for fair and reproducible evaluation of large, medical vision-language models (VLM). MultiMedEval comprehensively assesses the models' performance on a broad array of six multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical domains. The chosen tasks and performance metrics are based on their widespread adoption in the community and their diversity, ensuring a thorough evaluation of the model's overall generalizability. We open-source a Python toolkit (github.com/corentin-ryr/MultiMedEval) with a simple interface and setup process, enabling the evaluation of any VLM in just a few lines of code. Our goal is to simplify the intricate landscape of VLM evaluation, thus promoting fair and uniform benchmarking of future models.Comment: Under review at MIDL 202
    • …
    corecore