6 research outputs found

    Reliability of MRI interpretation of Discoid Lateral Meniscus: A multicenter study

    Get PDF
    BACKGROUND: Discoid lateral meniscus (DLM) has a varied and complex morphology that can be challenging to assess and treat. Preoperative magnetic resonance imaging (MRI) is frequently used for diagnosis and surgical planning; however, it is not known whether surgeons are reliable and accurate in their interpretation of MRI findings when defining the pathomorphology of DLM. HYPOTHESIS: Surgeons experienced in treating DLM are able to reliably interpret DLM pathology using MRI. STUDY DESIGN: Cohort study (diagnosis); Level of evidence, 3. METHODS: Knee MRI scans from 44 patients (45 knees) were selected from a pool of surgically treated patients with DLM. Five reviewers (fellowship-trained pediatric sports medicine surgeons) performed independent review of each MRI scan using the PRiSM Discoid Meniscus Classification. Inter- and intraobserver reliability of the rating factors-primary (width, height, presence of peripheral instability or tear) and secondary (location of instability or tear, tear type)-was assessed using the Fleiss Îş coefficient, designed for multiple readers with nominal variables (fair reliability, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; excellent, 0.81-1.00). Reliability is reported as Îş (95% CI). RESULTS: Interobserver reliability in assessing most primary and secondary characteristics ranged from substantial (meniscal width) to moderate (peripheral instability, anterior instability, posterior instability, and posterior tear). Intraobserver reliability for most characteristics ranged from substantial (peripheral instability, presence of tear, anterior instability, posterior instability, and posterior tear) to moderate (meniscal width, anterior tear, and tear type). Notable exceptions were presence of tear, anterior tear, and tear type-all with fair interobserver reliability. Height had poor interobserver reliability and fair intraobserver reliability. CONCLUSION: Orthopaedic surgeons reliably interpret MRI scans using the PRiSM Discoid Meniscus Classification for the majority of DLM characteristics but vary in their assessment of height and presence and type of tear. MRI evaluation may be helpful to diagnose discoid by width and identify the presence of instability: 2 major factors in the decision to proceed with surgery. Arthroscopic evaluation should be used in conjunction with MRI findings for complete DLM diagnosis

    FIRE-2 simulations: physics versus numerics in galaxy formation

    No full text

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

    No full text
    Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Get PDF
    Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.Comment: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-benc
    corecore