123 research outputs found

    Analyzing short-answer questions and their automatic scoring - studies on semantic relations in reading comprehension and the reduction of human annotation effort

    Get PDF
    Short-answer questions are a wide-spread exercise type in many educational areas. Answers given by learners to such questions are scored by teachers based on their content alone ignoring their linguistic correctness as far as possible. They typically have a length of up to a few sentences. Manual scoring is a time-consuming task, so that automatic scoring of short-answer questions using natural language processing techniques has become an important task. This thesis focuses on two aspects of short-answer questions and their scoring: First, we concentrate on a reading comprehension scenario for learners of German as a foreign language, where students answer questions about a reading text. Within this scenario, we examine the multiple relations between reading texts, learner answers and teacher-specified target answers. Second, we investigate how to reduce human scoring workload by both fully automatic and computer-assisted scoring. The latter is a scenario where scoring is not done entirely automatically, but where a teacher receives scoring support, for example, by means of clustering similar answers together. Addressing the first aspect, we conduct a series of corpus annotation studies which highlight the relations between pairs of learner answers and target answers, as well as between both types of answers and the reading text they refer to. We annotate sentences from the reading text that were potentially used by learners or teachers for constructing answers and observe that, unsurprisingly, most correct answers can easily be linked to the text; incorrect answers often link to the text as well, but are often backed up by a part of the text not relevant to answer the question. Based on these findings, we create a new baseline scoring model which considers for correctness whether learners looked for an answer in the right place or not. After identifying those links into the text, we label the relation between learner answers and target answers as well as between reading texts and answers by annotating entailment relations. In contrast to the widespread assumption that scoring can be fully mapped to the task of recognizing textual entailment, we find the two tasks to be only closely related and not completely equivalent. Correct answers do often, but not always, entail the target answer, as well as part of the related text, and incorrect answers do most of the time not stand in an entailment relation to the target answer, but often have some overlap with the text. This close relatedness allows us to use gold-standard entailment information to improve the performance of automatic scoring. We also use links between learner answers and both reading texts and target answers in a statistical alignment-based scoring approach using methods from machine translation and reach a performance comparable to an existing knowledge-based alignment approach. Our investigations into how human scoring effort can be reduced when learner answers are manually scored by teachers are based on two methods: active learning and clustering. In the active learning approach, we score particularly informative items first, i.e., items from which a classifier can learn most, identifying them using uncertainty-based sample selection. In this way, we reach a higher performance with a given number of annotation steps compared to randomly selected answers. In the second research strand, we use clustering methods to group similar answers together, such that groups of answers can be scored in one scoring step. In doing so, the number of necessary labeling steps can be substantially reduced. When comparing clustering-based scoring to classical supervised machine learning setups, where the human annotations are used to train a classifier, supervised machine learning is still in the lead in terms of performance, whereas clusters provide the advantage of structured output. However, we are able to close part of the performance gap by means of supervised feature selection and semi-supervised clustering. In an additional study, we investigate the automatic processing of learner language with respect to the performance of part-of-speech (POS) tagging tools. We manually annotate a German reading comprehension corpus both with spelling normalization and POS information and find that the performance of automatic POS tagging can be improved by spell-checking the data using the reading text as additional evidence for lexical material intended in a learner answer.Short-Answer-Fragen sind ein weit verbreiteter Aufgabentyp in vielen Bildungsbereichen. Die Antworten, die Lerner zu solchen Aufgaben geben, werden von Lehrenden allein auf Grundlage ihres Inhalts bewertet; linguistische Korrektheit wird soweit möglich ignoriert. Diese Doktorarbeit legt ihren Schwerpunkt auf zwei Aspekte im Zusammenhang mit Short- Answer-Fragen und ihrer Bewertung: Zum einen betrachten wir ein Leseverständnisszenario, bei dem Studenten Fragen zu Lesetexten beantworten. Dabei untersuchen wir insbesondere die verschiedenen Beziehungen, die es zwischen Lesetexten, Lernerantworten und vom Lehrer erstellten Musterantworten gibt. Zum anderen untersuchen wir, wie der menschliche Bewertungsaufwand durch voll-automatisches und computergestütztes Bewerten reduziert werden kann. Bei letzterem handelt es sich um ein Szenario, in dem Lehrer bei der Bewertung unterstützt werden, z.B. indem ähnliche Antworten automatisch gruppiert werden. Zur Untersuchung des ersten Aspekts unternehmen wir eine Reihe von Korpusannotationsstudien, die sowohl die Beziehungen zwischen Lerner- und Musterantworten beleuchten, als auch die Beziehung zwischen diesen Antworten und dem Lesetext, auf den sie sich beziehen. Wir annotieren Sätze aus dem Lesetext, die vermutlich bei der Formulierung einer Antwort benutzt wurden und machen die zu erwartende Beobachtung, dass die meisten korrekten Antworten problemlos mit bestimmten Textpassagen in Verbindung gebracht werden können. Inkorrekte Antworten haben ebenfalls oft eine Verbindung zu bestimmten Textpassagen, die aber oft für die jeweilige Frage nicht relevant sind. Auf Grundlage dieser Erkenntnisse entwerfen wir ein neues Baseline-Bewertungsmodell, das für die Korrektheit einer Antwort nur in Betracht zieht, ob der Lerner die Antwort an der richtigen Stelle im Lesetext gesucht hat oder nicht. Nachdem wir diese Verbindungen in den Text identifiziert haben, annotieren wir die Relation zwischen Lerner- und Musterantworten und zwischen Texten und Antworten mit Entailment- Relationen. Im Gegensatz zur der weitverbreiteten Annahme, dass das Bewerten von Short- Answer-Fragen und das Erkennen von Textual-Entailment-Relationen zwischen Lerner und Musterantworten sich direkt entsprechen, finden wir heraus, dass die beiden Aufgaben nur nahe verwandt aber nicht vollständig äquivalent sind. Korrekte Antworten entailen meistens, aber nicht immer, die Musterantwort und auch den entsprechenden Satz im Lesetext. Inkorrekte Antworten stehen meist in keiner Entailmentrelation mit der Musterantwort, haben aber oft zumindest teilweisen Overlap mit dem Text. Diese nahe Verwandtschaft erlaubt es uns, Goldstandard-Entailmentinformation zu benutzen, um die Performanz beim automatischen Bewerten zu verbessern. Wir benutzen die annotierten Verbindungen zwischen Lesetexten und Antworten auch in einem Scoringansatz, der auf statistischem Alignment basiert und Methoden aus dem Bereich der maschinellen Übersetzung nutzt. Dabei erreichen wir eine Scoringgenauigkeit, die mit Ansätzen, die ein existierendes wissensbasiertes Alignment nutzen, vergleichbar ist. Unsere Untersuchungen, wie der Bewertungsaufwand beim Menschen verringert werden kann, wenn Antworten vom Lehrer manuell bewertet werden, basieren auf zwei Methoden: Active Learning und Clustering. Beim Active-Learning-Ansatz werden besonders informative Antworten vorrangig zur Bewertung ausgewählt, d.h. solche Antworten, von denen ein Klassifikator besonders viel lernen kann. Wir identifizieren solche Antworten durch Uncertainty-Sampling- Methoden und erreichen dadurch mit einer gegebenen Anzahl von Annotationsschritten eine höhere Klassifikationsgenauigkeit als mit zufällig ausgewählten Antworten. In unserem zweiten Forschungszweig nutzen wir Clusteringmethoden um ähnliche Antworten zu gruppieren, so dass Gruppen von Antworten in einem Annotationsschritt bewertet werden können. Dadurch kann die Anzahl der insgesamt nötigen Bewertungsschritte drastisch reduziert werden. Beim Vergleich zwischen clusteringbasierten Bewertungsverfahren und klassischem überwachten maschinellen Lernen, bei dem menschliche Annotationen dazu genutzt werden, einen Klassifikator zu trainieren, erbringen überwachte maschinelle Lernverfahren immer noch eine höhere Bewertungsgenauigkeit. Demgegenüber bringen Cluster den Vorteil eines strukturierten Outputs mit sich. Wir sind jedoch in der Lage, einen Teil diese Genauigkeitslücke zu schließen, in dem wir überwachte Featureauswahl und halbüberwachtes Clustering anwenden. In einer zusätzlichen Studie untersuchen wir die automatische Verarbeitung von Lernersprache im Hinblick auf die Performanz vonWerkzeugen für dasWortarten-Tagging. Wir annotieren ein deutsches Leseverstehenskorpus manuell sowohl mit Normalisierungsinformation in Bezug auf Rechtschreibung als auch mit Wortartinformation. Als Ergebnis der Studie finden wir, dass die Performanz bei der automatischen Wortartenzuweisung durch Rechtschreibkorrektur verbessert werden kann, insbesondere wenn wir den Lesetext als zusätzliche Evidenz dafür verwenden, welche Wörter der Leser in einer Antwort vermutlich benutzen wollte

    The Influence of Variance in Learner Answers on Automatic Content Scoring

    Get PDF
    Automatic content scoring is an important application in the area of automatic educational assessment. Short texts written by learners are scored based on their content while spelling and grammar mistakes are usually ignored. The difficulty of automatically scoring such texts varies according to the variance within the learner answers. In this paper, we first discuss factors that influence variance in learner answers, so that practitioners can better estimate if automatic scoring might be applicable to their usage scenario. We then compare the two main paradigms in content scoring: (i) similarity-based and (ii) instance-based methods, and discuss how well they can deal with each of the variance-inducing factors described before

    Automated scoring of teachers’ pedagogical content knowledge : a comparison between human and machine scoring

    Get PDF
    To validly assess teachers’ pedagogical content knowledge (PCK), performance-based tasks with open-response formats are required. Automated scoring is considered an appropriate approach to reduce the resource-intensity of human scoring and to achieve more consistent scoring results than human raters. The focus is on the comparability of human and automated scoring of PCK for economics teachers. The answers of (prospective) teachers (N = 852) to six open-response tasks from a standardized and validated test were scored by two trained human raters and the engine “Educational SCoRIng Toolkit” (ESCRITO). The average agreement between human and computer ratings, κw = 0.66, suggests a convergent validity of the scoring results. The results of the single-sector variance analysis show a significant influence of the answers for each homogeneous subgroup (students = 460, trainees = 230, in-service teachers = 162) on the automated scoring. Findings are discussed in terms of implications for the use of automated scoring in educational assessment and its potentials and limitations

    Designing and implementing a research integrity promotion plan: recommendations for research funders

    Get PDF
    Various stakeholders in science have put research integrity high on their agenda. Among them, research funders are prominently placed to foster research integrity by requiring that the organizations and individual researchers they support make an explicit commitment to research integrity. Moreover, funders need to adopt appropriate research integrity practices themselves. To facilitate this, we recommend that funders develop and implement a Research Integrity Promotion Plan (RIPP). This Consensus View offers a range of examples of how funders are already promoting research integrity, distills 6 core topics that funders should cover in a RIPP, and provides guidelines on how to develop and implement a RIPP. We believe that the 6 core topics we put forward will guide funders towards strengthening research integrity policy in their organization and guide the researchers and research organizations they fund

    Ethnic Minorities Rewarded: Ethnostratification on the Wage Market in Belgium

    Full text link
    Several previous researches have confirmed the hypothesis of ethnostratification, which holds that the labour market is divided into different ethnic layers. While people of a European origin are over-represented in the top layers (the primary market), people with non-European roots and/or nationalities are more concentrated in bottom layers (the secondary market). Relative to the primary market, this secondary market is characterized by a higher chance of unemployment, lower wages, poorer working conditions and greater job insecurity. This paper deals with a very important condition of work: the wage. Does origin have an impact on the level of wage? We make a distinction between nine origin groups: Belgians, North en West Europeans, South Europeans (from Greece, Spain, Portugal), Italians, East Europeans, Moroccans, Turks, Sub Sahara Africans and Asians. The first part of this article briefly describes the database used for the analyses and presents a few general figures for the total Belgian population. In the second part we examine the impact of origin on wage levels. For each origin group we will give an overview of the average daily wages and the partition over the wage classes. For the weaker populations, gender and age are taken into account. Finally, by means of a regression analysis, we will examine the influence of origin while controlling a few other variables that may influence the wage level

    Participatory Approach in Decision Making Processes for Water Resources Management in the Mediterranean Basin

    Full text link
    This paper deals with the comparative analysis of different policy options for water resources management in three south-eastern Mediterranean countries. The applied methodology follows a participatory approach throughout its implementation and is supported by the use of three different software packages dealing with water allocation budget, water quality simulation, and Multi Criteria Analysis, respectively. The paper briefly describes the general objectives of the SMART project and then presents the three local case studies, the valuation objectives and the applied methodology - developed as a general replicable framework suitable for implementation in other decision-making processes. All the steps needed for a correct implementation are therefore described. Following the conceptualisation of the problem, the choice of the appropriate indicators as well as the calculation of their weighting and value functions are detailed. The paper concludes with the results of the Multi Criteria and the related Sensitivity Analyses performed, showing how the different policy responses under consideration can be assessed and furthermore compared through case studies thanks to their relative performances. The adopted methodology was found to be an effective operational approach for bridging scientific modelling and policy making by integrating the model outputs in a conceptual framework that can be understood and utilised by non experts, thus showing concrete potential for participatory decision making

    The Role of Risk Aversion and Lay Risk in the Probabilistic Externality Assessment for Oil Tanker Routes to Europe

    Full text link
    Oil spills are a major cause of environmental concern, in particular for Europe. However, the traditional approach to the evaluation of the expected external costs of these accidents fails to take into full account the implications of their probabilistic nature. By adapting a methodology originally developed for nuclear accidents to the case of oil spills, we extend the traditional approach to the assessment of the welfare losses borne by potentially affected individuals for being exposed to the risk of an oil spill. The proposed methodology differs from the traditional approach in three respects: it allows for risk aversion; it adopts an ex-ante rather than an ex-post perspective; it allows for subjective oil spill probabilities (held by the lay public) higher than those assessed by the experts in the field. In order to illustrate quantitatively this methodology, we apply it to the hypothetical (yet realistic) case of an oil spill in the Aegean Sea. We assess the risk premiums that potentially affected individuals would be willing to pay in order to avoid losses to economic activities such as tourism and fisheries, and non-use damages resulting from environmental impacts on the Aegean coasts. In the scenarios analysed, the risk premiums on expected losses for tourism and fisheries turn out to be substantial when measured as a percentage of expected losses; by contrast, they are quite small for the case of damages to the natural environment

    Marginal Cost versus Average Cost Pricing with Climatic Shocks in Senegal: A Dynamic Computable General Equilibrium Model Applied to Water

    Full text link
    The model simulates on a 20-year horizon, a first phase of increase in the water resource availability taking into account the supply policies by the Senegalese government and a second phase with hydrologic deficits due to demand evolution (demographic growth). The results show that marginal cost water pricing (with a subsidy ensuring the survival of the water production sector) makes it possible in the long term to absorb the shock of the resource shortage, GDP, investment and welfare increase. Unemployment drops and the sectors of rain rice, market gardening and drinking water distribution grow. In contrast, the current policy of average cost pricing of water leads the long-term economy in a recession with an agricultural production decrease, a strong degradation of welfare and a rise of unemployment. This result questions the basic tariff (average cost) on which block water pricing is based in Senegal

    Cost Effectiveness in River Management: Evaluation of Integrated River Policy System in Tidal Ouse

    Full text link
    The River Ouse forms a significant part of Humber river system, which drains about one fifth the land area of England and provides the largest fresh water source to the North Sea from UK. The river quality in the tidal river suffered from sag of dissolved oxygen (DO) during last few decades, deteriorated by the effluent discharges. The Environment Agency (EA) proposed to increase the water quality of Ouse by implementing more potent environmental policies. This paper explores the cost effectiveness of water management in the Tidal Ouse through various options by taking into account the variation of assimilative capacity of river water, both in static and dynamic scope of time. Reduction in both effluent discharges and water abstraction were considered along side with choice of effluent discharge location. Different instruments of environmental policy, the emission tax-subsidy (ETS) scheme and tradable pollution permits (TPP) systems were compared with the direct quantitative control approach. This paper at the last illustrated an empirical example to reach a particular water quality target in the tidal Ouse at the least cost, through a solution of constrained optimisation problem. The results suggested significant improvement in the water quality with less cost than current that will fail the target in low flow year
    corecore