28 research outputs found
Adapting Collaborative Chat for Massive Open Online Courses: Lessons Learned
Abstract. In this paper we explore how to import intelligent support for group learning that has been demonstrated as effective in classroom instruction into a Massive Open Online Course (MOOC) context. The Bazaar agent architecture paired with an innovative Lobby tool to enable coordination for synchronous reflection exercises provides a technical foundation for our work. We describe lessons learned, directions for future work, and offer pointers to resources for other researchers interested in computer supported collaborative learning in MOOCs
Expediting Support for Social Learning with Behavior Modeling
An important research problem for Educational Data Mining is to expedite the
cycle of data leading to the analysis of student learning processes and the
improvement of support for those processes. For this goal in the context of
social interaction in learning, we propose a three-part pipeline that includes
data infrastructure, learning process analysis with behavior modeling, and
intervention for support. We also describe an application of the pipeline to
data from a social learning platform to investigate appropriate goal-setting
behavior as a qualification of role models. Students following appropriate goal
setters persisted longer in the course, showed increased engagement in hands-on
course activities, and were more likely to review previously covered materials
as they continued through the course. To foster this beneficial social
interaction among students, we propose a social recommender system and show
potential for assisting students in interacting with qualified goal setters as
role models. We discuss how this generalizable pipeline can be adapted for
other support needs in online learning settings.Comment: in The 9th International Conference on Educational Data Mining, 201
The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia
Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively created resource that reflects the spirit of the open collaborative content creation paradigm.
Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source.
A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This includes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This dissertation
shows how natural language processing approaches can be used to assist information quality management on a massive scale.
In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particular focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a knowledge resource for language technology applications. We then proceed with the three main contributions of this thesis.
Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central
factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality.
As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data.
We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present FlawFinder, a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated performance than the biased ones but the scores more closely resemble their actual performance
in the wild.
As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future.
Our contribution in this area is threefold: (i) We describe a novel algorithm for segmenting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algorithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of article improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification approaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus
Semantic Relations in WordNet and the BNC
From the introduction: It is not always easy to define what a word means. We can choose between a variety of possibilities, from simply pointing at the correct object as we say its name to lengthy definitions in encyclopaedias, which can sometimes fill multiple pages. Although the former approach is pretty straightforward and is also very important for first language acquisition, it is obviously not a practical solution for defining the semantics of the whole lexicon. The latter approach is more widely accepted in this context, but it turns out that defining dictionary and encyclopaedia entries is not an easy task. In order to simplify the challenge of defining the meaning of words, it is of great advantage to organize the lexicon in a way that the structure in which the words are integrated gives us information about the meaning of the words by showing their relation to other words. These semantic relations are the focal point of this paper. In the first chapter, different ways to describe meaning will be discussed. It will become obvious why semantic relations are a very good instrument to organizing the lexicon. The second chapter deals with WordNet, an electronic lexical database which follows precisely this approach. We will examine the semantic relations which are used in WordNet and we will study the distinct characteristics of each of them. Furthermore, we will see which contribution is made by which relation to the organization of the lexicon. Finally, we will look at the downside of the fact that WordNet is a manually engineered network by examining the shortcomings of WordNet. In the third chapter, an alternative approach to linguistics is introduced. We will discuss the principles of corpus linguistics and, using the example of the British National Corpus, we will consider possibilities to extract semantic relations from language corpora which could help to overcome the deficiencies of the knowledge based approach. In the fourth chapter, I will describe a project the goal of which is to extend WordNet by findings from cognitive linguistics. Therefore, I will discuss the development process of a piece of software that has been programmed in the course of this thesis. Furthermore, the results from a small‐scale study using this software will be analysed and evaluated in order to check for the success of the project.Der Verfasser beschäftigt sich in seiner Magisterarbeit auf sehr detaillierte Weise mit semantischen Relationen von Wörtern. In einer Projektstudie versucht Herr Ferschke, auf der Basis eines bestehenden semantischen Netzes bestimmte kognitiv-relevante Objekte halb-automatisch herauszufiltern. Untermauert wird sein Projekt durch. eine Befragung Studierender zur konzeptuellen Einordnung dieser Objekte. Im ersten Kapitel legt Oliver Ferschke auf sehr fundierter linguistischer Basis verschiedene Möglichkeiten zur Beschreibung von Bedeutung dar. Er unterscheidet unterschiedliche Sichtweisen, was "Bedeutung" ausmacht und stellt diese klar gegenüber. Das zweite Kapitel widmet sich dem semantischen Netzwerk WordNet, welches sogenannte synsets für das Englische beschreibt. Aufbauend auf den in WordNet dargestellten semantischen Relationen stellt der Verfasser an ausgewählten Beispielen dar, wie englische Wörter in dieses Netzwerk eingebunden sind. Er bezieht sich dabei auf semantische Beziehungen wie Hyponymie, Meronymie, Gegenteile, Polysemie und belegt diese mit Beispielen. Darüber hinaus geht er auf einige Desiderata in WordNet ein. Das British National Corpus (BNC) wird im dritten Teil dieser Magisterarbeit eingehend vorgestellt. Für die Projektstudie werden aus diesem Korpus Informationen zur Häufigkeit herangezogen, um spätere Kategorisierungen auf eine möglichst quantitativ-valide Basis zu stellen. Herr Ferschke zeigt die wichtigsten Unterschiede zwischen korpuslinguistischen Herangehensweisen auf der einen Seite sowie strukturalistischen Untersuchungen und solchen, die der generativen Schule angehören, auf der anderen Seite auf. Er schließt seine Betrachtungen zu einer syntaktisch orientierten Angehensweise auf der Basis von patterns ab, die durch häufige syntaktische Muster auf bestimmte semantische Relationen schließen lassen (können). Der Verfasser stellt exemplarisch dar, wie diese patterns in einen CQLquery integriert werden können. Ebenso zeigt Herr Ferschke anhand von möglichen Konstituenten der Nominal- bzw. Präpositionalphrase, wie diese durch automatische Prozeduren im BNC identifiziert werden können. Das vierte Kapitel der vorliegenden Magisterarbeit widmet sich der Projektstudie. Es geht darum, dass Erkenntnisse der Prototypentheorie auf die Struktur von WordNet angewendet werden sollen. Mit Hilfe selbst entwickelter Software wird der Versuch unternommen, bestimmte kognitiv-relevante Ebenen der semantischen Beschreibungen zu identifizieren. Herr Ferschke verfolgt das Ziel, basic level objects innerhalb der Hierarchien von WordNet durch semi-automatische Prozeduren herauszufiltern. Seine Studie besteht aus zwei Teilen: In einem ersten voll automatischen Teil werden Wörter, die bestimmte semantische und quantitative Kriterien erfüllen, durch automatische Prozeduren identifiziert. Diese basic level objects werden im zweiten Teil des Projekts von Probanden in Bezug auf ihre Eigenschaften bewertet. Der Verfasser hat drei unterschiedliche semantische Bereiche ausgewählt, zu denen basic level objects bestimmt werden sollen: athletics, furniture, vehicle. In seinen Auswertungen stellt Herr Ferschke dar, welche potentiellen basic level objects von den Teilnehmern der Studie als solche ausgewählt wurden. Dabei werden sowohl Probleme angesprochen, die den Aufbau von WordNet betreffen und dadurch einen wesentlichen Einfluss auf die Auswahl der Wörter als basic level objects haben können. Ein zweites Problem, welches Herr Ferschke, diskutiert, ist die Sprachkompetenz der Probanden. Ein weiteres - vom Verfasser nicht genanntes Problem - besteht darin, inwiefern eine vorgegebene Wortdefinition die Bewertung der Studienteilnehmer beeinflusst hat. Ein nicht unwesentlicher Teil der Magisterarbeit besteht in der Konzeption und Umsetzung der Software für die Projektstudie. Dafür sind nicht nur detaillierte Kenntnisse aus dem Bereich der Informatik notwendig, sondern auch ein fundiertes Wissen im Bereich der Linguistik. Durch den Aufbau des Projekts macht Herr Ferschke sehr eindringlich klar, dass er beide Gebiete sehr gut beherrscht. Die vorliegende Arbeit ist aus linguistischer Sicht absolut fundiert und hervorragend dargestellt. Sie umfasst ein breites Spektrum linguistischer Theorien und Erklärungsmodelle und stellt die für dieses Thema wichtigen Aspekte umfassend dar. Die computerlinguistische Komponente ist ebenfalls als sehr gut zu beurteilen, zumal eine Verknüpfung zwischen der Prototypentheorie auf der einen Seite und WordNet auf der anderen Seite nicht ganz einfach ist. Das Problem liegt in erster Linie darin, die gegebene Struktur von WordNet für Aspekte der Prototypentheorie nutzbar zu machen. Dies ist Oliver Ferschke ohne Zweifel gelungen. Die vorliegende Magisterarbeit verdient die Note 'sehr gut' (1,0)
The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia
Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively created resource that reflects the spirit of the open collaborative content creation paradigm.
Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source.
A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This includes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This dissertation
shows how natural language processing approaches can be used to assist information quality management on a massive scale.
In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particular focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a knowledge resource for language technology applications. We then proceed with the three main contributions of this thesis.
Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central
factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality.
As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data.
We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present FlawFinder, a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated performance than the biased ones but the scores more closely resemble their actual performance
in the wild.
As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future.
Our contribution in this area is threefold: (i) We describe a novel algorithm for segmenting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algorithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of article improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification approaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus
The Quality of Massive Open Online Collaboration
User generated content is the main driving force of the increasingly social web. Participatory and collaborative content production has largely replaced the traditional ways of information sharing and make up a large share of the daily information consumed by web users.
The main properties of user generated content are a low publication threshold and little or no editorial control. While this has positively affected the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content.
In this project, we focus on the quality of collaboratively created texts. Using the example of Wikipedia, we investigate how the quality of articles can be assessed automatically and how we can apply language technology to facilitate quality assurance on a large scale.
In a first scenario, we analyze two corpora of Wikipedia article discussion pages extracted from the English Wikipedia and the Simple English Wikipedia, which we manually annotated with quality-directed speech act labels. We show how these corpora can be used to automatically identify the problems and solutions discussed by the community. We finally discuss possibilities how this approach can help to improve work coordination in Wikipedia and ultimately improve the quality assurance process.
In a second scenario, we focus on analyzing the article quality directly. Instead of applying abstract quality scores to each article, we approach the problem from a different direction and aim to identify concrete quality problems. Wikipedia already provides a rich set of quality flaw markers, which we extract on a large scale and use as training data for automatic quality flaw prediction, which can assist authors in improving the quality of their articles. In this context, we furthermore analyze the topic prevalence of individual flaw types, i.e. the phenomenon that particular flaws appear more often in articles from certain topics. This biased distribution negatively influences the data samples used for machine learning experiments and thus warrants further investigation
Wikipedia Article Feedback
The corpus lists article IDs of biographies of living and dead people, rated as above average or below average along four categories (trustowrthy, objective, well written, complete) based on the ratings from Wikipedia Article Feedback v4 [http://en.wikipedia.org/wiki/Wikipedia:Article_Feedback_Tool] (each of the listed articles rated at least 10 times)
A Lightly Supervised Approach to Role Identification in Wikipedia Talk Page Discussions
In this paper we describe an application of a lightly supervised Role Identification Model (RIM) to the analysis of coordination in Wikipedia talk page discussions. Our goal is to understand the substance of important coordination roles that predict quality of the Wikipedia pages where the discussions take place. Using the model as a lens, we present an analysis of four important coordination roles identified using the model, including Workers, Critiquers, Encouragers, and Managers
Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History
We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2 % of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.