51 research outputs found

    Technologies for Reusing Text from the Web

    Get PDF
    Texts from the web can be reused individually or in large quantities. The former is called text reuse and the latter language reuse. We first present a comprehensive overview of the different ways in which text and language is reused today, and how exactly information retrieval technologies can be applied in this respect. The remainder of the thesis then deals with specific retrieval tasks. In general, our contributions consist of models and algorithms, their evaluation, and for that purpose, large-scale corpus construction. The thesis divides into two parts. The first part introduces technologies for text reuse detection, and our contributions are as follows: (1) A unified view of projecting-based and embedding-based fingerprinting for near-duplicate detection and the first time evaluation of fingerprint algorithms on Wikipedia revision histories as a new, large-scale corpus of near-duplicates. (2) A new retrieval model for the quantification of cross-language text similarity, which gets by without parallel corpora. We have evaluated the model in comparison to other models on many different pairs of languages. (3) An evaluation framework for text reuse and particularly plagiarism detectors, which consists of tailored detection performance measures and a large-scale corpus of automatically generated and manually written plagiarism cases. The latter have been obtained via crowdsourcing. This framework has been successfully applied to evaluate many different state-of-the-art plagiarism detection approaches within three international evaluation competitions. The second part introduces technologies that solve three retrieval tasks based on language reuse, and our contributions are as follows: (4) A new model for the comparison of textual and non-textual web items across media, which exploits web comments as a source of information about the topic of an item. In this connection, we identify web comments as a largely neglected information source and introduce the rationale of comment retrieval. (5) Two new algorithms for query segmentation, which exploit web n-grams and Wikipedia as a means of discerning the user intent of a keyword query. Moreover, we crowdsource a new corpus for the evaluation of query segmentation which surpasses existing corpora by two orders of magnitude. (6) A new writing assistance tool called Netspeak, which is a search engine for commonly used language. Netspeak indexes the web in the form of web n-grams as a source of writing examples and implements a wildcard query processor on top of it.Texte aus dem Web können einzeln oder in großen Mengen wiederverwendet werden. Ersteres wird Textwiederverwendung und letzteres Sprachwiederverwendung genannt. Zunächst geben wir einen ausführlichen Überblick darüber, auf welche Weise Text und Sprache heutzutage wiederverwendet und wie Technologien des Information Retrieval in diesem Zusammenhang angewendet werden können. In der übrigen Arbeit werden dann spezifische Retrievalaufgaben behandelt. Unsere Beiträge bestehen dabei aus Modellen und Algorithmen, ihrer empirischen Auswertung und der Konstruktion von großen Korpora hierfür. Die Dissertation ist in zwei Teile gegliedert. Im ersten Teil präsentieren wir Technologien zur Erkennung von Textwiederverwendungen und leisten folgende Beiträge: (1) Ein Überblick über projektionsbasierte- und einbettungsbasierte Fingerprinting-Verfahren für die Erkennung nahezu identischer Texte, sowie die erstmalige Evaluierung einer Reihe solcher Verfahren auf den Revisionshistorien der Wikipedia. (2) Ein neues Modell zum sprachübergreifenden, inhaltlichen Vergleich von Texten. Das Modell basiert auf einem mehrsprachigen Korpus bestehend aus Pärchen themenverwandter Texte, wie zum Beispiel der Wikipedia. Wir vergleichen das Modell in mehreren Sprachen mit herkömmlichen Modellen. (3) Eine Evaluierungsumgebung für Algorithmen zur Plagiaterkennung. Die Umgebung besteht aus Maßen, die die Güte der Erkennung eines Algorithmus' quantifizieren, und einem großen Korpus von Plagiaten. Die Plagiate wurden automatisch generiert sowie mit Hilfe von Crowdsourcing manuell erstellt. Darüber hinaus haben wir zwei Workshops veranstaltet, in denen unsere Evaluierungsumgebung erfolgreich zur Evaluierung aktueller Plagiaterkennungsalgorithmen eingesetzt wurde. Im zweiten Teil präsentieren wir auf Sprachwiederverwendung basierende Technologien für drei verschiedene Retrievalaufgaben und leisten folgende Beiträge: (4) Ein neues Modell zum medienübergreifenden, inhaltlichen Vergleich von Objekten aus dem Web. Das Modell basiert auf der Auswertung der zu einem Objekt vorliegenden Kommentare. In diesem Zusammenhang identifizieren wir Webkommentare als eine in der Forschung bislang vernachlässigte Informationsquelle und stellen die Grundlagen des Kommentarretrievals vor. (5) Zwei neue Algorithmen zur Segmentierung von Websuchanfragen. Die Algorithmen nutzen Web n-Gramme sowie Wikipedia, um die Intention des Suchenden in einer Suchanfrage festzustellen. Darüber hinaus haben wir mittels Crowdsourcing ein neues Evaluierungskorpus erstellt, das zwei Größenordnungen größer ist als bisherige Korpora. (6) Eine neuartige Suchmaschine, genannt Netspeak, die die Suche nach gebräuchlicher Sprache ermöglicht. Netspeak indiziert das Web als Quelle für gebräuchliche Sprache in der Form von n-Grammen und implementiert eine Wildcardsuche darauf

    Technologies for Reusing Text from the Web

    Get PDF
    Texts from the web can be reused individually or in large quantities. The former is called text reuse and the latter language reuse. We first present a comprehensive overview of the different ways in which text and language is reused today, and how exactly information retrieval technologies can be applied in this respect. The remainder of the thesis then deals with specific retrieval tasks. In general, our contributions consist of models and algorithms, their evaluation, and for that purpose, large-scale corpus construction. The thesis divides into two parts. The first part introduces technologies for text reuse detection, and our contributions are as follows: (1) A unified view of projecting-based and embedding-based fingerprinting for near-duplicate detection and the first time evaluation of fingerprint algorithms on Wikipedia revision histories as a new, large-scale corpus of near-duplicates. (2) A new retrieval model for the quantification of cross-language text similarity, which gets by without parallel corpora. We have evaluated the model in comparison to other models on many different pairs of languages. (3) An evaluation framework for text reuse and particularly plagiarism detectors, which consists of tailored detection performance measures and a large-scale corpus of automatically generated and manually written plagiarism cases. The latter have been obtained via crowdsourcing. This framework has been successfully applied to evaluate many different state-of-the-art plagiarism detection approaches within three international evaluation competitions. The second part introduces technologies that solve three retrieval tasks based on language reuse, and our contributions are as follows: (4) A new model for the comparison of textual and non-textual web items across media, which exploits web comments as a source of information about the topic of an item. In this connection, we identify web comments as a largely neglected information source and introduce the rationale of comment retrieval. (5) Two new algorithms for query segmentation, which exploit web n-grams and Wikipedia as a means of discerning the user intent of a keyword query. Moreover, we crowdsource a new corpus for the evaluation of query segmentation which surpasses existing corpora by two orders of magnitude. (6) A new writing assistance tool called Netspeak, which is a search engine for commonly used language. Netspeak indexes the web in the form of web n-grams as a source of writing examples and implements a wildcard query processor on top of it.Texte aus dem Web können einzeln oder in großen Mengen wiederverwendet werden. Ersteres wird Textwiederverwendung und letzteres Sprachwiederverwendung genannt. Zunächst geben wir einen ausführlichen Überblick darüber, auf welche Weise Text und Sprache heutzutage wiederverwendet und wie Technologien des Information Retrieval in diesem Zusammenhang angewendet werden können. In der übrigen Arbeit werden dann spezifische Retrievalaufgaben behandelt. Unsere Beiträge bestehen dabei aus Modellen und Algorithmen, ihrer empirischen Auswertung und der Konstruktion von großen Korpora hierfür. Die Dissertation ist in zwei Teile gegliedert. Im ersten Teil präsentieren wir Technologien zur Erkennung von Textwiederverwendungen und leisten folgende Beiträge: (1) Ein Überblick über projektionsbasierte- und einbettungsbasierte Fingerprinting-Verfahren für die Erkennung nahezu identischer Texte, sowie die erstmalige Evaluierung einer Reihe solcher Verfahren auf den Revisionshistorien der Wikipedia. (2) Ein neues Modell zum sprachübergreifenden, inhaltlichen Vergleich von Texten. Das Modell basiert auf einem mehrsprachigen Korpus bestehend aus Pärchen themenverwandter Texte, wie zum Beispiel der Wikipedia. Wir vergleichen das Modell in mehreren Sprachen mit herkömmlichen Modellen. (3) Eine Evaluierungsumgebung für Algorithmen zur Plagiaterkennung. Die Umgebung besteht aus Maßen, die die Güte der Erkennung eines Algorithmus' quantifizieren, und einem großen Korpus von Plagiaten. Die Plagiate wurden automatisch generiert sowie mit Hilfe von Crowdsourcing manuell erstellt. Darüber hinaus haben wir zwei Workshops veranstaltet, in denen unsere Evaluierungsumgebung erfolgreich zur Evaluierung aktueller Plagiaterkennungsalgorithmen eingesetzt wurde. Im zweiten Teil präsentieren wir auf Sprachwiederverwendung basierende Technologien für drei verschiedene Retrievalaufgaben und leisten folgende Beiträge: (4) Ein neues Modell zum medienübergreifenden, inhaltlichen Vergleich von Objekten aus dem Web. Das Modell basiert auf der Auswertung der zu einem Objekt vorliegenden Kommentare. In diesem Zusammenhang identifizieren wir Webkommentare als eine in der Forschung bislang vernachlässigte Informationsquelle und stellen die Grundlagen des Kommentarretrievals vor. (5) Zwei neue Algorithmen zur Segmentierung von Websuchanfragen. Die Algorithmen nutzen Web n-Gramme sowie Wikipedia, um die Intention des Suchenden in einer Suchanfrage festzustellen. Darüber hinaus haben wir mittels Crowdsourcing ein neues Evaluierungskorpus erstellt, das zwei Größenordnungen größer ist als bisherige Korpora. (6) Eine neuartige Suchmaschine, genannt Netspeak, die die Suche nach gebräuchlicher Sprache ermöglicht. Netspeak indiziert das Web als Quelle für gebräuchliche Sprache in der Form von n-Grammen und implementiert eine Wildcardsuche darauf

    The English as a foreign language writing classroom and weblog :the effect of computer-mediated communication on attitudes of students and implication for EFL learning

    Get PDF
    PhD ThesisInnovative forms of communication technology have generated new educational models and learning environments. Existing literature includes much discussion concerning the consequences of using communication technology in the context of second language learning. However, recent research has not reached any convincing conclusion about the effects of communication technology in EFL teaching and learning. There are still many variables that need to be accounted for when the use of technology occurs in real-life educational environments, particularly when the adoption of a newly developed communication technology - the Weblog - could / may work better for language learners under specific circumstances. This empirical study focused on whether the use of Weblogs positively changes the learners' attitudes towards EFL writing and their informal use of the English language. Once the focus of this study had been established, the research questions and hypotheses were then addressed as a means of examining the effect of Weblogs. A quasi-experimentarl esearchd esignw as applied with a mixed-methodsa pproacht o elicit data from 119 EFL students in two universities in Taiwan. The collected data included 112p re- and 102 post- GEPT exam papers,1 19 questionnairer esponsesa nd the qualitative data of interviews with 24 research participants. These data were then analysed using inductive (qualitative logic) and deductive (quantitative logic) methods to find out the consequences of the research assumptions. The results corroborate the theoretical findings on the significance of computer-mediated communication in learners' affective learning. In other words, the use of Weblogs influenced the learners' attitudes towards EFL writing. The combination of quantitative and qualitative findings suggested that Weblog technology engages learners in active reading and encourages learners' reflectivity, collaboration, and participation in EFL writing. Finally, the results also echo the theoretical concerns about the learners' self-efficacy and language register in the context of second language writin

    Communicating across cultures in cyberspace

    Get PDF

    Case Studies in multiliteracies and inclusive pedagogy: Facilitating meaningful literacy learning

    Get PDF
    This thesis presents the results of a study designed to examine ways to engage and scaffold primary school students who experience literacy learning difficulties. Utilising a pedagogy of multiliteracies, proposed by the New London Group (1996, 2000), and a framework for inclusive pedagogy (Florian, 2014), this thesis sought to investigate ways to facilitate meaningful literacy learning for students who experience challenges when participating in print-based classroom activities. A qualitative case study approach was adopted to support the broader sociocultural and multiliteracies perspective that underlies the theoretical direction of this research. Three student case studies were constructed illustrating the students’ in-school and out-of-school literacy practices. Research data indicated that while these students exhibited strong engagement with multiple literacies in their out-of-school environment, their experiences in a classroom context were, at times, challenging and marginalising. During the fieldwork period, which took place in a Western Australian Year 6 primary classroom, a multimodal literacy activity was implemented over one school term. This activity required students to: 1. Audioread the novel The Bad Beginning 2. Create a storyboard utilising the iPad app Kid’s Book Report and 3. Create an iMovie review about the novel. Data analysis revealed that engagement with the multimodal literacy activity emerged in similar ways for the case study students. These students appeared to be engaged with the literacy activity when they were: • Activating prior knowledge and immersed in meaningful practices via situated learning. • Experiencing opportunities to create meaning in multiple ways. • Fostering shared meanings - scaffolded within a community of practice. Results indicate that engagement with multiple literacies, beyond the printed word, allowed the students to navigate literacy within various contexts. Exploring multimodal ways to present their thoughts further enhanced the students’ engagement with the multimodal literacy activity. This study provides insight into key areas in the field of literacy research and contributes to understandings of: multiliteracies; inclusive pedagogy; sociocultural approaches to literacy; and open-ended and flexible approaches to literacy learning. The study may be of interest to pre and in service primary school educators and education researchers and policy makers

    Insiders\u27 Guide to the Student Academic Conference: 11th Annual SAC

    Get PDF
    Minnesota State University Moorhead Student Academic Conference abstract book

    A Conversation Analysis of Facebook Confessions Pages: Identity and Identification

    Get PDF
    How individuals identify each other through digital media and display their claims of knowledge is at the core of this study. This work contributes new insights into how participants accomplished identity work by looking at the conversational resources they use in addressing matters of identity in their interaction. The study draws on Conversation Analysis (CA), particularly conceptual work on membership categorization analysis (MCA) and epistemics for analysis. The findings based on two interrelated aspects of the data taken from Facebook Confession Pages interaction. The first concerns the features of the initial (confessional) message, and the second relates to subsequent responses on the initial message. Close examination of the initial message shows ways that identity work is initiated as it would implicate in that subsequent response messages. Two primary forms of messages were then identified on the basis of person reference: those that inform and those that inquire. In each category, the analysis demonstrates that person reference is used as interactional resource in making an epistemic claim of the referent. The person reference is contextual in that they are locally based and understood within the specific contexts of the message. Thus, it is shown that the employment of person reference in the initial message illustrates the epistemic level that author has with the referent. Accordingly, analysis of the subsequent response messages demonstrated ways in which the identity, as presented in the initial message, is identified. The analysis of the subsequent response messages offers insight into how identity works is accomplished through a collaborative commenter’s epistemic stance. Additionally, the study also examines the technological element of FCPs that assist participants in their identity work that is Facebook name. It indicates that this Facebook functionality performed various interactional works including identification work as it provides a link to the right identified referent. Overall, the finding shows that as the identity work is performed, epistemic stance is a requisite component of the interactions. It then may challenge the notions of the invisibility of identity in digital contexts

    Elements of Scholarly Writing Identified by Writing Center Tutors in the Health Sciences

    Get PDF
    Postsession narrative notes written by professional tutors in a health sciences university writing center had never been analyzed to identify the most common elements noted as subpar in graduate students’ scholarly writing. The purpose of this basic qualitative study was to examine these notes to identify the most common elements noted as subpar in graduate students’ scholarly writing. The conceptual framework was Bloom’s original taxonomy. Hand coding of archival data was used to analyze 300 postsession narrative notes submitted by professional writing tutors during the fall trimester of 2022. Descriptive first-cycle coding was followed by pattern coding. The five most common elements were flow, style guide related concerns, organization, clarity, and alignment. Recommendations include that all elements of scholarly writing should be addressed simultaneously, professional writing tutors working with health science graduate students should not prioritize higher over lower order concerns, and predetermined instructional approaches should be secondary to addressing students’ individual needs. Findings may be used to improve writing support services to meet the demand for health care practitioners in the United States. Findings may also encourage equitable access for individuals who have faced barriers to obtaining and completing graduate education

    Analysing language learning websites

    Get PDF
    Der zunehmende Einsatz von Technologie im Bildungsbereich hat zu einer steigenden Anzahl von digitalen Lernmaterialien aller Art geführt, darunter vermehrt auch Webseiten, die sich auf das Verbessern bestimmter Sprachkenntnisse spezialisieren. Das Ziel dieser Diplomarbeit ist es, einen Analyseraster zu entwerfen, der es ermöglicht, derartige Sprachlernwebseiten hinsichtlich ihres Nutzens für den englischen Spracherwerb zu evaluieren. Der Fokus des Analyserasters liegt dabei auf dem Evaluieren von Grammatikerklärungen und Schreibübungen und umfasst dabei die Auswirkung verschiedenster Einflüsse auf die Gesamtperformance der Website. Hierfür werden verschiedenste Herangehensweisen aus der Pädagogik, Linguistik, multimodalen Interaktion und Informationstechnologie kritisch analysiert und, sofern geeignet, als Grundlagen für die Kriterien des Analyserasters herangezogen. Um die Anwendbarkeit des Analyserasters zu gewährleisten, wurden die Kriterien des Analyserasters in einem Webseiten-Test der Seite BBC Learning English ausprobiert. Dieser Test hat einerseits bestätigt, dass der Analyseraster, trotz einzelner Schwierigkeiten wie Subjektivität oder Repräsentativität, ein nützliches Hilfsmittel zur Evaluierung von englischen Sprachlernwebseiten ist. Darüberhinaus hat sich gezeigt, dass BBC Learning English trotz insgesamt relativ guter Testresultate ein gewisses Optimierungspotenzial besitzt.The increasing use of technology in education has lead to the emergence of a variety of computer-aided learning devices and materials, among them also an increasing number of language learning websites. The aim of this paper is to create a framework of evaluation for analysing English language learning websites, with a focus on grammar and writing. In order to achieve this aim, deeper insights into the concept of computerassisted language learning (CALL) are provided. Furthermore, the framework is based on the critical discussion and inclusion of appropriate theoretical concepts, empirical research and personal reasoning on the basis of pedagogical, linguistic, multimodal and technical concepts and beliefs. These considerations include insights from the field of materials evaluation and socio-cultural theory, Second Language Acquisition (SLA), English Language Teaching (ELT), Communicative Language Teaching (CLT), multimodal processing theory, multi-channel communication theory and Usability Evaluation. In order to ensure the applicability of the framework, it has been tested by applying the evaluation criteria in a website test of the language learning website BBC Learning English. This website test has proven that, in spite of issues such as subjectivity or representativeness, the framework of evaluation is a valuable tool to assess the overall usability of language learning websites. In addition, the test results also indicate that, while achieving relatively high overall usability rates, the tested language learning website might benefit from restructuring in specific areas

    Cultural Dynamics in a Globalized World

    Get PDF
    The book contains essays on current issues in arts and humanities in which peoples and cultures compete as well as collaborate in globalizing the world while maintaining their uniqueness as viewed from cross- and inter-disciplinary perspectives. The book covers areas such as literature, cultural studies, archaeology, philosophy, history, language studies, information and literacy studies, and area studies. Asia and the Pacific are the particular regions that the conference focuses on as they have become new centers of knowledge production in arts and humanities and, in the future, seem to be able to grow significantly as a major contributor of culture, science and arts to the globalized world. The book will help shed light on what arts and humanities scholars in Asia and the Pacific have done in terms of research and knowledge development, as well as the new frontiers of research that have been explored and opening up, which can connect the two regions with the rest of the globe
    corecore