18 research outputs found

    Revisiting the challenges and surveys in text similarity matching and detection methods

    Get PDF
    The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Musical Landscapes: Theophile Gautier and the Evolution of Nineteenth Century French Poetry

    Full text link
    Theophile Gautier\u27s first edition of Emaux et camees (1852) marks the juncture at which Romantic, Neoclassical, and nascent Symbolist poetic theories converged under the umbrella ideology of Parnassianism. Emaux et camees synthesizes the aesthetics promoted by these diverse groups, primarily by 1) using musical and painterly language, 2) emphasizing correspondences among arts, and 3) paradoxically demanding an attention to form and the artist\u27s labor while also emphasizing art\u27s inutility during a century characterized by Progress. Gautier\u27s Emaux et camees bridges painterly and musical poetics to create a new model for poetry. While the vocabulary of painting captivated many nineteenth century writers, music became increasingly admired by poets because of its freedom from representation, and as an intention-less language. Musical poets indemnified the mantra art for art\u27s sake and touted the intermingling of art forms, belief systems, and cultural practices during a time when usefulness, authoritarian rule, and homogeny were staunchly reinforced in the political and public spheres. Emaux et camees appeared in 1852, marking a point of departure for poetry. Gautier preserved earlier poetic principles, but also invested a robust work ethic and a devotion to form in his collection. Numerous offshoot poetic groups arose as a result of Gautier, who had reclaimed music\u27s nuanced, fragmented, performative, and anti-utilitarian nature for poetry and poetics

    Science in the Forest, Science in the Past

    Get PDF
    This collection brings together leading anthropologists, historians, philosophers, and artificial-intelligence researchers to discuss the sciences and mathematics used in various Eastern, Western, and Indigenous societies, both ancient and contemporary. The authors analyze prevailing assumptions about these societies and propose more faithful, sensitive analyses of their ontological views about reality—a step toward mutual understanding and translatability across cultures and research fields. Science in the Forest, Science in the Past is a pioneering interdisciplinary exploration that will challenge the way readers interested in sciences, mathematics, humanities, social research, computer sciences, and education think about deeply held notions of what constitutes reality, how it is apprehended, and how to investigate it

    El Che vive: memory, cinema,art andpolitics

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão, Programa de Pós-Graduação em Inglês: Estudos Linguísticos e Literários, Florianópolis, 2020.Che Guevara, morto há mais de cinquenta anos, surpreende por seu persistente ressurgimento através de imagens. Essa fascinação pelas imagens de Che se explica pelo conceito da ansiedade de lembrar e não-lembrar, fruto da demanda de rememoração e redenção ? no sentido Benjaminiano (LÖWY; BENJAMIN, 2005) ?, a qual é expressa através do olhar fantasmagórico de Che. Tal olhar fantasmagórico é ambivalente podendo potencialmente levar a imaginações artísticas e ações emancipatórias que recriem Che, ou a apropriações capitalistas ou outras formas de tentar controlar as imagens de Che. Na tese, são criadas algumas pontes entre o Marxismo e o pensamento decolonial, tal como entre o conceito de ação criativa Arendtiana (1998), da consciência antecipatória de Bloch (1996) e a cosmovisão ancestral (WILSON, 2001; LACLAU, 2016; ANZALDUA, 2012), e no entendimento amplo do conceito de alienação/fetiche. Diversos exemplos contemporâneos de imaginações artísticas e ações emancipatórias são discutidos, desafiando a retórica de suposta irrelevância política das imagens de Che. Tentativas de apropriação por corporações capitalistas, por um movimento nazista e por um artista gráfico também são discutidos a partir de uma redefinição ampla da teoria da alienação. Um conto e dois poemas de minha autoria sobre Che também são discutidos na tese, assim como cinco filmes: The Last Hours of Che Guevara (THE LAST HOURS, 2016), El Dia que Me Quieras (EL DIA, 1997), El Che de los Gays (EL CHE DE LOS GAYS, 2004), Personal Che (PERSONAL CHE, 2007), and Che! (1969).Abstract:Che Guevara, who died more than fifty years ago, keeps resurging through images. This fascination with Che's images is explained by the concept of the anxiety of remembering and non-remembering, caused by the demand for remembrance and redemption ? in the Benjaminian sense (LÖWY; BENJAMIN, 2005) ?, which is expressed through Che's ghostly look. Such a ghostly look is ambivalent and can potentially lead to artistic imaginations and emancipatory actions that recreate Che, or to capitalist appropriations or other ways of trying to control Che's images. In this doctoral dissertation, some bridges are created between Marxism and decolonial thought, such as between the Arendt?s concept of creative action (ARENDT, 1998), Bloch's anticipatory consciousness (BLOCH, 1996) and cosmovision (WILSON, 2001; LACLAU, 2016; ANZALDUA , 2012), and in the broad understanding of the concept of alienation / fetish. Several contemporary examples of artistic imaginations and emancipatory actions are discussed, challenging the rhetoric of supposed political irrelevance of Che's images. Attempts at appropriation by capitalist corporations, by a Nazi movement and by a graphic artist are also discussed from a plural redefinition of the theory of alienation. A short story and two poems of my own about Che are also discussed in the dissertation, as well as five films: The Last Hours of Che Guevara (THE LAST HOURS, 2016), El Dia que Me Quieras (EL DIA, 1997), El Che de los Gays (EL CHE DE LOS GAYS, 2004), Personal Che (PERSONAL CHE, 2007), and Che! (1969)

    Cherry Valley and the Uses of Memory

    Full text link
    Excerpt from Introduction: The Uses of Memory. One of the distinctions of historical maps dealing with the encounters occurring on the frontier between two separate cultures is a unique symbol to indicate massacres. The United States comprised, throughout its pre- and actual history, one great frontier that saw many such exchanges. Some of these were mortal, and many of them, one-sided in nature, tended to be called massacres by the losing side, in an attempt to salvage some moral high ground. However, no one disputes what what happened at Cherry Valley, New York, on 11 November 1778, was a massacre. On that date, Iroquois and Loyalist Rangers raided the hamlet of Cherry Valley on the New York frontier, south of the Mohawk Valley. The raid destroyed the settlement and forced the evacuation of the fort. Forty people died, most of them unarmed civilians. This minor episode seemed to give birth to a considerable body of work, comprising various histories from diverse viewpoints, and works of fiction including dramatic literature and motion pictures. The first question that arose from this material, in the course of preparing research for a historical paper was simple and factual: 1. Is it possible to find the truth of what happened that day?...Master'sCollege of Arts and Sciences: Liberal StudiesUniversity of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/117732/1/Paradis.pd
    corecore