198 research outputs found

    Plagiarism Detection Techniques for Arabic Script Languages: A Literature Review

    Get PDF
    Plagiarism is generally defined as literary theft and academic dishonesty. This considered as the serious issue in an academic documents and texts. There are numerous of plagiarism detection techniques have been developed for various natural languages, mainly English. In this paper we investigate and review the plagiarism detection techniques and algorithms which have been developed for Arabic Script Languages (ASL), and providing a literature review of the utilized methods in terms of techniques and outcomes.  The result of this paper will help the researchers who are going to commence their development and extend their researches in ASL like Arabic, Persian, Urdu, and Kurdish

    A cluster-based external plagiarism and parallel corpora detection method

    Get PDF
    Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent Univ., 2011.Thesis (Master's) -- Bilkent University, 2011.Includes bibliographical references leaves 60-64.Today different editions and translations of the same literary text can be found. Intuitively such translations that are based on the same literary text are expected to possess significantly similar structure. In the same way, it is possible that a text that is suspected to have plagiarism can possess structural similarities with the text that is believed to be the source of the plagiarism. Textual plagiarism implies the usage of an author’s text, his/her work or the idea that is inserted in another textual work without giving a reference or without taking the permission of the original text’s author. Today, existing intrinsic and external plagiarism detection methods tend to detect plagiarism cases within a given dataset in order to run these algorithms in a reasonable amount of time. Hence a reference document set is built in order to search for plagiarism cases successfully by these algorithms. In this thesis, a method for detecting and quantifying the external plagiarism and parallel corpora is introduced. For this purpose, we use the structural similarities in order to analyze plagiarism detection problem and to quantify the similarity between given texts. In this method, suspicious and source texts are partitioned into corresponding blocks. Each block is represented as a group of documents where a document consists of a fixed amount of words. Then, blocks are indexed and clustered by using the cover coefficient clustering algorithm. Cluster formations for both texts are then analyzed and their similarities are measured. The results over PAN’09 plagiarism dataset and over different versions of the famous literary text classic Leylˆa and Mecnun show that the proposed method successfully detects and quantifies the structurally similar plagiarism cases and succeeds in detecting the parallel corpora.Karbeyaz, Ceyhun EfeM.S

    Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization

    Get PDF
    Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages

    An Urdu semantic tagger - lexicons, corpora, methods and tools

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports F1F_1 of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F1_1 of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%

    The Esoteric, the Islamicate, and 20th Century World Literature

    Get PDF
    By exploring the intersections of the esoteric and the islamicate in a series of 20th century literary works from disparate global locations, this dissertation maps out a constellation of countercultural world literature as a model for further advancing the study of literature and esotericism in a planetary context. Chapters are focused on literary works of Iranian Sādeq Hedāyat (1903-1951), Argentine Jorge Luis Borges (1899-1986), and the cut-up collaborations of American William S. Burroughs (1914-1997) and British-Canadian Brion Gysin (1916-1986). Using the statement 'writing is magic and labour,' I argue that these four authors yearned to attain ‘magic’ in their creative writing, while each had their own distinct definition and understanding of what this ‘magic’ would be. These definitions and understandings have been largely shaped by each author’s particular encounters with esoteric and islamicate discourses; they are also products of their ‘labour’—practices and strategies of writing and research affected by the social and political power dynamics of the fields of global cultural production and circulation. Hedāyat’s conception of magic, formed through encounters with European, Islamic, and Zoroastrian esoteric discourses, chiefly refers to practices and texts associated with the ancient magus (Zoroastrian priestly class) that through centuries of religious conflict have transfigured into something distant and incomprehensible. This magic becomes the subject of extensive folklore research for Hedāyat, and is further used and invoked in his works of fiction. For Borges, magic refers to the unexplainable quality of the aesthetic events that flees rational justification. His explorations in pantheism that expand to a range of esoteric currents such as Kabbalah and Gnosticism, find in the islamicate a culture that has grappled with questions on the nature of divinity and on writing being sacred and magical. In the cut-up collaborations of Burroughs-Gysin, the magic of writing is in the randomness of the process as well as the speech act of language, while its labour is primarily dependent on using scissors instead of conventional instruments of writing. Inspired by the islamicate milieu of post-war Tangier, Burroughs-Gysin opened up new possibilities for writing and for human-machine collaborations that are still influencing the electronic literature of the 21st century

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Bollywood eclipsed : the postmodern aesthetics, scholarly appeal, and remaking of contemporary popular Indian cinema

    Get PDF
    This thesis uses postmodern theory to explore aesthetic shifts in post-millennial Bollywood cinema, with a particular focus on films produced by the Bombay film industry over the past nine years (2000-2009) and the recent boom of Hindi cross-cultural and self-remakes. My research investigates reasons behind the lack of appeal of Bollywood films in the West (particularly in their contemporary form), revealing how our understanding and appreciation of them is restricted or misinformed by a long history of censure from critics, scholars, educators and ambassadors of the Indian cinema. Through my analysis of the function and effects of cultural appropriation and postmodern traits in several recent popular Indian films, I expose Bollywood's unique film language in order to raise our appreciation of this cinema and suggest ways in which it can be better incorporated into future film studies courses. My analysis is based on a study of over a hundred contemporary Bollywood remakes and includes close textual analysis and case studies of a wide variety of popular Bollywood films, including: Dil Chahta Hai (2001), Abhay (2001), Kaante (2002), Devdas (2002), Koi
Mil Gaya (2003), Sarkar (2005), Krrish (2006) and Om Shanti Om (2007). In my conclusion, I offer a redefinition of contemporary Bollywood and I consider postmodernism's usefulness as a tool for teaching Indian cinema and its value as an international cultural phenomenon

    The Making of the Humanities, Volume III. The Modern Humanities

    Get PDF
    This comprehensive history of the humanities focuses on the modern period (1850-2000). The contributors, including Floris Cohen, Lorraine Daston and Ingrid Rowland, survey the rise of the humanities in interaction with the natural and social sciences, offering new perspectives on the interaction between disciplines in Europe and Asia and new insights generated by digital humanities
    • 

    corecore