23 research outputs found

    Methods for cross-language plagiarism detection

    Full text link
    NOTICE: this is the author's version (pre print) of a work that was accepted for publication in Knowledge-Based Systems. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Knowledge-Based Systems. 50:211-217. doi:10.1016/j.knosys.2013.06.018.Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T + MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks¿something never done before. The experiments show that T + MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.Barrón Cedeño, LA.; Gupta, PA.; Rosso ., P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems. 50:211-217. doi:10.1016/j.knosys.2013.06.018S2112175

    The publication of press releases as journalistic information. Comparative study of two Spanish newspapers

    Get PDF
    La distinción de lo que constituye un “evento noticiable” puede dar lugar a muchas interpretaciones. En este mundo de accesibilidad telemática, que es una consecuencia de la globalización, los eventos y los sucesos de todo tipo se pueden clasificar como noticias simplemente vistiéndolos como noticias. De acuerdo con los manuales de estilo y comunicación, la noticia tiene características propias: relevancia, interés social y proximidad, entre otras. Los comunicados de prensa se han perfeccionado como resultado de las agencias de relaciones públicas cada vez más sofisticadas, y con ellas la línea delgada entre la información y la publicidad ahora está borrosa. En este artículo, comparamos comunicados de prensa emitidos por empresas públicas y privadas con breves publicados en las secciones de economía de los periódicos. Como se verá, muchos de ellos coinciden y tienen algunas similitudes. La muestra utiliza breves publicados durante el primer semestre de 2014 en El Mundo y La Vanguardia, los periódicos en español de pago por lectura que ocupan un lugar destacado en el análisis del Estudio General de Medios. La metodología hace uso del programa Maple con su comando DetectPlagiarism para realizar una comparación ad hoc de los textos. El umbral de copia predeterminado para DetectPlagiarism es 0.35. Los índices de similitud entre los breves y los comunicados de prensa de La Vanguardia y El Mundo indican valores superiores a este umbral.Peer ReviewedPostprint (published version

    La coincidencia entre las notas de prensa y los breves de economía: análisis cuantitativo y cualitativo de dos periódicos españoles

    Get PDF
    Aquesta recerca té com a objectiu comparar les notes de premsa d’empreses i organismes públics i els breus publicats pels mitjans de comunicació. Amb això es pretén detectar si la nota de premsa és processada o contrastada abans de ser publicada o si, per contra, és difosa tal com arriba a la redacció. El que es busca, doncs, és establir numèricament quin percentatge del breu és coincident o similar a la nota de premsa. La mostra ha cobert mig any, el primer semestre de 2014, i s’ha escollit El Mundo i La Vanguardia, els diaris d’in- formació general de pagament a Espanya que tenen més breus d’empresa com a part fixa d’una secció («Economia»). Posteriorment, una sèrie d’entrevistes en profunditat (entre 2015 i 2018) ha permès orientar la interpretació de les dades aconseguides. Metodològi- cament, una eina específica, DetectPlagiarism, ha servit per acarar textos mitjançant l’or- dre SimilarityScore. S’han obtingut textos tan similars que són susceptibles d’haver estat reproduïts igual o de contenir alguna porció de text que ha estat copiada. Utilitzant l’or- dre DetectPlagiarism, el llindar de còpia per defecte és de 0,35. Doncs bé, els índexs de similitud entre els breus i les notes de premsa de La Vanguardia i El Mundo tenen un valor mitjà de 0,41, superior a aquest llindar.Postprint (published version

    利用N-gram和语义分析的维吾尔语文本相似性检测方法

    Get PDF
    目前自然语言文本相似度估计大多是针对英语等一些大类语言,为了实现维吾尔语文本的相似性检测,提出一种基于N-gram和语义分析的相似性检测方法。首先,根据维吾尔语单词特征,采用了N-gram统计模型来获得词语,并根据词语在文本中的出现频率来构建词语-文本关系矩阵,作为文本模型。然后,采用了潜在语义分析(LSA)来获得词语及其文本之间的隐藏关联,以此解决维吾尔语词义模糊的问题,并获得准确的相似度。在包含重组和同义词替换的剽窃文本集上进行实验,结果表明该方法能够准确有效地检测出相似性。国家自然科学基金资助项目(61762086);新疆维吾尔自治区高校科研计划立项项目(XJEDU2016S090

    Detecting Cross-Language Plagiarism using Open Knowledge Graphs

    Get PDF
    Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application toWebscale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA’s performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available

    Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

    Full text link
    Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.Comment: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2019. The data and code of our study are openly available at https://purl.org/hybridP

    Interlinking English and Chinese RDF data sets using machine translation

    Get PDF
    lesnikova2014aInternational audienceData interlinking is a difficult task particularly in a multilingual environment like the Web. In this paper, we evaluate the suitability of a Machine Translation approach to interlink RDF resources described in English and Chinese languages. We represent resources as text documents, and a similarity between documents is taken for similarity between resources. Documents are represented as vectors using two weighting schemes, then cosine similarity is computed. The experiment demonstrates that TF*IDF with a minimum amount of preprocessing steps can bring high results
    corecore