33,245 research outputs found
On the use of word embedding for cross language plagiarism detection
[EN] Cross language plagiarism is the unacknowledged reuse of text across language pairs. It occurs if a passage of text
is translated from source language to target language and no proper citation is provided. Although various methods have been
developed for detection of cross language plagiarism, less attention has been paid to measure and compare their performance,
especially when tackling with different types of paraphrasing through translation. In this paper, we investigate various approaches to cross language plagiarism detection. Moreover, we present a novel approach to cross language plagiarism detection
using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. In
order to evaluate the methods, we have constructed an English-Persian bilingual plagiarism detection corpus (referred to as
HAMTA-CL) comprised of seven types of obfuscation. The results show that the word embedding approach outperforms the
other approaches with respect to recall when encountering heavily paraphrased passages. On the other hand, translation based
approach performs well when the precision is the main consideration of the cross language plagiarism detection system.Asghari, H.; Fatemi, O.; Mohtaj, S.; Faili, H.; Rosso, P. (2019). On the use of word embedding for cross language plagiarism detection. Intelligent Data Analysis. 23(3):661-680. https://doi.org/10.3233/IDA-183985S661680233H. Asghari, K. Khoshnava, O. Fatemi and H. Faili, Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus: Notebook for {PAN} at {CLEF} 2015, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.A. Barrón-Cede no, M. Potthast, P. Rosso and B. Stein, Corpus and evaluation measures for automatic plagiarism detection, In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias, editors, Proceedings of the International Conference on Language Resources and Evaluation, {LREC} 2010, 17–23 May 2010, Valletta, Malta. European Language Resources Association, 2010.A. Barrón-Cede no, P. Rosso, D. Pinto and A. Juan, On cross-lingual plagiarism analysis using a statistical model, In B. Stein, E. Stamatatos and M. Koppel, editors, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2008.Farghaly, A., & Shaalan, K. (2009). Arabic Natural Language Processing. ACM Transactions on Asian Language Information Processing, 8(4), 1-22. doi:10.1145/1644879.1644881J. Ferrero, F. Agnès, L. Besacier and D. Schwab, A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection, In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk and S. Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation {LREC} 2016, Portorož, Slovenia, May 23–28, 2016, European Language Resources Association {(ELRA)}, 2016.Franco-Salvador, M., Gupta, P., Rosso, P., & Banchs, R. E. (2016). Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowledge-Based Systems, 111, 87-99. doi:10.1016/j.knosys.2016.08.004Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550-570. doi:10.1016/j.ipm.2015.12.004C.K. Kent and N. Salim, Web based cross language plagiarism detection, CoRR, abs/0912.3, 2009.McNamee, P., & Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, 7(1/2), 73-97. doi:10.1023/b:inrt.0000009441.78971.beT. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space, CoRR, abs/1301.3, 2013.S. Mohtaj, B. Roshanfekr, A. Zafarian and H. Asghari, Parsivar: A language processing toolkit for persian, In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis and T. Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018, European Language Resources Association ELRA, 2018.R.M.A. Nawab, M. Stevenson and P.D. Clough, University of Sheffield – Lab Report for {PAN} at {CLEF} 2010, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.G. Oberreuter, G. L’Huillier, S.A. Rios and J.D. Velásquez, Approaches for intrinsic and external plagiarism detection – Notebook for {PAN} at {CLEF} 2011, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2011.Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51-60. doi:10.1016/j.jalgor.2009.02.005M. Potthast, A. Barrón-Cede no, A. Eiselt, B. Stein and P. Rosso, Overview of the 2nd international competition on plagiarism detection, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2010). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45-62. doi:10.1007/s10579-009-9114-zM. Potthast, A. Eiselt, A. Barrón-Cede no, B. Stein and P. Rosso, Overview of the 3rd international competition on plagiarism detection, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2011.M. Potthast, S. Goering, P. Rosso and B. Stein, Towards data submissions for shared tasks: First experiences for the task of text alignment, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Potthast, M., Stein, B., & Anderka, M. (s. f.). A Wikipedia-Based Multilingual Retrieval Model. Advances in Information Retrieval, 522-530. doi:10.1007/978-3-540-78646-7_51B. Pouliquen, R. Steinberger and C. Ignat, Automatic identification of document translations in large multilingual document collections, CoRR, abs/cs/060, 2006.B. Stein, E. Stamatatos and M. Koppel, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2008.J. Wieting, M. Bansal, K. Gimpel and K. Livescu, Towards universal paraphrastic sentence embeddings, CoRR, abs/1511.0, 2015.V. Zarrabi, J. Rafiei, K. Khoshnava, H. Asghari and S. Mohtaj, Evaluation of text reuse corpora for text alignment task of plagiarism detection, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211-217. doi:10.1016/j.knosys.2013.06.01
Digital image correlation (DIC) analysis of the 3 December 2013 Montescaglioso landslide (Basilicata, Southern Italy). Results from a multi-dataset investigation
Image correlation remote sensing monitoring techniques are becoming key tools for
providing effective qualitative and quantitative information suitable for natural hazard assessments,
specifically for landslide investigation and monitoring. In recent years, these techniques have
been successfully integrated and shown to be complementary and competitive with more standard
remote sensing techniques, such as satellite or terrestrial Synthetic Aperture Radar interferometry.
The objective of this article is to apply the proposed in-depth calibration and validation analysis,
referred to as the Digital Image Correlation technique, to measure landslide displacement.
The availability of a multi-dataset for the 3 December 2013 Montescaglioso landslide, characterized
by different types of imagery, such as LANDSAT 8 OLI (Operational Land Imager) and TIRS
(Thermal Infrared Sensor), high-resolution airborne optical orthophotos, Digital Terrain Models
and COSMO-SkyMed Synthetic Aperture Radar, allows for the retrieval of the actual landslide
displacement field at values ranging from a few meters (2–3 m in the north-eastern sector of the
landslide) to 20–21 m (local peaks on the central body of the landslide). Furthermore, comprehensive
sensitivity analyses and statistics-based processing approaches are used to identify the role of the
background noise that affects the whole dataset. This noise has a directly proportional relationship to
the different geometric and temporal resolutions of the processed imagery. Moreover, the accuracy
of the environmental-instrumental background noise evaluation allowed the actual displacement
measurements to be correctly calibrated and validated, thereby leading to a better definition of
the threshold values of the maximum Digital Image Correlation sub-pixel accuracy and reliability
(ranging from 1/10 to 8/10 pixel) for each processed dataset
Neural Natural Language Inference Models Enhanced with External Knowledge
Modeling natural language inference is a very challenging task. With the
availability of large annotated data, it has recently become feasible to train
complex models such as neural-network-based inference models, which have shown
to achieve the state-of-the-art performance. Although there exist relatively
large annotated data, can machines learn all knowledge needed to perform
natural language inference (NLI) from these data? If not, how can
neural-network-based NLI models benefit from external knowledge and how to
build NLI models to leverage it? In this paper, we enrich the state-of-the-art
neural natural language inference models with external knowledge. We
demonstrate that the proposed models improve neural NLI models to achieve the
state-of-the-art performance on the SNLI and MultiNLI datasets.Comment: Accepted by ACL 201
Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation
Most recent approaches use the sequence-to-sequence model for paraphrase
generation. The existing sequence-to-sequence model tends to memorize the words
and the patterns in the training dataset instead of learning the meaning of the
words. Therefore, the generated sentences are often grammatically correct but
semantically improper. In this work, we introduce a novel model based on the
encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our
proposed model generates the words by querying distributed word representations
(i.e. neural word embeddings), hoping to capturing the meaning of the according
words. Following previous work, we evaluate our model on two
paraphrase-oriented tasks, namely text simplification and short text
abstractive summarization. Experimental results show that our model outperforms
the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two
English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a
Chinese summarization dataset. Moreover, our model achieves state-of-the-art
performances on these three benchmark datasets.Comment: arXiv admin note: text overlap with arXiv:1710.0231
WMT 2016 Multimodal translation system description based on bidirectional recurrent neural networks with double-embeddings
Bidirectional Recurrent Neural Networks (BiRNNs) have shown outstanding results on sequence-to-sequence learning tasks. This architecture becomes specially interesting for multimodal machine translation task, since BiRNNs can deal with images and text. On most translation systems the same word embedding is fed to both BiRNN units. In this paper, we present several experiments to enhance a baseline sequence-to-sequence system (Elliott et al., 2015), for example, by using double embeddings. These embeddings are trained on the forward and backward direction of the input sequence. Our system is trained, validated and tested on the Multi30K dataset (Elliott et al., 2016) in the context of theWMT 2016Multimodal Translation Task. The obtained results show that thedouble-embedding approach performs significantly better than the traditional single-embedding one.Postprint (published version
- …