Multi-type Obfuscation Corpus for CrossLingual Plagiarism Detection

Asghari, Habibollah; Mohtaj , Salar

Volume 17, Issue 2 (5-2025) itrc 2025, 17(2): 59-74 | Back to browse issues page

Mendeley

Zotero

RefWorks

Asghari H, Mohtaj S. Multi-type Obfuscation Corpus for CrossLingual Plagiarism Detection. itrc 2025; 17 (2) :59-74
URL: http://ijict.itrc.ac.ir/article-1-632-en.html

Multi-type Obfuscation Corpus for CrossLingual Plagiarism Detection

Habibollah Asghari¹

, Salar Mohtaj²

1- Department of Advanced Information Systems ICT Research Institute (ACECR) Tehran, Iran , habib.asghari@ictrc.ac.ir
2- Speech and Language Technology (SLT) Department German Research Centre for Artificial Intelligence (DFKI), Labor Berlin, Berlin, Germany

Abstract: (2926 Views)

In recent years, due to the high availability of documents through the Internet, plagiarism is becoming a serious issue in many fields of research. Moreover, the availability of machine translation systems facilitates the re-use of textual content across languages. So, the detection of plagiarism in cross-lingual cases is now of great importance especially when the source and target language are different. Various methods for automatic detection of text reuse have been developed whose objective is to help human experts investigate suspicious documents for plagiarism cases. For evaluating the performance of theses plagiarism detection systems and algorithms, we need to construct plagiarism detection corpora. In this paper, we propose an English-Persian plagiarism detection corpus comprised of different types of paraphrasing. The goal is to simulate what would be done by humans to conceal plagiarized passages after translating the text into the target language. The proposed corpus includes seven types of paraphrasing methods that cover (but not limited to) all of the obfuscation types in the previous works into one integrated CLPD corpus. To evaluate the corpus, an extrinsic evaluation approach has been applied by executing a wide variety of plagiarism detection algorithms as downstream tasks on the proposed corpus. The results show that the performance of the algorithms decreases by increasing the obfuscation complexity.

Keywords: Cross-lingual plagiarism detection, Corpus construction, Obfuscation strategy, Translation obfuscation

Full-Text [PDF 1115 kb] (772 Downloads)

Type of Study: Research | Subject: Information Technology

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Principal Contact