ICT Research Institute, ACECR, Tehran, Iran , habib.asghari@ictrc.ac.ir
Abstract: (1046 Views)
In recent years, due to high availability of documents through the Internet, plagiarism is becoming a serious issue in many fields of research. Moreover, the availability of machine translation systems facilitates the re-use of textual contents across languages. So, the detection of plagiarism in cross-lingual cases is now of great importance especially when source and target language are different. Various methods for automatic detection of text reuse have been developed whose objective is to help human experts to investigate suspicious documents for plagiarism cases. For evaluating the performance of theses plagiarism detection systems and algorithms, we need to construct plagiarism detection corpora. In this paper, we propose an English-Persian plagiarism detection corpus comprised of different types of paraphrasing. The goal is to simulate what would be done by human to conceal plagiarized passages after translating the text into target language. The proposed corpus includes seven types of paraphrasing methods that cover (but not limited to) all of the obfuscation types in the previous works into one integrated CLPD corpus. In order to evaluate the corpus, an extrinsic evaluation approach has been applied by executing a wide variety of plagiarism detection algorithms as downstream tasks on the proposed corpus. The results show that the performance of the algorithms decrease by increasing the obfuscation complexity.