Identifying Machine-Paraphrased Plagiarism (2103.11909v7)

Published 22 Mar 2021 in cs.CL, cs.AI, and cs.DL

Abstract: Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine-learning classifiers and eight state-of-the-art neural LLMs. We analyzed preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best-performing technique, Longformer, achieved an average F1 score of 81.0% (F1=99.7% for SpinBot and F1=71.6% for SpinnerChief cases), while human evaluators achieved F1=78.4% for SpinBot and F1=65.6% for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan. To facilitate future research, all data, code, and two web applications showcasing our contributions are openly available at https://github.com/jpwahle/iconf22-paraphrase.

Citations (27)

View on Semantic Scholar

Summary

The paper demonstrates the superior efficacy of Transformer-based models, especially Longformer, in detecting machine-paraphrased plagiarism with an F1 score of 81.0% overall.
It evaluates performance on academic texts paraphrased by tools like SpinBot and SpinnerChief, revealing markedly higher accuracy than human evaluators.
The study underscores the potential for integrating advanced AI models into plagiarism detection systems to enhance academic integrity verification.

Identifying Machine-Paraphrased Plagiarism

The use of paraphrasing tools to mask plagiarized content poses a significant challenge to maintaining academic integrity, which is a concern across various educational and research platforms. The paper "Identifying Machine-Paraphrased Plagiarism" addresses this issue by evaluating the effectiveness of different ML and neural LLMs in distinguishing human-written text from text paraphrased by machines. Specifically, this research investigates the effectiveness of several pre-trained word embedding models combined with ML classifiers and advanced neural LLMs structured on Transformer architectures.

Key Findings

The paper comprehensively explores the efficacy of multiple detection techniques on various data types, including research paper preprints, graduation theses, and Wikipedia articles, which were paraphrased using SpinBot and SpinnerChief tools. Among the analyzed techniques, Longformer exhibited the highest performance, with an F1 score of 81.0% on average, achieving a remarkable 99.7% for SpinBot generated samples but a lower 71.6% for SpinnerChief cases. Comparison against human evaluators revealed that the Longformer model surpasses human identification accuracy, providing a more consistent performance across different paraphrasing conditions (78.4% for SpinBot and 65.6% for SpinnerChief).

The paper reveals that models based on the Transformer architecture, such as BERT, RoBERTa, and DistilBERT, demonstrate superior capabilities in capturing the nuances of machine-paraphrased text. In particular, Transformer variants that innovate on BERT's attention mechanism, such as Longformer, show marked improvements over traditional pre-trained word embedding models like GloVe or word2vec, especially when paired with classic ML classifiers.

Implications and Future Directions

The success of Longformer and similar advanced models in identifying machine-paraphrased content suggests that integrating these models into existing text-matching software could significantly enhance their detection capabilities. Given the limited effectiveness of current text-matching systems like Turnitin and PlagScan against sophisticated paraphrasing tools, incorporating AI models for paraphrase detection becomes a valuable addition. Such integration could serve as a complementary component in academic integrity verification processes, offering alerts on potential cases of misconduct.

Moving forward, expanding the dataset to include a broader range of paraphrasing tools, topics, and languages is anticipated to further improve detection accuracy. The authors advocate for a collaborative open data approach to support the extension of paraphrase detection research. Additionally, exploring automatic text generation by neural models could simulate more realistic paraphrased content, providing enriched training data for future AI detection systems.

Overall, the paper demonstrates a significant step in addressing the challenges posed by machine-paraphrased plagiarism, offering a robust pathway to enhance the existing academic integrity frameworks with effective AI-driven solutions.

PDF Markdown

Related Papers

GitHub

GitHub - jpwahle/iconf22-paraphrase: The official implementation of the iConference 2022 paper "Identifying Machine-Paraphrased Plagiarism". (16 stars)

YouTube

Show All Videos