SimilaR: R Code Clone and Plagiarism Detection

Abstract:

Third-party software for assuring source code quality is becoming increasingly popular. Tools that evaluate the coverage of unit tests, perform static code analysis, or inspect run-time memory use are crucial in the software development life cycle. More sophisticated methods allow for performing meta-analyses of large software repositories, e.g., to discover abstract topics they relate to or common design patterns applied by their developers. They may be useful in gaining a better understanding of the component interdependencies, avoiding cloned code as well as detecting plagiarism in programming classes. A meaningful measure of similarity of computer programs often forms the basis of such tools. While there are a few noteworthy instruments for similarity assessment, none of them turns out particularly suitable for analysing R code chunks. Existing solutions rely on rather simple techniques and heuristics and fail to provide a user with the kind of sensitivity and specificity required for working with R scripts. In order to fill this gap, we propose a new algorithm based on a Program Dependence Graph, implemented in the SimilaR package. It can serve as a tool not only for improving R code quality but also for detecting plagiarism, even when it has been masked by applying some obfuscation techniques or imputing dead code. We demonstrate its accuracy and efficiency in a real-world case study.

Cite PDF Tweet

Authors

Affiliations

Maciej Bartoszuk

 

Marek Gagolewski

 

Published

Sept. 9, 2020

Received

Apr 1, 2020

DOI

10.32614/RJ-2020-017

Volume

Pages

12/1

367 - 385

Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2020-017.zip

CRAN packages used

magrittr, SimilaR, nortest, DescTools

CRAN Task Views implied by cited packages

MissingData, WebTechnologies

Footnotes

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Bartoszuk & Gagolewski, "The R Journal: SimilaR: R Code Clone and Plagiarism Detection", The R Journal, 2020

    BibTeX citation

    @article{RJ-2020-017,
      author = {Bartoszuk, Maciej and Gagolewski, Marek},
      title = {The R Journal: SimilaR: R Code Clone and Plagiarism Detection},
      journal = {The R Journal},
      year = {2020},
      note = {https://doi.org/10.32614/RJ-2020-017},
      doi = {10.32614/RJ-2020-017},
      volume = {12},
      issue = {1},
      issn = {2073-4859},
      pages = {367-385}
    }