Crowdsourced Data Preprocessing with R and Amazon Mechanical Turk

Abstract:

This article introduces the use of the Amazon Mechanical Turk (MTurk) crowdsourcing platform as a resource for R users to leverage crowdsourced human intelligence for preprocessing “messy” data into a form easily analyzed within R. The article first describes MTurk and the MTurkR package, then outlines how to use MTurkR to gather and manage crowdsourced data with MTurk using some of the package’s core functionality. Potential applications of MTurkR include construction of manually coded training sets, human transcription and translation, manual data scraping from scanned documents, content analysis, image classification, and the completion of online survey questionnaires, among others. As an example of massive data preprocessing, the article describes an image rating task involving 225 crowdsourced workers and more than 5500 images using just three MTurkR function calls.

Cite PDF Tweet

Author

Affiliation

Thomas J. Leeper

 

Published

June 12, 2016

Received

Oct 30, 2015

DOI

10.32614/RJ-2016-020

Volume

Pages

8/1

276 - 288

CRAN packages used

MTurkR, MTurkRGUI, tcltk, curl, XML

CRAN Task Views implied by cited packages

WebTechnologies

Footnotes

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Leeper, "The R Journal: Crowdsourced Data Preprocessing with R and Amazon Mechanical Turk", The R Journal, 2016

    BibTeX citation

    @article{RJ-2016-020,
      author = {Leeper, Thomas J.},
      title = {The R Journal: Crowdsourced Data Preprocessing with R and Amazon Mechanical Turk},
      journal = {The R Journal},
      year = {2016},
      note = {https://doi.org/10.32614/RJ-2016-020},
      doi = {10.32614/RJ-2016-020},
      volume = {8},
      issue = {1},
      issn = {2073-4859},
      pages = {276-288}
    }