R Packages to Aid in Handling Web Access Logs

Abstract:

Web access logs contain information on HTTP(S) requests and form a key part of both industry and academic explorations of human behaviour on the internet. But the preparation (reading, parsing and manipulation) of that data is just unique enough to make generalized tools unfit for the task, both in programming time and processing time which are compounded when dealing with large data sets common with web access logs. In this paper we explain and demonstrate a series of packages designed to efficiently read in, parse and munge access log data, allowing researchers to handle URLs and IP addresses easily. These packages are substantially faster than existing R methods from a 3-500% speedup for file reading to a 57,000% speedup in URL parsing.

Cite PDF Tweet

Authors

Affiliations

Oliver Keyes

 

Bob Rudis

 

Jay Jacobs

 

Published

June 12, 2016

Received

Jan 29, 2016

DOI

10.32614/RJ-2016-026

Volume

Pages

8/1

360 - 366

CRAN packages used

httr, ApacheLogProcessor, webreadr, readr, microbenchmark, urltools, httr, XML, lubridate, iptools, rgeolocate, Rcpp

CRAN Task Views implied by cited packages

WebTechnologies, HighPerformanceComputing, NumericalMathematics, ReproducibleResearch, TimeSeries

Footnotes

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Keyes, et al., "The R Journal: R Packages to Aid in Handling Web Access Logs", The R Journal, 2016

    BibTeX citation

    @article{RJ-2016-026,
      author = {Keyes, Oliver and Rudis, Bob and Jacobs, Jay},
      title = {The R Journal: R Packages to Aid in Handling Web Access Logs},
      journal = {The R Journal},
      year = {2016},
      note = {https://doi.org/10.32614/RJ-2016-026},
      doi = {10.32614/RJ-2016-026},
      volume = {8},
      issue = {1},
      issn = {2073-4859},
      pages = {360-366}
    }