The stringdist Package for Approximate String Matching

Abstract:

Comparing text strings in terms of distance functions is a common and fundamental task in many statistical text-processing applications. Thus far, string distance functionality has been somewhat scattered around R and its extension packages, leaving users with inconistent interfaces and encoding handling. The stringdist package was designed to offer a low-level interface to several popular string distance algorithms which have been re-implemented in C for this purpose. The package offers distances based on counting q-grams, edit-based distances, and some lesser known heuristic distance functions. Based on this functionality, the package also offers inexact matching equivalents of R’s native exact matching functions match and %in%.

Cite PDF Tweet

Author

Affiliation

Mark P.J. van der Loo

 

Published

April 26, 2014

Received

Nov 4, 2013

DOI

10.32614/RJ-2014-011

Volume

Pages

6/1

111 - 122

CRAN packages used

kernlab, RecordLinkage, MiscPsycho, cba, Mkmisc, deducorrect, vwr, stringdist, textcat, TraMineR

CRAN Task Views implied by cited packages

OfficialStatistics, Cluster, NaturalLanguageProcessing, Graphics, MachineLearning, Multivariate, Optimization, Survival

Footnotes

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Loo, "The R Journal: The stringdist Package for Approximate String Matching", The R Journal, 2014

    BibTeX citation

    @article{RJ-2014-011,
      author = {Loo, Mark P.J. van der},
      title = {The R Journal: The stringdist Package for Approximate String Matching},
      journal = {The R Journal},
      year = {2014},
      note = {https://doi.org/10.32614/RJ-2014-011},
      doi = {10.32614/RJ-2014-011},
      volume = {6},
      issue = {1},
      issn = {2073-4859},
      pages = {111-122}
    }