A Tidy Data Model for Natural Language Processing using cleanNLP

Abstract:

Recent advances in natural language processing have produced libraries that extract low level features from a collection of raw texts. These features, known as annotations, are usually stored internally in hierarchical, tree-based data structures. This paper proposes a data model to represent annotations as a collection of normalized relational data tables optimized for exploratory data analysis and predictive modeling. The R package cleanNLP, which calls one of two state of the art NLP libraries (CoreNLP or spaCy), is presented as an implementation of this data model. It takes raw text as an input and returns a list of normalized tables. Specific annotations provided include tokenization, part of speech tagging, named entity recognition, sentiment analysis, dependency parsing, coreference resolution, and word embeddings. The package currently supports input text in English, German, French, and Spanish.

Cite PDF Tweet

Author

Affiliation

Taylor Arnold

 

Published

June 27, 2017

Received

Mar 27, 2017

DOI

10.32614/RJ-2017-035

Volume

Pages

9/2

248 - 267

Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2017-035.zip

CRAN packages used

dplyr, ggplot2, magrittr, broom, janitor, tidyr, cleanNLP, tidytext, StanfordCoreNLP, coreNLP, XML, spacyr, NLP, cleanNLP, lda, lsa, topicmodels, sqliter, rJava, sotu, glmnet

CRAN Task Views implied by cited packages

NaturalLanguageProcessing, WebTechnologies, Graphics, HighPerformanceComputing, MachineLearning, Phylogenetics, Survival

Footnotes

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Arnold, "The R Journal: A Tidy Data Model for Natural Language Processing using cleanNLP", The R Journal, 2017

    BibTeX citation

    @article{RJ-2017-035,
      author = {Arnold, Taylor},
      title = {The R Journal: A Tidy Data Model for Natural Language Processing using cleanNLP},
      journal = {The R Journal},
      year = {2017},
      note = {https://doi.org/10.32614/RJ-2017-035},
      doi = {10.32614/RJ-2017-035},
      volume = {9},
      issue = {2},
      issn = {2073-4859},
      pages = {248-267}
    }