The R Journal: Finding Optimal Normalizing Transformations via bestNormalize

Ryan A. Peterson

doi:10.32614/RJ-2021-041

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2021-041.zip

CRAN packages used

CRAN Task Views implied by cited packages

References

M. S. Bartlett. The use of transformations. Biometrics, 3(1): 39–52, 1947. URL https://doi.org/10.2307/3001536.

P. J. Bickel and K. A. Doksum. An analysis of transformations revisited. Journal of the American Statistical Association, 76(374): 296–311, 1981. URL https://doi.org/10.1080/01621459.1981.10477649.

G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2): 211–252, 1964. URL https://doi.org/10.2307/2984418.

P. Breheny and W. Burchett. Visualization of regression models using visreg. The R Journal, 9(2): 56–71, 2017.

R. B. D’Agostino. Goodness-of-fit-techniques. CRC press, 1986.

R. Gaujoux. doRNG: Generic reproducible parallel backend for ’foreach’ loops. 2020. URL https://CRAN.R-project.org/package=doRNG. R package version 1.8.2.

G. M. Goerg. Lambert w random variables-a new family of generalized skewed distributions with applications to risk estimation. The Annals of Applied Statistics, 5(3): 2197–2230, 2011. URL https://doi.org/10.1214/11-AOAS457.

J. Gross and U. Ligges. Nortest: Tests for normality. 2015. URL https://CRAN.R-project.org/package=nortest. R package version 1.0-4.

F. E. Harrell. Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis. Springer, 2015.

J. A. John and N. R. Draper. An alternative family of transformations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(2): 190–197, 1980. URL https://doi.org/10.2307/2986305.

M. Kuhn. Caret: Classification and regression training. 2017. URL https://CRAN.R-project.org/package=caret. R package version 6.0-78.

M. Kuhn and K. Johnson. Applied predictive modeling. Springer, 2013. URL https://doi.org/10.1007/978-1-4614-6849-3.

M. Kuhn and D. Vaughan. Yardstick: Tidy characterizations of model performance. 2020. URL https://CRAN.R-project.org/package=yardstick. R package version 0.0.7.

M. Kuhn and H. Wickham. Recipes: Preprocessing tools to create design matrices. 2018. URL https://CRAN.R-project.org/package=recipes. R package version 0.1.2.

M. Kuhn and H. Wickham. Tidymodels: Easily install and load the ’tidymodels’ packages. 2020. URL https://CRAN.R-project.org/package=tidymodels. R package version 0.1.0.

B. F. J. Manly. Exponential data transformations. Journal of the Royal Statistical Society. Series D (The Statistician), 25(1): 37–42, 1976. URL https://doi.org/10.2307/2988129.

R. A. Peterson and J. E. Cavanaugh. Ordered quantile normalization: A semiparametric transformation built for the cross-validation era. Journal of Applied Statistics, 1–16, 2019. URL https://doi.org/10.1080/02664763.2019.1630372.

P. Royston. Estimating departure from normality. Statistics in Medicine, 10(8): 1283–1293, 1991. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780100811.

H. C. Thode. Testing for normality. CRC press, 2002.

B. Van der Waerden. Order tests for the two-sample problem and their power. In Indagationes mathematicae (proceedings), pages. 453–458 1952. Elsevier.

W. N. Venables and B. D. Ripley. Modern applied statistics with s. Fourth New York: Springer, 2002. URL http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0.

S. N. Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B), 73(1): 3–36, 2011.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

	Overall (N=6,283)
Make
Acura	185 (2.9%)
Buick	252 (4.0%)
Chevrolet	1,257 (20.0%)
GMC	492 (7.8%)
Honda	1,029 (16.4%)
Hyundai	381 (6.1%)
Mazda	272 (4.3%)
Nissan	735 (11.7%)
Pontiac	63 (1.0%)
Toyota	1,202 (19.1%)
Volkswagen	415 (6.6%)
Price ($)
Mean (SD)	17,145 (8,346)
Range	722 - 64,998
Mileage
Mean (SD)	63,638 (49,125)
Range	2 - 325,556
Year
Mean (SD)	2011.9 (3.5)
Range	2000.0 - 2016.0
Age (years old)
Mean (SD)	5.1 (3.5)
Range	1.0 - 17.0

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
Intercept	0.005	0.010	0.553	0.58
g(Mileage)	-0.234	0.016	-14.966	< 0.001
g(Age)	-0.441	0.016	-27.134	< 0.001

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
MAE
glmnet	0.181	0.184	0.186	0.189	0.194	0.198	0
earth	0.147	0.151	0.154	0.155	0.158	0.163	0
rf	0.136	0.141	0.143	0.144	0.147	0.157	0
RMSE
glmnet	0.242	0.247	0.252	0.256	0.264	0.276	0
earth	0.203	0.209	0.214	0.217	0.226	0.235	0
rf	0.193	0.208	0.213	0.210	0.215	0.217	0
RSQ
glmnet	0.767	0.772	0.785	0.782	0.789	0.801	0
earth	0.807	0.833	0.845	0.842	0.855	0.864	0
rf	0.835	0.845	0.855	0.854	0.860	0.873	0

Method	RMSE	RSQ	MAE
glmnet	4076	0.772	2847
earth	3619	0.814	2500
rf	3257	0.853	2294

Finding Optimal Normalizing Transformations via bestNormalize

Author

Affiliation

Published

Received

DOI

Volume

Pages

Introduction

Normalization methods

The Box-Cox transformation

The Yeo-Johnson transformation

The Lambert W x F transformation

The Ordered Quantile technique

Other included transformations

Other not-included transformations

Which transformation “best normalizes” the data?

Simple examples

Basic implementation

Performing transformations individually

In-sample normalization efficacy

Out-of-sample normalization efficacy

Important features

Improving speed of estimation

Implementation with caret, recipes

Additional customization

1) Adding user-defined functions

2) Re-defining normality

Application to Autotrader data

Background

Transform-both-sides regression

Implementation with recipes

Discussion

References

Supplementary materials

CRAN packages used

CRAN Task Views implied by cited packages

Footnotes

References

Reuse

Citation