class: center, middle, inverse, title-slide # Wrangling categorical variables in R ##
https://ecodatasci-tlv.github.io/factors/
### 2019-02-27 --- # What are factors? R's representation of categorical variable. Consists of: 1. A set of _values_ 2. An ordered set of valid _levels_ -- ```r eyes <- factor(c("blue", "green", "green"), levels = c("blue", "brown", "green")) eyes ``` ``` ## [1] blue green green ## Levels: blue brown green ``` --- class: inverse, middle, center # Some examples --- # What you see is NOT what you get ```r (treatment <- c(50, 0, 50, 100)) ``` ``` ## [1] 50 0 50 100 ``` -- ```r (treatment_f <- factor(treatment)) ``` ``` ## [1] 50 0 50 100 ## Levels: 0 50 100 ``` -- ```r as.numeric(treatment_f) ``` ``` ## [1] 2 1 2 3 ``` --- class: middle, center # Factors are stored as _integers_ --- class: middle .center[ .img-small[  ]] ```r (drinks_a <- factor(c("coffee", "beer", "beer"))) ``` ``` ## [1] coffee beer beer ## Levels: beer coffee ``` ```r (drinks_b <- factor(c("lemonade", "coffee"))) ``` ``` ## [1] lemonade coffee ## Levels: coffee lemonade ``` --- class: center, middle # Factor levels are _alphabetical_ by default .img-small[  ] --- # Predict the output ```r drinks_a <- factor(c("coffee", "beer", "beer")) drinks_b <- factor(c("lemonade", "coffee")) *c(drinks_a, drinks_b) ``` -- ``` ## [1] 2 1 1 2 1 ``` -- ```r *c(drinks_a, "tea") ``` -- ``` ## [1] "2" "1" "1" "tea" ``` -- ```r *as.numeric(c(drinks_a, "tea")) ``` -- ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] 2 1 1 NA ``` --- class: center, middle  --- # Let's take two data frames... .pull-left[ ```r a ``` ``` ## # A tibble: 3 x 2 ## name drink ## <chr> <fct> ## 1 Alon coffee ## 2 Ela beer ## 3 Shaked beer ``` ```r b ``` ``` ## # A tibble: 2 x 2 ## name drink ## <chr> <fct> ## 1 Oren lemonade ## 2 Erez coffee ``` ] --- # We now know that this will fail .pull-left[ ```r a ``` ``` ## # A tibble: 3 x 2 ## name drink ## <chr> <fct> ## 1 Alon coffee ## 2 Ela beer ## 3 Shaked beer ``` ```r b ``` ``` ## # A tibble: 2 x 2 ## name drink ## <chr> <fct> ## 1 Oren lemonade ## 2 Erez coffee ``` ] .pull-right[ ```r a$drink ``` ``` ## [1] coffee beer beer ## Levels: beer coffee ``` ```r b$drink ``` ``` ## [1] lemonade coffee ## Levels: coffee lemonade ``` ```r *c(a$drink, b$drink) ``` ``` ## [1] 2 1 1 2 1 ``` ] --- # But sometimes, R handles factor magic for you... .pull-left[ ```r a ``` ``` ## # A tibble: 3 x 2 ## name drink ## <chr> <fct> ## 1 Alon coffee ## 2 Ela beer ## 3 Shaked beer ``` ```r b ``` ``` ## # A tibble: 2 x 2 ## name drink ## <chr> <fct> ## 1 Oren lemonade ## 2 Erez coffee ``` ] .pull-right[ ```r *rbind(a, b) ``` ``` ## # A tibble: 5 x 2 ## name drink ## <chr> <fct> ## 1 Alon coffee ## 2 Ela beer ## 3 Shaked beer ## 4 Oren lemonade ## 5 Erez coffee ``` ] --- # The tidyverse reminds you to pay attention to types! .pull-left[ ```r a ``` ``` ## # A tibble: 3 x 2 ## name drink ## <chr> <fct> ## 1 Alon coffee ## 2 Ela beer ## 3 Shaked beer ``` ```r b ``` ``` ## # A tibble: 2 x 2 ## name drink ## <chr> <fct> ## 1 Oren lemonade ## 2 Erez coffee ``` ] .pull-right[ ```r *bind_rows(a, b) ``` ``` ## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character ``` ``` ## Warning in bind_rows_(x, .id): binding character and factor vector, ## coercing into character vector ## Warning in bind_rows_(x, .id): binding character and factor vector, ## coercing into character vector ``` ``` ## # A tibble: 5 x 2 ## name drink ## <chr> <chr> ## 1 Alon coffee ## 2 Ela beer ## 3 Shaked beer ## 4 Oren lemonade ## 5 Erez coffee ``` ] --- class: center, middle  --- # Why care? By default, `stringsAsFactors = TRUE` when you use `read.csv()` Be aware when you: - try to combine (`c()`) factors - try to convert to numeric -- Better practice: Use `stringsAsFactors = FALSE` or `read_csv`. **Only use factors when YOU decide it's useful** --- <blockquote class="twitter-tweet tw-align-center" data-lang="en"><p lang="en" dir="ltr">10 years after convincing all scientists to switch to R: "10% of genetics papers contain error due to stringsAsFactors"</p>— Austin Richardson (@heyaudy) <a href="https://twitter.com/heyaudy/status/770658444286189569?ref_src=twsrc%5Etfw">August 30, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet tw-align-center" data-lang="en" style="text-align: center;"><p lang="en" dir="ltr">I hope R's default stringsAsFactors=TRUE steps on a Lego.</p>— dr allison barner (@algaebarnacle) <a href="https://twitter.com/algaebarnacle/status/1088932294427996160?ref_src=twsrc%5Etfw">January 25, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet tw-align-center" data-lang="en"><p lang="en" dir="ltr">How do I hate thee stringsAsFactors = TRUE? Let me count the ways <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a></p>— Hadley Wickham (@hadleywickham) <a href="https://twitter.com/hadleywickham/status/624349074636976128?ref_src=twsrc%5Etfw">July 23, 2015</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> --- class: inverse, middle, center [](https://twitter.com/JennyBryan/status/630052228624068609) --- class: center, middle  ```r library(tidyverse) # or if you only want to load forcats: # library(forcats) ``` --- # forcats overview * `fct_infreq()`: reorder a factor by the frequency of values * `fct_rev()`: reverse the order of factors * `fct_relevel()`: change the order of a factor by hand * `fct_collapse()`: collapse factor levels into manually defined groups * `fct_lump()`: collapse the least/most frequent values of a factor into "other" * `fct_reorder()`: reorder a factor by another variable Source: https://forcats.tidyverse.org/ --- class: center, middle background-image: url(img/forpoke_with_ball_real.png) background-position: center background-size: auto 80% --- ```r # usethis::use_course("http://bit.ly/ecodata-factors-ex") pokemon <- read_csv("data/pokemon.csv") ``` --- ```r # usethis::use_course("http://bit.ly/ecodata-factors-ex") pokemon ``` ``` ## # A tibble: 721 x 23 ## number name type_1 type_2 total hp attack defense sp_atk sp_def ## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 Bulb… Grass Poison 318 45 49 49 65 65 ## 2 2 Ivys… Grass Poison 405 60 62 63 80 80 ## 3 3 Venu… Grass Poison 525 80 82 83 100 100 ## 4 4 Char… Fire <NA> 309 39 52 43 60 50 ## 5 5 Char… Fire <NA> 405 58 64 58 80 65 ## 6 6 Char… Fire Flying 534 78 84 78 109 85 ## 7 7 Squi… Water <NA> 314 44 48 65 50 64 ## 8 8 Wart… Water <NA> 405 59 63 80 65 80 ## 9 9 Blas… Water <NA> 530 79 83 100 85 105 ## 10 10 Cate… Bug <NA> 195 45 30 35 20 20 ## # ... with 711 more rows, and 13 more variables: speed <dbl>, ## # generation <dbl>, is_legendary <lgl>, color <chr>, ## # has_gender <lgl>, pr_male <dbl>, egg_group_1 <chr>, ## # egg_group_2 <chr>, has_mega_evolution <lgl>, height_m <dbl>, ## # weight_kg <dbl>, catch_rate <dbl>, body_style <chr> ``` --- class: exercise # Exercise How many types of pokemon are there? How many body styles do they have? .bottom-right[  ] --- Let's get graphing! ```r theme_set(theme_minimal()) # change the default theme options(ggplot2.continuous.colour = "viridis", # change the default colors ggplot2.continuous.fill = "viridis")) ``` .center[  ] --- exclude: true ```r ggplot(pokemon, aes(weight_kg, height_m)) + geom_point() + scale_x_log10() + scale_y_log10() + facet_wrap(.~ body_style) ``` <img src="index_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% ggplot(aes(type_1)) + geom_bar() + coord_flip() ``` <img src="index_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% * mutate(type_1 = fct_infreq(type_1)) %>% ggplot(aes(type_1)) + geom_bar() + coord_flip() ``` <img src="index_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% mutate(type_1 = fct_infreq(type_1), * type_1 = fct_rev(type_1)) %>% ggplot(aes(type_1)) + geom_bar() + coord_flip() ``` <img src="index_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% mutate(type_1 = fct_infreq(type_1), type_1 = fct_rev(type_1), * type_1 = fct_relevel(type_1, "Normal")) %>% ggplot(aes(type_1)) + geom_bar() + coord_flip() ``` <img src="index_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- class: exercise # Exercise How do you make this graph? <img src="index_files/figure-html/exercise-2-1.png" style="display: block; margin: auto;" /> .bottom-right[  ] --- ```r pokemon %>% ggplot(aes(body_style)) + geom_bar() + coord_flip() ``` <img src="index_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% * mutate(body_style = fct_collapse( * body_style, * head = c("head_base", "head_only", "head_arms", "head_legs"), * bipedal = c("bipedal_tailless", "bipedal_tailed") * )) %>% ggplot(aes(body_style)) + geom_bar() + coord_flip() ``` <img src="index_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% mutate(body_style = fct_collapse( body_style, head = c("head_base", "head_only", "head_arms", "head_legs"), bipedal = c("bipedal_tailless", "bipedal_tailed") ), * body_style = fct_lump(body_style, 3)) %>% ggplot(aes(body_style)) + geom_bar() + coord_flip() ``` <img src="index_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- class: exercise # Exercise How do you make this graph? <img src="index_files/figure-html/exercise-3-1.png" style="display: block; margin: auto;" /> .bottom-right[  ] --- ```r pokemon %>% ggplot(aes(generation, attack)) + geom_boxplot() ``` ``` ## Warning: Continuous x aesthetic -- did you forget aes(group=...)? ``` <img src="index_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% * mutate(generation = factor(generation)) %>% ggplot(aes(generation, attack)) + geom_boxplot() ``` <img src="index_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% mutate(generation = factor(generation)) %>% * mutate(generation = fct_reorder(generation, attack)) %>% ggplot(aes(generation, attack)) + geom_boxplot() ``` <img src="index_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- class: exercise # Discuss How is `fct_infreq()` different from `fct_reorder()`? How are they similar? .bottom-right[  ] --- ```r pokemon %>% count(generation, body_style) ``` ``` ## # A tibble: 80 x 3 ## generation body_style n ## <dbl> <chr> <int> ## 1 1 bipedal_tailed 34 ## 2 1 bipedal_tailless 21 ## 3 1 four_wings 4 ## 4 1 head_arms 5 ## 5 1 head_base 7 ## 6 1 head_legs 7 ## 7 1 head_only 7 ## 8 1 insectoid 6 ## 9 1 multiple_bodies 4 ## 10 1 quadruped 25 ## # ... with 70 more rows ``` --- ```r pokemon %>% count(generation, body_style) %>% ggplot(aes(generation, body_style, size = n, color = n)) + geom_point() ``` <img src="index_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> --- ```r pokemon %>% * mutate(generation = factor(generation)) %>% count(generation, body_style) %>% ggplot(aes(generation, body_style, size = n, color = n)) + geom_point() ``` <img src="index_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> --- class: middle, center, inverse # Statistics with factors --- ## Statistical tests that can be used when your response is categorical 1. Chi-square 2. Logistic regression 3. Multinomial logistic regression --- class: middle, center ## Chi-square Tests whether there is a different between the expected and the observed frequencies of one or more categories. .small[ In other words, use it when your response and predictor are categorical ] --- # Example: Are electric pokemons are more yellow than expected by chance? .center[  ] --- .small[ ```r #create a table that has the frequency of colors to types of pokemons chi_data <- table(pokemon$type_1,pokemon$color) chi_data ``` ``` ## ## Black Blue Brown Green Grey Pink Purple Red White Yellow ## Bug 3 4 3 7 8 0 6 16 4 12 ## Dark 8 5 2 0 4 0 3 4 1 1 ## Dragon 1 9 1 4 1 0 3 2 2 1 ## Electric 2 6 0 1 3 1 0 3 4 16 ## Fairy 0 1 0 0 0 7 1 0 8 0 ## Fighting 0 4 6 1 6 0 2 2 2 2 ## Fire 0 0 12 0 0 0 0 27 0 8 ## Flying 0 0 0 1 0 0 2 0 0 0 ## Ghost 9 0 4 0 0 0 8 0 1 1 ## Grass 0 5 6 41 3 4 0 1 3 3 ## Ground 1 1 12 4 6 0 2 2 0 2 ## Ice 0 10 3 0 2 1 0 2 5 0 ## Normal 3 6 37 1 10 13 5 4 9 5 ## Poison 1 5 2 3 0 0 17 0 0 0 ## Psychic 3 6 4 6 2 6 7 1 6 6 ## Rock 0 9 11 3 10 1 2 2 0 3 ## Steel 1 4 3 2 10 0 0 0 1 1 ## Water 0 59 4 5 4 8 7 9 6 3 ``` ] --- ```r #run the chi-square test chi_test <- chisq.test(chi_data) ``` ``` ## Warning in chisq.test(chi_data): Chi-squared approximation may be incorrect ``` ```r chi_test ``` ``` ## ## Pearson's Chi-squared test ## ## data: chi_data ## X-squared = 1072.1, df = 153, p-value < 2.2e-16 ``` --- ### Look at the residuls to see what is significantly different than expected by chance ```r res <- as.data.frame(chi_test$residuals) head(res) ``` ``` ## Var1 Var2 Freq ## 1 Bug Black 0.12192828 ## 2 Dark Black 6.06157722 ## 3 Dragon Black -0.06316109 ## 4 Electric Black 0.31820289 ## 5 Fairy Black -0.86862399 ## 6 Fighting Black -1.05336131 ``` --- ##### Let's plot the chi-square residuals and find out how type is associated with color ```r ggpubr::ggballoonplot(res, size = "Freq",fill = "Freq")+ scale_fill_viridis_c(option = "D") ``` <img src="index_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> --- class: exercise # Test whether `grass` egg group pokemons hatch into green adults more than expected by chance  --- class: middle, center # Logistic regression A statistical test to study the relationship between a binary response and one or more continious predictors. At the center of the logistic regression analysis is the task estimating the log odds of an event. --- ## Test assumptions * The dependent variable should be dichotomous in nature * There should be no outliers in the data * here should be no high correlations (multicollinearity) among the predictors --- ## Important to remember * **Logistic regression is sensative to overfitting** - adding more predictors will alsways increase the varience explained, but too many predictors can create overfitting and reduces the generalizability of the model beyond your data * **Logistic regresion has a pseudo R-square** - report it with caution because Numerous pseudo-R-square tend to have many computational problems that cause them to be artificially high or low. A solution: present the results of a goodness of fit test such as Hosmer-Lemeshow --- # Example: Does pokemon becomes legendary with the in increase in `speed`, `attack`, and `catch_rate` abilities? --- .small[ ```r mylogit <- glm(is_legendary ~ speed, data = pokemon, family = "binomial") summary(mylogit) ``` ``` ## ## Call: ## glm(formula = is_legendary ~ speed, family = "binomial", data = pokemon) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6065 -0.3647 -0.2335 -0.1487 2.8500 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.32249 0.63305 -9.987 < 2e-16 *** ## speed 0.04557 0.00669 6.812 9.62e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 342.18 on 720 degrees of freedom ## Residual deviance: 283.16 on 719 degrees of freedom ## AIC: 287.16 ## ## Number of Fisher Scoring iterations: 6 ``` ] --- ## Plot the logistic regression ```r newdat<- data.frame(speed=seq(min(pokemon$speed), max(pokemon$speed),len=length(pokemon$speed)*3)) newdat$is_legendary = predict(mylogit, newdata=newdat, type="response") plot(is_legendary~speed, data=pokemon, col="red4") lines(is_legendary ~ speed, newdat, col="green4", lwd=2) ``` <img src="index_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> --- class: middle, center Farther reading on logistic regression https://stats.idre.ucla.edu/r/dae/logit-regression/ --- class: exercise ##Exercise Run a logistic regression to see if a Pokemon's chance for mega evolution grows with its height, weight, and defense abilities .bottom-right[  ] --- ## Multinomial Logistic Regression * Used to explain the relationship between one nominal dependent variable and one or more independent variables. * For a nominal dependent variable with k categories, the multinomial regression model estimates k-1 logit equations * Multinomial regression is similar to discriminant analysis. The latter should be used if the predictors are normally distributed. Bottom line: Use multinomial regression when you have a categorical variable with more than two categories, and your predictors are not normally distributed. --- ## Example: We want to know how pokemons' defense, attack, and speed abilities associated with its body style --- .small[ ```r library(nnet) pokemon$body_style<- as.factor(pokemon$body_style) levels(pokemon$body_style) ``` ``` ## [1] "bipedal_tailed" "bipedal_tailless" "four_wings" ## [4] "head_arms" "head_base" "head_legs" ## [7] "head_only" "insectoid" "multiple_bodies" ## [10] "quadruped" "serpentine_body" "several_limbs" ## [13] "two_wings" "with_fins" ``` ```r multi_test <- multinom(body_style ~ defense + attack + speed, data = pokemon) ``` ``` ## # weights: 70 (52 variable) ## initial value 1902.760335 ## iter 10 value 1680.948966 ## iter 20 value 1677.348542 ## iter 30 value 1636.162688 ## iter 40 value 1605.701126 ## iter 50 value 1589.026351 ## final value 1589.011520 ## converged ``` ] --- .small[ ```r z<- summary(multi_test)$coefficients/summary(multi_test)$standard.errors z ``` ``` ## (Intercept) defense attack speed ## bipedal_tailless 1.18749511 0.2400757 -0.4524127 -2.4337327 ## four_wings -2.48903423 -0.6082106 -0.9787830 2.2772756 ## head_arms -0.32452980 2.2516206 -2.5342064 -1.5611265 ## head_base 0.06305546 3.0856705 -3.4499647 -1.7620993 ## head_legs 0.66655338 0.7273578 -1.8135951 -2.4752083 ## head_only 0.57065728 3.3862641 -4.7738335 -0.9185228 ## insectoid -0.29881862 2.1277536 -1.8087106 -2.5092639 ## multiple_bodies -2.46301454 3.8395014 -2.0208084 -0.4735239 ## quadruped 0.38797732 0.7417399 -1.6273184 0.1306347 ## serpentine_body -1.15719645 2.6139819 -2.0045508 -1.5619130 ## several_limbs -1.19130963 1.7628633 -0.3896117 -2.6449461 ## two_wings -2.21854835 0.6181348 -3.1473156 3.8489004 ## with_fins 0.01210612 -0.7475730 -1.5411417 -0.2220883 ``` ```r # Everythig is compared to "bipedal_tailed" ``` ] --- .small[ ```r p <- (1 - pnorm(abs(z), 0, 1)) * 2 p ``` ``` ## (Intercept) defense attack speed ## bipedal_tailless 0.23503239 0.8102715790 6.509717e-01 0.0149440258 ## four_wings 0.01280906 0.5430478252 3.276872e-01 0.0227697743 ## head_arms 0.74553697 0.0243462581 1.127023e-02 0.1184939143 ## head_base 0.94972234 0.0020309373 5.606599e-04 0.0780525204 ## head_legs 0.50505746 0.4670068065 6.974009e-02 0.0133158478 ## head_only 0.56823198 0.0007085117 1.807518e-06 0.3583452256 ## insectoid 0.76507845 0.0333575221 7.049598e-02 0.0120983059 ## multiple_bodies 0.01377743 0.0001232844 4.329961e-02 0.6358394467 ## quadruped 0.69803281 0.4582449362 1.036695e-01 0.8960643129 ## serpentine_body 0.24719211 0.0089493784 4.501109e-02 0.1183084912 ## several_limbs 0.23353206 0.0779235486 6.968237e-01 0.0081703930 ## two_wings 0.02651747 0.5364864923 1.647770e-03 0.0001186492 ## with_fins 0.99034095 0.4547177691 1.232823e-01 0.8242451264 ``` ] --- We can exponentiate the coefficients from our model to see these risk ratios ```r exp(coef(multi_test)) ``` ``` ## (Intercept) defense attack speed ## bipedal_tailless 1.75472339 1.0013276 0.9976274 0.9874592 ## four_wings 0.07932100 0.9921245 0.9893703 1.0227772 ## head_arms 0.80734346 1.0156863 0.9800257 0.9881693 ## head_base 1.04649297 1.0217192 0.9689031 0.9844539 ## head_legs 1.79961903 1.0075551 0.9785287 0.9702592 ## head_only 1.47729755 1.0230719 0.9567726 0.9923737 ## insectoid 0.80471381 1.0158360 0.9843688 0.9779113 ## multiple_bodies 0.07032018 1.0316839 0.9773144 0.9945880 ## quadruped 1.19258253 1.0038961 0.9919404 1.0006196 ## serpentine_body 0.41789830 1.0192781 0.9825792 0.9865751 ## several_limbs 0.28084096 1.0172282 0.9953562 0.9640250 ## two_wings 0.25872831 1.0044195 0.9787804 1.0237580 ## with_fins 1.00868260 0.9926585 0.9863215 0.9982006 ``` --- class: middle, center More information on the test https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ --- class: exercise ##Exercise Find out how pokemons' defense, attack, and speed abilities associated with its egg group .bottom-right[  ] --- # Resources This presentation was based in large part on the following resources: - [Working with categorical data in R without losing your mind](http://www.amelia.mn/WranglingCats.pdf) by Amelia McNamara - [Be the boss of your factors](https://stat545.com/block029_factors.html), Stat545 - [Wrangling categorical data in R](https://peerj.com/preprints/3163.pdf) by Amelia McNamara and Nicholas Horton