Working with Strings in R: Seattle Pet Names
Welcome to the blog. In this new post I’ll do a short tutorial on how to work with strings in R. I’ll show you some of the main functions of the stringr
package and the amazing power of the rebus
package. The data frame I will be using is from the week 13 of TidyTuesday. This data frame seemed to be the perfect opportunity to build this tutorial given the importance of strings for its understanding. The data is called “Seattle Pet Names” and is related to the date, names, species, breed, and zip code of the pets registered in Seattle. I’ll be focusing the analyses on the names given to cats and dogs.
Let’s start the tutorial by loading the needed packages.
library(tidyverse) # wrangling and data visualization
library(kableExtra) # visualize html tables
library(rebus) # string maniipulation
library(data.table) # in this case open dataframe
library(splitstackshape) # split columns
library(lubridate) # dealing with dates and times
We now open the “Seattle Pet Names” file and glimpse
it.
# open file
pet_names <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-26/seattle_pets.csv")
# explore it
glimpse(pet_names)
## Observations: 52,519
## Variables: 7
## $ license_issue_date <chr> "November 16 2018", "November 11 2018", "No...
## $ license_number <chr> "8002756", "S124529", "903793", "824666", "...
## $ animals_name <chr> "Wall-E", "Andre", "Mac", "Melb", "Gingersn...
## $ species <chr> "Dog", "Dog", "Dog", "Cat", "Cat", "Dog", "...
## $ primary_breed <chr> "Mixed Breed, Medium (up to 44 lbs fully gr...
## $ secondary_breed <chr> "Mix", "Dachshund, Standard Wire Haired", N...
## $ zip_code <chr> "98108", "98117", "98136", "98117", "98144"...
We will now create two separate data frames, one for cats and one for dogs.
# cats dataframe
cats_names <- pet_names %>%
filter(species == "Cat")
# dogs dataframe
dogs_names <- pet_names %>%
filter(species == "Dog")
Let’s start to manipulate strings! Now, we’ll use two different functions from the stringr package: str_remove_all
to remove all punctuations in the names and str_squish
to remove all excess white space.
# cats new dataframe
cats_names_tbl <- cats_names %>%
mutate(
animals_name =
animals_name %>%
str_remove_all(pattern = "[:punct:]") %>% # remove punctuation
str_squish()) # remove all excess white space
# explore the data
cats_names_tbl %>%
head(15) %>%
kable()
license_issue_date | license_number | animals_name | species | primary_breed | secondary_breed | zip_code |
---|---|---|---|---|---|---|
November 23 2018 | 824666 | Melb | Cat | Domestic Shorthair | NA | 98117 |
December 30 2018 | S119138 | Gingersnap | Cat | Domestic Shorthair | Mix | 98144 |
August 09 2018 | S142558 | Sebastian | Cat | Domestic Shorthair | Mix | 98122 |
August 20 2018 | S142546 | Madeline | Cat | Domestic Shorthair | Mix | 98105 |
December 08 2018 | S123830 | Cleo | Cat | Domestic Shorthair | NA | 98199 |
October 20 2018 | S149153 | Glitch | Cat | Siamese | Domestic Medium Hair | 98122 |
November 24 2018 | 817137 | Candy | Cat | Domestic Shorthair | NA | 98126 |
December 07 2018 | 895346 | Cinnamon | Cat | Domestic Shorthair | NA | 98144 |
October 31 2018 | S123360 | Sydney2 | Cat | Domestic Medium Hair | NA | 98101 |
October 23 2018 | S122244 | Calvin | Cat | American Shorthair | NA | 98119 |
November 15 2018 | 8002730 | Mochi | Cat | Domestic Medium Hair | Mix | 98105 |
November 27 2018 | S125276 | dmh | Cat | Domestic Medium Hair | Mix | 98117 |
December 20 2018 | 952291 | Justin | Cat | Domestic Medium Hair | NA | 98122 |
December 21 2018 | S150217 | Dash | Cat | Domestic Shorthair | NA | 98117 |
December 10 2018 | 8003722 | Buster | Cat | Domestic Shorthair | Mix | 98115 |
# dogs new dataframe
dogs_names_tbl <- dogs_names %>%
mutate(
animals_name =
animals_name %>%
str_remove_all(pattern = "[:punct:]") %>% # remove punctuation
str_squish()) # remove all excess white space
dogs_names_tbl %>%
head(15) %>%
kable()
license_issue_date | license_number | animals_name | species | primary_breed | secondary_breed | zip_code |
---|---|---|---|---|---|---|
November 16 2018 | 8002756 | WallE | Dog | Mixed Breed, Medium (up to 44 lbs fully grown) | Mix | 98108 |
November 11 2018 | S124529 | Andre | Dog | Terrier, Jack Russell | Dachshund, Standard Wire Haired | 98117 |
November 21 2018 | 903793 | Mac | Dog | Retriever, Labrador | NA | 98136 |
December 16 2018 | S138529 | Cody | Dog | Retriever, Labrador | NA | 98103 |
October 04 2017 | 580652 | Millie | Dog | Terrier, Boston | NA | 98115 |
December 23 2018 | 961052 | Sabre | Dog | Terrier | NA | 98126 |
December 07 2018 | S125461 | Thomas | Dog | Chihuahua, Short Coat | Mix | 98177 |
November 07 2018 | 8002543 | Lulu | Dog | Vizsla, Smooth Haired | Mix | 98105 |
December 15 2018 | S138838 | Milo | Dog | Boxer | Retriever, Labrador | 98109 |
November 27 2018 | S123980 | Anubis | Dog | Poodle, Standard | NA | 98112 |
October 25 2018 | 830506 | Skylar | Dog | Border Collie | Terrier, Airedale | 98144 |
October 23 2018 | S137719 | Cleo | Dog | Bernese Mountain Dog | Spaniel | 98107 |
November 07 2018 | 905090 | Petey | Dog | Pomeranian | Shih Tzu | 98119 |
December 24 2018 | S152290 | Kaia | Dog | Karelian Bear Dog | NA | 98117 |
December 27 2018 | 8004142 | Maya | Dog | Chihuahua, Short Coat | NA | 98126 |
One of our goals is to know the number of words that every name set has. In this case, we should use the str_count
function with the regex pattern \\w+
so that the white space is not included as a word.
# count the number of words
# for cats
cats_names_tbl %>%
mutate(count_words = str_count(animals_name, pattern = "\\w+")) %>%
arrange(desc(count_words)) %>%
select(animals_name, count_words) %>%
head()
## animals_name count_words
## 1 King Charles Leon the First aka Chuck 7
## 2 Lady G Lolo Paloma MacGuffie Hunter 6
## 3 Her Ladyship Princess Penelope Peachfuzz Howe 6
## 4 Morris Boney T MacGuffie Hunter 5
## 5 Fern River Brits Gouda Reserve 5
## 6 Jazzmine Primula Rosamund the Fair 5
# for dogs
dogs_names_tbl %>%
mutate(count_words = str_count(animals_name, pattern = "\\w+")) %>%
arrange(desc(count_words)) %>%
select(animals_name, count_words) %>%
head()
## animals_name count_words
## 1 Little Miss Dublin Maeve of the Emerald Isle 8
## 2 Legends of Olde Sir Walter The Lady Killer 8
## 3 Cascade Mountains Out of a Dream BRIA 7
## 4 Nuit Ahathoor Hecate Sappho Jezebel Lilith Crowley 7
## 5 His Royal Highness the Duke of Tacoma 7
## 6 Lady Kassandra Yu Countess of Wallingford KBE 7
So, the cat’s name with more words has 7 name sets, namely “King Charles Leon the First aka Chuck”. The dogs’ names with more words have 8 name sets and they’re called “Little Miss Dublin Maeve of the Emerald Isle” and “Legends of Olde Sir Walter The Lady Killer”.
Let’s imagine we wanted to know the number of characters of each pet instead of knowing the number of name sets of each dog and cat. In this scenario, we could use the str_length
function. First, we will use the str_remove_all
to remove the white spaces followed by the str_length
function.
# find the number of characters
# for cats
cats_names_tbl %>%
mutate(animals_name_rem = animals_name %>% str_remove_all(pattern = " "),
number_char = animals_name_rem %>% str_length()) %>%
select(animals_name, number_char) %>%
arrange(desc(number_char)) %>%
head()
## animals_name number_char
## 1 Her Ladyship Princess Penelope Peachfuzz Howe 40
## 2 King Charles Leon the First aka Chuck 31
## 3 Lady G Lolo Paloma MacGuffie Hunter 30
## 4 Jazzmine Primula Rosamund the Fair 30
## 5 Ginger Chanel OBrien SpletzKauff 29
## 6 Samantha Scully HammerstromGuel 29
# for dogs
dogs_names_tbl %>%
mutate(animals_name_rem = animals_name %>% str_remove_all(pattern = " "),
number_char = animals_name_rem %>% str_length()) %>%
select(animals_name, number_char) %>%
arrange(desc(number_char)) %>%
head()
## animals_name number_char
## 1 Nuit Ahathoor Hecate Sappho Jezebel Lilith Crowley 49
## 2 Greta McGonagall Galactica Sunnydale Fugent 39
## 3 Lady Kassandra Yu Countess of Wallingford KBE 39
## 4 Little Miss Dublin Maeve of the Emerald Isle 37
## 5 Legends of Olde Sir Walter The Lady Killer 35
## 6 Lotus Birdie Snufflepupagus Underfoot 34
40 characters is the maximum number of characters for cats, while for dogs the maximum number is 49.
We can also manipulate if a string is in upper case or lower case. For this, we can use the str_to_upper
and str_to_lower
functions, respectively.
Note: From now on well use the pets’ names with lower cases.
# to upper case
# for cats
cats_names_tbl %>%
mutate(animals_name = animals_name %>% str_to_upper()) %>%
head()
## license_issue_date license_number animals_name species
## 1 November 23 2018 824666 MELB Cat
## 2 December 30 2018 S119138 GINGERSNAP Cat
## 3 August 09 2018 S142558 SEBASTIAN Cat
## 4 August 20 2018 S142546 MADELINE Cat
## 5 December 08 2018 S123830 CLEO Cat
## 6 October 20 2018 S149153 GLITCH Cat
## primary_breed secondary_breed zip_code
## 1 Domestic Shorthair <NA> 98117
## 2 Domestic Shorthair Mix 98144
## 3 Domestic Shorthair Mix 98122
## 4 Domestic Shorthair Mix 98105
## 5 Domestic Shorthair <NA> 98199
## 6 Siamese Domestic Medium Hair 98122
# for dogs
dogs_names_tbl %>%
mutate(animals_name = animals_name %>% str_to_upper()) %>%
head()
## license_issue_date license_number animals_name species
## 1 November 16 2018 8002756 WALLE Dog
## 2 November 11 2018 S124529 ANDRE Dog
## 3 November 21 2018 903793 MAC Dog
## 4 December 16 2018 S138529 CODY Dog
## 5 October 04 2017 580652 MILLIE Dog
## 6 December 23 2018 961052 SABRE Dog
## primary_breed
## 1 Mixed Breed, Medium (up to 44 lbs fully grown)
## 2 Terrier, Jack Russell
## 3 Retriever, Labrador
## 4 Retriever, Labrador
## 5 Terrier, Boston
## 6 Terrier
## secondary_breed zip_code
## 1 Mix 98108
## 2 Dachshund, Standard Wire Haired 98117
## 3 <NA> 98136
## 4 <NA> 98103
## 5 <NA> 98115
## 6 <NA> 98126
# to lower case
# for cats
cats_names_tbl <- cats_names_tbl %>%
mutate(animals_name = animals_name %>% str_to_lower())
# for dogs
dogs_names_tbl <- dogs_names_tbl %>%
mutate(animals_name = animals_name %>% str_to_lower())
Before any further analysis, we will separate the pets’ names in different columns. With stringr we can use the str_c
or the str_split
function, but to be honest I prefer using the cSplit
function from the splitstackshape
.
# separate columns with str_c
# for cats
cats_names_tbl %>%
separate(animals_name,
into = str_c("animals_name", 1:7), # 7 because that's the maximum number of words
sep = " ",
remove = FALSE,
extra = "drop") %>%
head()
## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 16887
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## license_issue_date license_number animals_name animals_name1
## 1 November 23 2018 824666 melb melb
## 2 December 30 2018 S119138 gingersnap gingersnap
## 3 August 09 2018 S142558 sebastian sebastian
## 4 August 20 2018 S142546 madeline madeline
## 5 December 08 2018 S123830 cleo cleo
## 6 October 20 2018 S149153 glitch glitch
## animals_name2 animals_name3 animals_name4 animals_name5 animals_name6
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA>
## animals_name7 species primary_breed secondary_breed zip_code
## 1 <NA> Cat Domestic Shorthair <NA> 98117
## 2 <NA> Cat Domestic Shorthair Mix 98144
## 3 <NA> Cat Domestic Shorthair Mix 98122
## 4 <NA> Cat Domestic Shorthair Mix 98105
## 5 <NA> Cat Domestic Shorthair <NA> 98199
## 6 <NA> Cat Siamese Domestic Medium Hair 98122
# separate columns with str_c
# for dogs
dogs_names_tbl %>%
separate(animals_name,
into = str_c("animals_name", 1:8), # 8 because that's the maximum number of words
sep = " ",
remove = FALSE,
extra = "drop") %>%
head()
## Warning: Expected 8 pieces. Missing pieces filled with `NA` in 35103
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## license_issue_date license_number animals_name animals_name1
## 1 November 16 2018 8002756 walle walle
## 2 November 11 2018 S124529 andre andre
## 3 November 21 2018 903793 mac mac
## 4 December 16 2018 S138529 cody cody
## 5 October 04 2017 580652 millie millie
## 6 December 23 2018 961052 sabre sabre
## animals_name2 animals_name3 animals_name4 animals_name5 animals_name6
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA>
## animals_name7 animals_name8 species
## 1 <NA> <NA> Dog
## 2 <NA> <NA> Dog
## 3 <NA> <NA> Dog
## 4 <NA> <NA> Dog
## 5 <NA> <NA> Dog
## 6 <NA> <NA> Dog
## primary_breed
## 1 Mixed Breed, Medium (up to 44 lbs fully grown)
## 2 Terrier, Jack Russell
## 3 Retriever, Labrador
## 4 Retriever, Labrador
## 5 Terrier, Boston
## 6 Terrier
## secondary_breed zip_code
## 1 Mix 98108
## 2 Dachshund, Standard Wire Haired 98117
## 3 <NA> 98136
## 4 <NA> 98103
## 5 <NA> 98115
## 6 <NA> 98126
# separate columns with str_split
# for cats
splitting1 <- as.data.frame(str_split(cats_names_tbl$animals_name, fixed(" "), n = 7, simplify = TRUE)) # 7 because that's the maximum number of words
cats_after_split <- bind_cols(cats_names_tbl, splitting1)
# separate columns with str_split
# for dogs
splitting2 <- as.data.frame(str_split(dogs_names_tbl$animals_name, fixed(" "), n = 8, simplify = TRUE)) # 8 because that's the maximum number of words
cats_after_split <- bind_cols(dogs_names_tbl, splitting2)
With the cSplit
function gets simpler.
# with the cSplit function
# for cats
cats_names_tbl <- cats_names_tbl %>%
cSplit("animals_name", sep = " ", drop = FALSE) # separate animals_name column, but keep it
cats_names_tbl %>%
head(20)
## license_issue_date license_number animals_name species
## 1: November 23 2018 824666 melb Cat
## 2: December 30 2018 S119138 gingersnap Cat
## 3: August 09 2018 S142558 sebastian Cat
## 4: August 20 2018 S142546 madeline Cat
## 5: December 08 2018 S123830 cleo Cat
## 6: October 20 2018 S149153 glitch Cat
## 7: November 24 2018 817137 candy Cat
## 8: December 07 2018 895346 cinnamon Cat
## 9: October 31 2018 S123360 sydney2 Cat
## 10: October 23 2018 S122244 calvin Cat
## 11: November 15 2018 8002730 mochi Cat
## 12: November 27 2018 S125276 dmh Cat
## 13: December 20 2018 952291 justin Cat
## 14: December 21 2018 S150217 dash Cat
## 15: December 10 2018 8003722 buster Cat
## 16: October 21 2018 S136705 monk Cat
## 17: October 29 2018 S137285 mari Cat
## 18: November 19 2018 S124303 peko Cat
## 19: November 07 2018 896862 sajah Cat
## 20: November 12 2018 896813 nami Cat
## primary_breed secondary_breed zip_code animals_name_1
## 1: Domestic Shorthair <NA> 98117 melb
## 2: Domestic Shorthair Mix 98144 gingersnap
## 3: Domestic Shorthair Mix 98122 sebastian
## 4: Domestic Shorthair Mix 98105 madeline
## 5: Domestic Shorthair <NA> 98199 cleo
## 6: Siamese Domestic Medium Hair 98122 glitch
## 7: Domestic Shorthair <NA> 98126 candy
## 8: Domestic Shorthair <NA> 98144 cinnamon
## 9: Domestic Medium Hair <NA> 98101 sydney2
## 10: American Shorthair <NA> 98119 calvin
## 11: Domestic Medium Hair Mix 98105 mochi
## 12: Domestic Medium Hair Mix 98117 dmh
## 13: Domestic Medium Hair <NA> 98122 justin
## 14: Domestic Shorthair <NA> 98117 dash
## 15: Domestic Shorthair Mix 98115 buster
## 16: Domestic Medium Hair Mix 98122 monk
## 17: Domestic Shorthair <NA> 98144 mari
## 18: Domestic Medium Hair <NA> 98117 peko
## 19: Domestic Medium Hair <NA> 98112 sajah
## 20: Domestic Shorthair <NA> 98107 nami
## animals_name_2 animals_name_3 animals_name_4 animals_name_5
## 1: <NA> <NA> <NA> <NA>
## 2: <NA> <NA> <NA> <NA>
## 3: <NA> <NA> <NA> <NA>
## 4: <NA> <NA> <NA> <NA>
## 5: <NA> <NA> <NA> <NA>
## 6: <NA> <NA> <NA> <NA>
## 7: <NA> <NA> <NA> <NA>
## 8: <NA> <NA> <NA> <NA>
## 9: <NA> <NA> <NA> <NA>
## 10: <NA> <NA> <NA> <NA>
## 11: <NA> <NA> <NA> <NA>
## 12: <NA> <NA> <NA> <NA>
## 13: <NA> <NA> <NA> <NA>
## 14: <NA> <NA> <NA> <NA>
## 15: <NA> <NA> <NA> <NA>
## 16: <NA> <NA> <NA> <NA>
## 17: <NA> <NA> <NA> <NA>
## 18: <NA> <NA> <NA> <NA>
## 19: <NA> <NA> <NA> <NA>
## 20: <NA> <NA> <NA> <NA>
## animals_name_6 animals_name_7
## 1: <NA> <NA>
## 2: <NA> <NA>
## 3: <NA> <NA>
## 4: <NA> <NA>
## 5: <NA> <NA>
## 6: <NA> <NA>
## 7: <NA> <NA>
## 8: <NA> <NA>
## 9: <NA> <NA>
## 10: <NA> <NA>
## 11: <NA> <NA>
## 12: <NA> <NA>
## 13: <NA> <NA>
## 14: <NA> <NA>
## 15: <NA> <NA>
## 16: <NA> <NA>
## 17: <NA> <NA>
## 18: <NA> <NA>
## 19: <NA> <NA>
## 20: <NA> <NA>
# with the cSplit function
# for dogs
dogs_names_tbl <- dogs_names_tbl %>%
cSplit("animals_name", sep = " ", drop = FALSE) # separate animals_name column, but keep it
dogs_names_tbl %>%
head(20)
## license_issue_date license_number animals_name species
## 1: November 16 2018 8002756 walle Dog
## 2: November 11 2018 S124529 andre Dog
## 3: November 21 2018 903793 mac Dog
## 4: December 16 2018 S138529 cody Dog
## 5: October 04 2017 580652 millie Dog
## 6: December 23 2018 961052 sabre Dog
## 7: December 07 2018 S125461 thomas Dog
## 8: November 07 2018 8002543 lulu Dog
## 9: December 15 2018 S138838 milo Dog
## 10: November 27 2018 S123980 anubis Dog
## 11: October 25 2018 830506 skylar Dog
## 12: October 23 2018 S137719 cleo Dog
## 13: November 07 2018 905090 petey Dog
## 14: December 24 2018 S152290 kaia Dog
## 15: December 27 2018 8004142 maya Dog
## 16: November 18 2018 835221 shirley Dog
## 17: October 18 2018 S101544 diesel Dog
## 18: November 30 2018 904438 jacob Dog
## 19: November 06 2018 8002506 linkin Dog
## 20: December 02 2018 829922 gracie Dog
## primary_breed
## 1: Mixed Breed, Medium (up to 44 lbs fully grown)
## 2: Terrier, Jack Russell
## 3: Retriever, Labrador
## 4: Retriever, Labrador
## 5: Terrier, Boston
## 6: Terrier
## 7: Chihuahua, Short Coat
## 8: Vizsla, Smooth Haired
## 9: Boxer
## 10: Poodle, Standard
## 11: Border Collie
## 12: Bernese Mountain Dog
## 13: Pomeranian
## 14: Karelian Bear Dog
## 15: Chihuahua, Short Coat
## 16: Australian Shepherd
## 17: Shepherd
## 18: German Shepherd
## 19: Vizsla, Smooth Haired
## 20: Terrier, Fox, Smooth
## secondary_breed zip_code animals_name_1 animals_name_2
## 1: Mix 98108 walle <NA>
## 2: Dachshund, Standard Wire Haired 98117 andre <NA>
## 3: <NA> 98136 mac <NA>
## 4: <NA> 98103 cody <NA>
## 5: <NA> 98115 millie <NA>
## 6: <NA> 98126 sabre <NA>
## 7: Mix 98177 thomas <NA>
## 8: Mix 98105 lulu <NA>
## 9: Retriever, Labrador 98109 milo <NA>
## 10: <NA> 98112 anubis <NA>
## 11: Terrier, Airedale 98144 skylar <NA>
## 12: Spaniel 98107 cleo <NA>
## 13: Shih Tzu 98119 petey <NA>
## 14: <NA> 98117 kaia <NA>
## 15: <NA> 98126 maya <NA>
## 16: Retriever, Golden 98144 shirley <NA>
## 17: Mix 98118 diesel <NA>
## 18: Rottweiler 98199 jacob <NA>
## 19: <NA> 98105 linkin <NA>
## 20: <NA> 98118 gracie <NA>
## animals_name_3 animals_name_4 animals_name_5 animals_name_6
## 1: <NA> <NA> <NA> <NA>
## 2: <NA> <NA> <NA> <NA>
## 3: <NA> <NA> <NA> <NA>
## 4: <NA> <NA> <NA> <NA>
## 5: <NA> <NA> <NA> <NA>
## 6: <NA> <NA> <NA> <NA>
## 7: <NA> <NA> <NA> <NA>
## 8: <NA> <NA> <NA> <NA>
## 9: <NA> <NA> <NA> <NA>
## 10: <NA> <NA> <NA> <NA>
## 11: <NA> <NA> <NA> <NA>
## 12: <NA> <NA> <NA> <NA>
## 13: <NA> <NA> <NA> <NA>
## 14: <NA> <NA> <NA> <NA>
## 15: <NA> <NA> <NA> <NA>
## 16: <NA> <NA> <NA> <NA>
## 17: <NA> <NA> <NA> <NA>
## 18: <NA> <NA> <NA> <NA>
## 19: <NA> <NA> <NA> <NA>
## 20: <NA> <NA> <NA> <NA>
## animals_name_7 animals_name_8
## 1: <NA> <NA>
## 2: <NA> <NA>
## 3: <NA> <NA>
## 4: <NA> <NA>
## 5: <NA> <NA>
## 6: <NA> <NA>
## 7: <NA> <NA>
## 8: <NA> <NA>
## 9: <NA> <NA>
## 10: <NA> <NA>
## 11: <NA> <NA>
## 12: <NA> <NA>
## 13: <NA> <NA>
## 14: <NA> <NA>
## 15: <NA> <NA>
## 16: <NA> <NA>
## 17: <NA> <NA>
## 18: <NA> <NA>
## 19: <NA> <NA>
## 20: <NA> <NA>
Let us now find the most common first and last letters used in the first name for both cats and dogs. For this, we will use the str_sub
function.
# first letter
# for cats
cats_names_tbl %>%
mutate(first_letter = animals_name_1 %>% str_sub(1,1)) %>%
count(first_letter) %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## first_letter n
## <chr> <int>
## 1 m 1998
## 2 s 1831
## 3 b 1356
## 4 c 1234
## 5 l 1142
## 6 p 1095
# for dogs
dogs_names_tbl %>%
mutate(first_letter = animals_name_1 %>% str_sub(1,1)) %>%
count(first_letter) %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## first_letter n
## <chr> <int>
## 1 m 3408
## 2 b 3366
## 3 s 3042
## 4 c 2760
## 5 l 2712
## 6 r 2199
# last letter
# for cats
cats_names_tbl %>%
mutate(last_letter = animals_name_1 %>% str_sub(-1,-1)) %>%
count(last_letter) %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## last_letter n
## <chr> <int>
## 1 y 2881
## 2 e 2855
## 3 a 2406
## 4 r 1368
## 5 o 1209
## 6 s 1110
# for dogs
dogs_names_tbl %>%
mutate(last_letter = animals_name_1 %>% str_sub(-1,-1)) %>%
count(last_letter) %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## last_letter n
## <chr> <int>
## 1 y 7373
## 2 e 6575
## 3 a 4839
## 4 r 3007
## 5 o 2684
## 6 n 2159
The letter “M” seems to be the preferred one when naming both cats and dogs. Curiously, the letter “Y” is picked more often by the pet owners for the name ending.
After this, our goal is to find the most common names for cats and dogs.
# most common name
# for cats
cats_names_tbl %>%
filter(!is.na(animals_name_1)) %>%
count(animals_name_1) %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## animals_name_1 n
## <fct> <int>
## 1 luna 116
## 2 lucy 104
## 3 lily 91
## 4 charlie 88
## 5 max 86
## 6 bella 84
# for dogs
dogs_names_tbl %>%
filter(!is.na(animals_name_1)) %>%
count(animals_name_1) %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 x 2
## animals_name_1 n
## <fct> <int>
## 1 lucy 358
## 2 charlie 325
## 3 bella 273
## 4 luna 261
## 5 daisy 244
## 6 cooper 199
“Luna” is the most popular name for cats, while for dogs “Lucy” is the name given more often.
Back again to stringr
, it would be possible to use str_subset
to find names with a specific pattern.
# find unique first names with the pattern "su"
# for cats
unique(str_subset(cats_names_tbl$animals_name_1, pattern = "su"))
## [1] "sugar" "summer" "sunny" "suzie" "sufjan"
## [6] "susie" "sushi" "sunbeam" "sunset" "suki"
## [11] "susu" "natsumi" "ursula" "suri" "graysun"
## [16] "sultana" "suess" "tsunami" "sisu" "sumi"
## [21] "sundae" "sukapati" "sullivan" "suika" "sultrina"
## [26] "katsu" "sula" "sumo" "suede" "suma"
## [31] "sundance" "sunshine" "sugarpuffs" "sukhi" "sunkist"
## [36] "sunflower" "sunnyside" "sutton" "sucha" "yasuo"
## [41] "sully" "mitsu" "suji" "sunday" "sukie"
## [46] "satsuma" "zasu" "misu" "suzykitty" "surya"
## [51] "natsu" "nasuchan" "suzy" "tonkatsu" "suzi"
## [56] "susan" "superman" "suva" "suleyman" "tsuki"
## [61] "sulley"
# for dogs
unique(str_subset(dogs_names_tbl$animals_name_1, pattern = "su"))
## [1] "sue" "suzie" "suki" "sully"
## [5] "susie" "sunny" "suzy" "ursula"
## [9] "sushi" "sugar" "sun" "katsu"
## [13] "mitsu" "jesus" "sunshine" "su"
## [17] "summer" "sukey" "tamotsu" "sundae"
## [21] "kitsunejip" "sunchaser" "missus" "suri"
## [25] "sumi" "komatsu" "suggen" "yousuke"
## [29] "sundance" "sudo" "tsuki" "sunnee"
## [33] "sukhi" "sunday" "tsunami" "jessup"
## [37] "subaru" "rayasunshine" "susannah" "tsukee"
## [41] "bosun" "sullivan" "suka" "subotai"
## [45] "sunniva" "suzzie" "sug" "zsuska"
## [49] "asuka" "sisu" "sukie" "sunup"
## [53] "sucia" "suni" "sutter" "suzieq"
## [57] "subi" "surti" "sumo" "suze"
## [61] "sugarberry" "jitsu" "summit" "sukha"
## [65] "sudi" "sultan" "kitsune" "susi"
## [69] "colossus" "sukoshi" "narcissus" "sua"
## [73] "kietsu" "sugi" "suma" "sukee"
## [77] "misu" "suzee" "consuela"
Therefore, we have 61 unique cat names with the pattern “su”. For dogs this pattern appears more often, meaning 79 unique occurrences.
str_which
is another interesting function which tell us in which row a specific pattern occurs.
# find a pattern row location
# for cats
str_which(cats_names_tbl$animals_name_1, pattern = "car")
## [1] 53 180 271 397 449 811 874 1100 1137 1181 1241
## [12] 1308 1398 1421 1543 1585 1664 1682 2163 2909 3006 3120
## [23] 3188 3207 3361 3375 3442 3571 3849 3899 4047 4288 4388
## [34] 4513 4823 4976 5028 5060 5104 5197 5287 5421 5456 5640
## [45] 5647 5690 5691 5769 5814 5820 6131 6144 6235 6455 6501
## [56] 6685 6829 7027 7079 7110 7306 7321 7369 7499 7778 7865
## [67] 7866 7944 8067 8159 8276 8432 8487 9280 9429 9458 9684
## [78] 9870 9923 10027 10045 10072 10140 10214 10437 10712 10780 11483
## [89] 11530 11575 11669 11814 12054 12140 12306 12732 13404 13436 13449
## [100] 13544 13721 13843 14312 14485 14666 14864 15025 15206 15382 15383
## [111] 15751 15819 15995 16304 16369 16622 16634 16800 16893 16945 17126
## [122] 17163
# for dogs
str_which(cats_names_tbl$animals_name_1, pattern = "car")
## [1] 53 180 271 397 449 811 874 1100 1137 1181 1241
## [12] 1308 1398 1421 1543 1585 1664 1682 2163 2909 3006 3120
## [23] 3188 3207 3361 3375 3442 3571 3849 3899 4047 4288 4388
## [34] 4513 4823 4976 5028 5060 5104 5197 5287 5421 5456 5640
## [45] 5647 5690 5691 5769 5814 5820 6131 6144 6235 6455 6501
## [56] 6685 6829 7027 7079 7110 7306 7321 7369 7499 7778 7865
## [67] 7866 7944 8067 8159 8276 8432 8487 9280 9429 9458 9684
## [78] 9870 9923 10027 10045 10072 10140 10214 10437 10712 10780 11483
## [89] 11530 11575 11669 11814 12054 12140 12306 12732 13404 13436 13449
## [100] 13544 13721 13843 14312 14485 14666 14864 15025 15206 15382 15383
## [111] 15751 15819 15995 16304 16369 16622 16634 16800 16893 16945 17126
## [122] 17163
Rebus
In the previous section, some important functions within the stringr
ecosystem were shown. Now let’s work again with stringr
, this time in association with the rebus
package. This package facilitates our work when dealing with regular expressions. As a quick example, let’s imagine that we wanted to count all dogs and cats names starting with a specific letter. We just have to write START %R%
and subsequently the required pattern:
# names that start with L
# for cats
cats_names_tbl %>%
mutate(L_char = str_count(animals_name_1, pattern = START %R% "l")) %>%
summarize(cats_words_start_with_L = sum(L_char, na.rm = TRUE))
## cats_words_start_with_L
## 1 1142
# dogs
dogs_names_tbl %>%
mutate(L_char = str_count(animals_name_1, pattern = START %R% "l")) %>%
summarize(dogs_words_start_with_L = sum(L_char, na.rm = TRUE))
## dogs_words_start_with_L
## 1 2712
In this case, we have 1142 cat first names starting with the “L” character. For dogs, we have 2712 instances.
Moving on with our wrangling of strings, if we would like to check the first names ending with the pattern “cy” we could simply do the following:
# find names that end with the pattern "cy"
# for cats
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = "cy" %R% END
)
)
, decreasing = TRUE)
##
## lucy percy macy quincy fancy clancy mercy nancy darcy
## 104 13 10 9 6 4 3 3 2
## marcy chauncy gracy lacy legacy peacy purcy purrcy saucy
## 2 1 1 1 1 1 1 1 1
## spicy
## 1
# for dogs
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = "cy" %R% END
)
)
, decreasing = TRUE)
##
## lucy quincy percy macy darcy fancy clancy mercy cy icy
## 358 29 16 15 7 4 3 3 2 2
## lacy nancy spicy gacy gancy kacy marcy saucy stacy yancy
## 2 2 2 1 1 1 1 1 1 1
As we can see, we’ve included the pattern before the END
function of rebus
and with that we’ve indicated we wanted a pattern ending with “cy”.
We could simply try to find names where the third letter is a “z”. The rebus
package facilitates our job through the ANY_CHAR
function.
# first names where the third letter is a "z"
# for cats
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% "z"
)
)
, decreasing = TRUE)
##
## gizmo izzy hazel taz jazz jazzy
## 20 20 17 12 10 9
## lizzy ozzie enzo suzie buzz lizzie
## 9 9 6 6 4 4
## mozart zazu fuzzy izzie jezebel liza
## 4 4 3 3 3 3
## mazie ouzo ozzy zuzu dizzy fizz
## 3 3 3 3 2 2
## fuzz jaz kaze kizzie liz orzo
## 2 2 2 2 2 2
## pazzo rizhik tazzie wizard zizou azzu
## 2 2 2 2 2 1
## bazil bazique boz boze bozo caz
## 1 1 1 1 1 1
## cazadores cazzy cezanne dazzle dezi dezmond
## 1 1 1 1 1 1
## fez fezzie fezzik fozzie fozzy fuzzjackson
## 1 1 1 1 1 1
## gazpacho giza gozer gozo gyzmo haze
## 1 1 1 1 1 1
## hazelnut itzel izzybella jazl jazzmine jezavell
## 1 1 1 1 1 1
## jezebell jezebelle jezzabelle jezzie kozmo lazer
## 1 1 1 1 1 1
## lazlo lazy lazzlo mazel mazinga mazy
## 1 1 1 1 1 1
## mazzy miz mizu moz muzette muzzy
## 1 1 1 1 1 1
## nazee nozomi puzzle puzzles raz razz
## 1 1 1 1 1 1
## riza sazjha suzi suzy suzykitty tazmin
## 1 1 1 1 1 1
## tazo tazzy tizzie yaz yuzu zazi
## 1 1 1 1 1 1
## zizi
## 1
And now the same pattern, but for dogs.
# for dogs
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% "z"
)
)
, decreasing = TRUE)
##
## izzy hazel gizmo enzo ozzie
## 60 49 46 30 25
## lizzie ozzy jazz suzie zuzu
## 21 17 14 14 13
## taz dozer izzie rizzo suzy
## 11 10 8 7 7
## jazzy buzz fozzie lizzy cozmo
## 6 5 5 5 4
## dizzy fezzik mazey mazie jazmine
## 4 4 4 4 3
## jezebel lazlo liza mazzy wizard
## 3 3 3 3 3
## zizou bizzy dazzle fez fuzzy
## 3 2 2 2 2
## izzabella izze izzi jaz jazmyne
## 2 2 2 2 2
## kizzy kozmo liz nizhoni orzo
## 2 2 2 2 2
## razzie tazz auzzie bazely bazi
## 2 2 1 1 1
## bazzle bezudry bizkit bozley buzzy
## 1 1 1 1 1
## cazaril cezanne cozette dez dizzle
## 1 1 1 1 1
## elza fizzgig foz fozul fozzy
## 1 1 1 1 1
## fuzz fuzzbucket gazelle gizzmo gizzy
## 1 1 1 1 1
## gozer guzi haze hazelnut hazy
## 1 1 1 1 1
## izzybella jazzelle jazzi jazzie jezzabelle
## 1 1 1 1 1
## jezzebelle jozy kaz kazanova kazi
## 1 1 1 1 1
## kazmer kazu kezie kuzi lizaminion
## 1 1 1 1 1
## lizette lizzi luz mazama mazel
## 1 1 1 1 1
## mazy mizu mizzie mozart moze
## 1 1 1 1 1
## mozi mozzie mozzy muzby nazy
## 1 1 1 1 1
## nazzy ouzel pez puzzle raz
## 1 1 1 1 1
## razus razzmatazz rezi rezso rizz
## 1 1 1 1 1
## rozalia rozsa sazi soz suze
## 1 1 1 1 1
## suzee suzieq suzzie tazman tazzy
## 1 1 1 1 1
## tezcatlipoca tizzy tozer tozzi tuzi
## 1 1 1 1 1
## twz wizzard yzzy zazie zazu
## 1 1 1 1 1
## zezza zizka zozo
## 1 1 1
We could think about of matching a specific pattern. For instance, let’s try to find the pattern “cc”.
# match the pattern "cc"
# for cats
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = "cc" %R% ANY_CHAR
)
)
, decreasing = TRUE)
##
## gnocchi rocco chewbacca gucci piccadilly staccato
## 6 4 3 2 2 2
## zucchini becca blenducci boudiccea flocca hiccup
## 2 1 1 1 1 1
## kucci lucca lucci macchiatto mccone mccovey
## 1 1 1 1 1 1
## moccasin moustacchio nicchi piccolo rebecca recco
## 1 1 1 1 1 1
## soccer torocco yacchan zucca
## 1 1 1 1
“Gnocchi” seems to be the most popular cat name with the “cc” pattern.
# match the pattern "cc"
# for dogs
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = "cc" %R% ANY_CHAR
)
)
, decreasing = TRUE)
##
## rocco chewbacca lucca gucci ricco bacchus
## 28 21 17 6 5 4
## bocce zucca boudicca becca cappuccino abbracci
## 4 4 3 2 2 1
## bocca chicca chubacca gnocchi jacco macca
## 1 1 1 1 1 1
## macchia macchiato mocca nicca nicco nucci
## 1 1 1 1 1 1
## pucchi pucci puccini ricci rocca rucca
## 1 1 1 1 1 1
## sacco scaramucci zacca zaccheus
## 1 1 1 1
“Rocco” shows as the most popular dog name where the “cc” pattern is present.
Another option is to merely match more than one type of pattern. From the data exploration, some pet names were not exactly the same, though very similar. For instance, “Lucy”, “Lucie”, and “Luci”. In this case, we can use the function or
from the rebus
package.
# match multiple patterns with "or"
# for cats
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = START %R% or("lucy", "lucie", "luci")
%R% END)
)
, decreasing = TRUE)
##
## lucy luci lucie
## 104 3 1
And for dogs:
# match multiple patterns with "or"
# for dogs
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = START %R% or("lucy", "lucie", "luci")
%R% END)
)
, decreasing = TRUE)
##
## lucy lucie luci
## 358 9 7
We could also try to find first names containing only vowels by using the char_class
function form the rebus
package.
# find first names that correspond only to vowels
# for cats
vow <- char_class("aeiou")
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = START %R% one_or_more(vow) %R% END)
)
, decreasing = TRUE)
##
## o e io
## 2 1 1
And for dogs:
# find first names that correspond only to vowels
# for dogs
vow <- char_class("aeiou")
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = START %R% one_or_more(vow) %R% END)
)
, decreasing = TRUE)
##
## io iuiu o uau
## 2 1 1 1
We could also do the reverse by using the negated_char_class
function.
# find names that do not have vowels
# for cats
not_vow <- negated_char_class("aeiou")
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = START %R% one_or_more(not_vow) %R% END)
)
, decreasing = TRUE)
##
## mr gypsy ms mrs flynn lynx mj sky dmh dr
## 70 9 9 8 6 6 5 5 4 4
## jj nyx bb bc bw cc dc pj tj cj
## 4 4 3 3 3 3 3 3 3 2
## d dsh g j jr jynx kt p rhys sgt
## 2 2 2 2 2 2 2 2 2 2
## skyy syd ty 1 2 30 99 bj bryn bynx
## 2 2 2 1 1 1 1 1 1 1
## c80 dj dw fly frytz fyn fynn gb glynn grrrly
## 1 1 1 1 1 1 1 1 1 1
## grwh jd jh jp jt jyn k2 kc kk lbj
## 1 1 1 1 1 1 1 1 1 1
## m my prrr pym q qt rd rhythm ryn s
## 1 1 1 1 1 1 1 1 1 1
## sly sphynx sydny tk tyr v whbl z zz
## 1 1 1 1 1 1 1 1 1
And for dogs:
# find names that do not have vowels
# for dogs
not_vow <- negated_char_class("aeiou")
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = START %R% one_or_more(not_vow) %R% END)
)
, decreasing = TRUE)
##
## mr gypsy jj ty flynn pj sky cj ms z
## 48 15 12 9 8 8 7 6 6 5
## bb dj dr fynn kc mj p tj b cc
## 4 4 4 4 4 4 4 4 3 3
## fry j nyx sgt bj cy d gg jp jr
## 3 3 3 3 2 2 2 2 2 2
## k ky lj lt m pd sly t tyr wynn
## 2 2 2 2 2 2 2 2 2 2
## 7s bb8 bg bryn c cb cr dc dw f
## 1 1 1 1 1 1 1 1 1 1
## fly gd grr hy jb jd jw jwl k2 kk
## 1 1 1 1 1 1 1 1 1 1
## lb lc lynyrd mc mcfly mrs my pk pm pp
## 1 1 1 1 1 1 1 1 1 1
## pt pym r2d2 rj spy sy sylph tt twy twz
## 1 1 1 1 1 1 1 1 1 1
## vãlkl yzzy
## 1 1
“Mr” and “Gypsy” seem to be the most popular options of first names without vowels.
We could also try to find names only with digits by using the function one_or_more(DIGIT)
.
# find names only with digits
# for cats
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = START %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)
##
## 1 2 30 99
## 1 1 1 1
And for dogs:
# find names only with digits
# for cats
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = START %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)
## integer(0)
For dogs, we do not find first names with digits only.
Additionally, we could capture both words and digits for a given first name with the help of the function capture(WRD)
:
# capture first name that have words and digits
# for cats
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = capture(WRD) %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)
##
## cat2 testcat2 30 99 c80 cat1 charlene2
## 3 2 1 1 1 1 1
## deb0 jojo3 k2 number1 oscar2 slasher2 sydney2
## 1 1 1 1 1 1 1
## tont2
## 1
“Cat2” is the most popular cat name with both words and digits.
Let’s see how this pattern looks for dogs:
# capture first name that have words and digits
# for cats
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = capture(WRD) %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)
##
## bb8 casper2 cayde6 jacques2 k2 number3 penny2 r2d2
## 1 1 1 1 1 1 1 1
There are 8 dog names with both words and digits, but none is more popular than the other.
Amazingly, we could easily identify first names with repeated letters. We can use the REF1
function to try to find names with three repeated letters.
# find first names with three repeated letters
# for cats
sort(
table(
str_subset(
cats_names_tbl$animals_name_1, pattern = capture(LOWER) %R% REF1 %R% REF1)
)
, decreasing = TRUE)
##
## copurrrnicus cosettte grrrly katgrrrl prrr
## 1 1 1 1 1
## purrrsia wafffles
## 1 1
And for dogs:
# find first names with three repeated letters
# for cats
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = capture(LOWER) %R% REF1 %R% REF1)
)
, decreasing = TRUE)
##
## dollly ellliah willlow
## 1 1 1
We have 7 cat names with three repeated letters in a row. In dogs, we only observe this pattern 3 times.
To finalize, we could use the function exactly
to detect how many times a specific pattern occurs.
# detect a pattern with exactly
# for dogs
sort(
table(
str_subset(
dogs_names_tbl$animals_name_1, pattern = exactly("lucy"))
)
, decreasing = TRUE)
## lucy
## 358
Thus, the name “Lucy” was given to 358 dogs in Seattle in the last years.
Conclusion
This post touched on some of the most important functions to work with strings. Hope you have enjoyed how powerful can be the stringr
package, especially when paired together with the rebus
package. Nonetheless, there is still much more to learn when dealing with strings. Keep learning and coding!!