Working with Strings in R: Seattle Pet Names

Welcome to the blog. In this new post I’ll do a short tutorial on how to work with strings in R. I’ll show you some of the main functions of the stringr package and the amazing power of the rebus package. The data frame I will be using is from the week 13 of TidyTuesday. This data frame seemed to be the perfect opportunity to build this tutorial given the importance of strings for its understanding. The data is called “Seattle Pet Names” and is related to the date, names, species, breed, and zip code of the pets registered in Seattle. I’ll be focusing the analyses on the names given to cats and dogs.

Let’s start the tutorial by loading the needed packages.

library(tidyverse) # wrangling and data visualization
library(kableExtra) # visualize html tables
library(rebus) # string maniipulation
library(data.table) # in this case open dataframe
library(splitstackshape) # split columns
library(lubridate) # dealing with dates and times

We now open the “Seattle Pet Names” file and glimpse it.

# open file
pet_names <- fread("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-26/seattle_pets.csv")


# explore it
glimpse(pet_names)

## Observations: 52,519
## Variables: 7
## $ license_issue_date <chr> "November 16 2018", "November 11 2018", "No...
## $ license_number     <chr> "8002756", "S124529", "903793", "824666", "...
## $ animals_name       <chr> "Wall-E", "Andre", "Mac", "Melb", "Gingersn...
## $ species            <chr> "Dog", "Dog", "Dog", "Cat", "Cat", "Dog", "...
## $ primary_breed      <chr> "Mixed Breed, Medium (up to 44 lbs fully gr...
## $ secondary_breed    <chr> "Mix", "Dachshund, Standard Wire Haired", N...
## $ zip_code           <chr> "98108", "98117", "98136", "98117", "98144"...

We will now create two separate data frames, one for cats and one for dogs.

# cats dataframe
cats_names <- pet_names %>%
  filter(species == "Cat")


# dogs dataframe
dogs_names <- pet_names %>% 
  filter(species == "Dog")

Let’s start to manipulate strings! Now, we’ll use two different functions from the stringr package: str_remove_all to remove all punctuations in the names and str_squish to remove all excess white space.

# cats new dataframe
cats_names_tbl <- cats_names %>% 
  mutate(
    animals_name = 
      animals_name %>%
      str_remove_all(pattern = "[:punct:]") %>% # remove punctuation
      str_squish()) # remove all excess white space

# explore the data
cats_names_tbl %>%
  head(15) %>% 
  kable()

license_issue_date	license_number	animals_name	species	primary_breed	secondary_breed	zip_code
November 23 2018	824666	Melb	Cat	Domestic Shorthair	NA	98117
December 30 2018	S119138	Gingersnap	Cat	Domestic Shorthair	Mix	98144
August 09 2018	S142558	Sebastian	Cat	Domestic Shorthair	Mix	98122
August 20 2018	S142546	Madeline	Cat	Domestic Shorthair	Mix	98105
December 08 2018	S123830	Cleo	Cat	Domestic Shorthair	NA	98199
October 20 2018	S149153	Glitch	Cat	Siamese	Domestic Medium Hair	98122
November 24 2018	817137	Candy	Cat	Domestic Shorthair	NA	98126
December 07 2018	895346	Cinnamon	Cat	Domestic Shorthair	NA	98144
October 31 2018	S123360	Sydney2	Cat	Domestic Medium Hair	NA	98101
October 23 2018	S122244	Calvin	Cat	American Shorthair	NA	98119
November 15 2018	8002730	Mochi	Cat	Domestic Medium Hair	Mix	98105
November 27 2018	S125276	dmh	Cat	Domestic Medium Hair	Mix	98117
December 20 2018	952291	Justin	Cat	Domestic Medium Hair	NA	98122
December 21 2018	S150217	Dash	Cat	Domestic Shorthair	NA	98117
December 10 2018	8003722	Buster	Cat	Domestic Shorthair	Mix	98115

# dogs new dataframe
dogs_names_tbl <- dogs_names %>% 
  mutate(
    animals_name = 
      animals_name %>%
      str_remove_all(pattern = "[:punct:]") %>% # remove punctuation
      str_squish()) # remove all excess white space

dogs_names_tbl %>%
  head(15) %>% 
  kable()

license_issue_date	license_number	animals_name	species	primary_breed	secondary_breed	zip_code
November 16 2018	8002756	WallE	Dog	Mixed Breed, Medium (up to 44 lbs fully grown)	Mix	98108
November 11 2018	S124529	Andre	Dog	Terrier, Jack Russell	Dachshund, Standard Wire Haired	98117
November 21 2018	903793	Mac	Dog	Retriever, Labrador	NA	98136
December 16 2018	S138529	Cody	Dog	Retriever, Labrador	NA	98103
October 04 2017	580652	Millie	Dog	Terrier, Boston	NA	98115
December 23 2018	961052	Sabre	Dog	Terrier	NA	98126
December 07 2018	S125461	Thomas	Dog	Chihuahua, Short Coat	Mix	98177
November 07 2018	8002543	Lulu	Dog	Vizsla, Smooth Haired	Mix	98105
December 15 2018	S138838	Milo	Dog	Boxer	Retriever, Labrador	98109
November 27 2018	S123980	Anubis	Dog	Poodle, Standard	NA	98112
October 25 2018	830506	Skylar	Dog	Border Collie	Terrier, Airedale	98144
October 23 2018	S137719	Cleo	Dog	Bernese Mountain Dog	Spaniel	98107
November 07 2018	905090	Petey	Dog	Pomeranian	Shih Tzu	98119
December 24 2018	S152290	Kaia	Dog	Karelian Bear Dog	NA	98117
December 27 2018	8004142	Maya	Dog	Chihuahua, Short Coat	NA	98126

One of our goals is to know the number of words that every name set has. In this case, we should use the str_count function with the regex pattern \\w+ so that the white space is not included as a word.

# count the number of words 
# for cats
cats_names_tbl %>% 
  mutate(count_words = str_count(animals_name, pattern = "\\w+")) %>%
  arrange(desc(count_words)) %>%
  select(animals_name, count_words) %>%
  head()

##                                    animals_name count_words
## 1         King Charles Leon the First aka Chuck           7
## 2           Lady G Lolo Paloma MacGuffie Hunter           6
## 3 Her Ladyship Princess Penelope Peachfuzz Howe           6
## 4               Morris Boney T MacGuffie Hunter           5
## 5                Fern River Brits Gouda Reserve           5
## 6            Jazzmine Primula Rosamund the Fair           5

# for dogs
dogs_names_tbl %>% 
  mutate(count_words = str_count(animals_name, pattern = "\\w+")) %>%
  arrange(desc(count_words)) %>%
  select(animals_name, count_words) %>%
  head()

##                                              animals_name count_words
## 1            Little Miss Dublin Maeve of the Emerald Isle           8
## 2              Legends of Olde Sir Walter The Lady Killer           8
## 3                   Cascade Mountains Out of a Dream BRIA           7
## 4 NuitÂ AhathoorÂ HecateÂ SapphoÂ JezebelÂ Lilith Crowley           7
## 5                   His Royal Highness the Duke of Tacoma           7
## 6           Lady Kassandra Yu Countess of Wallingford KBE           7

So, the cat’s name with more words has 7 name sets, namely “King Charles Leon the First aka Chuck”. The dogs’ names with more words have 8 name sets and they’re called “Little Miss Dublin Maeve of the Emerald Isle” and “Legends of Olde Sir Walter The Lady Killer”.

Let’s imagine we wanted to know the number of characters of each pet instead of knowing the number of name sets of each dog and cat. In this scenario, we could use the str_length function. First, we will use the str_remove_all to remove the white spaces followed by the str_length function.

# find the number of characters
# for cats
cats_names_tbl %>% 
  mutate(animals_name_rem = animals_name %>% str_remove_all(pattern = " "),
         number_char = animals_name_rem %>% str_length()) %>%
  select(animals_name, number_char) %>% 
  arrange(desc(number_char)) %>%
  head()

##                                    animals_name number_char
## 1 Her Ladyship Princess Penelope Peachfuzz Howe          40
## 2         King Charles Leon the First aka Chuck          31
## 3           Lady G Lolo Paloma MacGuffie Hunter          30
## 4            Jazzmine Primula Rosamund the Fair          30
## 5              Ginger Chanel OBrien SpletzKauff          29
## 6               Samantha Scully HammerstromGuel          29

# for dogs
dogs_names_tbl %>% 
  mutate(animals_name_rem = animals_name %>% str_remove_all(pattern = " "),
         number_char = animals_name_rem %>% str_length()) %>%
 select(animals_name, number_char) %>% 
arrange(desc(number_char)) %>%
  head()

##                                              animals_name number_char
## 1 NuitÂ AhathoorÂ HecateÂ SapphoÂ JezebelÂ Lilith Crowley          49
## 2             Greta McGonagall Galactica Sunnydale Fugent          39
## 3           Lady Kassandra Yu Countess of Wallingford KBE          39
## 4            Little Miss Dublin Maeve of the Emerald Isle          37
## 5              Legends of Olde Sir Walter The Lady Killer          35
## 6                   Lotus Birdie Snufflepupagus Underfoot          34

40 characters is the maximum number of characters for cats, while for dogs the maximum number is 49.

We can also manipulate if a string is in upper case or lower case. For this, we can use the str_to_upper and str_to_lower functions, respectively.

Note: From now on well use the pets’ names with lower cases.

# to upper case
# for cats
cats_names_tbl %>% 
  mutate(animals_name = animals_name %>% str_to_upper()) %>%
  head()

##   license_issue_date license_number animals_name species
## 1   November 23 2018         824666         MELB     Cat
## 2   December 30 2018        S119138   GINGERSNAP     Cat
## 3     August 09 2018        S142558    SEBASTIAN     Cat
## 4     August 20 2018        S142546     MADELINE     Cat
## 5   December 08 2018        S123830         CLEO     Cat
## 6    October 20 2018        S149153       GLITCH     Cat
##        primary_breed      secondary_breed zip_code
## 1 Domestic Shorthair                 <NA>    98117
## 2 Domestic Shorthair                  Mix    98144
## 3 Domestic Shorthair                  Mix    98122
## 4 Domestic Shorthair                  Mix    98105
## 5 Domestic Shorthair                 <NA>    98199
## 6            Siamese Domestic Medium Hair    98122

# for dogs
dogs_names_tbl %>% 
  mutate(animals_name = animals_name %>% str_to_upper()) %>%
  head()

##   license_issue_date license_number animals_name species
## 1   November 16 2018        8002756        WALLE     Dog
## 2   November 11 2018        S124529        ANDRE     Dog
## 3   November 21 2018         903793          MAC     Dog
## 4   December 16 2018        S138529         CODY     Dog
## 5    October 04 2017         580652       MILLIE     Dog
## 6   December 23 2018         961052        SABRE     Dog
##                                    primary_breed
## 1 Mixed Breed, Medium (up to 44 lbs fully grown)
## 2                          Terrier, Jack Russell
## 3                            Retriever, Labrador
## 4                            Retriever, Labrador
## 5                                Terrier, Boston
## 6                                        Terrier
##                   secondary_breed zip_code
## 1                             Mix    98108
## 2 Dachshund, Standard Wire Haired    98117
## 3                            <NA>    98136
## 4                            <NA>    98103
## 5                            <NA>    98115
## 6                            <NA>    98126

# to lower case
# for cats
cats_names_tbl <- cats_names_tbl %>% 
  mutate(animals_name = animals_name %>% str_to_lower()) 

# for dogs
dogs_names_tbl <- dogs_names_tbl %>% 
  mutate(animals_name = animals_name %>% str_to_lower())

Before any further analysis, we will separate the pets’ names in different columns. With stringr we can use the str_c or the str_split function, but to be honest I prefer using the cSplit function from the splitstackshape.

# separate columns with str_c
# for cats
cats_names_tbl %>% 
  separate(animals_name,
           into = str_c("animals_name", 1:7), # 7 because that's the maximum number of words
           sep = " ",
           remove = FALSE,
           extra = "drop") %>%
  head()

## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 16887
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].

##   license_issue_date license_number animals_name animals_name1
## 1   November 23 2018         824666         melb          melb
## 2   December 30 2018        S119138   gingersnap    gingersnap
## 3     August 09 2018        S142558    sebastian     sebastian
## 4     August 20 2018        S142546     madeline      madeline
## 5   December 08 2018        S123830         cleo          cleo
## 6    October 20 2018        S149153       glitch        glitch
##   animals_name2 animals_name3 animals_name4 animals_name5 animals_name6
## 1          <NA>          <NA>          <NA>          <NA>          <NA>
## 2          <NA>          <NA>          <NA>          <NA>          <NA>
## 3          <NA>          <NA>          <NA>          <NA>          <NA>
## 4          <NA>          <NA>          <NA>          <NA>          <NA>
## 5          <NA>          <NA>          <NA>          <NA>          <NA>
## 6          <NA>          <NA>          <NA>          <NA>          <NA>
##   animals_name7 species      primary_breed      secondary_breed zip_code
## 1          <NA>     Cat Domestic Shorthair                 <NA>    98117
## 2          <NA>     Cat Domestic Shorthair                  Mix    98144
## 3          <NA>     Cat Domestic Shorthair                  Mix    98122
## 4          <NA>     Cat Domestic Shorthair                  Mix    98105
## 5          <NA>     Cat Domestic Shorthair                 <NA>    98199
## 6          <NA>     Cat            Siamese Domestic Medium Hair    98122

# separate columns with str_c
# for dogs
dogs_names_tbl %>% 
  separate(animals_name,
           into = str_c("animals_name", 1:8), # 8 because that's the maximum number of words
           sep = " ",
           remove = FALSE,
           extra = "drop") %>%
  head()

## Warning: Expected 8 pieces. Missing pieces filled with `NA` in 35103
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].

##   license_issue_date license_number animals_name animals_name1
## 1   November 16 2018        8002756        walle         walle
## 2   November 11 2018        S124529        andre         andre
## 3   November 21 2018         903793          mac           mac
## 4   December 16 2018        S138529         cody          cody
## 5    October 04 2017         580652       millie        millie
## 6   December 23 2018         961052        sabre         sabre
##   animals_name2 animals_name3 animals_name4 animals_name5 animals_name6
## 1          <NA>          <NA>          <NA>          <NA>          <NA>
## 2          <NA>          <NA>          <NA>          <NA>          <NA>
## 3          <NA>          <NA>          <NA>          <NA>          <NA>
## 4          <NA>          <NA>          <NA>          <NA>          <NA>
## 5          <NA>          <NA>          <NA>          <NA>          <NA>
## 6          <NA>          <NA>          <NA>          <NA>          <NA>
##   animals_name7 animals_name8 species
## 1          <NA>          <NA>     Dog
## 2          <NA>          <NA>     Dog
## 3          <NA>          <NA>     Dog
## 4          <NA>          <NA>     Dog
## 5          <NA>          <NA>     Dog
## 6          <NA>          <NA>     Dog
##                                    primary_breed
## 1 Mixed Breed, Medium (up to 44 lbs fully grown)
## 2                          Terrier, Jack Russell
## 3                            Retriever, Labrador
## 4                            Retriever, Labrador
## 5                                Terrier, Boston
## 6                                        Terrier
##                   secondary_breed zip_code
## 1                             Mix    98108
## 2 Dachshund, Standard Wire Haired    98117
## 3                            <NA>    98136
## 4                            <NA>    98103
## 5                            <NA>    98115
## 6                            <NA>    98126

# separate columns with str_split
# for cats
splitting1 <- as.data.frame(str_split(cats_names_tbl$animals_name, fixed(" "), n = 7, simplify = TRUE)) # 7 because that's the maximum number of words

cats_after_split <- bind_cols(cats_names_tbl, splitting1)

# separate columns with str_split
# for dogs
splitting2 <- as.data.frame(str_split(dogs_names_tbl$animals_name, fixed(" "), n = 8, simplify = TRUE)) # 8 because that's the maximum number of words

cats_after_split <- bind_cols(dogs_names_tbl, splitting2)

With the cSplit function gets simpler.

# with the cSplit function
# for cats
cats_names_tbl <- cats_names_tbl %>%
  cSplit("animals_name", sep = " ", drop = FALSE) # separate animals_name column, but keep it

cats_names_tbl %>%
  head(20)

##     license_issue_date license_number animals_name species
##  1:   November 23 2018         824666         melb     Cat
##  2:   December 30 2018        S119138   gingersnap     Cat
##  3:     August 09 2018        S142558    sebastian     Cat
##  4:     August 20 2018        S142546     madeline     Cat
##  5:   December 08 2018        S123830         cleo     Cat
##  6:    October 20 2018        S149153       glitch     Cat
##  7:   November 24 2018         817137        candy     Cat
##  8:   December 07 2018         895346     cinnamon     Cat
##  9:    October 31 2018        S123360      sydney2     Cat
## 10:    October 23 2018        S122244       calvin     Cat
## 11:   November 15 2018        8002730        mochi     Cat
## 12:   November 27 2018        S125276          dmh     Cat
## 13:   December 20 2018         952291       justin     Cat
## 14:   December 21 2018        S150217         dash     Cat
## 15:   December 10 2018        8003722       buster     Cat
## 16:    October 21 2018        S136705         monk     Cat
## 17:    October 29 2018        S137285         mari     Cat
## 18:   November 19 2018        S124303         peko     Cat
## 19:   November 07 2018         896862        sajah     Cat
## 20:   November 12 2018         896813         nami     Cat
##            primary_breed      secondary_breed zip_code animals_name_1
##  1:   Domestic Shorthair                 <NA>    98117           melb
##  2:   Domestic Shorthair                  Mix    98144     gingersnap
##  3:   Domestic Shorthair                  Mix    98122      sebastian
##  4:   Domestic Shorthair                  Mix    98105       madeline
##  5:   Domestic Shorthair                 <NA>    98199           cleo
##  6:              Siamese Domestic Medium Hair    98122         glitch
##  7:   Domestic Shorthair                 <NA>    98126          candy
##  8:   Domestic Shorthair                 <NA>    98144       cinnamon
##  9: Domestic Medium Hair                 <NA>    98101        sydney2
## 10:   American Shorthair                 <NA>    98119         calvin
## 11: Domestic Medium Hair                  Mix    98105          mochi
## 12: Domestic Medium Hair                  Mix    98117            dmh
## 13: Domestic Medium Hair                 <NA>    98122         justin
## 14:   Domestic Shorthair                 <NA>    98117           dash
## 15:   Domestic Shorthair                  Mix    98115         buster
## 16: Domestic Medium Hair                  Mix    98122           monk
## 17:   Domestic Shorthair                 <NA>    98144           mari
## 18: Domestic Medium Hair                 <NA>    98117           peko
## 19: Domestic Medium Hair                 <NA>    98112          sajah
## 20:   Domestic Shorthair                 <NA>    98107           nami
##     animals_name_2 animals_name_3 animals_name_4 animals_name_5
##  1:           <NA>           <NA>           <NA>           <NA>
##  2:           <NA>           <NA>           <NA>           <NA>
##  3:           <NA>           <NA>           <NA>           <NA>
##  4:           <NA>           <NA>           <NA>           <NA>
##  5:           <NA>           <NA>           <NA>           <NA>
##  6:           <NA>           <NA>           <NA>           <NA>
##  7:           <NA>           <NA>           <NA>           <NA>
##  8:           <NA>           <NA>           <NA>           <NA>
##  9:           <NA>           <NA>           <NA>           <NA>
## 10:           <NA>           <NA>           <NA>           <NA>
## 11:           <NA>           <NA>           <NA>           <NA>
## 12:           <NA>           <NA>           <NA>           <NA>
## 13:           <NA>           <NA>           <NA>           <NA>
## 14:           <NA>           <NA>           <NA>           <NA>
## 15:           <NA>           <NA>           <NA>           <NA>
## 16:           <NA>           <NA>           <NA>           <NA>
## 17:           <NA>           <NA>           <NA>           <NA>
## 18:           <NA>           <NA>           <NA>           <NA>
## 19:           <NA>           <NA>           <NA>           <NA>
## 20:           <NA>           <NA>           <NA>           <NA>
##     animals_name_6 animals_name_7
##  1:           <NA>           <NA>
##  2:           <NA>           <NA>
##  3:           <NA>           <NA>
##  4:           <NA>           <NA>
##  5:           <NA>           <NA>
##  6:           <NA>           <NA>
##  7:           <NA>           <NA>
##  8:           <NA>           <NA>
##  9:           <NA>           <NA>
## 10:           <NA>           <NA>
## 11:           <NA>           <NA>
## 12:           <NA>           <NA>
## 13:           <NA>           <NA>
## 14:           <NA>           <NA>
## 15:           <NA>           <NA>
## 16:           <NA>           <NA>
## 17:           <NA>           <NA>
## 18:           <NA>           <NA>
## 19:           <NA>           <NA>
## 20:           <NA>           <NA>

# with the cSplit function
# for dogs
dogs_names_tbl <- dogs_names_tbl %>%
  cSplit("animals_name", sep = " ", drop = FALSE) # separate animals_name column, but keep it


dogs_names_tbl %>%
  head(20)

##     license_issue_date license_number animals_name species
##  1:   November 16 2018        8002756        walle     Dog
##  2:   November 11 2018        S124529        andre     Dog
##  3:   November 21 2018         903793          mac     Dog
##  4:   December 16 2018        S138529         cody     Dog
##  5:    October 04 2017         580652       millie     Dog
##  6:   December 23 2018         961052        sabre     Dog
##  7:   December 07 2018        S125461       thomas     Dog
##  8:   November 07 2018        8002543         lulu     Dog
##  9:   December 15 2018        S138838         milo     Dog
## 10:   November 27 2018        S123980       anubis     Dog
## 11:    October 25 2018         830506       skylar     Dog
## 12:    October 23 2018        S137719         cleo     Dog
## 13:   November 07 2018         905090        petey     Dog
## 14:   December 24 2018        S152290         kaia     Dog
## 15:   December 27 2018        8004142         maya     Dog
## 16:   November 18 2018         835221      shirley     Dog
## 17:    October 18 2018        S101544       diesel     Dog
## 18:   November 30 2018         904438        jacob     Dog
## 19:   November 06 2018        8002506       linkin     Dog
## 20:   December 02 2018         829922       gracie     Dog
##                                      primary_breed
##  1: Mixed Breed, Medium (up to 44 lbs fully grown)
##  2:                          Terrier, Jack Russell
##  3:                            Retriever, Labrador
##  4:                            Retriever, Labrador
##  5:                                Terrier, Boston
##  6:                                        Terrier
##  7:                          Chihuahua, Short Coat
##  8:                          Vizsla, Smooth Haired
##  9:                                          Boxer
## 10:                               Poodle, Standard
## 11:                                  Border Collie
## 12:                           Bernese Mountain Dog
## 13:                                     Pomeranian
## 14:                              Karelian Bear Dog
## 15:                          Chihuahua, Short Coat
## 16:                            Australian Shepherd
## 17:                                       Shepherd
## 18:                                German Shepherd
## 19:                          Vizsla, Smooth Haired
## 20:                           Terrier, Fox, Smooth
##                     secondary_breed zip_code animals_name_1 animals_name_2
##  1:                             Mix    98108          walle           <NA>
##  2: Dachshund, Standard Wire Haired    98117          andre           <NA>
##  3:                            <NA>    98136            mac           <NA>
##  4:                            <NA>    98103           cody           <NA>
##  5:                            <NA>    98115         millie           <NA>
##  6:                            <NA>    98126          sabre           <NA>
##  7:                             Mix    98177         thomas           <NA>
##  8:                             Mix    98105           lulu           <NA>
##  9:             Retriever, Labrador    98109           milo           <NA>
## 10:                            <NA>    98112         anubis           <NA>
## 11:               Terrier, Airedale    98144         skylar           <NA>
## 12:                         Spaniel    98107           cleo           <NA>
## 13:                        Shih Tzu    98119          petey           <NA>
## 14:                            <NA>    98117           kaia           <NA>
## 15:                            <NA>    98126           maya           <NA>
## 16:               Retriever, Golden    98144        shirley           <NA>
## 17:                             Mix    98118         diesel           <NA>
## 18:                      Rottweiler    98199          jacob           <NA>
## 19:                            <NA>    98105         linkin           <NA>
## 20:                            <NA>    98118         gracie           <NA>
##     animals_name_3 animals_name_4 animals_name_5 animals_name_6
##  1:           <NA>           <NA>           <NA>           <NA>
##  2:           <NA>           <NA>           <NA>           <NA>
##  3:           <NA>           <NA>           <NA>           <NA>
##  4:           <NA>           <NA>           <NA>           <NA>
##  5:           <NA>           <NA>           <NA>           <NA>
##  6:           <NA>           <NA>           <NA>           <NA>
##  7:           <NA>           <NA>           <NA>           <NA>
##  8:           <NA>           <NA>           <NA>           <NA>
##  9:           <NA>           <NA>           <NA>           <NA>
## 10:           <NA>           <NA>           <NA>           <NA>
## 11:           <NA>           <NA>           <NA>           <NA>
## 12:           <NA>           <NA>           <NA>           <NA>
## 13:           <NA>           <NA>           <NA>           <NA>
## 14:           <NA>           <NA>           <NA>           <NA>
## 15:           <NA>           <NA>           <NA>           <NA>
## 16:           <NA>           <NA>           <NA>           <NA>
## 17:           <NA>           <NA>           <NA>           <NA>
## 18:           <NA>           <NA>           <NA>           <NA>
## 19:           <NA>           <NA>           <NA>           <NA>
## 20:           <NA>           <NA>           <NA>           <NA>
##     animals_name_7 animals_name_8
##  1:           <NA>           <NA>
##  2:           <NA>           <NA>
##  3:           <NA>           <NA>
##  4:           <NA>           <NA>
##  5:           <NA>           <NA>
##  6:           <NA>           <NA>
##  7:           <NA>           <NA>
##  8:           <NA>           <NA>
##  9:           <NA>           <NA>
## 10:           <NA>           <NA>
## 11:           <NA>           <NA>
## 12:           <NA>           <NA>
## 13:           <NA>           <NA>
## 14:           <NA>           <NA>
## 15:           <NA>           <NA>
## 16:           <NA>           <NA>
## 17:           <NA>           <NA>
## 18:           <NA>           <NA>
## 19:           <NA>           <NA>
## 20:           <NA>           <NA>

Let us now find the most common first and last letters used in the first name for both cats and dogs. For this, we will use the str_sub function.

# first letter
# for cats
cats_names_tbl %>% 
  mutate(first_letter = animals_name_1 %>% str_sub(1,1)) %>%
  count(first_letter) %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
##   first_letter     n
##   <chr>        <int>
## 1 m             1998
## 2 s             1831
## 3 b             1356
## 4 c             1234
## 5 l             1142
## 6 p             1095

# for dogs
dogs_names_tbl %>% 
  mutate(first_letter = animals_name_1 %>% str_sub(1,1)) %>%
  count(first_letter) %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
##   first_letter     n
##   <chr>        <int>
## 1 m             3408
## 2 b             3366
## 3 s             3042
## 4 c             2760
## 5 l             2712
## 6 r             2199

# last letter
# for cats
cats_names_tbl %>% 
  mutate(last_letter = animals_name_1 %>% str_sub(-1,-1)) %>%
  count(last_letter) %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
##   last_letter     n
##   <chr>       <int>
## 1 y            2881
## 2 e            2855
## 3 a            2406
## 4 r            1368
## 5 o            1209
## 6 s            1110

# for dogs
dogs_names_tbl %>% 
  mutate(last_letter = animals_name_1 %>% str_sub(-1,-1)) %>%
  count(last_letter) %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
##   last_letter     n
##   <chr>       <int>
## 1 y            7373
## 2 e            6575
## 3 a            4839
## 4 r            3007
## 5 o            2684
## 6 n            2159

The letter “M” seems to be the preferred one when naming both cats and dogs. Curiously, the letter “Y” is picked more often by the pet owners for the name ending.

After this, our goal is to find the most common names for cats and dogs.

# most common name
# for cats
cats_names_tbl %>%
  filter(!is.na(animals_name_1)) %>% 
  count(animals_name_1) %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
##   animals_name_1     n
##   <fct>          <int>
## 1 luna             116
## 2 lucy             104
## 3 lily              91
## 4 charlie           88
## 5 max               86
## 6 bella             84

# for dogs
dogs_names_tbl %>% 
  filter(!is.na(animals_name_1)) %>% 
  count(animals_name_1) %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
##   animals_name_1     n
##   <fct>          <int>
## 1 lucy             358
## 2 charlie          325
## 3 bella            273
## 4 luna             261
## 5 daisy            244
## 6 cooper           199

“Luna” is the most popular name for cats, while for dogs “Lucy” is the name given more often.

Back again to stringr, it would be possible to use str_subset to find names with a specific pattern.

# find unique first names with the pattern "su"
# for cats
unique(str_subset(cats_names_tbl$animals_name_1, pattern = "su"))

##  [1] "sugar"      "summer"     "sunny"      "suzie"      "sufjan"    
##  [6] "susie"      "sushi"      "sunbeam"    "sunset"     "suki"      
## [11] "susu"       "natsumi"    "ursula"     "suri"       "graysun"   
## [16] "sultana"    "suess"      "tsunami"    "sisu"       "sumi"      
## [21] "sundae"     "sukapati"   "sullivan"   "suika"      "sultrina"  
## [26] "katsu"      "sula"       "sumo"       "suede"      "suma"      
## [31] "sundance"   "sunshine"   "sugarpuffs" "sukhi"      "sunkist"   
## [36] "sunflower"  "sunnyside"  "sutton"     "sucha"      "yasuo"     
## [41] "sully"      "mitsu"      "suji"       "sunday"     "sukie"     
## [46] "satsuma"    "zasu"       "misu"       "suzykitty"  "surya"     
## [51] "natsu"      "nasuchan"   "suzy"       "tonkatsu"   "suzi"      
## [56] "susan"      "superman"   "suva"       "suleyman"   "tsuki"     
## [61] "sulley"

# for dogs
unique(str_subset(dogs_names_tbl$animals_name_1, pattern = "su"))

##  [1] "sue"          "suzie"        "suki"         "sully"       
##  [5] "susie"        "sunny"        "suzy"         "ursula"      
##  [9] "sushi"        "sugar"        "sun"          "katsu"       
## [13] "mitsu"        "jesus"        "sunshine"     "su"          
## [17] "summer"       "sukey"        "tamotsu"      "sundae"      
## [21] "kitsunejip"   "sunchaser"    "missus"       "suri"        
## [25] "sumi"         "komatsu"      "suggen"       "yousuke"     
## [29] "sundance"     "sudo"         "tsuki"        "sunnee"      
## [33] "sukhi"        "sunday"       "tsunami"      "jessup"      
## [37] "subaru"       "rayasunshine" "susannah"     "tsukee"      
## [41] "bosun"        "sullivan"     "suka"         "subotai"     
## [45] "sunniva"      "suzzie"       "sug"          "zsuska"      
## [49] "asuka"        "sisu"         "sukie"        "sunup"       
## [53] "sucia"        "suni"         "sutter"       "suzieq"      
## [57] "subi"         "surti"        "sumo"         "suze"        
## [61] "sugarberry"   "jitsu"        "summit"       "sukha"       
## [65] "sudi"         "sultan"       "kitsune"      "susi"        
## [69] "colossus"     "sukoshi"      "narcissus"    "sua"         
## [73] "kietsu"       "sugi"         "suma"         "sukee"       
## [77] "misu"         "suzee"        "consuela"

Therefore, we have 61 unique cat names with the pattern “su”. For dogs this pattern appears more often, meaning 79 unique occurrences.

str_which is another interesting function which tell us in which row a specific pattern occurs.

# find a pattern row location
# for cats
str_which(cats_names_tbl$animals_name_1, pattern = "car")

##   [1]    53   180   271   397   449   811   874  1100  1137  1181  1241
##  [12]  1308  1398  1421  1543  1585  1664  1682  2163  2909  3006  3120
##  [23]  3188  3207  3361  3375  3442  3571  3849  3899  4047  4288  4388
##  [34]  4513  4823  4976  5028  5060  5104  5197  5287  5421  5456  5640
##  [45]  5647  5690  5691  5769  5814  5820  6131  6144  6235  6455  6501
##  [56]  6685  6829  7027  7079  7110  7306  7321  7369  7499  7778  7865
##  [67]  7866  7944  8067  8159  8276  8432  8487  9280  9429  9458  9684
##  [78]  9870  9923 10027 10045 10072 10140 10214 10437 10712 10780 11483
##  [89] 11530 11575 11669 11814 12054 12140 12306 12732 13404 13436 13449
## [100] 13544 13721 13843 14312 14485 14666 14864 15025 15206 15382 15383
## [111] 15751 15819 15995 16304 16369 16622 16634 16800 16893 16945 17126
## [122] 17163

# for dogs
str_which(cats_names_tbl$animals_name_1, pattern = "car")

##   [1]    53   180   271   397   449   811   874  1100  1137  1181  1241
##  [12]  1308  1398  1421  1543  1585  1664  1682  2163  2909  3006  3120
##  [23]  3188  3207  3361  3375  3442  3571  3849  3899  4047  4288  4388
##  [34]  4513  4823  4976  5028  5060  5104  5197  5287  5421  5456  5640
##  [45]  5647  5690  5691  5769  5814  5820  6131  6144  6235  6455  6501
##  [56]  6685  6829  7027  7079  7110  7306  7321  7369  7499  7778  7865
##  [67]  7866  7944  8067  8159  8276  8432  8487  9280  9429  9458  9684
##  [78]  9870  9923 10027 10045 10072 10140 10214 10437 10712 10780 11483
##  [89] 11530 11575 11669 11814 12054 12140 12306 12732 13404 13436 13449
## [100] 13544 13721 13843 14312 14485 14666 14864 15025 15206 15382 15383
## [111] 15751 15819 15995 16304 16369 16622 16634 16800 16893 16945 17126
## [122] 17163

Rebus

In the previous section, some important functions within the stringr ecosystem were shown. Now let’s work again with stringr, this time in association with the rebus package. This package facilitates our work when dealing with regular expressions. As a quick example, let’s imagine that we wanted to count all dogs and cats names starting with a specific letter. We just have to write START %R% and subsequently the required pattern:

# names that start with L
# for cats
cats_names_tbl %>% 
  mutate(L_char = str_count(animals_name_1, pattern = START %R% "l")) %>%
  summarize(cats_words_start_with_L = sum(L_char, na.rm = TRUE))

##   cats_words_start_with_L
## 1                    1142

# dogs
dogs_names_tbl %>% 
  mutate(L_char = str_count(animals_name_1, pattern = START %R% "l")) %>%
  summarize(dogs_words_start_with_L = sum(L_char, na.rm = TRUE))

##   dogs_words_start_with_L
## 1                    2712

In this case, we have 1142 cat first names starting with the “L” character. For dogs, we have 2712 instances.

Moving on with our wrangling of strings, if we would like to check the first names ending with the pattern “cy” we could simply do the following:

# find names that end with the pattern "cy"
# for cats
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = "cy" %R% END
  )
)
, decreasing = TRUE)

## 
##    lucy   percy    macy  quincy   fancy  clancy   mercy   nancy   darcy 
##     104      13      10       9       6       4       3       3       2 
##   marcy chauncy   gracy    lacy  legacy   peacy   purcy  purrcy   saucy 
##       2       1       1       1       1       1       1       1       1 
##   spicy 
##       1

# for dogs
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = "cy" %R% END
  )
)
, decreasing = TRUE)

## 
##   lucy quincy  percy   macy  darcy  fancy clancy  mercy     cy    icy 
##    358     29     16     15      7      4      3      3      2      2 
##   lacy  nancy  spicy   gacy  gancy   kacy  marcy  saucy  stacy  yancy 
##      2      2      2      1      1      1      1      1      1      1

As we can see, we’ve included the pattern before the END function of rebus and with that we’ve indicated we wanted a pattern ending with “cy”.

We could simply try to find names where the third letter is a “z”. The rebus package facilitates our job through the ANY_CHAR function.

# first names where the third letter is a "z"
# for cats
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% "z"
  )
)
, decreasing = TRUE)

## 
##       gizmo        izzy       hazel         taz        jazz       jazzy 
##          20          20          17          12          10           9 
##       lizzy       ozzie        enzo       suzie        buzz      lizzie 
##           9           9           6           6           4           4 
##      mozart        zazu       fuzzy       izzie     jezebel        liza 
##           4           4           3           3           3           3 
##       mazie        ouzo        ozzy        zuzu       dizzy        fizz 
##           3           3           3           3           2           2 
##        fuzz         jaz        kaze      kizzie         liz        orzo 
##           2           2           2           2           2           2 
##       pazzo      rizhik      tazzie      wizard       zizou        azzu 
##           2           2           2           2           2           1 
##       bazil     bazique         boz        boze        bozo         caz 
##           1           1           1           1           1           1 
##   cazadores       cazzy     cezanne      dazzle        dezi     dezmond 
##           1           1           1           1           1           1 
##         fez      fezzie      fezzik      fozzie       fozzy fuzzjackson 
##           1           1           1           1           1           1 
##    gazpacho        giza       gozer        gozo       gyzmo        haze 
##           1           1           1           1           1           1 
##    hazelnut       itzel   izzybella        jazl    jazzmine    jezavell 
##           1           1           1           1           1           1 
##    jezebell   jezebelle  jezzabelle      jezzie       kozmo       lazer 
##           1           1           1           1           1           1 
##       lazlo        lazy      lazzlo       mazel     mazinga        mazy 
##           1           1           1           1           1           1 
##       mazzy         miz        mizu         moz     muzette       muzzy 
##           1           1           1           1           1           1 
##       nazee      nozomi      puzzle     puzzles         raz        razz 
##           1           1           1           1           1           1 
##        riza      sazjha        suzi        suzy   suzykitty      tazmin 
##           1           1           1           1           1           1 
##        tazo       tazzy      tizzie         yaz        yuzu        zazi 
##           1           1           1           1           1           1 
##        zizi 
##           1

And now the same pattern, but for dogs.

# for dogs
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% "z"
  )
)
, decreasing = TRUE)

## 
##         izzy        hazel        gizmo         enzo        ozzie 
##           60           49           46           30           25 
##       lizzie         ozzy         jazz        suzie         zuzu 
##           21           17           14           14           13 
##          taz        dozer        izzie        rizzo         suzy 
##           11           10            8            7            7 
##        jazzy         buzz       fozzie        lizzy        cozmo 
##            6            5            5            5            4 
##        dizzy       fezzik        mazey        mazie      jazmine 
##            4            4            4            4            3 
##      jezebel        lazlo         liza        mazzy       wizard 
##            3            3            3            3            3 
##        zizou        bizzy       dazzle          fez        fuzzy 
##            3            2            2            2            2 
##    izzabella         izze         izzi          jaz      jazmyne 
##            2            2            2            2            2 
##        kizzy        kozmo          liz      nizhoni         orzo 
##            2            2            2            2            2 
##       razzie         tazz       auzzie       bazely         bazi 
##            2            2            1            1            1 
##       bazzle      bezudry       bizkit       bozley        buzzy 
##            1            1            1            1            1 
##      cazaril      cezanne      cozette          dez       dizzle 
##            1            1            1            1            1 
##         elza      fizzgig          foz        fozul        fozzy 
##            1            1            1            1            1 
##         fuzz   fuzzbucket      gazelle       gizzmo        gizzy 
##            1            1            1            1            1 
##        gozer         guzi         haze     hazelnut         hazy 
##            1            1            1            1            1 
##    izzybella     jazzelle        jazzi       jazzie   jezzabelle 
##            1            1            1            1            1 
##   jezzebelle         jozy          kaz     kazanova         kazi 
##            1            1            1            1            1 
##       kazmer         kazu        kezie         kuzi   lizaminion 
##            1            1            1            1            1 
##      lizette        lizzi          luz       mazama        mazel 
##            1            1            1            1            1 
##         mazy         mizu       mizzie       mozart         moze 
##            1            1            1            1            1 
##         mozi       mozzie        mozzy        muzby         nazy 
##            1            1            1            1            1 
##        nazzy        ouzel          pez       puzzle          raz 
##            1            1            1            1            1 
##        razus   razzmatazz         rezi        rezso         rizz 
##            1            1            1            1            1 
##      rozalia        rozsa         sazi          soz         suze 
##            1            1            1            1            1 
##        suzee       suzieq       suzzie       tazman        tazzy 
##            1            1            1            1            1 
## tezcatlipoca        tizzy        tozer        tozzi         tuzi 
##            1            1            1            1            1 
##          twz      wizzard         yzzy        zazie         zazu 
##            1            1            1            1            1 
##        zezza        zizka         zozo 
##            1            1            1

We could think about of matching a specific pattern. For instance, let’s try to find the pattern “cc”.

# match the pattern "cc"
# for cats
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = "cc" %R% ANY_CHAR
    )
)
, decreasing = TRUE)

## 
##     gnocchi       rocco   chewbacca       gucci  piccadilly    staccato 
##           6           4           3           2           2           2 
##    zucchini       becca   blenducci   boudiccea      flocca      hiccup 
##           2           1           1           1           1           1 
##       kucci       lucca       lucci  macchiatto      mccone     mccovey 
##           1           1           1           1           1           1 
##    moccasin moustacchio      nicchi     piccolo     rebecca       recco 
##           1           1           1           1           1           1 
##      soccer     torocco     yacchan       zucca 
##           1           1           1           1

“Gnocchi” seems to be the most popular cat name with the “cc” pattern.

# match the pattern "cc"
# for dogs
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = "cc" %R% ANY_CHAR
    )
)
, decreasing = TRUE)

## 
##      rocco  chewbacca      lucca      gucci      ricco    bacchus 
##         28         21         17          6          5          4 
##      bocce      zucca   boudicca      becca cappuccino   abbracci 
##          4          4          3          2          2          1 
##      bocca     chicca   chubacca    gnocchi      jacco      macca 
##          1          1          1          1          1          1 
##    macchia  macchiato      mocca      nicca      nicco      nucci 
##          1          1          1          1          1          1 
##     pucchi      pucci    puccini      ricci      rocca      rucca 
##          1          1          1          1          1          1 
##      sacco scaramucci      zacca   zaccheus 
##          1          1          1          1

“Rocco” shows as the most popular dog name where the “cc” pattern is present.

Another option is to merely match more than one type of pattern. From the data exploration, some pet names were not exactly the same, though very similar. For instance, “Lucy”, “Lucie”, and “Luci”. In this case, we can use the function or from the rebus package.

# match multiple patterns with "or"
# for cats
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = START %R% or("lucy", "lucie", "luci") 
    %R% END)
)
, decreasing = TRUE)

## 
##  lucy  luci lucie 
##   104     3     1

And for dogs:

# match multiple patterns with "or"
# for dogs
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = START %R% or("lucy", "lucie", "luci") 
    %R% END)
)
, decreasing = TRUE)

## 
##  lucy lucie  luci 
##   358     9     7

We could also try to find first names containing only vowels by using the char_class function form the rebus package.

# find first names that correspond only to vowels
# for cats
vow <- char_class("aeiou")
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = START %R% one_or_more(vow) %R% END)
)
, decreasing = TRUE)

## 
##  o  e io 
##  2  1  1

And for dogs:

# find first names that correspond only to vowels
# for dogs
vow <- char_class("aeiou")
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = START %R% one_or_more(vow) %R% END)
)
, decreasing = TRUE)

## 
##   io iuiu    o  uau 
##    2    1    1    1

We could also do the reverse by using the negated_char_class function.

# find names that do not have vowels
# for cats
not_vow <- negated_char_class("aeiou")

sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = START %R% one_or_more(not_vow) %R% END)
)
, decreasing = TRUE)

## 
##     mr  gypsy     ms    mrs  flynn   lynx     mj    sky    dmh     dr 
##     70      9      9      8      6      6      5      5      4      4 
##     jj    nyx     bb     bc     bw     cc     dc     pj     tj     cj 
##      4      4      3      3      3      3      3      3      3      2 
##      d    dsh      g      j     jr   jynx     kt      p   rhys    sgt 
##      2      2      2      2      2      2      2      2      2      2 
##   skyy    syd     ty      1      2     30     99     bj   bryn   bynx 
##      2      2      2      1      1      1      1      1      1      1 
##    c80     dj     dw    fly  frytz    fyn   fynn     gb  glynn grrrly 
##      1      1      1      1      1      1      1      1      1      1 
##   grwh     jd     jh     jp     jt    jyn     k2     kc     kk    lbj 
##      1      1      1      1      1      1      1      1      1      1 
##      m     my   prrr    pym      q     qt     rd rhythm    ryn      s 
##      1      1      1      1      1      1      1      1      1      1 
##    sly sphynx  sydny     tk    tyr      v   whbl      z     zz 
##      1      1      1      1      1      1      1      1      1

And for dogs:

# find names that do not have vowels
# for dogs
not_vow <- negated_char_class("aeiou")

sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = START %R% one_or_more(not_vow) %R% END)
)
, decreasing = TRUE)

## 
##     mr  gypsy     jj     ty  flynn     pj    sky     cj     ms      z 
##     48     15     12      9      8      8      7      6      6      5 
##     bb     dj     dr   fynn     kc     mj      p     tj      b     cc 
##      4      4      4      4      4      4      4      4      3      3 
##    fry      j    nyx    sgt     bj     cy      d     gg     jp     jr 
##      3      3      3      3      2      2      2      2      2      2 
##      k     ky     lj     lt      m     pd    sly      t    tyr   wynn 
##      2      2      2      2      2      2      2      2      2      2 
##     7s    bb8     bg   bryn      c     cb     cr     dc     dw      f 
##      1      1      1      1      1      1      1      1      1      1 
##    fly     gd    grr     hy     jb     jd     jw    jwl     k2     kk 
##      1      1      1      1      1      1      1      1      1      1 
##     lb     lc lynyrd     mc  mcfly    mrs     my     pk     pm     pp 
##      1      1      1      1      1      1      1      1      1      1 
##     pt    pym   r2d2     rj    spy     sy  sylph     tt    twy    twz 
##      1      1      1      1      1      1      1      1      1      1 
##  vãlkl   yzzy 
##      1      1

“Mr” and “Gypsy” seem to be the most popular options of first names without vowels.

We could also try to find names only with digits by using the function one_or_more(DIGIT).

# find names only with digits
# for cats
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = START %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)

## 
##  1  2 30 99 
##  1  1  1  1

And for dogs:

# find names only with digits
# for cats
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = START %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)

## integer(0)

For dogs, we do not find first names with digits only.

Additionally, we could capture both words and digits for a given first name with the help of the function capture(WRD):

# capture first name that have words and digits
# for cats
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = capture(WRD) %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)

## 
##      cat2  testcat2        30        99       c80      cat1 charlene2 
##         3         2         1         1         1         1         1 
##      deb0     jojo3        k2   number1    oscar2  slasher2   sydney2 
##         1         1         1         1         1         1         1 
##     tont2 
##         1

“Cat2” is the most popular cat name with both words and digits.

Let’s see how this pattern looks for dogs:

# capture first name that have words and digits
# for cats
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = capture(WRD) %R% one_or_more(DIGIT) %R% END)
)
, decreasing = TRUE)

## 
##      bb8  casper2   cayde6 jacques2       k2  number3   penny2     r2d2 
##        1        1        1        1        1        1        1        1

There are 8 dog names with both words and digits, but none is more popular than the other.

Amazingly, we could easily identify first names with repeated letters. We can use the REF1 function to try to find names with three repeated letters.

# find first names with three repeated letters
# for cats
sort(
  table(
  str_subset(
    cats_names_tbl$animals_name_1, pattern = capture(LOWER) %R% REF1 %R% REF1)
)
, decreasing = TRUE)

## 
## copurrrnicus     cosettte       grrrly     katgrrrl         prrr 
##            1            1            1            1            1 
##     purrrsia     wafffles 
##            1            1

And for dogs:

# find first names with three repeated letters
# for cats
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = capture(LOWER) %R% REF1 %R% REF1)
)
, decreasing = TRUE)

## 
##  dollly ellliah willlow 
##       1       1       1

We have 7 cat names with three repeated letters in a row. In dogs, we only observe this pattern 3 times.

To finalize, we could use the function exactly to detect how many times a specific pattern occurs.

# detect a pattern with exactly 
# for dogs
sort(
  table(
  str_subset(
    dogs_names_tbl$animals_name_1, pattern = exactly("lucy"))
)
, decreasing = TRUE)

## lucy 
##  358

Thus, the name “Lucy” was given to 358 dogs in Seattle in the last years.

Conclusion

This post touched on some of the most important functions to work with strings. Hope you have enjoyed how powerful can be the stringr package, especially when paired together with the rebus package. Nonetheless, there is still much more to learn when dealing with strings. Keep learning and coding!!

Working with Strings in R: Seattle Pet Names

Rebus

Conclusion

Hugo Toscano

Iteration made easier: A case study with purrr

Factors in R: Forcats to help

Working with Strings in R: Seattle Pet Names

Euro vs Dollar: Working with Lubridate and some other packages

Clustering the Pharmaceutical Industry Stocks

Text Mining Crime and Punishment & Anna Karenina: A Tidytext Approach

Creating a Model to Predict if a Bank Customer accepts Personal Loans

German Elections in the 21st Century

Predicting Airfares on New Routes a Supervised Learning Approach With Multiple Linear Regression

Hints to deal with Missing Values in R