String Matching in R Programming
String matching is an important aspect of any language. It is useful in finding, replacing as well as removing string(s)
A regular expression is a string that contains special symbols and characters to find and extract the information needed from the given data
Operations on String Matching
Finding a String
grep() function: It returns the index at which the pattern is found in the vector.
grep(pattern, string, ignore.case=FALSE)
str <- c("Man", "woman","baby", "amman", "happy")
grep('man', str)
> grep('man', str)
Output:
[1] 2 4
str <- c("Man", "woman","baby", "amman", "happy")
> grep('man', str, ignore.case ="True")
output:
[1] 1 2 4
grepl() function: It is a logical function that returns the value True if the specified pattern is found in the vector and false if it is not found.
Syntax
grepl(pattern, string, ignore.case=FALSE)
To find whether any instance(s) of ‘the’ are present in the string.
str <- c("Man", "woman","baby", "amman", "happy")
> grepl('the', str)
Output:
FALSE FALSE FALSE FALSE FALSE
> grepl('wo', str)
Output:
FALSE TRUE FALSE FALSE FALSE
regexpr() function: It searches for occurrences of a pattern in every element of the string.
Syntax: regexpr(pattern, string, ignore.case = FALSE)
example: To find whether any instance(s) of ‘he’ is present in each string of the vector.
str <- c("Hello", "hello", "hi", "ahey", "aahead")
>regexpr('he', str)
Output:
-1 1 -1 2 3
example: To find whether any instance(s) of words starting with a vowel is present in each string of the vector.
str <- c("abra", "Ubra", "hunt", "quirky")
regexpr('^[aeiouAEIOU]', str)
output:
[1] 1 1 -1 -1
Finding and Replacing Strings in R
sub() and gsub()
In order to search and replace a particular string, we can use two functions namely, sub() and gsub().
sub replaces the only first occurrence of the string to be replaced and returns the modified string.
gsub() replaces all occurrences of the string to be replaced and returns the modified string.
Syntax: sub(pattern, replaced_string, string) gsub(pattern, replaced_string, string)
Example : To replace the first occurrence of ‘he’ with ‘aa’
str = "heutabhe"
> sub('he', 'aa', str)
output:
"aautabhe"
Example : To replace all occurrences of ‘he’ with ‘aa’
str = "heutabhe"
> gsub('he', 'aa', str)
output:
"aautabaa"
Finding and Removing Strings in R
str_remove() and str_remove_all()
str_remove() removes the only first occurrence of the string/pattern to be removed and returns the modified string.
str_remove_all() removes all occurrences of the string to be removes and returns the modified string.
Syntax: str_remove(string, pattern, ignore.case=False)
Example : Removing the first occurrence of vowels in the vector
library(stringr)
x <- c("apple", "pear", "banana", "orange")
> str_remove(x, "[aeiou]")
output:
"pple" "par" "bnana" "range
Example : Removing all occurrences of vowels in the vector
library(stringr)
x <- c("apple", "pear", "banana", "orange")
> str_remove_all(x, "[aeiou]")
output:
"ppl" "pr" "bnn" "rng"
More examples, text mining applications on novels
Sense and Sensibility is a novel by Jane Austen, published in 1811. It was published anonymously; By A Lady appears on the title page where the author's name might have been. It tells the story of the Dashwood sisters, Elinor (age 19) and Marianne (age 16½) as they come of age. They have an older half-brother, John, and a younger sister, Margaret (age 13).
Pride and Prejudice is an 1813 romantic novel of manners written by Jane Austen. The novel follows the character development of Elizabeth Bennet, the dynamic protagonist of the book who learns about the repercussions of hasty judgments and comes to appreciate the difference between superficial goodness and actual goodness. Its humour lies in its honest depiction of manners, education, marriage, and money during the Regency era in Great Britain.
Mansfield Park is the third published novel by Jane Austen, first published in 1814 by Thomas Egerton. A second edition was published in 1816 by John Murray, still within Austen's lifetime. The novel did not receive any public reviews until 1821.
Emma, by Jane Austen, is a novel about youthful hubris and romantic misunderstandings. It is set in the fictional country village of Highbury and the surrounding estates of Hartfield, Randalls and Donwell Abbey, and involves the relationships among people from a small number of families.[2] The novel was first published in December 1815
Northanger Abbey: This article is about the 1817 novel. For adaptations of the novel, see Jane Austen in popular culture § Northanger Abbey (1817).
Persuasion is the last novel fully completed by Jane Austen. It was published at the end of 1817, six months after her death.
The story concerns Anne Elliot, a young Englishwoman of twenty-seven years, whose family moves to lower their expenses and reduce their debt by renting their home to an Admiral and his wife. The wife's brother, Navy Captain Frederick Wentworth, was engaged to Anne in 1806, but the engagement was broken when Anne was "persuaded" by her friends and family to end their relationship. Anne and Captain Wentworth, both single and unattached, meet again after a seven-year separation, setting the scene for many humorous encounters as well as a second, well-considered chance at love and marriage for Anne in her second "bloom".
Libraries used in R for text analytics
library(tidytext)
library(tidyverse)
library(janeaustenr)
library(stringr)
library(wordcloud)
library(reshape2)
library(textdata)
Dictionaries used for sentiments
get_sentiments("afinn")
get_sentiments("bing")
get_sentiments("nrc")
data(sentiments)
#dataset structure
str(sentiments)
Sample output for afinn lexicon
afinn_lexicon <- get_sentiments("afinn")
head(afinn_lexicon)
## # A tibble: 6 × 2
Output
## word score
## 1 abandon -2
## 2 abandoned -2
#NRC
nrc_lexicon <- get_sentiments("nrc")
head(nrc_lexicon)
Output:
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
Sample output for bing lexicon
#BING
bing_lexicon <- get_sentiments("bing")
head(bing_lexicon)
Output
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
#Get the emma book and transform it into a tidy dataset
tidy_books <- austen_books() %>%
filter(book == "Emma") %>%
group_by(book) %>%
## Using row_number() with mutate() will create a column of consecutive numbers. The row_number() function is useful for creating an identification number (an ID variable). It is also useful for labeling each observation by a grouping variable.
##cumsum() function in R Language is used to calculate the cumulative sum of the vector passed as
## argument
# str_detect function returns a logical value (i.e. FALSE or TRUE),
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>% ungroup() %>%
unnest_tokens(word, text)
nrc lexicon associated with joy
#Using the nrc lexicon, only the words that are associated to a sentiment of `joy`
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
#Summarize the usage of `joy` words
tidy_books %>%
semi_join(nrc_joy) %>%
count(word, sort = T)
Output
## # A tibble: 303 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
Semi join and inner join
semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.
anti_join came in handy for us in a setting where we were trying to re-create an old table from the source data. We then wanted to be able to identify the records from the original table that did not exist in our updated table. Good example
Application in r
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
convert the text to the tidy format using unnest_tokens()
A side example, understand mutate
#install.packages("dplyr")
library(dplyr)
mtcars
mtcares2 = mutate(mtcars, mtcars_new = mpg/cyl)
Output:
Undestand mutate and group by
## Creating identification number to represent 50 individual people
ID <- c(1:20)
## Creating sex variable (10 males/10 females)
Sex <- rep(c("male", "female"), 10) # rep stands for replicate
## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33,
39, 28, 26, 29, 33, 22, 35, 23, 26, 36)
## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173,
0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482)
## Creating a unified dataset that puts together all variables
## tibble is a simple dataframe
data <- tibble(ID, Sex, Age, Score)
## group by sex
data %>%
group_by(Sex) %>%
summarize(m = mean(Score), # calculates the mean
s = sd(Score), # calculates the standard deviation
n = n()) %>% # calculates the total number of observations
ungroup()
##`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 4
##x m s n
##hr> <dbl> <dbl> <int>
##1 female 0.282 0.184 10
##2 male 0.363 0.300 10
## mutate() and group_by()
data %>%
group_by(Sex) %>%
mutate(m = mean(Score)) %>% # calculates mean score by Sex
ungroup()
janeaustensentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 100, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head(janeaustensentiment)
## Source: local data frame [6 x 5]
## Groups: book, index [6]
##
## book index negative positive sentiment
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 Sense & Sensibility 0 20 47 27
## 2 Sense & Sensibility 1 22 54 32
## 3 Sense & Sensibility 2 16 35 19
## 4 Sense & Sensibility 3 20 45 25
## 5 Sense & Sensibility 4 21 63 42
Plot Sentiment
ggplot(data = janeaustensentiment, mapping = aes(x = index, y = sentiment, fill = book)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(facets = ~ book, ncol = 2, scales = "free_x")
output:
References
https://www.geeksforgeeks.org/string-matching-in-r-programming/
https://en.wikipedia.org/
https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3
Comments