Data Analysis in R - Amazon book data analysis

Book Data Analysis

Explore the Book Details csv file from the Kaggle Amazon Book Reviews dataset.

Purpose: Clean data and conduct basic analysis exploring different packages.

citations at end of file.

Summary.

The data set contains 212404 observations that lists various information on books published in 10 columns. A large amount of data (approx 23%) is missing and the dates are in a number of different formats. An attempt was made to normalise the dates but did not achieve full results due to multiple differnt formats.
This is to be re-examined at a later date.
There are multiple entries for some book titles due to spelling and also the fact that some books have been published multiple times.
The author and categories columns contain lists that were unnested as part of the analysis.

Load the environment.

library(readr)
library(lubridate)
library(knitr)
library(rmarkdown)
library(tidyverse)
library(janitor)
library(naniar)
library(stringdist)
library(tidytext)
library(kableExtra)

Import the book_data.csv file. Show how many observations are in the df.

book_names <- read_csv("books_data.csv", show_col_types = FALSE)
nrow(book_names)

## [1] 212404

glimpse(book_names)

## Rows: 212,404
## Columns: 10
## $ Title         <chr> "Its Only Art If Its Well Hung!", "Dr. Seuss: American I…
## $ description   <chr> NA, "Philip Nel takes a fascinating look into the key as…
## $ authors       <chr> "['Julie Strain']", "['Philip Nel']", "['David R. Ray']"…
## $ image         <chr> "http://books.google.com/books/content?id=DykPAAAACAAJ&p…
## $ previewLink   <chr> "http://books.google.nl/books?id=DykPAAAACAAJ&dq=Its+Onl…
## $ publisher     <chr> NA, "A&C Black", NA, "iUniverse", NA, "Wm. B. Eerdmans P…
## $ publishedDate <chr> "1996", "2005-01-01", "2000", "2005-02", "2003-03-01", "…
## $ infoLink      <chr> "http://books.google.nl/books?id=DykPAAAACAAJ&dq=Its+Onl…
## $ categories    <chr> "['Comics & Graphic Novels']", "['Biography & Autobiogra…
## $ ratingsCount  <dbl> NA, NA, NA, NA, NA, 5, NA, 3, NA, NA, NA, NA, NA, NA, NA…

It appears that some of the columns could contain lists []

View the first six observations.

head(book_names)

## # A tibble: 6 × 10
##   Title   description authors image previewLink publisher publishedDate infoLink
##   <chr>   <chr>       <chr>   <chr> <chr>       <chr>     <chr>         <chr>   
## 1 Its On…  <NA>       ['Juli… http… http://boo… <NA>      1996          http://…
## 2 Dr. Se… "Philip Ne… ['Phil… http… http://boo… A&C Black 2005-01-01    http://…
## 3 Wonder… "This reso… ['Davi… http… http://boo… <NA>      2000          http://…
## 4 Whispe… "Julia Tho… ['Vero… http… http://boo… iUniverse 2005-02       http://…
## 5 Nation…  <NA>       ['Edwa… <NA>  http://boo… <NA>      2003-03-01    http://…
## 6 The Ch… "In The Ch… ['Ever… http… http://boo… Wm. B. E… 1996          http://…
## # ℹ 2 more variables: categories <chr>, ratingsCount <dbl>

Summarise the number of missing values in each column using the naniar package.

#use the tidyverse so summarise missing values for each column.
#summary_missing_book_names <- book_names %>%
#  summarize(across(everything(), ~ sum(is.na(.))))
#summary_missing_book_names

#use the naniar package to summarise missing values for each column
miss_var_summary(book_names)

## # A tibble: 10 × 3
##    variable      n_miss  pct_miss
##    <chr>          <int>     <num>
##  1 ratingsCount  162652 76.6     
##  2 publisher      75886 35.7     
##  3 description    68442 32.2     
##  4 image          52075 24.5     
##  5 categories     41199 19.4     
##  6 authors        31413 14.8     
##  7 publishedDate  25305 11.9     
##  8 previewLink    23836 11.2     
##  9 infoLink       23836 11.2     
## 10 Title              1  0.000471

Visualise that missing data using the naniar packge.

gg_miss_var(book_names)

Thats a lot of missing value, lets calculate the total percentage of missing values in the data.

# using base R 
#percentge_of_mising_values <- sum(is.na(book_names)) / prod(dim(book_names)) *100
#print(percentge_of_mising_values)

#using the naniar package to calcualte the total percentage of missing values in the df
print(total_pct_miss <- pct_miss(book_names))

## [1] 23.7587

The data contains 212404 records in 10 columns. 23.8% of the values are missing.

#### Lets clean the dataset create new df and remove whitespaces from the columns.

book_names_clean <- book_names %>%
  mutate(across(everything(), ~ if (is.character(.)) str_trim(.) else .))

Find duplicate values based on book title using the dplyr package.

duplicate_titles <- book_names_clean %>%
     dplyr::group_by(Title) %>%
     dplyr::filter(n() > 1) %>%
     ungroup()

print(duplicate_titles)

## # A tibble: 0 × 10
## # ℹ 10 variables: Title <chr>, description <chr>, authors <chr>, image <chr>,
## #   previewLink <chr>, publisher <chr>, publishedDate <chr>, infoLink <chr>,
## #   categories <chr>, ratingsCount <dbl>

dplyr failed to find any duplicate values.

Sort by book title to quickly determine duplicate or similar name titles.

books_title_sorted <- book_names_clean %>%
  select(-description, -categories, -image, -previewLink, -infoLink, -ratingsCount) %>%      #remove unessary columns to make it easier to read results
  arrange(Title) %>%                                        # Arrange by title
  slice_head(n = 10)                                        # Get the first 10 rows for clarity

kable(books_title_sorted)  # Display the sorted table

Title	authors	publisher	publishedDate
” Film technique, ” and, ” Film acting ”	[‘V. I. Pudovkin’]	Sims Press	2008-11
” We’ll Always Have Paris”: The Definitive Guide to Great Lines from the Movies	[‘Robert A. Nowlan’, ‘Gwendolyn Wright Nowlan’]	Perennial	1994
“… And Poetry is Born …” Russian Classical Poetry	[‘Aleksandr Sergeevich Pushkin’]	NA	1984
“A Titanic hero” Thomas Andrews, shipbuilder	[‘Shan F. Bullock’]	NA	1913
“A Truthful Impression of the Country”: British and American Travel Writing in China, 1880-1949	[‘Nicholas J. Clifford’, ‘Nicholas Rowland Clifford’, ‘Nick Clifford’]	University of Michigan Press	2001
“A careless word, a needless sinking”: A history of the staggering losses suffered by the U.S. Merchant Marine, both in ships and personnel during World War II	[‘Arthur R. Moore’]	NA	1983
“A careless word– a needless sinking”: A history of the staggering losses suffered by the U.S. Merchant Marine, both in ships and personnel during World War II	[‘Arthur R. Moore’]	NA	1983
“A giant in the earth,”: A biography of Dr. J. B. Boddie,	[‘Charles Emerson Boddie’]	NA	1944
“A kind of life”: Conversations in the combat zone	[‘Roswell Angier’]	NA	1976-01-01
“A parallel”, the basis of the Book of Mormon: B.H. Roberts’ “Parallel” of the Book of Mormon to View of the Hebrews	[‘Hal Hougey’, ‘Brigham Henry Roberts’]	NA	1963

It is clear from the results that the data contains numerous duplicate titles.

Initial inspection of the book_names df shows a number of issues.

1. Four columns can be removed. (image, previewLink, infoLink and ratingsCount).
2. Column names to be standardized to lower case.
3. Missing values converted to NA.
4. Date values changed from and be normalised to year. <br? 5. Square brackets (list values) and “” to be removed from authors and categories columns.
6. Duplicate values should be investigated.

Remove the unwanted columns.

book_names_clean <- book_names_clean %>% 
  select(-image, -previewLink, -infoLink, -ratingsCount)

Standardize column names to lower case.

#tolower(colnames(book_names_clean))   #This didnt work as one of the columns is named publishDate
book_names_clean <- book_names_clean %>% 
  rename_with(tolower)

Fill any blank cells with NA. Use the Naniar package to get a summary of missing values.

book_names_clean <- book_names_clean %>%
  mutate(across(where(is.character), ~na_if(., "")))
              
miss_var_summary(book_names_clean)

## # A tibble: 6 × 3
##   variable      n_miss  pct_miss
##   <chr>          <int>     <num>
## 1 publisher      75886 35.7     
## 2 description    68442 32.2     
## 3 categories     41199 19.4     
## 4 authors        31413 14.8     
## 5 publisheddate  25305 11.9     
## 6 title              1  0.000471

The data set contains lists inside the authors and categories columns. Create a custom function that will remove the brackets and quotes from these columns and split the cleaned string into indiviual items.

# Function to extract items from list-like strings
extract_items <- function(x) {
  if (is.na(x)) {
    return(NA_character_)
  } else {
    x <- gsub("\\[|\\]|'", "", x)       # Remove brackets and quotes
    return(strsplit(x, ", ")[[1]])      # Split into individual items
  }
}

# Apply the function to authors and categories using the purr package map() fuction
book_names_clean <- book_names_clean %>%
  mutate(authors_list = map(authors, extract_items),
         categories_list = map(categories, extract_items))

# unnest these lists to get each author/category on a separate row. We wil use these later.
book_names_authors_unnested <- book_names_clean %>% 
 unnest(authors_list)

book_names_categories_unnested <- book_names_clean %>%
  unnest(categories_list)
  

#remove the original columns.
book_names_clean <- book_names_clean %>%
  select(-authors, -categories)

Changing the publisheddate column to a date format (yyyy) handling multiple date formats and NA values.

book_names_clean <- book_names_clean %>%
  mutate(publisheddate = case_when(
    is.na(publisheddate) ~ NA_real_,          # Keep NA values as is
    grepl("^\\d{4}-\\d{2}-\\d{2}$", publisheddate) ~ as.numeric(format(as.Date(publisheddate, "%Y-%m-%d"), "%Y")),
    grepl("^\\d{4}-\\d{2}$", publisheddate) ~ as.numeric(format(as.Date(publisheddate, "%Y-%m"), "%Y")),
    grepl("^\\d{4}$", publisheddate) ~ as.numeric(publisheddate),
    TRUE ~ NA_real_  # Handle any other unexpected cases
  ))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `publisheddate = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion

#using lubridate this produced more errors
#book_names <- book_names %>%
#  mutate(publisheddate = case_when(
#    grepl("^\\d{4}-\\d{2}-\\d{2}$", publisheddate) ~ year(ymd(publisheddate)),
#    grepl("^\\d{4}-\\d{2}$", publisheddate) ~ year(ym(publisheddate)),
#    grepl("^\\d{4}$", publisheddate) ~ as.numeric(publisheddate),
#    TRUE ~ NA_real_
#  ))
# 96713 failed to parse using lubridate

Count the number of NA values in publisheddate

sum(is.na(book_names_clean$publisheddate))

## [1] 36612

mean(is.na(book_names_clean$publisheddate)) * 100

## [1] 17.237

The number of NA values has incresed from 11% to 17%. This is likely due to invalid dates and the transformation code fallback clause assigns NA to all unhandled cases. Further investigation is recommended.

Check the original values in the book_names df to check for invalid date formats.

unique_formats <- unique(book_names$publishedDate)
#view(unique_formats)

A few non standard formats were noted in the original data. such as “19??” and “1973*“.

Explore the first and last dates in the publisheddate column.

first_date <- min(book_names_clean$publisheddate, na.rm = TRUE)
last_date <- max(book_names_clean$publisheddate, na.rm = TRUE)

print(first_date)

## [1] 1016

print(last_date)

## [1] 2030

The dates seem to be out of the expected range. We can filter to see which books are listed for the dates. we will de-select the description column as it is not required.

unusual_dates <- book_names_clean %>%
  dplyr::filter(publisheddate %in% c(1016, 2030)) %>% 
  select(-description)

kable(unusual_dates)

title	publisher	publisheddate	authors_list	categories_list
MANSIONS OF DARKNESS	Valancourt Books	1016	Archie Roy	Fiction
A Wealth of Wisdom: Legendary African American Elders Speak	Simon and Schuster	2030	Camille Cosby , Rene Poussaint	Biography & Autobiography

A quick internet results show that published date is wrong on at least 2 of the records.

Bonus
Tried to compare the book titles to find similar named book using the stringdist package. This did not work as it required too much memory. (168GB) and would require something like 45 Billion comparisions. Considred using block processing based on other fields and this is done later on in a later code chunk.

# similarity_scores <- stringdist::stringdistmatrix(book_names_clean$title, method = "jw")

Find the total number of authors in the df

total_authors <- book_names_clean %>%
  dplyr::filter(!is.na(authors_list)) %>%
  summarise(total_authors = n_distinct(authors_list))

kable(total_authors)

total_authors
127275

Find the Author with the most book releases.

author_ranked <- book_names_clean %>%
  dplyr::filter(!is.na(authors_list)) %>% 
  group_by(authors_list) %>%
  summarise(total_books = n()) %>%
  arrange(desc(total_books))
  
author_ranked %>% 
  head(10) %>%      #use the head() function together with the kable() to display the top 10
  kable() %>% 
  kable_styling(full_width = TRUE)

authors_list	total_books
Rose Arny	236
William Shakespeare	191
Library of Congress. Copyright Office	178
Agatha Christie	142
Erle Stanley Gardner	124
“Louis LAmour”	123
Charles Dickens	89
Edgar Rice Burroughs	85
Zane Grey	75
Rudyard Kipling	75

#print(head(author_ranked, 10))

I had never heard of “Rose Arny” but an internet search revealed, “her prominence in the data could be due to her role in the publishing industry rather than as an author of original works. She was the editor of Publishers Weekly, a major trade publication in the book industry, and her name often appeared in connection with book reviews, announcements, and other publishing-related content.”

Some elements in the authors_list column include lists (e.g. $ : chr [1:3] “K. H. Scheer” “Wendayne Ackerman” “F. J. Ackerman”) rather than a simple character string. To get around this we mutate the column, collapsing list elements into a single string. Then use ggplot to produce a bar graph.

author_ranked <- author_ranked %>%
  mutate(authors_list = sapply(authors_list, paste, collapse = " "))  # Collapse list elements into a single string

top_10_authors <- author_ranked %>% 
  head(10)

ggplot(top_10_authors, aes(y = reorder(authors_list, -total_books), x = total_books)) +
  geom_bar(stat = "identity", fill = "#4183C4") +
  labs(
    title = "Top 10 Authors by books published", x = "Total published", y = "Author")

Wanting to understand more about “Rose Arny” I List 10 books that list her as author.

rose_arny_books <- book_names_authors_unnested %>%
  dplyr::filter(authors_list == "Rose Arny") %>% 
  select(-description, -categories) %>% 
  arrange((title)) %>% 
  slice_head(n = 10)      # display only 10 for clarity
kable(rose_arny_books)

title	authors	publisher	publisheddate	authors_list	categories_list
01443 DEVELOPING SKILLS IN ALGEBRA ONE, BOOK C	[‘Rose Arny’]	NA	1995	Rose Arny	American literature
1,000 Points of Light: The Public Remains in the Dark (Oswald’s Closest Friend: The George De Mohrenschildt Story, Volume 1)	[‘Rose Arny’]	NA	1999	Rose Arny	American literature
36 propositions for a home/36 modeles pour une maison (English and French Edition)	[‘Rose Arny’]	NA	1998	Rose Arny	American literature
A Bride for Crimson Falls (Silhouette Desire Ser, No. 1076)	[‘Rose Arny’]	NA	1997	Rose Arny	American literature
A Lady’s Point of View (Harlequin Regency Romance, No 14)	[‘Rose Arny’]	NA	2001-06	Rose Arny	American literature
A Man I Used to Know: Love that Man! (Harlequin Superromance No. 831)	[‘Rose Arny’]	NA	1999-04	Rose Arny	American literature
A Man Like Michael (Desire Ser.)	[‘Rose Arny’]	NA	1997	Rose Arny	American literature
A Mother’s Secrets (Randolph Family Ties, Book 3) (Harlequin Intrigue Series #577))	[‘Rose Arny’]	NA	2000	Rose Arny	American literature
A Perfect Pair (Silhouette Special Edition No. 1590)	[‘Rose Arny’]	NA	2003	Rose Arny	American literature
A Rosey Little Christmas / Jingle Bell Bride (Harlequin Duets, No. 64)	[‘Rose Arny’]	NA	2002	Rose Arny	American literature

The wide range of genres would suggest there are errors in the data. The titles all come under one category of “American literature” and reafirm “Rose Arny” role in the publishing industry.

Check the titles under the author “Agatha Christie” for duplicate.

agatha_christie_books <- book_names_authors_unnested %>%
  dplyr::filter(authors_list == "Agatha Christie") %>% 
  select(-description, -categories) %>% 
  arrange((title)) %>% 
  slice_head(n = 30)                # limit to top 30 for clarity

agatha_christie_books %>%         
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)

title	authors	publisher	publisheddate	authors_list	categories_list
13 CLUES FOR MISS MARPLE	[‘Agatha Christie’]	NA	1975	Agatha Christie	Detective and mystery stories
13 For Luck	[‘Agatha Christie’]	Dell Publishing Company	1961	Agatha Christie	Fiction
13 clues for Miss Marple;: A collection of mystery stories,	[‘Agatha Christie’]	NA	1966	Agatha Christie	Detective and mystery stories, English
4.50 from Paddington (Agatha Christie Collection S.)	[‘Agatha Christie’]	NA	2000	Agatha Christie	England
A Murder Is Announced	[‘Agatha Christie’]	HarperCollins	2016-12-29	Agatha Christie	NA
A Murder Is Announced (Miss Marple Mysteries)	[‘Agatha Christie’]	HarperCollins	2016-12-29	Agatha Christie	NA
A Pocket Full of Rye	[‘Agatha Christie’]	Signet Book	2000	Agatha Christie	Fiction
A Pocketful of Rye	[‘Agatha Christie’]	NA	2006	Agatha Christie	Detective and mystery stories
A Star over Bethlehem and Other Stories	[‘Agatha Christie’]	William Morrow Paperbacks	2011-10-25	Agatha Christie	Fiction
A pocket full of rye	[‘Agatha Christie’]	Harper Collins	2011-06-14	Agatha Christie	Fiction

The results show that out of the 142 books titles some are republished works of the same title. They may also be listed under different categories. The title column should be normalised with lower case text and the removal of unnecessary characters and spaces.

Normalise the title column of agatha_christie_books with the stringr package.

agatha_christie_books <- agatha_christie_books %>%
  mutate(title = str_replace_all(title, "[[:punct:]]", "")) %>%  # Remove punctuation
  mutate(title = str_to_lower(title)) %>%                       # Convert to lowercase
  mutate(title = str_trim(title)) %>%                           # Trim whitespace
  mutate(title = str_replace_all(title, "\\s+", " "))           # Replace extra spaces

Compare the sliced book titles to find similar named book using the stringdist package. This time with a much smaller data set than we tried before. The fig.width and fig.height values for the code chunk were adjusted for clarity. The pheatmap package was used instead of the base R heatmap package becuase it gave better customization options for text size.

similarity_scores <- stringdist::stringdistmatrix(agatha_christie_books$title, method = "jw")

similarity_matrix <- as.matrix(similarity_scores)

rownames(similarity_matrix) <- agatha_christie_books$title
colnames(similarity_matrix) <- agatha_christie_books$title

#print(similarity_scores)

# Create a heatmap to identify similar names
library(pheatmap)

## Warning: package 'pheatmap' was built under R version 4.4.3

pheatmap(as.matrix(similarity_matrix),
         main = "Jaro-Winkler Similarity Scores",
         fontsize_row = 7, fontsize_col = 7)  # Adjust font sizes

Find the closest match in top 10 titles in the dataset

which(similarity_matrix == min(similarity_matrix[similarity_matrix > 0]), arr.ind = TRUE)

##                      row col
## a pocketful of rye     8   7
## a pocket full of rye   7   8
## a pocket full of rye  10   8
## a pocketful of rye     8  10

Lets count and sort the categories to see how many books are in each category. We know from our exploration above that 19.4% of the categories_list is NA values so we will remove them.

category_counts <- book_names_clean %>%
  dplyr::filter(!is.na(categories_list)) %>% 
  group_by(categories_list) %>%
  summarise(total_books = n()) %>%
  arrange(desc(total_books)) %>% 
  slice_head(n = 10)
# View the results using kable to ensure categories_list names are printed.
kable(category_counts)

categories_list	total_books
Fiction	23419
Religion	9459
History	9330
Juvenile Fiction	6643
Biography & Autobiography	6324
Business & Economics	5625
Computers	4312
Social Science	3834
Juvenile Nonfiction	3446
Science	2623

Some elements of the catergories_list in the category_counts df contain lists (e.g. chr [1:2] “Body” “Mind & Spirit”) rather than a simple character string. To get around this we mutate the column, collapsing list elements into a single string. Then usge ggplot to create a bar graph.

category_counts <- category_counts %>%
  mutate(categories_list = sapply(categories_list, paste, collapse = " "))  # Collapse list elements into a single string


ggplot(category_counts, aes(y = reorder(categories_list, -total_books), x = total_books)) +
  geom_bar(stat = "identity", fill = "#4183C4") +
  labs(
    title = "Top 10 Categories by Total Books published", x = "Total published", y = "Category")

Fiction books are by far the most popular.

Find the top 10 Fiction Authors.

top_10_fiction_writers <- book_names_categories_unnested %>% 
  dplyr::filter(categories_list == "Fiction") %>% 
  group_by(authors_list) %>% 
  summarise(total_books = n()) %>% 
  arrange(desc(total_books)) %>% 
  slice_head(n = 10)

kable(top_10_fiction_writers)

authors_list	total_books
“Louis LAmour”	105
Agatha Christie	76
Nora Roberts	51
Edgar Rice Burroughs	48
Georgette Heyer	43
Zane Grey	40
Stephen King	38
Rex Stout	38
William W. Johnstone	37
Danielle Steel	36

A quick internet search revealed. “Louis L’Amour was a prolific American author best known for his Western novels, which he often referred to as”frontier stories.” “His most famous works include Hondo, Last of the Breed, and the Sackett series, which remains a cornerstone of Western literature.”

Explore which year had the most books published. We know from our previous exploration that 17.2% of the published date values are NA values so we will ignore these.

book_year_count <- book_names_clean %>%
  dplyr::filter(!is.na(publisheddate)) %>% 
  group_by(publisheddate) %>%
  summarize(count = n()) %>% 
  arrange(desc(count)) %>% 
  slice_head(n = 50)
  
book_year_count %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(position = "left")

publisheddate	count
2004	7337
2003	7154
2005	6992
2002	6940
2000	6799
2001	6419
1999	6173
1998	5582
2012	5247
2013	5001

#print(book_year_count)

Create a bar graph showing the most popular years for books published.

ggplot(book_year_count, aes(x = publisheddate, y = count)) +
geom_bar(stat = "identity", fill = "#4183C4") +
  labs(title = "Top 50 years for number of books published")

Further research revealed the global financial crisis of 2008 had a significant effect on consumer spending which also had an effect on book publishing. Publishers were less willing to invest in new authors and new titles. The rise of digital media was also influencing consumer behavior with people turning online for entertainment and information.

Find the most common words in the descpiption column. start by breaking the description into indiviual words and counting them. Then compare that list to stop_words in the tidytext package to remove common words.

word_counts <- book_names_clean %>%
  
  dplyr::filter(!is.na(description)) %>%            # filter and rows with NA in the description
  unnest_tokens(word, description) %>%              # break descriptions into individual words
  mutate(word = str_to_lower(word)) %>%             # convert all words to lowercase
  count(word, sort = TRUE)                          # count occurrences of each word and sort by frequency


# use the tidytext package to remove common stop words. ("the, "and", etc.)
word_counts <- word_counts %>%
  anti_join(stop_words, by = "word")                # this will compare words in the word column to the stop_words list

#top_10_words <- word_counts                    # create a list of top 10 common words
#  slice_head(n = 10)
#print(top_10_words)

word_counts %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)

word	n
book	62910
life	37923
world	30768
time	21432
history	21220
story	17688
guide	17099
author	16805
edition	16463
american	16353

Find the top 10 words used in the description column to describe fiction books using the book_names_categories_unnested df we created previously. Using unnest_tokens in early code chunk when creating book_names_categories_unnested cause errors. By filtering before unnesting here instead of when we created the book_names_categories_unnested df we greatly reduce the computing power required and reduces unnecessary computations.

fiction_words <- book_names_categories_unnested %>%
  dplyr::filter(str_detect(categories_list, "Fiction"), !is.na(description)) %>%  # filter for Fiction and remove NA descriptions

  unnest_tokens(word, description) %>%              # break descriptions into individual words
  count(word, sort = TRUE) %>%                      # Count occurrences of each word and sort
  anti_join(stop_words, by = "word")                # this will compare words in the word column to the stop_words list

fiction_words %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)

word	n
life	9353
world	7345
love	6839
book	6499
story	6279
time	5141
de	5097
family	5063
author	4661
stories	3790

#print(head(fiction_words, 10))

Find the top 10 words used in the description column to describe Agatha Christies books using the books_names_authors_unnested df we created early. By filtering before unnesting and converting to lower case here instead of when we created the book_names_authors_unnested df we greatly reduce the computing power required and reduces unnecessary computations.

agatha_christie_words <-book_names_authors_unnested %>% 
  dplyr::filter(str_detect(authors_list, "Agatha Christie"), !is.na(description)) %>% 
  
  unnest_tokens(word, description) %>%              # descriptions into individual words
  mutate(word = str_to_lower(word)) %>%             # Convert all words to lowercase
  count(word, sort = TRUE) %>%                      # Count occurrences of each word and sort
  anti_join(stop_words, by = "word")                # compare and remove words in teh word column to the stop_words list 
agatha_christie_words %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)

word	n
poirot	82
agatha	60
hercule	60
murder	53
mystery	52
miss	37
marple	33
christie	30
murderer	30
detective	27

#print(head(agatha_christie_words, 10))

An internet search revealed “Hercule Poirot is a fictional Belgian detective created by British writer Agatha Christie. Poirot is Christie’s most famous and longest-running character, appearing in 33 novels, two plays, and 51 short stories published between 1920 and 1975”

Shakespeare top 10 words.

Shakespeare_words <-book_names_authors_unnested %>% 
  dplyr::filter(str_detect(authors_list, "William Shakespeare"), !is.na(description)) %>% 
  
  unnest_tokens(word, description) %>%              # descriptions into individual words
  mutate(word = str_to_lower(word)) %>%             # Convert all words to lowercase
  count(word, sort = TRUE) %>%                      # Count occurrences of each word and sort
  anti_join(stop_words, by = "word")                # compare and remove words in teh word column to the stop_words list 

Shakespeare_words %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)

word	n
shakespeare	191
play	124
shakespeare’s	92
edition	84
text	80
plays	56
folger	55
notes	52
introduction	47
series	46

# print(head(Shakespeare_words, 10))

An internet search revealed “Folger is most likely referring to the Folger Shakespeare Library, a renowned institution dedicated to the works of William Shakespeare and the early modern period. Located in Washington, D.C., the Folger Shakespeare Library houses the world’s largest collection of Shakespeare’s printed works, including rare editions like the First Folio”
You live and learn.

References

Dataset Citation

Source: Kaggle
Dataset: Amazon Books Reviews

License The dataset is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

This analysis was conducted with the assistance of Microsoft Copilot, an AI companion created by Microsoft, to help process data and provide coding assistance.

citation(readr) citation(lubridate) citation(knitr) citation(rmarkdown) citation(tidyverse) citation(janitor) citation(naniar) citation(stringdist) citation(tidytext) citation(kableExtra) citation(pheatmap)

richrowe.uk