Book Data Analysis

Explore the Book Details csv file from the Kaggle Amazon Book Reviews dataset.

Purpose: Clean data and conduct basic analysis exploring different packages.

citations at end of file.

Summary.

The data set contains 212404 observations that lists various information on books published in 10 columns. A large amount of data (approx 23%) is missing and the dates are in a number of different formats. An attempt was made to normalise the dates but did not achieve full results due to multiple differnt formats.
This is to be re-examined at a later date.
There are multiple entries for some book titles due to spelling and also the fact that some books have been published multiple times.
The author and categories columns contain lists that were unnested as part of the analysis.



Load the environment.

library(readr)
library(lubridate)
library(knitr)
library(rmarkdown)
library(tidyverse)
library(janitor)
library(naniar)
library(stringdist)
library(tidytext)
library(kableExtra)


Import the book_data.csv file. Show how many observations are in the df.

book_names <- read_csv("books_data.csv", show_col_types = FALSE)
nrow(book_names)
## [1] 212404
glimpse(book_names)
## Rows: 212,404
## Columns: 10
## $ Title         <chr> "Its Only Art If Its Well Hung!", "Dr. Seuss: American I…
## $ description   <chr> NA, "Philip Nel takes a fascinating look into the key as…
## $ authors       <chr> "['Julie Strain']", "['Philip Nel']", "['David R. Ray']"…
## $ image         <chr> "http://books.google.com/books/content?id=DykPAAAACAAJ&p…
## $ previewLink   <chr> "http://books.google.nl/books?id=DykPAAAACAAJ&dq=Its+Onl…
## $ publisher     <chr> NA, "A&C Black", NA, "iUniverse", NA, "Wm. B. Eerdmans P…
## $ publishedDate <chr> "1996", "2005-01-01", "2000", "2005-02", "2003-03-01", "…
## $ infoLink      <chr> "http://books.google.nl/books?id=DykPAAAACAAJ&dq=Its+Onl…
## $ categories    <chr> "['Comics & Graphic Novels']", "['Biography & Autobiogra…
## $ ratingsCount  <dbl> NA, NA, NA, NA, NA, 5, NA, 3, NA, NA, NA, NA, NA, NA, NA…

It appears that some of the columns could contain lists []

View the first six observations.

head(book_names)
## # A tibble: 6 × 10
##   Title   description authors image previewLink publisher publishedDate infoLink
##   <chr>   <chr>       <chr>   <chr> <chr>       <chr>     <chr>         <chr>   
## 1 Its On…  <NA>       ['Juli… http… http://boo… <NA>      1996          http://…
## 2 Dr. Se… "Philip Ne… ['Phil… http… http://boo… A&C Black 2005-01-01    http://…
## 3 Wonder… "This reso… ['Davi… http… http://boo… <NA>      2000          http://…
## 4 Whispe… "Julia Tho… ['Vero… http… http://boo… iUniverse 2005-02       http://…
## 5 Nation…  <NA>       ['Edwa… <NA>  http://boo… <NA>      2003-03-01    http://…
## 6 The Ch… "In The Ch… ['Ever… http… http://boo… Wm. B. E… 1996          http://…
## # ℹ 2 more variables: categories <chr>, ratingsCount <dbl>

Summarise the number of missing values in each column using the naniar package.

#use the tidyverse so summarise missing values for each column.
#summary_missing_book_names <- book_names %>%
#  summarize(across(everything(), ~ sum(is.na(.))))
#summary_missing_book_names

#use the naniar package to summarise missing values for each column
miss_var_summary(book_names)
## # A tibble: 10 × 3
##    variable      n_miss  pct_miss
##    <chr>          <int>     <num>
##  1 ratingsCount  162652 76.6     
##  2 publisher      75886 35.7     
##  3 description    68442 32.2     
##  4 image          52075 24.5     
##  5 categories     41199 19.4     
##  6 authors        31413 14.8     
##  7 publishedDate  25305 11.9     
##  8 previewLink    23836 11.2     
##  9 infoLink       23836 11.2     
## 10 Title              1  0.000471

Visualise that missing data using the naniar packge.

gg_miss_var(book_names)

Thats a lot of missing value, lets calculate the total percentage of missing values in the data.

# using base R 
#percentge_of_mising_values <- sum(is.na(book_names)) / prod(dim(book_names)) *100
#print(percentge_of_mising_values)

#using the naniar package to calcualte the total percentage of missing values in the df
print(total_pct_miss <- pct_miss(book_names))
## [1] 23.7587

The data contains 212404 records in 10 columns. 23.8% of the values are missing.


#### Lets clean the dataset create new df and remove whitespaces from the columns.

book_names_clean <- book_names %>%
  mutate(across(everything(), ~ if (is.character(.)) str_trim(.) else .))

Find duplicate values based on book title using the dplyr package.

duplicate_titles <- book_names_clean %>%
     dplyr::group_by(Title) %>%
     dplyr::filter(n() > 1) %>%
     ungroup()

print(duplicate_titles)
## # A tibble: 0 × 10
## # ℹ 10 variables: Title <chr>, description <chr>, authors <chr>, image <chr>,
## #   previewLink <chr>, publisher <chr>, publishedDate <chr>, infoLink <chr>,
## #   categories <chr>, ratingsCount <dbl>

dplyr failed to find any duplicate values.

Sort by book title to quickly determine duplicate or similar name titles.

books_title_sorted <- book_names_clean %>%
  select(-description, -categories, -image, -previewLink, -infoLink, -ratingsCount) %>%      #remove unessary columns to make it easier to read results
  arrange(Title) %>%                                        # Arrange by title
  slice_head(n = 10)                                        # Get the first 10 rows for clarity

kable(books_title_sorted)  # Display the sorted table
Title authors publisher publishedDate
” Film technique, ” and, ” Film acting ” [‘V. I. Pudovkin’] Sims Press 2008-11
” We’ll Always Have Paris”: The Definitive Guide to Great Lines from the Movies [‘Robert A. Nowlan’, ‘Gwendolyn Wright Nowlan’] Perennial 1994
“… And Poetry is Born …” Russian Classical Poetry [‘Aleksandr Sergeevich Pushkin’] NA 1984
“A Titanic hero” Thomas Andrews, shipbuilder [‘Shan F. Bullock’] NA 1913
“A Truthful Impression of the Country”: British and American Travel Writing in China, 1880-1949 [‘Nicholas J. Clifford’, ‘Nicholas Rowland Clifford’, ‘Nick Clifford’] University of Michigan Press 2001
“A careless word, a needless sinking”: A history of the staggering losses suffered by the U.S. Merchant Marine, both in ships and personnel during World War II [‘Arthur R. Moore’] NA 1983
“A careless word– a needless sinking”: A history of the staggering losses suffered by the U.S. Merchant Marine, both in ships and personnel during World War II [‘Arthur R. Moore’] NA 1983
“A giant in the earth,”: A biography of Dr. J. B. Boddie, [‘Charles Emerson Boddie’] NA 1944
“A kind of life”: Conversations in the combat zone [‘Roswell Angier’] NA 1976-01-01
“A parallel”, the basis of the Book of Mormon: B.H. Roberts’ “Parallel” of the Book of Mormon to View of the Hebrews [‘Hal Hougey’, ‘Brigham Henry Roberts’] NA 1963

It is clear from the results that the data contains numerous duplicate titles.



Initial inspection of the book_names df shows a number of issues.


1. Four columns can be removed. (image, previewLink, infoLink and ratingsCount).
2. Column names to be standardized to lower case.
3. Missing values converted to NA.
4. Date values changed from and be normalised to year. <br? 5. Square brackets (list values) and “” to be removed from authors and categories columns.
6. Duplicate values should be investigated.

Remove the unwanted columns.

book_names_clean <- book_names_clean %>% 
  select(-image, -previewLink, -infoLink, -ratingsCount)

Standardize column names to lower case.

#tolower(colnames(book_names_clean))   #This didnt work as one of the columns is named publishDate
book_names_clean <- book_names_clean %>% 
  rename_with(tolower)

Fill any blank cells with NA. Use the Naniar package to get a summary of missing values.

book_names_clean <- book_names_clean %>%
  mutate(across(where(is.character), ~na_if(., "")))
              
miss_var_summary(book_names_clean)
## # A tibble: 6 × 3
##   variable      n_miss  pct_miss
##   <chr>          <int>     <num>
## 1 publisher      75886 35.7     
## 2 description    68442 32.2     
## 3 categories     41199 19.4     
## 4 authors        31413 14.8     
## 5 publisheddate  25305 11.9     
## 6 title              1  0.000471


The data set contains lists inside the authors and categories columns. Create a custom function that will remove the brackets and quotes from these columns and split the cleaned string into indiviual items.

# Function to extract items from list-like strings
extract_items <- function(x) {
  if (is.na(x)) {
    return(NA_character_)
  } else {
    x <- gsub("\\[|\\]|'", "", x)       # Remove brackets and quotes
    return(strsplit(x, ", ")[[1]])      # Split into individual items
  }
}

# Apply the function to authors and categories using the purr package map() fuction
book_names_clean <- book_names_clean %>%
  mutate(authors_list = map(authors, extract_items),
         categories_list = map(categories, extract_items))

# unnest these lists to get each author/category on a separate row. We wil use these later.
book_names_authors_unnested <- book_names_clean %>% 
 unnest(authors_list)

book_names_categories_unnested <- book_names_clean %>%
  unnest(categories_list)
  

#remove the original columns.
book_names_clean <- book_names_clean %>%
  select(-authors, -categories)


Changing the publisheddate column to a date format (yyyy) handling multiple date formats and NA values.

book_names_clean <- book_names_clean %>%
  mutate(publisheddate = case_when(
    is.na(publisheddate) ~ NA_real_,          # Keep NA values as is
    grepl("^\\d{4}-\\d{2}-\\d{2}$", publisheddate) ~ as.numeric(format(as.Date(publisheddate, "%Y-%m-%d"), "%Y")),
    grepl("^\\d{4}-\\d{2}$", publisheddate) ~ as.numeric(format(as.Date(publisheddate, "%Y-%m"), "%Y")),
    grepl("^\\d{4}$", publisheddate) ~ as.numeric(publisheddate),
    TRUE ~ NA_real_  # Handle any other unexpected cases
  ))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `publisheddate = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion
#using lubridate this produced more errors
#book_names <- book_names %>%
#  mutate(publisheddate = case_when(
#    grepl("^\\d{4}-\\d{2}-\\d{2}$", publisheddate) ~ year(ymd(publisheddate)),
#    grepl("^\\d{4}-\\d{2}$", publisheddate) ~ year(ym(publisheddate)),
#    grepl("^\\d{4}$", publisheddate) ~ as.numeric(publisheddate),
#    TRUE ~ NA_real_
#  ))
# 96713 failed to parse using lubridate

Count the number of NA values in publisheddate

sum(is.na(book_names_clean$publisheddate))
## [1] 36612
mean(is.na(book_names_clean$publisheddate)) * 100
## [1] 17.237

The number of NA values has incresed from 11% to 17%. This is likely due to invalid dates and the transformation code fallback clause assigns NA to all unhandled cases. Further investigation is recommended.


Check the original values in the book_names df to check for invalid date formats.

unique_formats <- unique(book_names$publishedDate)
#view(unique_formats)

A few non standard formats were noted in the original data. such as “19??” and “1973*“.

Explore the first and last dates in the publisheddate column.

first_date <- min(book_names_clean$publisheddate, na.rm = TRUE)
last_date <- max(book_names_clean$publisheddate, na.rm = TRUE)

print(first_date)
## [1] 1016
print(last_date)
## [1] 2030


The dates seem to be out of the expected range. We can filter to see which books are listed for the dates. we will de-select the description column as it is not required.

unusual_dates <- book_names_clean %>%
  dplyr::filter(publisheddate %in% c(1016, 2030)) %>% 
  select(-description)

kable(unusual_dates)
title publisher publisheddate authors_list categories_list
MANSIONS OF DARKNESS Valancourt Books 1016 Archie Roy Fiction
A Wealth of Wisdom: Legendary African American Elders Speak Simon and Schuster 2030 Camille Cosby , Rene Poussaint Biography & Autobiography

A quick internet results show that published date is wrong on at least 2 of the records.

Bonus
Tried to compare the book titles to find similar named book using the stringdist package. This did not work as it required too much memory. (168GB) and would require something like 45 Billion comparisions. Considred using block processing based on other fields and this is done later on in a later code chunk.

# similarity_scores <- stringdist::stringdistmatrix(book_names_clean$title, method = "jw")

Find the total number of authors in the df

total_authors <- book_names_clean %>%
  dplyr::filter(!is.na(authors_list)) %>%
  summarise(total_authors = n_distinct(authors_list))

kable(total_authors)
total_authors
127275


Find the Author with the most book releases.

author_ranked <- book_names_clean %>%
  dplyr::filter(!is.na(authors_list)) %>% 
  group_by(authors_list) %>%
  summarise(total_books = n()) %>%
  arrange(desc(total_books))
  
author_ranked %>% 
  head(10) %>%      #use the head() function together with the kable() to display the top 10
  kable() %>% 
  kable_styling(full_width = TRUE)
authors_list total_books
Rose Arny 236
William Shakespeare 191
Library of Congress. Copyright Office 178
Agatha Christie 142
Erle Stanley Gardner 124
“Louis LAmour” 123
Charles Dickens 89
Edgar Rice Burroughs 85
Zane Grey 75
Rudyard Kipling 75
#print(head(author_ranked, 10))

I had never heard of “Rose Arny” but an internet search revealed, “her prominence in the data could be due to her role in the publishing industry rather than as an author of original works. She was the editor of Publishers Weekly, a major trade publication in the book industry, and her name often appeared in connection with book reviews, announcements, and other publishing-related content.”

Some elements in the authors_list column include lists (e.g. $ : chr [1:3] “K. H. Scheer” “Wendayne Ackerman” “F. J. Ackerman”) rather than a simple character string. To get around this we mutate the column, collapsing list elements into a single string. Then use ggplot to produce a bar graph.

author_ranked <- author_ranked %>%
  mutate(authors_list = sapply(authors_list, paste, collapse = " "))  # Collapse list elements into a single string

top_10_authors <- author_ranked %>% 
  head(10)

ggplot(top_10_authors, aes(y = reorder(authors_list, -total_books), x = total_books)) +
  geom_bar(stat = "identity", fill = "#4183C4") +
  labs(
    title = "Top 10 Authors by books published", x = "Total published", y = "Author")

Wanting to understand more about “Rose Arny” I List 10 books that list her as author.

rose_arny_books <- book_names_authors_unnested %>%
  dplyr::filter(authors_list == "Rose Arny") %>% 
  select(-description, -categories) %>% 
  arrange((title)) %>% 
  slice_head(n = 10)      # display only 10 for clarity
kable(rose_arny_books)
title authors publisher publisheddate authors_list categories_list
01443 DEVELOPING SKILLS IN ALGEBRA ONE, BOOK C [‘Rose Arny’] NA 1995 Rose Arny American literature
1,000 Points of Light: The Public Remains in the Dark (Oswald’s Closest Friend: The George De Mohrenschildt Story, Volume 1) [‘Rose Arny’] NA 1999 Rose Arny American literature
36 propositions for a home/36 modeles pour une maison (English and French Edition) [‘Rose Arny’] NA 1998 Rose Arny American literature
A Bride for Crimson Falls (Silhouette Desire Ser, No. 1076) [‘Rose Arny’] NA 1997 Rose Arny American literature
A Lady’s Point of View (Harlequin Regency Romance, No 14) [‘Rose Arny’] NA 2001-06 Rose Arny American literature
A Man I Used to Know: Love that Man! (Harlequin Superromance No. 831) [‘Rose Arny’] NA 1999-04 Rose Arny American literature
A Man Like Michael (Desire Ser.) [‘Rose Arny’] NA 1997 Rose Arny American literature
A Mother’s Secrets (Randolph Family Ties, Book 3) (Harlequin Intrigue Series #577)) [‘Rose Arny’] NA 2000 Rose Arny American literature
A Perfect Pair (Silhouette Special Edition No. 1590) [‘Rose Arny’] NA 2003 Rose Arny American literature
A Rosey Little Christmas / Jingle Bell Bride (Harlequin Duets, No. 64) [‘Rose Arny’] NA 2002 Rose Arny American literature

The wide range of genres would suggest there are errors in the data. The titles all come under one category of “American literature” and reafirm “Rose Arny” role in the publishing industry.

Check the titles under the author “Agatha Christie” for duplicate.

agatha_christie_books <- book_names_authors_unnested %>%
  dplyr::filter(authors_list == "Agatha Christie") %>% 
  select(-description, -categories) %>% 
  arrange((title)) %>% 
  slice_head(n = 30)                # limit to top 30 for clarity

agatha_christie_books %>%         
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)
title authors publisher publisheddate authors_list categories_list
13 CLUES FOR MISS MARPLE [‘Agatha Christie’] NA 1975 Agatha Christie Detective and mystery stories
13 For Luck [‘Agatha Christie’] Dell Publishing Company 1961 Agatha Christie Fiction
13 clues for Miss Marple;: A collection of mystery stories, [‘Agatha Christie’] NA 1966 Agatha Christie Detective and mystery stories, English
4.50 from Paddington (Agatha Christie Collection S.) [‘Agatha Christie’] NA 2000 Agatha Christie England
A Murder Is Announced [‘Agatha Christie’] HarperCollins 2016-12-29 Agatha Christie NA
A Murder Is Announced (Miss Marple Mysteries) [‘Agatha Christie’] HarperCollins 2016-12-29 Agatha Christie NA
A Pocket Full of Rye [‘Agatha Christie’] Signet Book 2000 Agatha Christie Fiction
A Pocketful of Rye [‘Agatha Christie’] NA 2006 Agatha Christie Detective and mystery stories
A Star over Bethlehem and Other Stories [‘Agatha Christie’] William Morrow Paperbacks 2011-10-25 Agatha Christie Fiction
A pocket full of rye [‘Agatha Christie’] Harper Collins 2011-06-14 Agatha Christie Fiction

The results show that out of the 142 books titles some are republished works of the same title. They may also be listed under different categories. The title column should be normalised with lower case text and the removal of unnecessary characters and spaces.

Normalise the title column of agatha_christie_books with the stringr package.

agatha_christie_books <- agatha_christie_books %>%
  mutate(title = str_replace_all(title, "[[:punct:]]", "")) %>%  # Remove punctuation
  mutate(title = str_to_lower(title)) %>%                       # Convert to lowercase
  mutate(title = str_trim(title)) %>%                           # Trim whitespace
  mutate(title = str_replace_all(title, "\\s+", " "))           # Replace extra spaces

Compare the sliced book titles to find similar named book using the stringdist package. This time with a much smaller data set than we tried before. The fig.width and fig.height values for the code chunk were adjusted for clarity. The pheatmap package was used instead of the base R heatmap package becuase it gave better customization options for text size.

similarity_scores <- stringdist::stringdistmatrix(agatha_christie_books$title, method = "jw")

similarity_matrix <- as.matrix(similarity_scores)

rownames(similarity_matrix) <- agatha_christie_books$title
colnames(similarity_matrix) <- agatha_christie_books$title

#print(similarity_scores)

# Create a heatmap to identify similar names
library(pheatmap)
## Warning: package 'pheatmap' was built under R version 4.4.3
pheatmap(as.matrix(similarity_matrix),
         main = "Jaro-Winkler Similarity Scores",
         fontsize_row = 7, fontsize_col = 7)  # Adjust font sizes


Find the closest match in top 10 titles in the dataset

which(similarity_matrix == min(similarity_matrix[similarity_matrix > 0]), arr.ind = TRUE)
##                      row col
## a pocketful of rye     8   7
## a pocket full of rye   7   8
## a pocket full of rye  10   8
## a pocketful of rye     8  10


Lets count and sort the categories to see how many books are in each category. We know from our exploration above that 19.4% of the categories_list is NA values so we will remove them.

category_counts <- book_names_clean %>%
  dplyr::filter(!is.na(categories_list)) %>% 
  group_by(categories_list) %>%
  summarise(total_books = n()) %>%
  arrange(desc(total_books)) %>% 
  slice_head(n = 10)
# View the results using kable to ensure categories_list names are printed.
kable(category_counts)
categories_list total_books
Fiction 23419
Religion 9459
History 9330
Juvenile Fiction 6643
Biography & Autobiography 6324
Business & Economics 5625
Computers 4312
Social Science 3834
Juvenile Nonfiction 3446
Science 2623

Some elements of the catergories_list in the category_counts df contain lists (e.g. chr [1:2] “Body” “Mind & Spirit”) rather than a simple character string. To get around this we mutate the column, collapsing list elements into a single string. Then usge ggplot to create a bar graph.

category_counts <- category_counts %>%
  mutate(categories_list = sapply(categories_list, paste, collapse = " "))  # Collapse list elements into a single string


ggplot(category_counts, aes(y = reorder(categories_list, -total_books), x = total_books)) +
  geom_bar(stat = "identity", fill = "#4183C4") +
  labs(
    title = "Top 10 Categories by Total Books published", x = "Total published", y = "Category") 

Fiction books are by far the most popular.

Find the top 10 Fiction Authors.

top_10_fiction_writers <- book_names_categories_unnested %>% 
  dplyr::filter(categories_list == "Fiction") %>% 
  group_by(authors_list) %>% 
  summarise(total_books = n()) %>% 
  arrange(desc(total_books)) %>% 
  slice_head(n = 10)

kable(top_10_fiction_writers)
authors_list total_books
“Louis LAmour” 105
Agatha Christie 76
Nora Roberts 51
Edgar Rice Burroughs 48
Georgette Heyer 43
Zane Grey 40
Stephen King 38
Rex Stout 38
William W. Johnstone 37
Danielle Steel 36

A quick internet search revealed. “Louis L’Amour was a prolific American author best known for his Western novels, which he often referred to as”frontier stories.” “His most famous works include Hondo, Last of the Breed, and the Sackett series, which remains a cornerstone of Western literature.”

Explore which year had the most books published. We know from our previous exploration that 17.2% of the published date values are NA values so we will ignore these.

book_year_count <- book_names_clean %>%
  dplyr::filter(!is.na(publisheddate)) %>% 
  group_by(publisheddate) %>%
  summarize(count = n()) %>% 
  arrange(desc(count)) %>% 
  slice_head(n = 50)
  
book_year_count %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(position = "left")
publisheddate count
2004 7337
2003 7154
2005 6992
2002 6940
2000 6799
2001 6419
1999 6173
1998 5582
2012 5247
2013 5001
#print(book_year_count)  

Create a bar graph showing the most popular years for books published.

ggplot(book_year_count, aes(x = publisheddate, y = count)) +
geom_bar(stat = "identity", fill = "#4183C4") +
  labs(title = "Top 50 years for number of books published")

Further research revealed the global financial crisis of 2008 had a significant effect on consumer spending which also had an effect on book publishing. Publishers were less willing to invest in new authors and new titles. The rise of digital media was also influencing consumer behavior with people turning online for entertainment and information.

Find the most common words in the descpiption column. start by breaking the description into indiviual words and counting them. Then compare that list to stop_words in the tidytext package to remove common words.

word_counts <- book_names_clean %>%
  
  dplyr::filter(!is.na(description)) %>%            # filter and rows with NA in the description
  unnest_tokens(word, description) %>%              # break descriptions into individual words
  mutate(word = str_to_lower(word)) %>%             # convert all words to lowercase
  count(word, sort = TRUE)                          # count occurrences of each word and sort by frequency


# use the tidytext package to remove common stop words. ("the, "and", etc.)
word_counts <- word_counts %>%
  anti_join(stop_words, by = "word")                # this will compare words in the word column to the stop_words list

#top_10_words <- word_counts                    # create a list of top 10 common words
#  slice_head(n = 10)
#print(top_10_words)

word_counts %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)
word n
book 62910
life 37923
world 30768
time 21432
history 21220
story 17688
guide 17099
author 16805
edition 16463
american 16353


Find the top 10 words used in the description column to describe fiction books using the book_names_categories_unnested df we created previously. Using unnest_tokens in early code chunk when creating book_names_categories_unnested cause errors. By filtering before unnesting here instead of when we created the book_names_categories_unnested df we greatly reduce the computing power required and reduces unnecessary computations.

fiction_words <- book_names_categories_unnested %>%
  dplyr::filter(str_detect(categories_list, "Fiction"), !is.na(description)) %>%  # filter for Fiction and remove NA descriptions

  unnest_tokens(word, description) %>%              # break descriptions into individual words
  count(word, sort = TRUE) %>%                      # Count occurrences of each word and sort
  anti_join(stop_words, by = "word")                # this will compare words in the word column to the stop_words list

fiction_words %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)
word n
life 9353
world 7345
love 6839
book 6499
story 6279
time 5141
de 5097
family 5063
author 4661
stories 3790
#print(head(fiction_words, 10))


Find the top 10 words used in the description column to describe Agatha Christies books using the books_names_authors_unnested df we created early. By filtering before unnesting and converting to lower case here instead of when we created the book_names_authors_unnested df we greatly reduce the computing power required and reduces unnecessary computations.

agatha_christie_words <-book_names_authors_unnested %>% 
  dplyr::filter(str_detect(authors_list, "Agatha Christie"), !is.na(description)) %>% 
  
  unnest_tokens(word, description) %>%              # descriptions into individual words
  mutate(word = str_to_lower(word)) %>%             # Convert all words to lowercase
  count(word, sort = TRUE) %>%                      # Count occurrences of each word and sort
  anti_join(stop_words, by = "word")                # compare and remove words in teh word column to the stop_words list 
agatha_christie_words %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)
word n
poirot 82
agatha 60
hercule 60
murder 53
mystery 52
miss 37
marple 33
christie 30
murderer 30
detective 27
#print(head(agatha_christie_words, 10))

An internet search revealed “Hercule Poirot is a fictional Belgian detective created by British writer Agatha Christie. Poirot is Christie’s most famous and longest-running character, appearing in 33 novels, two plays, and 51 short stories published between 1920 and 1975”

Shakespeare top 10 words.

Shakespeare_words <-book_names_authors_unnested %>% 
  dplyr::filter(str_detect(authors_list, "William Shakespeare"), !is.na(description)) %>% 
  
  unnest_tokens(word, description) %>%              # descriptions into individual words
  mutate(word = str_to_lower(word)) %>%             # Convert all words to lowercase
  count(word, sort = TRUE) %>%                      # Count occurrences of each word and sort
  anti_join(stop_words, by = "word")                # compare and remove words in teh word column to the stop_words list 

Shakespeare_words %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(full_width = TRUE)
word n
shakespeare 191
play 124
shakespeare’s 92
edition 84
text 80
plays 56
folger 55
notes 52
introduction 47
series 46
# print(head(Shakespeare_words, 10))

An internet search revealed “Folger is most likely referring to the Folger Shakespeare Library, a renowned institution dedicated to the works of William Shakespeare and the early modern period. Located in Washington, D.C., the Folger Shakespeare Library houses the world’s largest collection of Shakespeare’s printed works, including rare editions like the First Folio”
You live and learn.



References

Dataset Citation

  • Source: Kaggle
  • Dataset: Amazon Books Reviews

License The dataset is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

This analysis was conducted with the assistance of Microsoft Copilot, an AI companion created by Microsoft, to help process data and provide coding assistance.


citation(readr) citation(lubridate) citation(knitr) citation(rmarkdown) citation(tidyverse) citation(janitor) citation(naniar) citation(stringdist) citation(tidytext) citation(kableExtra) citation(pheatmap)