Book Data Analysis
Richard_Rowe
2025-02-27
Explore the Book Details csv file from the Kaggle Amazon Book Reviews dataset.
Purpose: Clean data and conduct basic analysis exploring different packages.
citations at end of file.
Summary.
The data set contains 212404 observations that lists various
information on books published in 10 columns. A large amount of data
(approx 23%) is missing and the dates are in a number of different
formats. An attempt was made to normalise the dates but did not achieve
full results due to multiple differnt formats.
This is to be
re-examined at a later date.
There are multiple entries for some
book titles due to spelling and also the fact that some books have been
published multiple times.
The author and categories columns contain
lists that were unnested as part of the analysis.
Load the environment.
library(readr)
library(lubridate)
library(knitr)
library(rmarkdown)
library(tidyverse)
library(janitor)
library(naniar)
library(stringdist)
library(tidytext)
library(kableExtra)
Import the book_data.csv file. Show how many observations are in the df.
book_names <- read_csv("books_data.csv", show_col_types = FALSE)
nrow(book_names)
## [1] 212404
glimpse(book_names)
## Rows: 212,404
## Columns: 10
## $ Title <chr> "Its Only Art If Its Well Hung!", "Dr. Seuss: American I…
## $ description <chr> NA, "Philip Nel takes a fascinating look into the key as…
## $ authors <chr> "['Julie Strain']", "['Philip Nel']", "['David R. Ray']"…
## $ image <chr> "http://books.google.com/books/content?id=DykPAAAACAAJ&p…
## $ previewLink <chr> "http://books.google.nl/books?id=DykPAAAACAAJ&dq=Its+Onl…
## $ publisher <chr> NA, "A&C Black", NA, "iUniverse", NA, "Wm. B. Eerdmans P…
## $ publishedDate <chr> "1996", "2005-01-01", "2000", "2005-02", "2003-03-01", "…
## $ infoLink <chr> "http://books.google.nl/books?id=DykPAAAACAAJ&dq=Its+Onl…
## $ categories <chr> "['Comics & Graphic Novels']", "['Biography & Autobiogra…
## $ ratingsCount <dbl> NA, NA, NA, NA, NA, 5, NA, 3, NA, NA, NA, NA, NA, NA, NA…
It appears that some of the columns could contain lists []
View the first six observations.
head(book_names)
## # A tibble: 6 × 10
## Title description authors image previewLink publisher publishedDate infoLink
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Its On… <NA> ['Juli… http… http://boo… <NA> 1996 http://…
## 2 Dr. Se… "Philip Ne… ['Phil… http… http://boo… A&C Black 2005-01-01 http://…
## 3 Wonder… "This reso… ['Davi… http… http://boo… <NA> 2000 http://…
## 4 Whispe… "Julia Tho… ['Vero… http… http://boo… iUniverse 2005-02 http://…
## 5 Nation… <NA> ['Edwa… <NA> http://boo… <NA> 2003-03-01 http://…
## 6 The Ch… "In The Ch… ['Ever… http… http://boo… Wm. B. E… 1996 http://…
## # ℹ 2 more variables: categories <chr>, ratingsCount <dbl>
Summarise the number of missing values in each column using the naniar package.
#use the tidyverse so summarise missing values for each column.
#summary_missing_book_names <- book_names %>%
# summarize(across(everything(), ~ sum(is.na(.))))
#summary_missing_book_names
#use the naniar package to summarise missing values for each column
miss_var_summary(book_names)
## # A tibble: 10 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 ratingsCount 162652 76.6
## 2 publisher 75886 35.7
## 3 description 68442 32.2
## 4 image 52075 24.5
## 5 categories 41199 19.4
## 6 authors 31413 14.8
## 7 publishedDate 25305 11.9
## 8 previewLink 23836 11.2
## 9 infoLink 23836 11.2
## 10 Title 1 0.000471
Visualise that missing data using the naniar packge.
gg_miss_var(book_names)
Thats a lot of missing value, lets calculate the total percentage of missing values in the data.
# using base R
#percentge_of_mising_values <- sum(is.na(book_names)) / prod(dim(book_names)) *100
#print(percentge_of_mising_values)
#using the naniar package to calcualte the total percentage of missing values in the df
print(total_pct_miss <- pct_miss(book_names))
## [1] 23.7587
The data contains 212404 records in 10 columns. 23.8% of the values are missing.
#### Lets clean the dataset create new df and remove whitespaces
from the columns.
book_names_clean <- book_names %>%
mutate(across(everything(), ~ if (is.character(.)) str_trim(.) else .))
Find duplicate values based on book title using the dplyr package.
duplicate_titles <- book_names_clean %>%
dplyr::group_by(Title) %>%
dplyr::filter(n() > 1) %>%
ungroup()
print(duplicate_titles)
## # A tibble: 0 × 10
## # ℹ 10 variables: Title <chr>, description <chr>, authors <chr>, image <chr>,
## # previewLink <chr>, publisher <chr>, publishedDate <chr>, infoLink <chr>,
## # categories <chr>, ratingsCount <dbl>
dplyr failed to find any duplicate values.
Sort by book title to quickly determine duplicate or similar name titles.
books_title_sorted <- book_names_clean %>%
select(-description, -categories, -image, -previewLink, -infoLink, -ratingsCount) %>% #remove unessary columns to make it easier to read results
arrange(Title) %>% # Arrange by title
slice_head(n = 10) # Get the first 10 rows for clarity
kable(books_title_sorted) # Display the sorted table
Title | authors | publisher | publishedDate |
---|---|---|---|
” Film technique, ” and, ” Film acting ” | [‘V. I. Pudovkin’] | Sims Press | 2008-11 |
” We’ll Always Have Paris”: The Definitive Guide to Great Lines from the Movies | [‘Robert A. Nowlan’, ‘Gwendolyn Wright Nowlan’] | Perennial | 1994 |
“… And Poetry is Born …” Russian Classical Poetry | [‘Aleksandr Sergeevich Pushkin’] | NA | 1984 |
“A Titanic hero” Thomas Andrews, shipbuilder | [‘Shan F. Bullock’] | NA | 1913 |
“A Truthful Impression of the Country”: British and American Travel Writing in China, 1880-1949 | [‘Nicholas J. Clifford’, ‘Nicholas Rowland Clifford’, ‘Nick Clifford’] | University of Michigan Press | 2001 |
“A careless word, a needless sinking”: A history of the staggering losses suffered by the U.S. Merchant Marine, both in ships and personnel during World War II | [‘Arthur R. Moore’] | NA | 1983 |
“A careless word– a needless sinking”: A history of the staggering losses suffered by the U.S. Merchant Marine, both in ships and personnel during World War II | [‘Arthur R. Moore’] | NA | 1983 |
“A giant in the earth,”: A biography of Dr. J. B. Boddie, | [‘Charles Emerson Boddie’] | NA | 1944 |
“A kind of life”: Conversations in the combat zone | [‘Roswell Angier’] | NA | 1976-01-01 |
“A parallel”, the basis of the Book of Mormon: B.H. Roberts’ “Parallel” of the Book of Mormon to View of the Hebrews | [‘Hal Hougey’, ‘Brigham Henry Roberts’] | NA | 1963 |
It is clear from the results that the data contains numerous duplicate titles.
Initial inspection of the book_names df shows a number of issues.
1. Four columns can be removed. (image, previewLink, infoLink
and ratingsCount).
2. Column names to be standardized to lower
case.
3. Missing values converted to NA.
4. Date values changed
from
6. Duplicate values should be investigated.
Remove the unwanted columns.
book_names_clean <- book_names_clean %>%
select(-image, -previewLink, -infoLink, -ratingsCount)
Standardize column names to lower case.
#tolower(colnames(book_names_clean)) #This didnt work as one of the columns is named publishDate
book_names_clean <- book_names_clean %>%
rename_with(tolower)
Fill any blank cells with NA. Use the Naniar package to get a summary of missing values.
book_names_clean <- book_names_clean %>%
mutate(across(where(is.character), ~na_if(., "")))
miss_var_summary(book_names_clean)
## # A tibble: 6 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 publisher 75886 35.7
## 2 description 68442 32.2
## 3 categories 41199 19.4
## 4 authors 31413 14.8
## 5 publisheddate 25305 11.9
## 6 title 1 0.000471
The data set contains lists inside the authors and categories
columns. Create a custom function that will remove the brackets and
quotes from these columns and split the cleaned string into indiviual
items.
# Function to extract items from list-like strings
extract_items <- function(x) {
if (is.na(x)) {
return(NA_character_)
} else {
x <- gsub("\\[|\\]|'", "", x) # Remove brackets and quotes
return(strsplit(x, ", ")[[1]]) # Split into individual items
}
}
# Apply the function to authors and categories using the purr package map() fuction
book_names_clean <- book_names_clean %>%
mutate(authors_list = map(authors, extract_items),
categories_list = map(categories, extract_items))
# unnest these lists to get each author/category on a separate row. We wil use these later.
book_names_authors_unnested <- book_names_clean %>%
unnest(authors_list)
book_names_categories_unnested <- book_names_clean %>%
unnest(categories_list)
#remove the original columns.
book_names_clean <- book_names_clean %>%
select(-authors, -categories)
Changing the publisheddate column to a date format (yyyy)
handling multiple date formats and NA values.
book_names_clean <- book_names_clean %>%
mutate(publisheddate = case_when(
is.na(publisheddate) ~ NA_real_, # Keep NA values as is
grepl("^\\d{4}-\\d{2}-\\d{2}$", publisheddate) ~ as.numeric(format(as.Date(publisheddate, "%Y-%m-%d"), "%Y")),
grepl("^\\d{4}-\\d{2}$", publisheddate) ~ as.numeric(format(as.Date(publisheddate, "%Y-%m"), "%Y")),
grepl("^\\d{4}$", publisheddate) ~ as.numeric(publisheddate),
TRUE ~ NA_real_ # Handle any other unexpected cases
))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `publisheddate = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion
#using lubridate this produced more errors
#book_names <- book_names %>%
# mutate(publisheddate = case_when(
# grepl("^\\d{4}-\\d{2}-\\d{2}$", publisheddate) ~ year(ymd(publisheddate)),
# grepl("^\\d{4}-\\d{2}$", publisheddate) ~ year(ym(publisheddate)),
# grepl("^\\d{4}$", publisheddate) ~ as.numeric(publisheddate),
# TRUE ~ NA_real_
# ))
# 96713 failed to parse using lubridate
Count the number of NA values in publisheddate
sum(is.na(book_names_clean$publisheddate))
## [1] 36612
mean(is.na(book_names_clean$publisheddate)) * 100
## [1] 17.237
The number of NA values has incresed from 11% to 17%. This is likely due to invalid dates and the transformation code fallback clause assigns NA to all unhandled cases. Further investigation is recommended.
Check the original values in the book_names df to check for
invalid date formats.
unique_formats <- unique(book_names$publishedDate)
#view(unique_formats)
A few non standard formats were noted in the original data. such as
“19??” and “1973*“.
Explore the first and last dates in the publisheddate column.
first_date <- min(book_names_clean$publisheddate, na.rm = TRUE)
last_date <- max(book_names_clean$publisheddate, na.rm = TRUE)
print(first_date)
## [1] 1016
print(last_date)
## [1] 2030
The dates seem to be out of the expected range. We can filter to
see which books are listed for the dates. we will de-select the
description column as it is not required.
unusual_dates <- book_names_clean %>%
dplyr::filter(publisheddate %in% c(1016, 2030)) %>%
select(-description)
kable(unusual_dates)
title | publisher | publisheddate | authors_list | categories_list |
---|---|---|---|---|
MANSIONS OF DARKNESS | Valancourt Books | 1016 | Archie Roy | Fiction |
A Wealth of Wisdom: Legendary African American Elders Speak | Simon and Schuster | 2030 | Camille Cosby , Rene Poussaint | Biography & Autobiography |
A quick internet results show that published date is wrong on at
least 2 of the records.
Bonus
Tried to compare the book titles to find similar named book
using the stringdist package. This did not work as it required too much
memory. (168GB) and would require something like 45 Billion
comparisions. Considred using block processing based on other fields and
this is done later on in a later code chunk.
# similarity_scores <- stringdist::stringdistmatrix(book_names_clean$title, method = "jw")
Find the total number of authors in the df
total_authors <- book_names_clean %>%
dplyr::filter(!is.na(authors_list)) %>%
summarise(total_authors = n_distinct(authors_list))
kable(total_authors)
total_authors |
---|
127275 |
Find the Author with the most book releases.
author_ranked <- book_names_clean %>%
dplyr::filter(!is.na(authors_list)) %>%
group_by(authors_list) %>%
summarise(total_books = n()) %>%
arrange(desc(total_books))
author_ranked %>%
head(10) %>% #use the head() function together with the kable() to display the top 10
kable() %>%
kable_styling(full_width = TRUE)
authors_list | total_books |
---|---|
Rose Arny | 236 |
William Shakespeare | 191 |
Library of Congress. Copyright Office | 178 |
Agatha Christie | 142 |
Erle Stanley Gardner | 124 |
“Louis LAmour” | 123 |
Charles Dickens | 89 |
Edgar Rice Burroughs | 85 |
Zane Grey | 75 |
Rudyard Kipling | 75 |
#print(head(author_ranked, 10))
I had never heard of “Rose Arny” but an internet search revealed,
“her prominence in the data could be due to her role in the publishing
industry rather than as an author of original works. She was the editor
of Publishers Weekly, a major trade publication in the book industry,
and her name often appeared in connection with book reviews,
announcements, and other publishing-related content.”
Some elements in the authors_list column include lists (e.g. $ : chr [1:3] “K. H. Scheer” “Wendayne Ackerman” “F. J. Ackerman”) rather than a simple character string. To get around this we mutate the column, collapsing list elements into a single string. Then use ggplot to produce a bar graph.
author_ranked <- author_ranked %>%
mutate(authors_list = sapply(authors_list, paste, collapse = " ")) # Collapse list elements into a single string
top_10_authors <- author_ranked %>%
head(10)
ggplot(top_10_authors, aes(y = reorder(authors_list, -total_books), x = total_books)) +
geom_bar(stat = "identity", fill = "#4183C4") +
labs(
title = "Top 10 Authors by books published", x = "Total published", y = "Author")
Wanting to understand more about “Rose Arny” I List 10 books that list her as author.
rose_arny_books <- book_names_authors_unnested %>%
dplyr::filter(authors_list == "Rose Arny") %>%
select(-description, -categories) %>%
arrange((title)) %>%
slice_head(n = 10) # display only 10 for clarity
kable(rose_arny_books)
title | authors | publisher | publisheddate | authors_list | categories_list |
---|---|---|---|---|---|
01443 DEVELOPING SKILLS IN ALGEBRA ONE, BOOK C | [‘Rose Arny’] | NA | 1995 | Rose Arny | American literature |
1,000 Points of Light: The Public Remains in the Dark (Oswald’s Closest Friend: The George De Mohrenschildt Story, Volume 1) | [‘Rose Arny’] | NA | 1999 | Rose Arny | American literature |
36 propositions for a home/36 modeles pour une maison (English and French Edition) | [‘Rose Arny’] | NA | 1998 | Rose Arny | American literature |
A Bride for Crimson Falls (Silhouette Desire Ser, No. 1076) | [‘Rose Arny’] | NA | 1997 | Rose Arny | American literature |
A Lady’s Point of View (Harlequin Regency Romance, No 14) | [‘Rose Arny’] | NA | 2001-06 | Rose Arny | American literature |
A Man I Used to Know: Love that Man! (Harlequin Superromance No. 831) | [‘Rose Arny’] | NA | 1999-04 | Rose Arny | American literature |
A Man Like Michael (Desire Ser.) | [‘Rose Arny’] | NA | 1997 | Rose Arny | American literature |
A Mother’s Secrets (Randolph Family Ties, Book 3) (Harlequin Intrigue Series #577)) | [‘Rose Arny’] | NA | 2000 | Rose Arny | American literature |
A Perfect Pair (Silhouette Special Edition No. 1590) | [‘Rose Arny’] | NA | 2003 | Rose Arny | American literature |
A Rosey Little Christmas / Jingle Bell Bride (Harlequin Duets, No. 64) | [‘Rose Arny’] | NA | 2002 | Rose Arny | American literature |
The wide range of genres would suggest there are errors in the data.
The titles all come under one category of “American literature” and
reafirm “Rose Arny” role in the publishing industry.
Check the titles under the author “Agatha Christie” for duplicate.
agatha_christie_books <- book_names_authors_unnested %>%
dplyr::filter(authors_list == "Agatha Christie") %>%
select(-description, -categories) %>%
arrange((title)) %>%
slice_head(n = 30) # limit to top 30 for clarity
agatha_christie_books %>%
head(10) %>%
kable() %>%
kable_styling(full_width = TRUE)
title | authors | publisher | publisheddate | authors_list | categories_list |
---|---|---|---|---|---|
13 CLUES FOR MISS MARPLE | [‘Agatha Christie’] | NA | 1975 | Agatha Christie | Detective and mystery stories |
13 For Luck | [‘Agatha Christie’] | Dell Publishing Company | 1961 | Agatha Christie | Fiction |
13 clues for Miss Marple;: A collection of mystery stories, | [‘Agatha Christie’] | NA | 1966 | Agatha Christie | Detective and mystery stories, English |
4.50 from Paddington (Agatha Christie Collection S.) | [‘Agatha Christie’] | NA | 2000 | Agatha Christie | England |
A Murder Is Announced | [‘Agatha Christie’] | HarperCollins | 2016-12-29 | Agatha Christie | NA |
A Murder Is Announced (Miss Marple Mysteries) | [‘Agatha Christie’] | HarperCollins | 2016-12-29 | Agatha Christie | NA |
A Pocket Full of Rye | [‘Agatha Christie’] | Signet Book | 2000 | Agatha Christie | Fiction |
A Pocketful of Rye | [‘Agatha Christie’] | NA | 2006 | Agatha Christie | Detective and mystery stories |
A Star over Bethlehem and Other Stories | [‘Agatha Christie’] | William Morrow Paperbacks | 2011-10-25 | Agatha Christie | Fiction |
A pocket full of rye | [‘Agatha Christie’] | Harper Collins | 2011-06-14 | Agatha Christie | Fiction |
The results show that out of the 142 books titles some are republished works of the same title. They may also be listed under different categories. The title column should be normalised with lower case text and the removal of unnecessary characters and spaces.
Normalise the title column of agatha_christie_books with the stringr package.
agatha_christie_books <- agatha_christie_books %>%
mutate(title = str_replace_all(title, "[[:punct:]]", "")) %>% # Remove punctuation
mutate(title = str_to_lower(title)) %>% # Convert to lowercase
mutate(title = str_trim(title)) %>% # Trim whitespace
mutate(title = str_replace_all(title, "\\s+", " ")) # Replace extra spaces
Compare the sliced book titles to find similar named book using the stringdist package. This time with a much smaller data set than we tried before. The fig.width and fig.height values for the code chunk were adjusted for clarity. The pheatmap package was used instead of the base R heatmap package becuase it gave better customization options for text size.
similarity_scores <- stringdist::stringdistmatrix(agatha_christie_books$title, method = "jw")
similarity_matrix <- as.matrix(similarity_scores)
rownames(similarity_matrix) <- agatha_christie_books$title
colnames(similarity_matrix) <- agatha_christie_books$title
#print(similarity_scores)
# Create a heatmap to identify similar names
library(pheatmap)
## Warning: package 'pheatmap' was built under R version 4.4.3
pheatmap(as.matrix(similarity_matrix),
main = "Jaro-Winkler Similarity Scores",
fontsize_row = 7, fontsize_col = 7) # Adjust font sizes
Find the closest match in top 10 titles in the dataset
which(similarity_matrix == min(similarity_matrix[similarity_matrix > 0]), arr.ind = TRUE)
## row col
## a pocketful of rye 8 7
## a pocket full of rye 7 8
## a pocket full of rye 10 8
## a pocketful of rye 8 10
Lets count and sort the categories to see how many books are in each category. We know from our exploration above that 19.4% of the categories_list is NA values so we will remove them.
category_counts <- book_names_clean %>%
dplyr::filter(!is.na(categories_list)) %>%
group_by(categories_list) %>%
summarise(total_books = n()) %>%
arrange(desc(total_books)) %>%
slice_head(n = 10)
# View the results using kable to ensure categories_list names are printed.
kable(category_counts)
categories_list | total_books |
---|---|
Fiction | 23419 |
Religion | 9459 |
History | 9330 |
Juvenile Fiction | 6643 |
Biography & Autobiography | 6324 |
Business & Economics | 5625 |
Computers | 4312 |
Social Science | 3834 |
Juvenile Nonfiction | 3446 |
Science | 2623 |
Some elements of the catergories_list in the category_counts df contain lists (e.g. chr [1:2] “Body” “Mind & Spirit”) rather than a simple character string. To get around this we mutate the column, collapsing list elements into a single string. Then usge ggplot to create a bar graph.
category_counts <- category_counts %>%
mutate(categories_list = sapply(categories_list, paste, collapse = " ")) # Collapse list elements into a single string
ggplot(category_counts, aes(y = reorder(categories_list, -total_books), x = total_books)) +
geom_bar(stat = "identity", fill = "#4183C4") +
labs(
title = "Top 10 Categories by Total Books published", x = "Total published", y = "Category")
Fiction books are by far the most popular.
Find the top 10 Fiction Authors.
top_10_fiction_writers <- book_names_categories_unnested %>%
dplyr::filter(categories_list == "Fiction") %>%
group_by(authors_list) %>%
summarise(total_books = n()) %>%
arrange(desc(total_books)) %>%
slice_head(n = 10)
kable(top_10_fiction_writers)
authors_list | total_books |
---|---|
“Louis LAmour” | 105 |
Agatha Christie | 76 |
Nora Roberts | 51 |
Edgar Rice Burroughs | 48 |
Georgette Heyer | 43 |
Zane Grey | 40 |
Stephen King | 38 |
Rex Stout | 38 |
William W. Johnstone | 37 |
Danielle Steel | 36 |
A quick internet search revealed. “Louis L’Amour was a prolific American author best known for his Western novels, which he often referred to as”frontier stories.” “His most famous works include Hondo, Last of the Breed, and the Sackett series, which remains a cornerstone of Western literature.”
Explore which year had the most books published. We know from our previous exploration that 17.2% of the published date values are NA values so we will ignore these.
book_year_count <- book_names_clean %>%
dplyr::filter(!is.na(publisheddate)) %>%
group_by(publisheddate) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice_head(n = 50)
book_year_count %>%
head(10) %>%
kable() %>%
kable_styling(position = "left")
publisheddate | count |
---|---|
2004 | 7337 |
2003 | 7154 |
2005 | 6992 |
2002 | 6940 |
2000 | 6799 |
2001 | 6419 |
1999 | 6173 |
1998 | 5582 |
2012 | 5247 |
2013 | 5001 |
#print(book_year_count)
Create a bar graph showing the most popular years for books published.
ggplot(book_year_count, aes(x = publisheddate, y = count)) +
geom_bar(stat = "identity", fill = "#4183C4") +
labs(title = "Top 50 years for number of books published")
Further research revealed the global financial crisis of 2008 had a
significant effect on consumer spending which also had an effect on book
publishing. Publishers were less willing to invest in new authors and
new titles. The rise of digital media was also influencing consumer
behavior with people turning online for entertainment and information.
Find the most common words in the descpiption column. start by breaking the description into indiviual words and counting them. Then compare that list to stop_words in the tidytext package to remove common words.
word_counts <- book_names_clean %>%
dplyr::filter(!is.na(description)) %>% # filter and rows with NA in the description
unnest_tokens(word, description) %>% # break descriptions into individual words
mutate(word = str_to_lower(word)) %>% # convert all words to lowercase
count(word, sort = TRUE) # count occurrences of each word and sort by frequency
# use the tidytext package to remove common stop words. ("the, "and", etc.)
word_counts <- word_counts %>%
anti_join(stop_words, by = "word") # this will compare words in the word column to the stop_words list
#top_10_words <- word_counts # create a list of top 10 common words
# slice_head(n = 10)
#print(top_10_words)
word_counts %>%
head(10) %>%
kable() %>%
kable_styling(full_width = TRUE)
word | n |
---|---|
book | 62910 |
life | 37923 |
world | 30768 |
time | 21432 |
history | 21220 |
story | 17688 |
guide | 17099 |
author | 16805 |
edition | 16463 |
american | 16353 |
Find the top 10 words used in the description column to describe fiction books using the book_names_categories_unnested df we created previously. Using unnest_tokens in early code chunk when creating book_names_categories_unnested cause errors. By filtering before unnesting here instead of when we created the book_names_categories_unnested df we greatly reduce the computing power required and reduces unnecessary computations.
fiction_words <- book_names_categories_unnested %>%
dplyr::filter(str_detect(categories_list, "Fiction"), !is.na(description)) %>% # filter for Fiction and remove NA descriptions
unnest_tokens(word, description) %>% # break descriptions into individual words
count(word, sort = TRUE) %>% # Count occurrences of each word and sort
anti_join(stop_words, by = "word") # this will compare words in the word column to the stop_words list
fiction_words %>%
head(10) %>%
kable() %>%
kable_styling(full_width = TRUE)
word | n |
---|---|
life | 9353 |
world | 7345 |
love | 6839 |
book | 6499 |
story | 6279 |
time | 5141 |
de | 5097 |
family | 5063 |
author | 4661 |
stories | 3790 |
#print(head(fiction_words, 10))
Find the top 10 words used in the description column to describe
Agatha Christies books using the books_names_authors_unnested df we
created early. By filtering before unnesting and converting to lower
case here instead of when we created the book_names_authors_unnested df
we greatly reduce the computing power required and reduces unnecessary
computations.
agatha_christie_words <-book_names_authors_unnested %>%
dplyr::filter(str_detect(authors_list, "Agatha Christie"), !is.na(description)) %>%
unnest_tokens(word, description) %>% # descriptions into individual words
mutate(word = str_to_lower(word)) %>% # Convert all words to lowercase
count(word, sort = TRUE) %>% # Count occurrences of each word and sort
anti_join(stop_words, by = "word") # compare and remove words in teh word column to the stop_words list
agatha_christie_words %>%
head(10) %>%
kable() %>%
kable_styling(full_width = TRUE)
word | n |
---|---|
poirot | 82 |
agatha | 60 |
hercule | 60 |
murder | 53 |
mystery | 52 |
miss | 37 |
marple | 33 |
christie | 30 |
murderer | 30 |
detective | 27 |
#print(head(agatha_christie_words, 10))
An internet search revealed “Hercule Poirot is a fictional Belgian
detective created by British writer Agatha Christie. Poirot is
Christie’s most famous and longest-running character, appearing in 33
novels, two plays, and 51 short stories published between 1920 and 1975”
Shakespeare top 10 words.
Shakespeare_words <-book_names_authors_unnested %>%
dplyr::filter(str_detect(authors_list, "William Shakespeare"), !is.na(description)) %>%
unnest_tokens(word, description) %>% # descriptions into individual words
mutate(word = str_to_lower(word)) %>% # Convert all words to lowercase
count(word, sort = TRUE) %>% # Count occurrences of each word and sort
anti_join(stop_words, by = "word") # compare and remove words in teh word column to the stop_words list
Shakespeare_words %>%
head(10) %>%
kable() %>%
kable_styling(full_width = TRUE)
word | n |
---|---|
shakespeare | 191 |
play | 124 |
shakespeare’s | 92 |
edition | 84 |
text | 80 |
plays | 56 |
folger | 55 |
notes | 52 |
introduction | 47 |
series | 46 |
# print(head(Shakespeare_words, 10))
An internet search revealed “Folger is most likely referring to the
Folger Shakespeare Library, a renowned institution dedicated to the
works of William Shakespeare and the early modern period. Located in
Washington, D.C., the Folger Shakespeare Library houses the world’s
largest collection of Shakespeare’s printed works, including rare
editions like the First Folio”
You live and learn.
References
Dataset Citation
- Source: Kaggle
- Dataset: Amazon Books Reviews
License The dataset is released under the CC0 1.0
Universal (CC0 1.0) Public Domain Dedication.
This analysis
was conducted with the assistance of Microsoft Copilot, an AI companion
created by Microsoft, to help process data and provide coding
assistance.
citation(readr) citation(lubridate) citation(knitr)
citation(rmarkdown) citation(tidyverse) citation(janitor)
citation(naniar) citation(stringdist) citation(tidytext)
citation(kableExtra) citation(pheatmap)