Data Analysis in R - Heart attack data exploration

Heart attack social and economic factors

Summary.

After initial cleaning and analysis the data set was found to contain manipulated data. The dataset exhibits an almost perfectly even distribution across several variables, including gender, outcome, education, diet and average age. The relationship between education level and average income is not inline with real world trends. The conclusion is that the dataset is certainly flawed due to manipulation, making the data unreliable for drawing meaningful conclusion about the social economic factors affecting heart attack outcomes.

Set up environment.

library(tidyverse)
library(readr)

Import the data set as ha_data_complete.

ha_data_complete <-heart_attack_dataset <- read_csv("heart_attack_dataset.csv", show_col_types = FALSE)

View first 6 lines of the df.

head(ha_data_complete)

## # A tibble: 6 × 32
##     Age Gender Cholesterol BloodPressure HeartRate   BMI Smoker Diabetes
##   <dbl> <chr>        <dbl>         <dbl>     <dbl> <dbl>  <dbl>    <dbl>
## 1    31 Male           194           162        71  22.9      0        1
## 2    69 Male           208           148        93  33.9      1        1
## 3    34 Female         132           161        94  34        0        0
## 4    53 Male           268           134        91  35        0        1
## 5    57 Female         203           140        75  30.1      0        1
## 6    41 Male           158           154        72  38.7      0        1
## # ℹ 24 more variables: Hypertension <dbl>, FamilyHistory <dbl>,
## #   PhysicalActivity <dbl>, AlcoholConsumption <dbl>, Diet <chr>,
## #   StressLevel <dbl>, Ethnicity <chr>, Income <dbl>, EducationLevel <chr>,
## #   Medication <chr>, ChestPainType <chr>, ECGResults <chr>,
## #   MaxHeartRate <dbl>, ST_Depression <dbl>, ExerciseInducedAngina <chr>,
## #   Slope <chr>, NumberOfMajorVessels <dbl>, Thalassemia <chr>,
## #   PreviousHeartAttack <dbl>, StrokeHistory <dbl>, Residence <chr>, …

View the structure of the df.

glimpse(ha_data_complete)

## Rows: 372,974
## Columns: 32
## $ Age                   <dbl> 31, 69, 34, 53, 57, 41, 45, 61, 45, 30, 34, 84, …
## $ Gender                <chr> "Male", "Male", "Female", "Male", "Female", "Mal…
## $ Cholesterol           <dbl> 194, 208, 132, 268, 203, 158, 237, 218, 216, 143…
## $ BloodPressure         <dbl> 162, 148, 161, 134, 140, 154, 168, 113, 114, 130…
## $ HeartRate             <dbl> 71, 93, 94, 91, 75, 72, 114, 115, 72, 91, 111, 9…
## $ BMI                   <dbl> 22.9, 33.9, 34.0, 35.0, 30.1, 38.7, 21.7, 23.1, …
## $ Smoker                <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, …
## $ Diabetes              <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, …
## $ Hypertension          <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ FamilyHistory         <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, …
## $ PhysicalActivity      <dbl> 6, 1, 1, 6, 4, 3, 0, 4, 2, 6, 1, 0, 3, 1, 4, 4, …
## $ AlcoholConsumption    <dbl> 0, 2, 3, 0, 1, 2, 2, 4, 4, 2, 4, 4, 2, 0, 4, 0, …
## $ Diet                  <chr> "Unhealthy", "Unhealthy", "Healthy", "Healthy", …
## $ StressLevel           <dbl> 1, 6, 3, 3, 1, 4, 5, 5, 2, 3, 9, 2, 4, 8, 9, 3, …
## $ Ethnicity             <chr> "Hispanic", "Asian", "Black", "Hispanic", "Hispa…
## $ Income                <dbl> 64510, 91773, 173550, 43861, 83404, 113011, 6106…
## $ EducationLevel        <chr> "High School", "College", "College", "High Schoo…
## $ Medication            <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "…
## $ ChestPainType         <chr> "Typical", "Atypical", "Non-anginal", "Atypical"…
## $ ECGResults            <chr> "ST-T abnormality", "LV hypertrophy", "Normal", …
## $ MaxHeartRate          <dbl> 173, 189, 122, 104, 126, 155, 171, 101, 174, 130…
## $ ST_Depression         <dbl> 0.52, 3.79, 0.17, 0.67, 5.00, 4.30, 2.48, 2.81, …
## $ ExerciseInducedAngina <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "…
## $ Slope                 <chr> "Downsloping", "Upsloping", "Upsloping", "Flat",…
## $ NumberOfMajorVessels  <dbl> 1, 2, 0, 0, 0, 2, 0, 1, 3, 0, 1, 0, 1, 3, 1, 1, …
## $ Thalassemia           <chr> "Normal", "Normal", "Normal", "Reversible defect…
## $ PreviousHeartAttack   <dbl> 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, …
## $ StrokeHistory         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Residence             <chr> "Suburban", "Suburban", "Rural", "Suburban", "Ru…
## $ EmploymentStatus      <chr> "Retired", "Unemployed", "Retired", "Retired", "…
## $ MaritalStatus         <chr> "Single", "Married", "Single", "Widowed", "Marri…
## $ Outcome               <chr> "No Heart Attack", "No Heart Attack", "Heart Att…

Create the social_economic df and select relevant columns.

# Create a social_economic df and select the relevant columns.
social_economic <- ha_data_complete %>%
  select(Age, Gender, Diet, Income, Ethnicity, EducationLevel, EmploymentStatus, Residence, MaritalStatus, Outcome)

# Rename columns for better readability
social_economic <- social_economic %>%
  rename(
    age = Age,
    gender = Gender,
    diet = Diet,
    income = Income,
    education = EducationLevel,
    ethnicity = Ethnicity,
    employment = EmploymentStatus,
    residence = Residence,
    married = MaritalStatus,
    outcome = Outcome
  )
summary(social_economic)

##       age           gender              diet               income      
##  Min.   :30.00   Length:372974      Length:372974      Min.   : 20000  
##  1st Qu.:43.00   Class :character   Class :character   1st Qu.: 64957  
##  Median :57.00   Mode  :character   Mode  :character   Median :110111  
##  Mean   :56.98                                         Mean   :110033  
##  3rd Qu.:71.00                                         3rd Qu.:155012  
##  Max.   :84.00                                         Max.   :199999  
##   ethnicity          education          employment         residence        
##  Length:372974      Length:372974      Length:372974      Length:372974     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    married            outcome         
##  Length:372974      Length:372974     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Convert selected columns from Character to factors for later analysis.

# Convert categorical columns to factors
social_economic <- social_economic %>%
  mutate(
    gender = as.factor(gender),
    diet = as.factor(diet),
    education = as.factor(education),
    ethnicity = as.factor(ethnicity),
    employment = as.factor(employment),
    residence = as.factor(residence),
    married = as.factor(married),
    outcome = as.factor(outcome)
  )

# Check for missing values and summarize data
summary(social_economic)

##       age           gender              diet            income      
##  Min.   :30.00   Female:186204   Healthy  :124091   Min.   : 20000  
##  1st Qu.:43.00   Male  :186770   Moderate :125079   1st Qu.: 64957  
##  Median :57.00                   Unhealthy:123804   Median :110111  
##  Mean   :56.98                                      Mean   :110033  
##  3rd Qu.:71.00                                      3rd Qu.:155012  
##  Max.   :84.00                                      Max.   :199999  
##     ethnicity            education           employment        residence     
##  Asian   :74942   College     :124040   Employed  :123895   Rural   :124336  
##  Black   :74532   High School :124912   Retired   :124373   Suburban:124721  
##  Hispanic:74350   Postgraduate:124022   Unemployed:124706   Urban   :123917  
##  Other   :74181                                                              
##  White   :74969                                                              
##                                                                              
##      married                 outcome      
##  Divorced:93094   Heart Attack   :186316  
##  Married :92814   No Heart Attack:186658  
##  Single  :93599                           
##  Widowed :93467                           
##                                           
##

##### The data set includes an almost even number of male/females participants. Along with almost all the other values being balanced. This could indicate that the data set is biased or has been manipulated. Data set provenance indicates the data was webscraped from 3 data websites.

Check for missing values in the df.

colSums(is.na(social_economic))

##        age     gender       diet     income  ethnicity  education employment 
##          0          0          0          0          0          0          0 
##  residence    married    outcome 
##          0          0          0

Find duplicate values in the df which could further indicate the data set has been manipulated.

sum(duplicated(social_economic))

## [1] 1

Create a new df called social_economic_clean, remove duplicate value and confirm no duplicate values exist.

social_economic_clean <- social_economic[!duplicated(social_economic),]
sum(duplicated(social_economic_clean))

## [1] 0

Analysis

Calculate the min, max and mean age of heart attack victims and non heart attack victims.

min_max_age <-  social_economic_clean %>% 
  group_by(outcome) %>% 
  summarise(min(age), max(age), Average_age =round(mean(age), 1))
print(min_max_age)

## # A tibble: 2 × 4
##   outcome         `min(age)` `max(age)` Average_age
##   <fct>                <dbl>      <dbl>       <dbl>
## 1 Heart Attack            30         84          57
## 2 No Heart Attack         30         84          57

Examine gender distribution and mean age for both victims and non victims

gender_distribution <- social_economic_clean %>%
  group_by(outcome, gender) %>% 
  summarise(Count = n(), Average_age = round(mean(age) ,1), .groups = "drop")


# Print the distribution
print(gender_distribution)

## # A tibble: 4 × 4
##   outcome         gender Count Average_age
##   <fct>           <fct>  <int>       <dbl>
## 1 Heart Attack    Female 93193        57  
## 2 Heart Attack    Male   93123        56.9
## 3 No Heart Attack Female 93010        57  
## 4 No Heart Attack Male   93647        57

Again the figures in both the results above are very closely matched indicating sampling bias or manipulated data.

Examine data integrity further by examining relationship between education and average income

income_education <- social_economic_clean %>% 
  group_by(education) %>% 
  summarise(Average_Income = round(mean(income, 1)))

print(income_education, col.names = c("Education Level", "Average Income"))

## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic argument:
## • col.names = c("Education Level", "Average Income")

## # A tibble: 3 × 2
##   education    Average_Income
##   <fct>                 <dbl>
## 1 College              109975
## 2 High School          110183
## 3 Postgraduate         110182

Create plot to show education level and income.

ggplot(social_economic_clean, aes(x = income, y = education, colour = education)) +
  geom_point() +
  labs(title = "Education v Income")

#### Again the data above is very closely matched and not in accordance with accepted norms. This indicates that the data has been manipulated.

Examine relationship between diet and outcome.

diet_outcome <- table(social_economic_clean$diet, social_economic_clean$outcome)

# Print the table
print(diet_outcome)

##            
##             Heart Attack No Heart Attack
##   Healthy          62046           62045
##   Moderate         62502           62576
##   Unhealthy        61768           62036

Examine relationship between education and diet.

education_diet <-table(social_economic_clean$education, social_economic_clean$diet)

print(education_diet)

##               
##                Healthy Moderate Unhealthy
##   College        41082    41804     41153
##   High School    41697    41697     41518
##   Postgraduate   41312    41577     41133

Examine relationship between residence and diet.

residence_diet <-table(social_economic_clean$residence, social_economic_clean$diet)

print(residence_diet)

##           
##            Healthy Moderate Unhealthy
##   Rural      41546    41685     41105
##   Suburban   41509    41798     41413
##   Urban      41036    41595     41286

Examine the relationship between married status and outcome.

marrier_outcome <-table(social_economic_clean$married, social_economic_clean$outcome)

print(marrier_outcome)

##           
##            Heart Attack No Heart Attack
##   Divorced        46426           46667
##   Married         46490           46324
##   Single          46767           46832
##   Widowed         46633           46834

Examine the relationship between diet, ethnicity and outcome.

ggplot(social_economic_clean) +
  geom_jitter(mapping = aes(x = diet, y = ethnicity, colour = outcome)) +
  labs(title = "Diet v Ethnicity")

Examine the releationship between gener, age and outcome.

ggplot(social_economic_clean) +
  geom_jitter(mapping = aes(x = gender, y = age, colour = outcome)) +
  labs(title = "Gender v Age")

ggplot(social_economic_clean) +
  geom_jitter(mapping = aes(x = income, y = diet, colour = outcome)) +
  labs(title = "Income V Diet")

The plots above show there is a balanced result between diet, income, gender, age and ethinicity in this dataset.

References

####Data Source

misc{panday2021,
author = {Ankush Panday}, title = {Heart Attack Prediction in United States}, year = {2021}, publisher = {Kaggle}, journal = {Kaggle}, howpublished = {},

License The dataset is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

citation("tidyverse")

## To cite package 'tidyverse' in publications use:
## 
##   Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
##   Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
##   E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
##   Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
##   the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
##   doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }

citation("readr")

## To cite package 'readr' in publications use:
## 
##   Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
##   Data_. R package version 2.1.5,
##   <https://CRAN.R-project.org/package=readr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {readr: Read Rectangular Text Data},
##     author = {Hadley Wickham and Jim Hester and Jennifer Bryan},
##     year = {2024},
##     note = {R package version 2.1.5},
##     url = {https://CRAN.R-project.org/package=readr},
##   }

citation("lubridate")

## To cite lubridate in publications use:
## 
##   Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy
##   with lubridate. Journal of Statistical Software, 40(3), 1-25. URL
##   https://www.jstatsoft.org/v40/i03/.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Dates and Times Made Easy with {lubridate}},
##     author = {Garrett Grolemund and Hadley Wickham},
##     journal = {Journal of Statistical Software},
##     year = {2011},
##     volume = {40},
##     number = {3},
##     pages = {1--25},
##     url = {https://www.jstatsoft.org/v40/i03/},
##   }

citation("kableExtra")

## To cite package 'kableExtra' in publications use:
## 
##   Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
##   Pipe Syntax_. R package version 1.4.0,
##   <https://CRAN.R-project.org/package=kableExtra>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
##     author = {Hao Zhu},
##     year = {2024},
##     note = {R package version 1.4.0},
##     url = {https://CRAN.R-project.org/package=kableExtra},
##   }

```

richrowe.uk

Data Analysis in R - Heart attack data exploration

Heart attack social and economic factors

Richard_Rowe

2025-02-23

Summary.

Analysis

Again the figures in both the results above are very closely matched indicating sampling bias or manipulated data.

The plots above show there is a balanced result between diet, income, gender, age and ethinicity in this dataset.

References

Contact form

richrowe.uk

Data Analysis in R - Heart attack data exploration

Heart attack social and economic factors

Richard_Rowe

2025-02-23

A social economic look at the USA heart attack prediction dataset.

Summary.

Analysis

Again the figures in both the results above are very closely matched indicating sampling bias or manipulated data.

The plots above show there is a balanced result between diet, income, gender, age and ethinicity in this dataset.

References

Contact form