Heart attack social and economic factors
Richard_Rowe
2025-02-23
Summary.
After initial cleaning and analysis the data set was found to contain manipulated data. The dataset exhibits an almost perfectly even distribution across several variables, including gender, outcome, education, diet and average age. The relationship between education level and average income is not inline with real world trends. The conclusion is that the dataset is certainly flawed due to manipulation, making the data unreliable for drawing meaningful conclusion about the social economic factors affecting heart attack outcomes.
Set up environment.
library(tidyverse)
library(readr)
Import the data set as ha_data_complete.
ha_data_complete <-heart_attack_dataset <- read_csv("heart_attack_dataset.csv", show_col_types = FALSE)
View first 6 lines of the df.
head(ha_data_complete)
## # A tibble: 6 × 32
## Age Gender Cholesterol BloodPressure HeartRate BMI Smoker Diabetes
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31 Male 194 162 71 22.9 0 1
## 2 69 Male 208 148 93 33.9 1 1
## 3 34 Female 132 161 94 34 0 0
## 4 53 Male 268 134 91 35 0 1
## 5 57 Female 203 140 75 30.1 0 1
## 6 41 Male 158 154 72 38.7 0 1
## # ℹ 24 more variables: Hypertension <dbl>, FamilyHistory <dbl>,
## # PhysicalActivity <dbl>, AlcoholConsumption <dbl>, Diet <chr>,
## # StressLevel <dbl>, Ethnicity <chr>, Income <dbl>, EducationLevel <chr>,
## # Medication <chr>, ChestPainType <chr>, ECGResults <chr>,
## # MaxHeartRate <dbl>, ST_Depression <dbl>, ExerciseInducedAngina <chr>,
## # Slope <chr>, NumberOfMajorVessels <dbl>, Thalassemia <chr>,
## # PreviousHeartAttack <dbl>, StrokeHistory <dbl>, Residence <chr>, …
- View the structure of the df.
glimpse(ha_data_complete)
## Rows: 372,974
## Columns: 32
## $ Age <dbl> 31, 69, 34, 53, 57, 41, 45, 61, 45, 30, 34, 84, …
## $ Gender <chr> "Male", "Male", "Female", "Male", "Female", "Mal…
## $ Cholesterol <dbl> 194, 208, 132, 268, 203, 158, 237, 218, 216, 143…
## $ BloodPressure <dbl> 162, 148, 161, 134, 140, 154, 168, 113, 114, 130…
## $ HeartRate <dbl> 71, 93, 94, 91, 75, 72, 114, 115, 72, 91, 111, 9…
## $ BMI <dbl> 22.9, 33.9, 34.0, 35.0, 30.1, 38.7, 21.7, 23.1, …
## $ Smoker <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, …
## $ Diabetes <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, …
## $ Hypertension <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ FamilyHistory <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, …
## $ PhysicalActivity <dbl> 6, 1, 1, 6, 4, 3, 0, 4, 2, 6, 1, 0, 3, 1, 4, 4, …
## $ AlcoholConsumption <dbl> 0, 2, 3, 0, 1, 2, 2, 4, 4, 2, 4, 4, 2, 0, 4, 0, …
## $ Diet <chr> "Unhealthy", "Unhealthy", "Healthy", "Healthy", …
## $ StressLevel <dbl> 1, 6, 3, 3, 1, 4, 5, 5, 2, 3, 9, 2, 4, 8, 9, 3, …
## $ Ethnicity <chr> "Hispanic", "Asian", "Black", "Hispanic", "Hispa…
## $ Income <dbl> 64510, 91773, 173550, 43861, 83404, 113011, 6106…
## $ EducationLevel <chr> "High School", "College", "College", "High Schoo…
## $ Medication <chr> "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "…
## $ ChestPainType <chr> "Typical", "Atypical", "Non-anginal", "Atypical"…
## $ ECGResults <chr> "ST-T abnormality", "LV hypertrophy", "Normal", …
## $ MaxHeartRate <dbl> 173, 189, 122, 104, 126, 155, 171, 101, 174, 130…
## $ ST_Depression <dbl> 0.52, 3.79, 0.17, 0.67, 5.00, 4.30, 2.48, 2.81, …
## $ ExerciseInducedAngina <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "…
## $ Slope <chr> "Downsloping", "Upsloping", "Upsloping", "Flat",…
## $ NumberOfMajorVessels <dbl> 1, 2, 0, 0, 0, 2, 0, 1, 3, 0, 1, 0, 1, 3, 1, 1, …
## $ Thalassemia <chr> "Normal", "Normal", "Normal", "Reversible defect…
## $ PreviousHeartAttack <dbl> 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, …
## $ StrokeHistory <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Residence <chr> "Suburban", "Suburban", "Rural", "Suburban", "Ru…
## $ EmploymentStatus <chr> "Retired", "Unemployed", "Retired", "Retired", "…
## $ MaritalStatus <chr> "Single", "Married", "Single", "Widowed", "Marri…
## $ Outcome <chr> "No Heart Attack", "No Heart Attack", "Heart Att…
Create the social_economic df and select relevant columns.
# Create a social_economic df and select the relevant columns.
social_economic <- ha_data_complete %>%
select(Age, Gender, Diet, Income, Ethnicity, EducationLevel, EmploymentStatus, Residence, MaritalStatus, Outcome)
# Rename columns for better readability
social_economic <- social_economic %>%
rename(
age = Age,
gender = Gender,
diet = Diet,
income = Income,
education = EducationLevel,
ethnicity = Ethnicity,
employment = EmploymentStatus,
residence = Residence,
married = MaritalStatus,
outcome = Outcome
)
summary(social_economic)
## age gender diet income
## Min. :30.00 Length:372974 Length:372974 Min. : 20000
## 1st Qu.:43.00 Class :character Class :character 1st Qu.: 64957
## Median :57.00 Mode :character Mode :character Median :110111
## Mean :56.98 Mean :110033
## 3rd Qu.:71.00 3rd Qu.:155012
## Max. :84.00 Max. :199999
## ethnicity education employment residence
## Length:372974 Length:372974 Length:372974 Length:372974
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## married outcome
## Length:372974 Length:372974
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Convert selected columns from Character to factors for later analysis.
# Convert categorical columns to factors
social_economic <- social_economic %>%
mutate(
gender = as.factor(gender),
diet = as.factor(diet),
education = as.factor(education),
ethnicity = as.factor(ethnicity),
employment = as.factor(employment),
residence = as.factor(residence),
married = as.factor(married),
outcome = as.factor(outcome)
)
# Check for missing values and summarize data
summary(social_economic)
## age gender diet income
## Min. :30.00 Female:186204 Healthy :124091 Min. : 20000
## 1st Qu.:43.00 Male :186770 Moderate :125079 1st Qu.: 64957
## Median :57.00 Unhealthy:123804 Median :110111
## Mean :56.98 Mean :110033
## 3rd Qu.:71.00 3rd Qu.:155012
## Max. :84.00 Max. :199999
## ethnicity education employment residence
## Asian :74942 College :124040 Employed :123895 Rural :124336
## Black :74532 High School :124912 Retired :124373 Suburban:124721
## Hispanic:74350 Postgraduate:124022 Unemployed:124706 Urban :123917
## Other :74181
## White :74969
##
## married outcome
## Divorced:93094 Heart Attack :186316
## Married :92814 No Heart Attack:186658
## Single :93599
## Widowed :93467
##
##
##### The data set includes an almost even number of
male/females participants. Along with almost all the other values being
balanced. This could indicate that the data set is biased or has been
manipulated. Data set provenance indicates the data was webscraped from
3 data websites.
Check for missing values in the df.
colSums(is.na(social_economic))
## age gender diet income ethnicity education employment
## 0 0 0 0 0 0 0
## residence married outcome
## 0 0 0
Find duplicate values in the df which could further indicate the data set has been manipulated.
sum(duplicated(social_economic))
## [1] 1
Create a new df called social_economic_clean, remove duplicate value and confirm no duplicate values exist.
social_economic_clean <- social_economic[!duplicated(social_economic),]
sum(duplicated(social_economic_clean))
## [1] 0
Analysis
Calculate the min, max and mean age of heart attack victims and non heart attack victims.
min_max_age <- social_economic_clean %>%
group_by(outcome) %>%
summarise(min(age), max(age), Average_age =round(mean(age), 1))
print(min_max_age)
## # A tibble: 2 × 4
## outcome `min(age)` `max(age)` Average_age
## <fct> <dbl> <dbl> <dbl>
## 1 Heart Attack 30 84 57
## 2 No Heart Attack 30 84 57
Examine gender distribution and mean age for both victims and non victims
gender_distribution <- social_economic_clean %>%
group_by(outcome, gender) %>%
summarise(Count = n(), Average_age = round(mean(age) ,1), .groups = "drop")
# Print the distribution
print(gender_distribution)
## # A tibble: 4 × 4
## outcome gender Count Average_age
## <fct> <fct> <int> <dbl>
## 1 Heart Attack Female 93193 57
## 2 Heart Attack Male 93123 56.9
## 3 No Heart Attack Female 93010 57
## 4 No Heart Attack Male 93647 57
Again the figures in both the results above are very closely matched indicating sampling bias or manipulated data.
Examine data integrity further by examining relationship between education and average income
income_education <- social_economic_clean %>%
group_by(education) %>%
summarise(Average_Income = round(mean(income, 1)))
print(income_education, col.names = c("Education Level", "Average Income"))
## Warning: `...` must be empty in `format.tbl()`
## Caused by error in `format_tbl()`:
## ! `...` must be empty.
## ✖ Problematic argument:
## • col.names = c("Education Level", "Average Income")
## # A tibble: 3 × 2
## education Average_Income
## <fct> <dbl>
## 1 College 109975
## 2 High School 110183
## 3 Postgraduate 110182
Create plot to show education level and income.
ggplot(social_economic_clean, aes(x = income, y = education, colour = education)) +
geom_point() +
labs(title = "Education v Income")
#### Again the data above is very closely matched and not in accordance
with accepted norms. This indicates that the data has been
manipulated.
Examine relationship between diet and outcome.
diet_outcome <- table(social_economic_clean$diet, social_economic_clean$outcome)
# Print the table
print(diet_outcome)
##
## Heart Attack No Heart Attack
## Healthy 62046 62045
## Moderate 62502 62576
## Unhealthy 61768 62036
Examine relationship between education and diet.
education_diet <-table(social_economic_clean$education, social_economic_clean$diet)
print(education_diet)
##
## Healthy Moderate Unhealthy
## College 41082 41804 41153
## High School 41697 41697 41518
## Postgraduate 41312 41577 41133
Examine relationship between residence and diet.
residence_diet <-table(social_economic_clean$residence, social_economic_clean$diet)
print(residence_diet)
##
## Healthy Moderate Unhealthy
## Rural 41546 41685 41105
## Suburban 41509 41798 41413
## Urban 41036 41595 41286
Examine the relationship between married status and outcome.
marrier_outcome <-table(social_economic_clean$married, social_economic_clean$outcome)
print(marrier_outcome)
##
## Heart Attack No Heart Attack
## Divorced 46426 46667
## Married 46490 46324
## Single 46767 46832
## Widowed 46633 46834
Examine the relationship between diet, ethnicity and outcome.
ggplot(social_economic_clean) +
geom_jitter(mapping = aes(x = diet, y = ethnicity, colour = outcome)) +
labs(title = "Diet v Ethnicity")
Examine the releationship between gener, age and outcome.
ggplot(social_economic_clean) +
geom_jitter(mapping = aes(x = gender, y = age, colour = outcome)) +
labs(title = "Gender v Age")
ggplot(social_economic_clean) +
geom_jitter(mapping = aes(x = income, y = diet, colour = outcome)) +
labs(title = "Income V Diet")
The plots above show there is a balanced result between diet, income, gender, age and ethinicity in this dataset.
References
####Data Source
- misc{panday2021,
- author = {Ankush Panday}, title = {Heart Attack Prediction in United States}, year = {2021}, publisher = {Kaggle}, journal = {Kaggle}, howpublished = {},
License The dataset is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
citation("tidyverse")
## To cite package 'tidyverse' in publications use:
##
## Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
## E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
## Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
## the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
## doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Welcome to the {tidyverse}},
## author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
## year = {2019},
## journal = {Journal of Open Source Software},
## volume = {4},
## number = {43},
## pages = {1686},
## doi = {10.21105/joss.01686},
## }
citation("readr")
## To cite package 'readr' in publications use:
##
## Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
## Data_. R package version 2.1.5,
## <https://CRAN.R-project.org/package=readr>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {readr: Read Rectangular Text Data},
## author = {Hadley Wickham and Jim Hester and Jennifer Bryan},
## year = {2024},
## note = {R package version 2.1.5},
## url = {https://CRAN.R-project.org/package=readr},
## }
citation("lubridate")
## To cite lubridate in publications use:
##
## Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy
## with lubridate. Journal of Statistical Software, 40(3), 1-25. URL
## https://www.jstatsoft.org/v40/i03/.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Dates and Times Made Easy with {lubridate}},
## author = {Garrett Grolemund and Hadley Wickham},
## journal = {Journal of Statistical Software},
## year = {2011},
## volume = {40},
## number = {3},
## pages = {1--25},
## url = {https://www.jstatsoft.org/v40/i03/},
## }
citation("kableExtra")
## To cite package 'kableExtra' in publications use:
##
## Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
## Pipe Syntax_. R package version 1.4.0,
## <https://CRAN.R-project.org/package=kableExtra>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
## author = {Hao Zhu},
## year = {2024},
## note = {R package version 1.4.0},
## url = {https://CRAN.R-project.org/package=kableExtra},
## }
```