Assignment 5 - analyzing patient data

In this assignment, you will use R to analyze some anonymized patient data to look for links between various descriptors and outcomes. The dataset is in CSV format and contains a random sample of patient medical records. It will be provided in Canvas and you will need to read this into R.Columns included are the following:

YEAR – Year for which the data is aggregated.
STATE – State abbreviation
AGE_CATEGORY – 1 = 18 years to 44 years
2 = 45 to 64 years
3 = 65 to 79 years
4 = 80+ years
GENDER – Gender of patient.
DISEASE_CATEGORY – Diagnosis category.
1 = diabetes
2 = hypertension
PATIENTS – Number of patients included in strata (rows with less than 100 patients removed)
OFFICE_VISITS – Number of office visits in strata
A1C_MEAN – mean HgbA1c value (%)
A1C_MEDIAN – median HgbA1c value (%)
A1C_STDDEV – standard deviation of HgbA1c values (%)
WEIGHT_MEAN – mean weight value (lb)
WEIGHT_MEDIAN – median HgbA1c value (lb)
WEIGHT_STDDEV – standard deviation of HgbA1c value(lb)
BMI_MEAN – mean BMI
BMI_MEDIAN – median BMI
BMI_STDDEV – standard deviation of BMI values
FBG_MEAN – mean fasting blood glucose
FBG_MEDIAN – median fasting blood glucose
FBG_STDDEV – standard deviation of fasting blood glucose values
SBP_MEAN – mean systolic blood pressure
SBP_MEDIAN – median systolic blood pressure
SBP_STDDEV – standard deviation of systolic blood pressure values
DBP_MEAN – mean diastolic blood pressure
DBP_MEDIAN – median diastolic blood pressure
DBP_STDDEV – standard deviation of diastolic blood pressure values

Answer the following questions. You may use R or any other statistical package you prefer. In R, you can use ggplot2 or other plotting libraries to create graphs, and the corrplot package to find correlations.

1. Identify the 5 U.S. states in the dataset which have the largest number of patients
2. Identify the 5 U.S. states with the highest number of diabetes patients
3. Create a single pie chart that shows the number of diabetes patients in each state
4. Which states have the highest and lowest mean BMIs?
5. Plot mean BMI against mean systolic blood pressure. Show your plot, and discuss whether you think they are correlated or not.
6. Plot mean BMI against mean A1C. Show your plot, and discuss whether you think they are correlated or not.
7. Which age bracket has the highest frequency of hypertension?
8. Make a box plot (showing means, deviations) of all mean descriptors (A1C, weight, BMI, FBG, SBP, DBP) for both patients with hypertension and with diabetes separately. What do you learn from these?
9. Identify which of the following variables correlate: A1C_MEAN,A1C_STDDEV,WEIGHT_MEAN,WEIGHT_STDDEV,
BMI_MEAN,BMI_STSDEV,FBG_MEAN,FBG_STDDEV,SBP_MEAN,SBP_STDDEV,DBP_MEAN and DBP_STDDEV. Describe your results identifying which variables are highly correlated

Submit your answers and plots in PDF format on the Oncourse site.