Skip to main content
Get your Wikispaces Classroom now:
the easiest way to manage your class.
Data Science | DSDHT
Are there any new drugs left?
Boiling the ocean - integrating everything for patient care
Boiling the ocean - integrating everything together
Gene expression data analysis
Information-based Drug Discovery
Introduction to R Shiny
Managing disease with data science
Mapping Structure to function
Add "All Pages"
Assignment 5 - analyzing patient data
In this assignment, you will use R to analyze some anonymized patient data to look for links between various descriptors and outcomes. The dataset is in CSV format and contains a random sample of patient medical records. It will be provided in Canvas and you will need to read this into R.
Columns included are the following:
YEAR – Year for which the data is aggregated.
STATE – State abbreviation
AGE_CATEGORY – 1 = 18 years to 44 years
2 = 45 to 64 years
3 = 65 to 79 years
4 = 80+ years
GENDER – Gender of patient.
DISEASE_CATEGORY – Diagnosis category.
1 = diabetes
2 = hypertension
PATIENTS – Number of patients included in strata (rows with less than 100 patients removed)
OFFICE_VISITS – Number of office visits in strata
A1C_MEAN – mean HgbA1c value (%)
A1C_MEDIAN – median HgbA1c value (%)
A1C_STDDEV – standard deviation of HgbA1c values (%)
WEIGHT_MEAN – mean weight value (lb)
WEIGHT_MEDIAN – median HgbA1c value (lb)
WEIGHT_STDDEV – standard deviation of HgbA1c value(lb)
BMI_MEAN – mean BMI
BMI_MEDIAN – median BMI
BMI_STDDEV – standard deviation of BMI values
FBG_MEAN – mean fasting blood glucose
FBG_MEDIAN – median fasting blood glucose
FBG_STDDEV – standard deviation of fasting blood glucose values
SBP_MEAN – mean systolic blood pressure
SBP_MEDIAN – median systolic blood pressure
SBP_STDDEV – standard deviation of systolic blood pressure values
DBP_MEAN – mean diastolic blood pressure
DBP_MEDIAN – median diastolic blood pressure
DBP_STDDEV – standard deviation of diastolic blood pressure values
Answer the following questions. You may use R or any other statistical package you prefer. In R, you can use ggplot2 or other plotting libraries to create graphs, and the
to find correlations.
1. Identify the 5 U.S. states in the dataset which have the largest number of patients
2. Identify the 5 U.S. states with the highest number of diabetes patients
3. Create a single pie chart that shows the number of diabetes patients in each state
4. Which states have the highest and lowest mean BMIs?
5. Plot mean BMI against mean systolic blood pressure. Show your plot, and discuss whether you think they are correlated or not.
6. Plot mean BMI against mean A1C. Show your plot, and discuss whether you think they are correlated or not.
7. Which age bracket has the highest frequency of hypertension?
8. Make a box plot (showing means, deviations) of all mean descriptors (A1C, weight, BMI, FBG, SBP, DBP) for both patients with hypertension and with diabetes separately. What do you learn from these?
9. Identify which of the following variables correlate:
BMI_MEAN,BMI_STSDEV,FBG_MEAN,FBG_STDDEV,SBP_MEAN,SBP_STDDEV,DBP_MEAN and DBP_STDDEV. Describe your results identifying which variables are highly correlated
Submit your answers and plots in PDF format on the Oncourse site.
help on how to format text
Turn off "Getting Started"