Home
  • Exploratory Data Analysis
    • Exploring Activity Trends and Patterns Using EDA
    • Exploring NYC Crime Data Using EDA
  • Data Analysis - R
    • Analyzing Likert Scale of Depressive Symptoms
    • Differences in Mean Length of NYC Felonies
  • Data Analysis - SAS
    • Alcohol and Smoking Using NHANES
    • Away-From-Home Meals and BMI
  • SQL Queries
    • SQL Queries With NHANES Data
    • SQL Queries With Cancer Data
  • Biostatistics
    • Survival Analysis of Stroke Patients
    • Proposal for Phase III RCT
  • Resume


Exploring Activity Trends and Patterns Using EDA


Skills

  • Exploratory Data Analysis (EDA)
  • Data Wrangling
  • Data Visualization
  • R
  • Statistical Analysis


Introduction

Accelerometers have become an increasingly important tool for studying around-the-clock physical activity in studies. Levels of activities can be counted by measuring voltage signals on a patient within a short span of time. The goal of this project is to summarize trends and patterns from data collected on a male patient who wore an accelerometer device that recorded his physical activity for several months.

Researchers specifically want to understand his patterns of physical activity over periods of time; thus, some important questions to explore include, but are not limited to:

  • How does the patient’s activity level change over the course of the day?
  • Does the patient become active over time?
  • Does the day of the week affect activity levels? Does the week of the year affect activity levels?
  • What is the patient’s 24-hour activity profile for each day? How does time affect the profiles?


Data and Methods

The patient is a 63-year-old male individual with congestive heart failure (CHF). The data for the patient can be found here. The variables in the spreadsheet are:

week: Number indicating week of observational period

day: Day of the week

activity.*: Activity count (measure of activity levels). Numbers replacing asterisk indicate the minute of day on which the activity level was captured (e.g., activity.3 contains the activity count for the 3rd minute after midnight for that particular day)


Results


Tidying Data

After importing the data, we clean up the data by renaming variables, ordering the data by day of the week (Sunday first), creating a variable that indicates each day of the observational period (overall_days), and creating a variable that indicates each minute of the observational period (overall_minutes).

activity = read_csv("data/p8105_mtp_data.csv", col_types = cols()) %>% 
  janitor::clean_names() %>%
  mutate(day = fct_relevel(day, str_c(c("Sunday", "Monday", "Tuesday", "Wednesday",
                                        "Thursday", "Friday", "Saturday")))) %>% 
  arrange(week, day) %>% 
  mutate(overall_day = seq(1:nrow(.))) %>%
  gather(key = minute_of_day, value = activity_count, activity_1:activity_1440) %>% 
  separate(minute_of_day, into = c("activity", "minute_of_day"), sep = "_") %>% 
  select(-activity) %>% 
  mutate(minute_of_day = as.numeric(minute_of_day)) %>% 
  arrange(week, day, overall_day) %>% 
  mutate(overall_minutes = seq(1:nrow(.))) 


We can examine the head and tail ends of the dataset (shown below) to make sure that we cleaned our data properly.

kable(head(activity))
week day overall_day minute_of_day activity_count overall_minutes
1 Sunday 1 1 1 1
1 Sunday 1 2 1 2
1 Sunday 1 3 1 3
1 Sunday 1 4 1 4
1 Sunday 1 5 1 5
1 Sunday 1 6 1 6
kable(tail(activity))
week day overall_day minute_of_day activity_count overall_minutes
47 Saturday 329 1435 1 473755
47 Saturday 329 1436 1 473756
47 Saturday 329 1437 1 473757
47 Saturday 329 1438 1 473758
47 Saturday 329 1439 1 473759
47 Saturday 329 1440 1 473760


I run a summary function to examine the distribution of the variables. We can see that the patient was studied over 473,760 minutes (47 weeks; 329 days). His activity levels ranged from 1 to 10,417 units; 240.3 units was the mean. There were no activity levels below 1 unit, signifying that 1 unit could have been an indication of extremely minimal or no activity in this case, or that the accelerometer device was removed at this time. The mean for activity count (240.3) is moderately greater than its media (43.0), suggesting that activity count is right-skewed; we are probably more likely to encounter lower levels of activity levels rather than higher levels.

(summary(activity))
##       week           day         overall_day  minute_of_day   
##  Min.   : 1   Sunday   :67680   Min.   :  1   Min.   :   1.0  
##  1st Qu.:12   Monday   :67680   1st Qu.: 83   1st Qu.: 360.8  
##  Median :24   Tuesday  :67680   Median :165   Median : 720.5  
##  Mean   :24   Wednesday:67680   Mean   :165   Mean   : 720.5  
##  3rd Qu.:36   Thursday :67680   3rd Qu.:247   3rd Qu.:1080.2  
##  Max.   :47   Friday   :67680   Max.   :329   Max.   :1440.0  
##               Saturday :67680                                 
##  activity_count    overall_minutes 
##  Min.   :    1.0   Min.   :     1  
##  1st Qu.:    1.0   1st Qu.:118441  
##  Median :   43.0   Median :236881  
##  Mean   :  240.3   Mean   :236881  
##  3rd Qu.:  335.0   3rd Qu.:355320  
##  Max.   :10417.0   Max.   :473760  
## 


Activity Levels Over Time - Days

We use the ggplot function to determine whether the patient increased activity levels over the span of the observational period; we run a regression line through the plots and can see that activity levels seem to increase over time.

total_activity =
  activity %>% 
  group_by(week, day, overall_day) %>% 
  summarise(total_per_day = as.integer(sum(activity_count))) 

kable(head(total_activity, 15))
week day overall_day total_per_day
1 Sunday 1 196937
1 Monday 2 432750
1 Tuesday 3 297427
1 Wednesday 4 369435
1 Thursday 5 366178
1 Friday 6 416165
1 Saturday 7 1440
2 Sunday 8 1440
2 Monday 9 275174
2 Tuesday 10 293201
2 Wednesday 11 346427
2 Thursday 12 345304
2 Friday 13 387241
2 Saturday 14 1492
3 Sunday 15 1440
act_graph =
  activity %>% 
  group_by(week, day, overall_day) %>% 
  summarise(total_per_day = as.integer(sum(activity_count))) %>% 
  ggplot(aes(y = total_per_day, x = overall_day)) +
  geom_line(alpha = 0.5) +
  geom_smooth(method = "lm", size = 0.5, color = "red", se = FALSE) +
  labs(
    title = "Total Activity For Each of 329 Days",
    x = "Day of Observational Period",
    y = "Total Activity Levels"
  )

act_graph


Statistical Analysis

I run a formal statistical analysis to determine whether the association between total activity levels per day and the day of the observational period (i.e., time) is statistically significant. From the output below, we see that the parameter estimate for overall_day is around 573; thus, for every 1 day of the observational period that passes, the patient increased his total activity levels by 573 units, on average. The p-value is less than 0.05, indicating that this association is statistically significant - the patient did increase overall activity levels over time.

regression <- lm(total_activity$total_per_day ~ total_activity$overall_day)
tidy(summary(regression))
## # A tibble: 2 x 5
##   term                       estimate std.error statistic  p.value
##   <chr>                         <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                 251538.   15716.      16.0  5.54e-43
## 2 total_activity$overall_day     573.      82.6      6.94 2.10e-11


Variation in Activity Levels

I now want to examine whether there is any variation in activity levels by day of the week. The median activity levels of all 47 weeks for each day of the week is included below. We notice that Friday, Saturday, Thursday, and Sunday have the highest median activity levels.

median_act = 
  total_activity %>% 
  group_by(day) %>% 
  summarise(median_per_day = as.integer(median(total_per_day)))

kable(median_act[order(-median_act$median_per_day),])
day median_per_day
Friday 416165
Saturday 375047
Thursday 371230
Sunday 365415
Wednesday 346681
Monday 341906
Tuesday 319019


I plot the activity levels over 47 weeks for each day to visually examine the impact of day of the week on activity levels. We can see that the slope of the regression lines seems to be higher for Friday, Saturday, and Sunday, suggesting that the patient is more active during those days and that our plots match what we discovered from our table above.

daily_act = 
  activity %>% 
  group_by(week, day, overall_day) %>% 
  summarise(total_per_day = as.integer(sum(activity_count))) %>%
  ggplot(aes(x = week, y = total_per_day, color = day)) +
  geom_boxplot() +
  geom_point() +
    labs(
    title = "Activity Over 47 Weeks - By Each Day",
    x = "Week",
    y = "Total Activity Levels"
    ) +
  scale_x_continuous(breaks = c(0, 10, 20, 30, 40, 50),
                     labels = c("0", "10", "20", "30", "40", "50")) + 
  scale_y_continuous(labels = c("0", "200,000", "400,000", "600,000", "800,000")) +
  geom_smooth(method = "lm", size = 0.5, color = "red", se = FALSE) +
  facet_grid(~day) +
  theme(legend.position = "bottom")

daily_act


Statistical Analysis

I run a formal statistical analysis to determine whether activity levels varied by day of the week, after adjusting for the day of the observational period (i.e., time). Sunday was used as the reference category. From the output below, we see that the parameter estimates for most of the days are negative, suggesting a lower activity level for most of the days in comparison to Sunday. Furthermore, the parameter estimates for total_activity$dayMonday to total_activity$daySaturday all have a p-value above 0.05, indicating that the association is not statistically significant, after adjusting for time. Thus, there was not a statistically significant difference in activity levels by day of the week, after adjusting for time.

regression2 <- lm(total_activity$total_per_day ~ total_activity$day + total_activity$week)
tidy(summary(regression2))
## # A tibble: 8 x 5
##   term                        estimate std.error statistic  p.value
##   <chr>                          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                  263730.    24786.    10.6   7.56e-23
## 2 total_activity$dayMonday     -42278.    29138.    -1.45  1.48e- 1
## 3 total_activity$dayTuesday    -52483.    29138.    -1.80  7.26e- 2
## 4 total_activity$dayWednesday  -19462.    29138.    -0.668 5.05e- 1
## 5 total_activity$dayThursday    -3654.    29138.    -0.125 9.00e- 1
## 6 total_activity$dayFriday      30313.    29138.     1.04  2.99e- 1
## 7 total_activity$daySaturday    -6882.    29138.    -0.236 8.13e- 1
## 8 total_activity$week            3994.      574.     6.96  1.97e-11


24-Hour Activity Profiles

Lastly, I create 24-hour activity profiles for each day of the week aggregating data from all 47 weeks. We see bimodal distributions for most of the days, suggesting that the patient was more active during the mornings and in the evenings. Secondly, we see higher peaks for Thursday, Friday, and Sunday, indicating heavier activity levels during this time. As noted above though, day of the week was not statistically associated with difference in activity levels, after adjusting for time.

minute_act = 
  activity %>% 
  group_by(minute_of_day, day) %>% 
  summarise(total_per_day = as.integer(sum(activity_count))) %>%
  ggplot(aes(x = minute_of_day / 60, y = total_per_day, color = day)) +
  geom_point() +
    labs(
    title = "Activity Over 24 Hours - By Each Day",
    x = "Hour",
    y = "Total Activity Levels"
    ) +
  geom_smooth(size = 0.5, color = "red") +
  facet_grid(~day) +
  theme(legend.position = "bottom")

minute_act
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'