Exploring NYC Crime Data Using EDA


Skills

  • Exploratory Data Analysis (EDA)
  • Data Wrangling
  • Data Visualization
  • R


Introduction

NYC has a wealth of open and transparent data for anyone to study and analyze. Gathering such data and computing statistics on crime in NYC may provide critical insights regarding steps to be taken for other similar high-density and highly populated areas, which can be helpful for law enforcement and government officials to deter crime. For these particular project, we focused on examining the various factors, patterns, and variables associated with crime in New York City.


Data and Methods

We decided to study sex-related, drug-related, and weapons-related felonies occurring from 2014 to 2017 in particular for a couple of reasons. First, we felt that there was a significant overlap among these three types of felonies - drug-related crimes, for example, are often committed concurrently with weapons-related crimes. Second, our raw data contained over 6 million observations and we wanted to reasonably limit our scope a bit. Third, felonies are generally ranked higher compared to misdemeanors and violations, in terms of violence, risk, severity, and danger, providing us with potentially more important insights than the other categories. These reasons led us to explore data on the three main types of felonies (sex-related, weapon-related and drug-related) that occurred in New York City from 2014 to 2017.

We used the dataset collected by the New York City Police Department (NYPD); specifically, we used the NYPD Historic Complaint dataset, which provides longitudinal information on complaints filed to the NYPD, the type of crimes committed by a suspect, suspect demographics, victim demographics, location of crime, date and time of crime, and other variables. The link to the raw dataset is here. The link to how we acquired and cleaned the dataset is here.

Our final dataset, sex_drug_weapons, contains 46,692 observations. The list below provides all 14 variables in the dataset with their brief descriptions:

cmplnt_num: Randomly generated ID for each incident
boro_nm: Borough in which the incident occurred
cmplnt_fr_dt: Exact start date of occurrence for the reported incident
cmplnt_to_dt: Exact end date of occurrence for the reported incident
cmplnt_fr_tm: Exact time of occurrence for the reported incident
ky_cd: Three-digit offense classification code
ofns_desc: Description of offense corresponding with key code (ky_cd)
pd_cd: Three-digit internal classification code
pd_desc: Description of internal classification corresponding with PD code (pd_cd)
vic_race: Victim’s race description
vic_sex: Victim’s sex description (D=Business/Organization, E=PSNY/People of the State of New York, F=Female, M=Male)
year: Year the incident occurred
prem_typ_desc: Specific description of premises where incident occurred
crime_group: Identifies whether crime was a sex-related felony, drug-related felony, or weapons-related felony


Length of Reported Crime

For our exploratory analysis, we examined whether the average time between when the crime started and ended differed by borough and felony type. Examining the average time between when the crime started and ended can serve as a proxy indicator of the severity of the crime. Longer times may mean the crime is more severe, harder to resolve, more violent, and may require more resources to deal with. Furthermore, differences in the length of reported crimes may have implications for law enforcement officials, policymakers, and urban residents.


Raw Table

First, we read in our data and create a variable calculating the length of the felony in days.

Table 1. Reading in Dataset

felonies = readRDS(file = "./data/sex_drug_weapons.rds")


knitr::kable(head(felonies[1:5]))
cmplnt_num boro_nm cmplnt_fr_dt cmplnt_to_dt cmplnt_fr_tm
642372589 brooklyn 2017-09-07 2017-09-07 06:15:00
865947766 queens 2014-11-08 2014-11-08 22:50:00
265604404 bronx 2014-04-10 NA 19:30:00
663741947 brooklyn 2017-08-12 2017-08-12 20:00:00
831735305 brooklyn 2017-06-29 2017-06-29 10:45:00
617379463 staten_island 2016-06-17 2016-06-17 14:30:00
knitr::kable(head(felonies[6:10])) 
ky_cd ofns_desc pd_cd pd_desc vic_race
118 dangerous weapons 793 weapons possession 3 unknown
117 dangerous drugs 510 controlled substance, intent t unknown
117 dangerous drugs 501 controlled substance,possess. unknown
118 dangerous weapons 792 weapons possession 1 & 2 unknown
117 dangerous drugs 503 controlled substance,intent to unknown
117 dangerous drugs 501 controlled substance,possess. unknown
knitr::kable(head(felonies[11:15]))
vic_sex vic_age_group year prem_typ_desc crime_group
e unknown 2017 residence-house Weapons-Related
e NA 2014 street Drug-Related
e NA 2014 street Drug-Related
e unknown 2017 residence - apt. house Weapons-Related
e unknown 2017 residence - public housing Drug-Related
e unknown 2016 grocery/bodega Drug-Related


Note that not every observation has a value for cmplnt_to_dt. This could be due to several factors - perhaps the crime was never closed (i.e., it remained an ongoing crime) or perhaps the city was not able to record the value for that variable for whatever reason. To remedy this issue, we take on two approaches:

  • We input the end date for the crime as the difference between the end of 2017 and the start date of the occurrence of the crime (cmplnt_fr_dt). This makes sense if the crime remained ongoing until the end of 2017 or beyond. Our resulting dataset is time_data.
  • We exclude missing values for end dates (denoted by ‘NA’). This makes sense if the city was not able to record the end dates for the crime for whatever reason. Our resulting dataset is time_data2.

We will examine whether the average length of reported felonies differs when we use these two approaches.


Approach 1

In the table below, we input the end dates for any crime with an “NA” for the variable cmplnt_to_dt. The first few rows of the resulting dataset, time_data, is shown below.

Table 2. Input End Dates for Crime Occurrence

time_data = felonies %>%
  mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
         boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>% 
  janitor::clean_names() %>% 
  mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", 
                       na.rm = TRUE))) %>% 
  mutate(time_diff2 = if_else(is.na(time_diff2), as.Date("2017-12-31") 
         - as.Date(cmplnt_fr_dt), time_diff2)) %>%   
  select(time_diff2, boro_nm, crime_group)

knitr::kable(head(time_data))
time_diff2 boro_nm crime_group
0 days brooklyn Weapons-Related
0 days queens Drug-Related
1361 days bronx Drug-Related
0 days brooklyn Weapons-Related
0 days brooklyn Drug-Related
0 days staten_island Drug-Related


Approach 2

Our second approach involves excluding any observations with missing end dates for crimes. The first few rows of the resulting dataset, time_data2, is shown in the table below.

Table 3. Exclude Missing End Dates for Crime Occurrence

time_data2 = felonies %>%
  mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
         boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>% 
  janitor::clean_names() %>% 
  mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", 
                       na.rm = FALSE))) %>% 
  select(time_diff2, boro_nm, crime_group) %>% 
  filter(!is.na(time_diff2))

knitr::kable(head(time_data2))
time_diff2 boro_nm crime_group
0 brooklyn Weapons-Related
0 queens Drug-Related
0 brooklyn Weapons-Related
0 brooklyn Drug-Related
0 staten_island Drug-Related
0 brooklyn Weapons-Related


Average Length by Crime Group

We then create tables showing the average length of felonies in days by borough and crime group for both approaches. Notice the dramatic change in both the counts and average length of felonies between the approaches.

Table 4. Average Length of Felonies by Crime Group

tidy1 = time_data %>% 
  rename(`Crime Group` = crime_group) %>% 
  group_by(`Crime Group`) %>%
  summarise('Count with End Date' = n(), 
            `Avg. Length With End Date` = mean(time_diff2),
            `SD With End Date` = sd(time_diff2))

tidy2 = time_data2 %>% 
  rename(`Crime Group` = crime_group) %>% 
  group_by(`Crime Group`) %>%
  summarise('Count w/o End Date' = n(), 
            `Avg. Length Excluding NAs` = mean(time_diff2),
            `SD Excluding NAs` = sd(time_diff2))

merged_table <- merge(tidy1, tidy2, by = c("Crime Group"))
knitr::kable(merged_table)
Crime Group Count with End Date Avg. Length With End Date SD With End Date Count w/o End Date Avg. Length Excluding NAs SD Excluding NAs
Drug-Related 18293 156.8032 days 363.8860 14591 0.7349879 13.78610
Sex-Related 8646 157.3583 days 350.1052 7128 23.3329008 91.21814
Weapons-Related 19753 150.4999 days 355.5124 15865 0.3511451 10.15509


Average Length by Borough

Table 5. Average Length of Felonies by Borough

tidy3 = time_data %>% 
  filter(!is.na(boro_nm)) %>% 
  rename(Borough = boro_nm) %>% 
  group_by(Borough) %>%
  summarise('Count with End Date' = n(), 
            `Avg. Length With End Date` = mean(time_diff2), 
            `SD With End Dates` = sd(time_diff2))

tidy4 = time_data2 %>% 
  filter(!is.na(boro_nm)) %>% 
  rename(Borough = boro_nm) %>% 
  group_by(Borough) %>%
  summarise('Count w/o End Date' = n(), 
            `Avg. Length Excluding NAs` = mean(time_diff2), 
            `SD Excluding NAs` = sd(time_diff2))

merged_table2 <- merge(tidy3, tidy4, by = c("Borough"))
knitr::kable(merged_table2)
Borough Count with End Date Avg. Length With End Date SD With End Dates Count w/o End Date Avg. Length Excluding NAs SD Excluding NAs
bronx 12848 216.5919 days 410.3039 9261 4.519990 41.55909
brooklyn 15014 128.3765 days 331.8315 12615 4.258720 38.26172
manhattan 9200 151.6389 days 349.8756 7380 4.274063 38.52996
queens 7995 108.5520 days 310.8985 6985 6.540014 47.96291
staten_island 1634 139.4144 days 337.9634 1343 7.300136 62.23328


Discussion

From our tables above, we notice that there seems to be a marked difference in the average length of incidents across boroughs and felony type. Notably, we see that, on average, sex-related felonies seem to have a longer average incident length than drug-related and weapons-related felonies, whether we input end dates for crimes or exclude NAs. In terms of boroughs, Bronx ranks the highest for average length of felonies when inputting end dates for crimes; Staten Island slightly ranks higher than the rest of the boroughs when we exclude NAs.

View the “Differences in Mean Length of NYC Felonies” project under the “Data Analysis - R” page for formal statistical tests.