Exploring NYC Crime Data Using EDA

Skills

Exploratory Data Analysis (EDA)
Data Wrangling
Data Visualization
R

Introduction

NYC has a wealth of open and transparent data for anyone to study and analyze. Gathering such data and computing statistics on crime in NYC may provide critical insights regarding steps to be taken for other similar high-density and highly populated areas, which can be helpful for law enforcement and government officials to deter crime. For these particular project, we focused on examining the various factors, patterns, and variables associated with crime in New York City.

Data and Methods

We decided to study sex-related, drug-related, and weapons-related felonies occurring from 2014 to 2017 in particular for a couple of reasons. First, we felt that there was a significant overlap among these three types of felonies - drug-related crimes, for example, are often committed concurrently with weapons-related crimes. Second, our raw data contained over 6 million observations and we wanted to reasonably limit our scope a bit. Third, felonies are generally ranked higher compared to misdemeanors and violations, in terms of violence, risk, severity, and danger, providing us with potentially more important insights than the other categories. These reasons led us to explore data on the three main types of felonies (sex-related, weapon-related and drug-related) that occurred in New York City from 2014 to 2017.

We used the dataset collected by the New York City Police Department (NYPD); specifically, we used the NYPD Historic Complaint dataset, which provides longitudinal information on complaints filed to the NYPD, the type of crimes committed by a suspect, suspect demographics, victim demographics, location of crime, date and time of crime, and other variables. The link to the raw dataset is here. The link to how we acquired and cleaned the dataset is here.

Our final dataset, sex_drug_weapons, contains 46,692 observations. The list below provides all 14 variables in the dataset with their brief descriptions:

cmplnt_num: Randomly generated ID for each incident
boro_nm: Borough in which the incident occurred
cmplnt_fr_dt: Exact start date of occurrence for the reported incident
cmplnt_to_dt: Exact end date of occurrence for the reported incident
cmplnt_fr_tm: Exact time of occurrence for the reported incident
ky_cd: Three-digit offense classification code
ofns_desc: Description of offense corresponding with key code (ky_cd)
pd_cd: Three-digit internal classification code
pd_desc: Description of internal classification corresponding with PD code (pd_cd)
vic_race: Victim’s race description
vic_sex: Victim’s sex description (D=Business/Organization, E=PSNY/People of the State of New York, F=Female, M=Male)
year: Year the incident occurred
prem_typ_desc: Specific description of premises where incident occurred
crime_group: Identifies whether crime was a sex-related felony, drug-related felony, or weapons-related felony

Length of Reported Crime

For our exploratory analysis, we examined whether the average time between when the crime started and ended differed by borough and felony type. Examining the average time between when the crime started and ended can serve as a proxy indicator of the severity of the crime. Longer times may mean the crime is more severe, harder to resolve, more violent, and may require more resources to deal with. Furthermore, differences in the length of reported crimes may have implications for law enforcement officials, policymakers, and urban residents.

Raw Table

First, we read in our data and create a variable calculating the length of the felony in days.

Table 1. Reading in Dataset

felonies = readRDS(file = "./data/sex_drug_weapons.rds")


knitr::kable(head(felonies[1:5]))

cmplnt_num	boro_nm	cmplnt_fr_dt	cmplnt_to_dt	cmplnt_fr_tm
642372589	brooklyn	2017-09-07	2017-09-07	06:15:00
865947766	queens	2014-11-08	2014-11-08	22:50:00
265604404	bronx	2014-04-10	NA	19:30:00
663741947	brooklyn	2017-08-12	2017-08-12	20:00:00
831735305	brooklyn	2017-06-29	2017-06-29	10:45:00
617379463	staten_island	2016-06-17	2016-06-17	14:30:00

knitr::kable(head(felonies[6:10]))

ky_cd	ofns_desc	pd_cd	pd_desc	vic_race
118	dangerous weapons	793	weapons possession 3	unknown
117	dangerous drugs	510	controlled substance, intent t	unknown
117	dangerous drugs	501	controlled substance,possess.	unknown
118	dangerous weapons	792	weapons possession 1 & 2	unknown
117	dangerous drugs	503	controlled substance,intent to	unknown
117	dangerous drugs	501	controlled substance,possess.	unknown

knitr::kable(head(felonies[11:15]))

vic_sex	vic_age_group	year	prem_typ_desc	crime_group
e	unknown	2017	residence-house	Weapons-Related
e	NA	2014	street	Drug-Related
e	NA	2014	street	Drug-Related
e	unknown	2017	residence - apt. house	Weapons-Related
e	unknown	2017	residence - public housing	Drug-Related
e	unknown	2016	grocery/bodega	Drug-Related

Note that not every observation has a value for cmplnt_to_dt. This could be due to several factors - perhaps the crime was never closed (i.e., it remained an ongoing crime) or perhaps the city was not able to record the value for that variable for whatever reason. To remedy this issue, we take on two approaches:

We input the end date for the crime as the difference between the end of 2017 and the start date of the occurrence of the crime (cmplnt_fr_dt). This makes sense if the crime remained ongoing until the end of 2017 or beyond. Our resulting dataset is time_data.
We exclude missing values for end dates (denoted by ‘NA’). This makes sense if the city was not able to record the end dates for the crime for whatever reason. Our resulting dataset is time_data2.

We will examine whether the average length of reported felonies differs when we use these two approaches.

Approach 1

In the table below, we input the end dates for any crime with an “NA” for the variable cmplnt_to_dt. The first few rows of the resulting dataset, time_data, is shown below.

Table 2. Input End Dates for Crime Occurrence

time_data = felonies %>%
  mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
         boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>% 
  janitor::clean_names() %>% 
  mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", 
                       na.rm = TRUE))) %>% 
  mutate(time_diff2 = if_else(is.na(time_diff2), as.Date("2017-12-31") 
         - as.Date(cmplnt_fr_dt), time_diff2)) %>%   
  select(time_diff2, boro_nm, crime_group)

knitr::kable(head(time_data))

time_diff2	boro_nm	crime_group
0 days	brooklyn	Weapons-Related
0 days	queens	Drug-Related
1361 days	bronx	Drug-Related
0 days	brooklyn	Weapons-Related
0 days	brooklyn	Drug-Related
0 days	staten_island	Drug-Related

Approach 2

Our second approach involves excluding any observations with missing end dates for crimes. The first few rows of the resulting dataset, time_data2, is shown in the table below.

Table 3. Exclude Missing End Dates for Crime Occurrence

time_data2 = felonies %>%
  mutate(crime_group = forcats::fct_relevel(crime_group, "Drug-Related"),
         boro_nm = forcats::fct_relevel(boro_nm, "manhattan")) %>% 
  janitor::clean_names() %>% 
  mutate(time_diff2 = (as.numeric(cmplnt_to_dt - cmplnt_fr_dt, units = "days", 
                       na.rm = FALSE))) %>% 
  select(time_diff2, boro_nm, crime_group) %>% 
  filter(!is.na(time_diff2))

knitr::kable(head(time_data2))

time_diff2	boro_nm	crime_group
0	brooklyn	Weapons-Related
0	queens	Drug-Related
0	brooklyn	Weapons-Related
0	brooklyn	Drug-Related
0	staten_island	Drug-Related
0	brooklyn	Weapons-Related

Average Length by Crime Group

We then create tables showing the average length of felonies in days by borough and crime group for both approaches. Notice the dramatic change in both the counts and average length of felonies between the approaches.

Table 4. Average Length of Felonies by Crime Group

tidy1 = time_data %>% 
  rename(`Crime Group` = crime_group) %>% 
  group_by(`Crime Group`) %>%
  summarise('Count with End Date' = n(), 
            `Avg. Length With End Date` = mean(time_diff2),
            `SD With End Date` = sd(time_diff2))

tidy2 = time_data2 %>% 
  rename(`Crime Group` = crime_group) %>% 
  group_by(`Crime Group`) %>%
  summarise('Count w/o End Date' = n(), 
            `Avg. Length Excluding NAs` = mean(time_diff2),
            `SD Excluding NAs` = sd(time_diff2))

merged_table <- merge(tidy1, tidy2, by = c("Crime Group"))
knitr::kable(merged_table)

Crime Group	Count with End Date	Avg. Length With End Date	SD With End Date	Count w/o End Date	Avg. Length Excluding NAs	SD Excluding NAs
Drug-Related	18293	156.8032 days	363.8860	14591	0.7349879	13.78610
Sex-Related	8646	157.3583 days	350.1052	7128	23.3329008	91.21814
Weapons-Related	19753	150.4999 days	355.5124	15865	0.3511451	10.15509

Average Length by Borough

Table 5. Average Length of Felonies by Borough

tidy3 = time_data %>% 
  filter(!is.na(boro_nm)) %>% 
  rename(Borough = boro_nm) %>% 
  group_by(Borough) %>%
  summarise('Count with End Date' = n(), 
            `Avg. Length With End Date` = mean(time_diff2), 
            `SD With End Dates` = sd(time_diff2))

tidy4 = time_data2 %>% 
  filter(!is.na(boro_nm)) %>% 
  rename(Borough = boro_nm) %>% 
  group_by(Borough) %>%
  summarise('Count w/o End Date' = n(), 
            `Avg. Length Excluding NAs` = mean(time_diff2), 
            `SD Excluding NAs` = sd(time_diff2))

merged_table2 <- merge(tidy3, tidy4, by = c("Borough"))
knitr::kable(merged_table2)

Borough	Count with End Date	Avg. Length With End Date	SD With End Dates	Count w/o End Date	Avg. Length Excluding NAs	SD Excluding NAs
bronx	12848	216.5919 days	410.3039	9261	4.519990	41.55909
brooklyn	15014	128.3765 days	331.8315	12615	4.258720	38.26172
manhattan	9200	151.6389 days	349.8756	7380	4.274063	38.52996
queens	7995	108.5520 days	310.8985	6985	6.540014	47.96291
staten_island	1634	139.4144 days	337.9634	1343	7.300136	62.23328

Discussion

From our tables above, we notice that there seems to be a marked difference in the average length of incidents across boroughs and felony type. Notably, we see that, on average, sex-related felonies seem to have a longer average incident length than drug-related and weapons-related felonies, whether we input end dates for crimes or exclude NAs. In terms of boroughs, Bronx ranks the highest for average length of felonies when inputting end dates for crimes; Staten Island slightly ranks higher than the rest of the boroughs when we exclude NAs.

View the “Differences in Mean Length of NYC Felonies” project under the “Data Analysis - R” page for formal statistical tests.