Complete the following questions using SQL.
The following data provide information regarding breast and/or colon cancer patients from a clinic, who have been given drugs (A, B, C, and/or D). Data containing patient diagnoses can be found here. Data containing patient drug treatments can be found here. Relevant information on the variables is included below:
patient_id
- Patient ID
diagnosis_date
- Date of Cancer Diagnosis
diagnosis_code
- Diagnosis Code (ICD-9-CM)
diagnosis
- Breast Cancer or Colon Cancer
treatment_date
- Date Patient Initiated Drug Regimen
drug_code
- Drug Regimen (A, B, C, or D)
I first import and merge the datasets before running any queries.
diag <- read.csv('data/diagnosis.csv')
txt <- read.csv('data/treatment.csv')
merged <- sqldf(
"SELECT d.patient_id, diagnosis_date, diagnosis_code, diagnosis, treatment_date, drug_code
FROM diag AS d
INNER JOIN txt AS t
ON d.patient_id = t.patient_id"
)
knitr::kable(head(merged[1:3]))
patient_id | diagnosis_date | diagnosis_code |
---|---|---|
2120 | 1/9/10 | 174.1 |
2120 | 1/9/10 | 174.1 |
2120 | 1/9/10 | 174.1 |
2120 | 1/9/10 | 174.1 |
2120 | 1/9/10 | 174.1 |
2120 | 1/9/10 | 174.1 |
knitr::kable(head(merged[4:6]))
diagnosis | treatment_date | drug_code |
---|---|---|
Breast Cancer | 1/23/10 | B |
Breast Cancer | 1/24/10 | A |
Breast Cancer | 1/26/10 | A |
Breast Cancer | 1/26/10 | B |
Breast Cancer | 1/27/10 | A |
Breast Cancer | 1/27/10 | B |
Write a query that finds the number of unique patients.
There are 46 unique patients.
unique_patients <- sqldf(
"SELECT COUNT(DISTINCT patient_id) AS unique_patients
FROM merged"
)
unique_patients
## unique_patients
## 1 46
#46 unique patients
How many patients have breast cancer only, colon cancer only, and both breast and colon cancer?
We have 30 patients breast cancer only patients, 11 colon cancer only patients, and 5 patients with both cancers.
breast <- sqldf(
"SELECT COUNT(DISTINCT patient_id)
FROM merged
WHERE diagnosis = 'Breast Cancer'"
)
breast
## COUNT(DISTINCT patient_id)
## 1 35
#35 patients with breast cancer
colon <- sqldf(
"SELECT COUNT(DISTINCT patient_id)
FROM merged
WHERE diagnosis = 'Colon Cancer'"
)
colon
## COUNT(DISTINCT patient_id)
## 1 16
#16 patients with colon cancer.
#Since there are only 46 unique patients, and 35+16 = 51, this suggests we have 5 individuals who have both cancers.
#This means we have 35 - 5 = 30 breast cancer only patients and 16 - 5 = 11 colon cancer only patients
both_cancers <- sqldf(
"SELECT DISTINCT patient_id, diagnosis
FROM merged
ORDER BY patient_id")
both_cancers
## patient_id diagnosis
## 1 2038 Breast Cancer
## 2 2120 Breast Cancer
## 3 2175 Breast Cancer
## 4 2238 Breast Cancer
## 5 2407 Breast Cancer
## 6 2425 Breast Cancer
## 7 2462 Breast Cancer
## 8 2475 Breast Cancer
## 9 2607 Breast Cancer
## 10 2634 Colon Cancer
## 11 2720 Breast Cancer
## 12 2735 Breast Cancer
## 13 2762 Breast Cancer
## 14 2763 Breast Cancer
## 15 2770 Colon Cancer
## 16 3025 Breast Cancer
## 17 3070 Colon Cancer
## 18 3095 Colon Cancer
## 19 3395 Colon Cancer
## 20 3449 Colon Cancer
## 21 3749 Colon Cancer
## 22 3757 Breast Cancer
## 23 3757 Colon Cancer
## 24 3948 Breast Cancer
## 25 4057 Colon Cancer
## 26 4354 Breast Cancer
## 27 4374 Breast Cancer
## 28 4692 Breast Cancer
## 29 5259 Breast Cancer
## 30 5657 Breast Cancer
## 31 6281 Breast Cancer
## 32 6321 Breast Cancer
## 33 6837 Colon Cancer
## 34 6840 Colon Cancer
## 35 6877 Colon Cancer
## 36 6877 Breast Cancer
## 37 6889 Colon Cancer
## 38 6889 Breast Cancer
## 39 6922 Colon Cancer
## 40 6922 Breast Cancer
## 41 7230 Colon Cancer
## 42 7230 Breast Cancer
## 43 7242 Colon Cancer
## 44 7796 Breast Cancer
## 45 7937 Breast Cancer
## 46 7976 Breast Cancer
## 47 8480 Breast Cancer
## 48 8615 Breast Cancer
## 49 8827 Breast Cancer
## 50 9331 Breast Cancer
## 51 9489 Breast Cancer
#Code above confirms that we have 5 patients with both cancers: IDs 3757, 6877, 6889, 6922, and 7230
#Code also confirms numbers for breast cancer only patients and colon cancer only patients
What’s the average duration between diagnosis date and the earliest treatment date among breast cancer patients? Among colon cancer patients?
On average, colon cancer patients tend to start treatment 22.9375 days after being diagnosed while breast cancer patients tend to start treatment 4.857143 days after being diagnosed.
merged$diagnosis_date <- as.Date(merged$diagnosis_date, format("%m/%d/%y"))
merged$treatment_date <- as.Date(merged$treatment_date, format("%m/%d/%y"))
breast_dur <- sqldf(
"SELECT AVG(min_date_diff)
FROM
(SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
FROM merged
WHERE diagnosis = 'Breast Cancer'
GROUP BY patient_id)"
)
breast_dur
## AVG(min_date_diff)
## 1 4.857143
#4.857143 days
colon_dur <- sqldf(
"SELECT AVG(min_date_diff)
FROM
(SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
FROM merged
WHERE diagnosis = 'Colon Cancer'
GROUP BY patient_id)"
)
colon_dur
## AVG(min_date_diff)
## 1 22.9375
#22.9375 days
#Patients with both cancers are included in the calculations for both breast_dur and colon_dur
Do breast cancer patients vary in terms of average duration between diagnosis date and the earliest treatment date? Stratify by drug.
Average duration between diagnosis and treatment for breast cancer patients taking:
breast_a <- sqldf(
"SELECT AVG(min_date_diff)
FROM
(SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
FROM merged
WHERE diagnosis = 'Breast Cancer' AND drug_code = 'A'
GROUP BY patient_id)"
)
breast_a
## AVG(min_date_diff)
## 1 4.782609
breast_b <- sqldf(
"SELECT AVG(min_date_diff)
FROM
(SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
FROM merged
WHERE diagnosis = 'Breast Cancer' AND drug_code = 'B'
GROUP BY patient_id)"
)
breast_b
## AVG(min_date_diff)
## 1 5.4
breast_c <- sqldf(
"SELECT AVG(min_date_diff)
FROM
(SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
FROM merged
WHERE diagnosis = 'Breast Cancer' AND drug_code = 'C'
GROUP BY patient_id)"
)
breast_c
## AVG(min_date_diff)
## 1 42.13333
breast_d <- sqldf(
"SELECT AVG(min_date_diff)
FROM
(SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
FROM merged
WHERE diagnosis = 'Breast Cancer' AND drug_code = 'D'
GROUP BY patient_id)"
)
breast_d
## AVG(min_date_diff)
## 1 NA
#No breast cancer patients using drug D, so no average duration
#Code looks at any drug taken at any point during their treatment dates
What tends to be the most popular drug used by breast cancer patients? What about colon cancer patients?
A and B tend to be the most popular drugs used by breast cancer patients. B and C tend to be the most popular drugs used by colon cancer patients.
breast_drug <- sqldf(
"SELECT drug_code, COUNT(drug_code)
FROM
(SELECT DISTINCT patient_id, drug_code
FROM merged
WHERE diagnosis = 'Breast Cancer'
ORDER BY patient_id)
GROUP BY drug_code"
)
knitr::kable(head(breast_drug))
drug_code | COUNT(drug_code) |
---|---|
A | 23 |
B | 25 |
C | 15 |
#A and B tend to be the most popular drugs used by breast cancer patients.
#Observations total more than 35 because some breast cancer patients use multiple drugs.
#Code looks at any drug taken at any point during their treatment dates
colon_drug <- sqldf(
"SELECT drug_code, COUNT(drug_code)
FROM
(SELECT DISTINCT patient_id, drug_code
FROM merged
WHERE diagnosis = 'Colon Cancer'
ORDER BY patient_id)
GROUP BY drug_code"
)
knitr::kable(head(colon_drug))
drug_code | COUNT(drug_code) |
---|---|
A | 6 |
B | 8 |
C | 9 |
D | 4 |
#B and C tend to be the most popular drugs used by colon cancer patients.
#Observations total more than 16 because some colon cancer patients use multiple drugs.
#Code looks at any drug taken at any point during their treatment date