SQL Queries With Cancer Data


Skills

  • SQL


Task

Complete the following questions using SQL.


Data

The following data provide information regarding breast and/or colon cancer patients from a clinic, who have been given drugs (A, B, C, and/or D). Data containing patient diagnoses can be found here. Data containing patient drug treatments can be found here. Relevant information on the variables is included below:

patient_id - Patient ID

diagnosis_date - Date of Cancer Diagnosis

diagnosis_code - Diagnosis Code (ICD-9-CM)

diagnosis - Breast Cancer or Colon Cancer

treatment_date - Date Patient Initiated Drug Regimen

drug_code - Drug Regimen (A, B, C, or D)


I first import and merge the datasets before running any queries.

diag <- read.csv('data/diagnosis.csv')
txt <- read.csv('data/treatment.csv')

merged <- sqldf(
"SELECT d.patient_id, diagnosis_date, diagnosis_code, diagnosis, treatment_date, drug_code
    FROM diag AS d
    INNER JOIN txt AS t
    ON d.patient_id = t.patient_id"
)

knitr::kable(head(merged[1:3]))
patient_id diagnosis_date diagnosis_code
2120 1/9/10 174.1
2120 1/9/10 174.1
2120 1/9/10 174.1
2120 1/9/10 174.1
2120 1/9/10 174.1
2120 1/9/10 174.1
knitr::kable(head(merged[4:6]))
diagnosis treatment_date drug_code
Breast Cancer 1/23/10 B
Breast Cancer 1/24/10 A
Breast Cancer 1/26/10 A
Breast Cancer 1/26/10 B
Breast Cancer 1/27/10 A
Breast Cancer 1/27/10 B


Queries

Q1

Write a query that finds the number of unique patients.

There are 46 unique patients.

unique_patients <- sqldf(
"SELECT COUNT(DISTINCT patient_id) AS unique_patients
    FROM merged" 
)

unique_patients
##   unique_patients
## 1              46
#46 unique patients


Q2

How many patients have breast cancer only, colon cancer only, and both breast and colon cancer?

We have 30 patients breast cancer only patients, 11 colon cancer only patients, and 5 patients with both cancers.

breast <- sqldf(
"SELECT COUNT(DISTINCT patient_id)
    FROM merged
WHERE diagnosis = 'Breast Cancer'"
)
breast
##   COUNT(DISTINCT patient_id)
## 1                         35
#35 patients with breast cancer
colon <- sqldf(
"SELECT COUNT(DISTINCT patient_id)
    FROM merged
WHERE diagnosis = 'Colon Cancer'"
) 
colon
##   COUNT(DISTINCT patient_id)
## 1                         16
#16 patients with colon cancer.
#Since there are only 46 unique patients, and 35+16 = 51, this suggests we have 5 individuals who have both cancers.
#This means we have 35 - 5 = 30 breast cancer only patients and 16 - 5 = 11 colon cancer only patients
both_cancers <- sqldf(
"SELECT DISTINCT patient_id, diagnosis
    FROM merged
ORDER BY patient_id")
both_cancers
##    patient_id     diagnosis
## 1        2038 Breast Cancer
## 2        2120 Breast Cancer
## 3        2175 Breast Cancer
## 4        2238 Breast Cancer
## 5        2407 Breast Cancer
## 6        2425 Breast Cancer
## 7        2462 Breast Cancer
## 8        2475 Breast Cancer
## 9        2607 Breast Cancer
## 10       2634  Colon Cancer
## 11       2720 Breast Cancer
## 12       2735 Breast Cancer
## 13       2762 Breast Cancer
## 14       2763 Breast Cancer
## 15       2770  Colon Cancer
## 16       3025 Breast Cancer
## 17       3070  Colon Cancer
## 18       3095  Colon Cancer
## 19       3395  Colon Cancer
## 20       3449  Colon Cancer
## 21       3749  Colon Cancer
## 22       3757 Breast Cancer
## 23       3757  Colon Cancer
## 24       3948 Breast Cancer
## 25       4057  Colon Cancer
## 26       4354 Breast Cancer
## 27       4374 Breast Cancer
## 28       4692 Breast Cancer
## 29       5259 Breast Cancer
## 30       5657 Breast Cancer
## 31       6281 Breast Cancer
## 32       6321 Breast Cancer
## 33       6837  Colon Cancer
## 34       6840  Colon Cancer
## 35       6877  Colon Cancer
## 36       6877 Breast Cancer
## 37       6889  Colon Cancer
## 38       6889 Breast Cancer
## 39       6922  Colon Cancer
## 40       6922 Breast Cancer
## 41       7230  Colon Cancer
## 42       7230 Breast Cancer
## 43       7242  Colon Cancer
## 44       7796 Breast Cancer
## 45       7937 Breast Cancer
## 46       7976 Breast Cancer
## 47       8480 Breast Cancer
## 48       8615 Breast Cancer
## 49       8827 Breast Cancer
## 50       9331 Breast Cancer
## 51       9489 Breast Cancer
#Code above confirms that we have 5 patients with both cancers: IDs 3757, 6877, 6889, 6922, and 7230
#Code also confirms numbers for breast cancer only patients and colon cancer only patients


Q3

What’s the average duration between diagnosis date and the earliest treatment date among breast cancer patients? Among colon cancer patients?

On average, colon cancer patients tend to start treatment 22.9375 days after being diagnosed while breast cancer patients tend to start treatment 4.857143 days after being diagnosed.

merged$diagnosis_date <- as.Date(merged$diagnosis_date, format("%m/%d/%y"))
merged$treatment_date <- as.Date(merged$treatment_date, format("%m/%d/%y"))

breast_dur <- sqldf(
"SELECT AVG(min_date_diff)
    FROM
    (SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
        FROM merged
    WHERE diagnosis = 'Breast Cancer'
    GROUP BY patient_id)"
)
breast_dur
##   AVG(min_date_diff)
## 1           4.857143
#4.857143 days
colon_dur <- sqldf(
"SELECT AVG(min_date_diff)
    FROM
    (SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
        FROM merged
    WHERE diagnosis = 'Colon Cancer'
    GROUP BY patient_id)"
)
colon_dur
##   AVG(min_date_diff)
## 1            22.9375
#22.9375 days
#Patients with both cancers are included in the calculations for both breast_dur and colon_dur


Q4

Do breast cancer patients vary in terms of average duration between diagnosis date and the earliest treatment date? Stratify by drug.

Average duration between diagnosis and treatment for breast cancer patients taking:

  • Drug A - 4.782609 days
  • Drug B - 5.4 days
  • Drug C - 42.13333 days
  • Drug D - NA (seems like no breast cancer patients took drug D)
breast_a <- sqldf(
"SELECT AVG(min_date_diff)
    FROM
    (SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
        FROM merged
    WHERE diagnosis = 'Breast Cancer' AND drug_code = 'A'
    GROUP BY patient_id)"
)
breast_a
##   AVG(min_date_diff)
## 1           4.782609
breast_b <- sqldf(
"SELECT AVG(min_date_diff)
    FROM
    (SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
        FROM merged
    WHERE diagnosis = 'Breast Cancer' AND drug_code = 'B'
    GROUP BY patient_id)"
)
breast_b
##   AVG(min_date_diff)
## 1                5.4
breast_c <- sqldf(
"SELECT AVG(min_date_diff)
    FROM
    (SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
        FROM merged
    WHERE diagnosis = 'Breast Cancer' AND drug_code = 'C'
    GROUP BY patient_id)"
)
breast_c
##   AVG(min_date_diff)
## 1           42.13333
breast_d <- sqldf(
"SELECT AVG(min_date_diff)
    FROM
    (SELECT patient_id, MIN(treatment_date - diagnosis_date) AS min_date_diff
        FROM merged
    WHERE diagnosis = 'Breast Cancer' AND drug_code = 'D'
    GROUP BY patient_id)"
)
breast_d
##   AVG(min_date_diff)
## 1                 NA
#No breast cancer patients using drug D, so no average duration 
#Code looks at any drug taken at any point during their treatment dates


Q5

What tends to be the most popular drug used by breast cancer patients? What about colon cancer patients?

A and B tend to be the most popular drugs used by breast cancer patients. B and C tend to be the most popular drugs used by colon cancer patients.

breast_drug <- sqldf(
"SELECT drug_code, COUNT(drug_code)
    FROM
    (SELECT DISTINCT patient_id, drug_code
        FROM merged
    WHERE diagnosis = 'Breast Cancer'
    ORDER BY patient_id)
GROUP BY drug_code"
)
knitr::kable(head(breast_drug))
drug_code COUNT(drug_code)
A 23
B 25
C 15
#A and B tend to be the most popular drugs used by breast cancer patients.
#Observations total more than 35 because some breast cancer patients use multiple drugs.
#Code looks at any drug taken at any point during their treatment dates
colon_drug <- sqldf(
"SELECT drug_code, COUNT(drug_code)
    FROM
    (SELECT DISTINCT patient_id, drug_code
        FROM merged
    WHERE diagnosis = 'Colon Cancer'
    ORDER BY patient_id)
GROUP BY drug_code"
)
knitr::kable(head(colon_drug))
drug_code COUNT(drug_code)
A 6
B 8
C 9
D 4
#B and C tend to be the most popular drugs used by colon cancer patients.
#Observations total more than 16 because some colon cancer patients use multiple drugs.
#Code looks at any drug taken at any point during their treatment date