Chapter 11: Survival Analysis and Censored DataΒΆ
OverviewΒΆ
Survival Analysis studies time until an event occurs
ApplicationsΒΆ
Medicine: Time until death, disease recurrence
Engineering: Time until machine failure
Business: Customer churn, employee retention
Social Science: Time until employment, marriage
Key Challenge: Censored DataΒΆ
Censoring: We donβt observe the event for all subjects
Types:
Right Censoring (most common):
Study ends before event occurs
Subject drops out
We know: βevent time > observed timeβ
Left Censoring:
Event occurred before observation started
Less common
Interval Censoring:
Event occurred within a time interval
Why Not Just Use Regression?ΒΆ
β Ignoring censored data β Biased estimates
β Removing censored data β Loss of information
β
Survival methods β Properly handle censoring
Key ConceptsΒΆ
Survival FunctionΒΆ
Probability of surviving beyond time \(t\)
Properties:
\(S(0) = 1\) (everyone alive at start)
\(S(\infty) = 0\) (eventually all experience event)
Decreasing function
Hazard FunctionΒΆ
Interpretation:
Instantaneous risk of event at time \(t\)
Given survival up to time \(t\)
Cumulative HazardΒΆ
Methods CoveredΒΆ
Kaplan-Meier Estimator: Non-parametric survival curves
Log-Rank Test: Compare survival between groups
Cox Proportional Hazards: Semi-parametric regression
Time-Varying Coefficients: Extensions
11.1 Kaplan-Meier EstimatorΒΆ
Non-Parametric Survival EstimationΒΆ
Kaplan-Meier (KM) Formula: $\(\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)\)$
where:
\(t_i\) = distinct event times
\(d_i\) = number of events at \(t_i\)
\(n_i\) = number at risk at \(t_i\) (not yet experienced event or censored)
Key FeaturesΒΆ
β
No distributional assumptions
β
Handles censored data properly
β
Produces survival curve
β
Confidence intervals available
Median Survival TimeΒΆ
Time when \(S(t) = 0.5\) (50% survival)
# Kaplan-Meier Example: Brain Cancer Data
# Generate sample survival data
np.random.seed(42)
n = 80
# Simulate survival times (exponential distribution)
times = np.random.exponential(scale=20, size=n)
# Simulate censoring (30% censored)
censoring_times = np.random.exponential(scale=25, size=n)
observed_times = np.minimum(times, censoring_times)
event_observed = (times <= censoring_times).astype(int)
# Create DataFrame
df_surv = pd.DataFrame({
'time': observed_times,
'event': event_observed
})
print("π Dataset Summary:")
print(f" Total subjects: {n}")
print(f" Events observed: {event_observed.sum()}")
print(f" Censored: {n - event_observed.sum()}")
print(f" Censoring rate: {(1 - event_observed.mean())*100:.1f}%")
# Fit Kaplan-Meier
kmf = KaplanMeierFitter()
kmf.fit(df_surv['time'], df_surv['event'], label='Overall')
# Plot survival curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Survival function
kmf.plot_survival_function(ax=ax1, ci_show=True)
ax1.set_xlabel('Time')
ax1.set_ylabel('Survival Probability')
ax1.set_title('Kaplan-Meier Survival Curve')
ax1.grid(True, alpha=0.3)
# Add median survival line
median_surv = kmf.median_survival_time_
ax1.axhline(0.5, color='red', linestyle='--', alpha=0.5, label=f'Median')
ax1.axvline(median_surv, color='red', linestyle='--', alpha=0.5)
ax1.text(median_surv + 1, 0.55, f'Median: {median_surv:.1f}', fontsize=10)
ax1.legend()
# Cumulative hazard
kmf.plot_cumulative_density(ax=ax2, ci_show=True)
ax2.set_xlabel('Time')
ax2.set_ylabel('Cumulative Event Probability')
ax2.set_title('Cumulative Incidence (1 - Survival)')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nπ Survival Statistics:")
print(f" Median survival time: {median_surv:.2f}")
print(f" Survival at t=10: {kmf.predict(10):.3f}")
print(f" Survival at t=20: {kmf.predict(20):.3f}")
print(f" Survival at t=30: {kmf.predict(30):.3f}")
11.2 Comparing Survival Curves: Log-Rank TestΒΆ
Hypothesis TestΒΆ
Hβ: Survival distributions are the same across groups
Hβ: Survival distributions differ
Log-Rank StatisticΒΆ
Compares observed vs expected events at each time point
where:
\(O_i\) = observed events in group
\(E_i\) = expected events under Hβ
\(V_i\) = variance
When to UseΒΆ
β
Compare 2+ groups
β
Non-parametric (no assumptions)
β
Handles censoring
p-value < 0.05 β Significant differenceΒΆ
# Compare Survival by Treatment Group
np.random.seed(42)
n_per_group = 40
# Group 1: Control (worse survival)
times_control = np.random.exponential(scale=15, size=n_per_group)
censoring_control = np.random.exponential(scale=20, size=n_per_group)
# Group 2: Treatment (better survival)
times_treatment = np.random.exponential(scale=25, size=n_per_group)
censoring_treatment = np.random.exponential(scale=30, size=n_per_group)
# Create DataFrame
df_groups = pd.DataFrame({
'time': np.concatenate([np.minimum(times_control, censoring_control),
np.minimum(times_treatment, censoring_treatment)]),
'event': np.concatenate([(times_control <= censoring_control).astype(int),
(times_treatment <= censoring_treatment).astype(int)]),
'group': ['Control']*n_per_group + ['Treatment']*n_per_group
})
# Fit KM for each group
kmf_control = KaplanMeierFitter()
kmf_treatment = KaplanMeierFitter()
control_data = df_groups[df_groups['group'] == 'Control']
treatment_data = df_groups[df_groups['group'] == 'Treatment']
kmf_control.fit(control_data['time'], control_data['event'], label='Control')
kmf_treatment.fit(treatment_data['time'], treatment_data['event'], label='Treatment')
# Log-rank test
results = logrank_test(
control_data['time'], treatment_data['time'],
control_data['event'], treatment_data['event']
)
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
kmf_control.plot_survival_function(ax=ax, ci_show=True)
kmf_treatment.plot_survival_function(ax=ax, ci_show=True)
ax.set_xlabel('Time')
ax.set_ylabel('Survival Probability')
ax.set_title(f'Kaplan-Meier Curves by Treatment Group\n'
f'Log-Rank Test: p-value = {results.p_value:.4f}')
ax.grid(True, alpha=0.3)
ax.legend()
plt.show()
print("\nπ Log-Rank Test Results:")
print(f" Test statistic: {results.test_statistic:.4f}")
print(f" p-value: {results.p_value:.4f}")
if results.p_value < 0.05:
print(" β
Significant difference between groups (p < 0.05)")
else:
print(" β No significant difference (p >= 0.05)")
print(f"\nπ Median Survival Times:")
print(f" Control: {kmf_control.median_survival_time_:.2f}")
print(f" Treatment: {kmf_treatment.median_survival_time_:.2f}")
print(f" Difference: {kmf_treatment.median_survival_time_ - kmf_control.median_survival_time_:.2f}")
11.3 Cox Proportional Hazards ModelΒΆ
Semi-Parametric RegressionΒΆ
Cox Model: $\(h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\)$
Components:
\(h_0(t)\) = baseline hazard (unspecified, non-parametric)
\(\exp(\beta^T X)\) = relative risk (parametric)
Hazard Ratio (HR)ΒΆ
Interpretation:
\(HR = 1\): No effect
\(HR > 1\): Increased risk (bad)
\(HR < 1\): Decreased risk (good)
For binary variable (0/1): $\(HR = e^\beta\)$
For continuous variable:
1-unit increase β hazard multiplied by \(e^\beta\)
Proportional Hazards AssumptionΒΆ
Key Assumption: Hazard ratio is constant over time
If violated β need time-varying coefficients
AdvantagesΒΆ
β
Handles censoring
β
Multiple covariates
β
No baseline hazard assumption
β
Interpretable hazard ratios
β
Flexible
Model FittingΒΆ
Partial Likelihood (not full likelihood)
Only uses ordering of events
Doesnβt need baseline hazard
Maximized via Newton-Raphson
# Cox Proportional Hazards Model
# Generate data with multiple covariates
np.random.seed(42)
n = 200
# Covariates
age = np.random.normal(60, 10, n)
treatment = np.random.binomial(1, 0.5, n)
stage = np.random.choice([1, 2, 3], size=n, p=[0.3, 0.4, 0.3])
# True hazard ratios
# age: HR = 1.03 per year
# treatment: HR = 0.6 (protective)
# stage: HR = 1.5 per stage
true_log_hazard = 0.03*age - 0.5*treatment + 0.4*stage
# Generate survival times
baseline_hazard = 0.01
u = np.random.uniform(0, 1, n)
times = -np.log(u) / (baseline_hazard * np.exp(true_log_hazard))
# Add censoring
censoring_times = np.random.exponential(scale=50, size=n)
observed_times = np.minimum(times, censoring_times)
event_observed = (times <= censoring_times).astype(int)
# Create DataFrame
df_cox = pd.DataFrame({
'time': observed_times,
'event': event_observed,
'age': age,
'treatment': treatment,
'stage': stage
})
print("π Dataset:")
print(f" N = {n}")
print(f" Events: {event_observed.sum()}")
print(f" Censored: {n - event_observed.sum()}")
print(f"\n{df_cox.head()}")
# Fit Cox model
cph = CoxPHFitter()
cph.fit(df_cox, duration_col='time', event_col='event')
# Summary
print("\n" + "="*70)
print("Cox Proportional Hazards Model")
print("="*70)
print(cph.summary)
# Hazard ratios
print("\nπ Hazard Ratios (exp(coef)):")
print("="*70)
for var in ['age', 'treatment', 'stage']:
hr = np.exp(cph.params_[var])
ci_lower = np.exp(cph.confidence_intervals_.loc[var, '95% lower-bound'])
ci_upper = np.exp(cph.confidence_intervals_.loc[var, '95% upper-bound'])
p_value = cph.summary.loc[var, 'p']
print(f" {var:12s}: HR = {hr:.3f}, 95% CI = [{ci_lower:.3f}, {ci_upper:.3f}], p = {p_value:.4f}")
if var == 'age':
print(f" β 1 year older: {(hr-1)*100:.1f}% increase in hazard")
elif var == 'treatment':
if hr < 1:
print(f" β Treatment reduces hazard by {(1-hr)*100:.1f}%")
else:
print(f" β Treatment increases hazard by {(hr-1)*100:.1f}%")
elif var == 'stage':
print(f" β Each stage increase: {(hr-1)*100:.1f}% increase in hazard")
print(f"\nπ Model Performance:")
print(f" Concordance index: {cph.concordance_index_:.4f}")
print(" (0.5 = random, 1.0 = perfect predictions)")
# Visualize Cox Model Results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# 1. Coefficient plot
cph.plot(ax=axes[0])
axes[0].set_title('Coefficient Estimates with 95% CI')
axes[0].axvline(0, color='red', linestyle='--', alpha=0.5, label='No effect')
axes[0].legend()
# 2. Baseline survival
cph.baseline_survival_.plot(ax=axes[1], legend=False)
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Baseline Survival')
axes[1].set_title('Estimated Baseline Survival Function')
axes[1].grid(True, alpha=0.3)
# 3. Survival curves for different profiles
# Create example profiles
profiles = pd.DataFrame([
{'age': 50, 'treatment': 1, 'stage': 1},
{'age': 60, 'treatment': 1, 'stage': 2},
{'age': 70, 'treatment': 0, 'stage': 3}
])
for idx, row in profiles.iterrows():
surv = cph.predict_survival_function(row.to_frame().T)
surv.plot(ax=axes[2], label=f"Age={int(row['age'])}, Tx={int(row['treatment'])}, Stage={int(row['stage'])}")
axes[2].set_xlabel('Time')
axes[2].set_ylabel('Survival Probability')
axes[2].set_title('Predicted Survival for Different Profiles')
axes[2].legend(fontsize=8)
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Interpretation:")
print(" β’ Coefficients > 0: Increased hazard (worse survival)")
print(" β’ Coefficients < 0: Decreased hazard (better survival)")
print(" β’ CI crossing 0: Not significant")
11.4 Real Dataset: Rossi Recidivism DataΒΆ
Dataset DescriptionΒΆ
Study: Criminal recidivism (re-arrest)
Subjects: 432 male prisoners
Event: Re-arrest
Time: Weeks until re-arrest or end of study
Covariates: Financial aid, age, race, work experience, etc.
Research QuestionΒΆ
What factors affect time to recidivism?
# Load Rossi dataset
rossi = load_rossi()
print("π Rossi Recidivism Dataset:")
print(f" Shape: {rossi.shape}")
print(f"\nColumns:\n{rossi.columns.tolist()}")
print(f"\nFirst few rows:\n{rossi.head()}")
print(f"\nSummary Statistics:")
print(rossi.describe())
# Event summary
print(f"\nπ Event Summary:")
print(f" Total subjects: {len(rossi)}")
print(f" Re-arrested: {rossi['arrest'].sum()}")
print(f" Not re-arrested (censored): {len(rossi) - rossi['arrest'].sum()}")
print(f" Arrest rate: {rossi['arrest'].mean()*100:.1f}%")
# KM curves by financial aid
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# By financial aid
kmf_fin_yes = KaplanMeierFitter()
kmf_fin_no = KaplanMeierFitter()
fin_yes = rossi[rossi['fin'] == 1]
fin_no = rossi[rossi['fin'] == 0]
kmf_fin_yes.fit(fin_yes['week'], fin_yes['arrest'], label='Financial Aid')
kmf_fin_no.fit(fin_no['week'], fin_no['arrest'], label='No Financial Aid')
kmf_fin_yes.plot_survival_function(ax=ax1)
kmf_fin_no.plot_survival_function(ax=ax1)
ax1.set_xlabel('Weeks')
ax1.set_ylabel('Survival (Not Re-arrested)')
ax1.set_title('Survival by Financial Aid Status')
ax1.grid(True, alpha=0.3)
# Log-rank test
results_fin = logrank_test(fin_yes['week'], fin_no['week'],
fin_yes['arrest'], fin_no['arrest'])
ax1.text(0.05, 0.05, f'Log-rank p = {results_fin.p_value:.4f}',
transform=ax1.transAxes, fontsize=10,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
# By race
kmf_black = KaplanMeierFitter()
kmf_other = KaplanMeierFitter()
black = rossi[rossi['race'] == 1]
other = rossi[rossi['race'] == 0]
kmf_black.fit(black['week'], black['arrest'], label='Black')
kmf_other.fit(other['week'], other['arrest'], label='Other')
kmf_black.plot_survival_function(ax=ax2)
kmf_other.plot_survival_function(ax=ax2)
ax2.set_xlabel('Weeks')
ax2.set_ylabel('Survival (Not Re-arrested)')
ax2.set_title('Survival by Race')
ax2.grid(True, alpha=0.3)
results_race = logrank_test(black['week'], other['week'],
black['arrest'], other['arrest'])
ax2.text(0.05, 0.05, f'Log-rank p = {results_race.p_value:.4f}',
transform=ax2.transAxes, fontsize=10,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.tight_layout()
plt.show()
# Fit Cox model to Rossi data
cph_rossi = CoxPHFitter()
cph_rossi.fit(rossi, duration_col='week', event_col='arrest')
print("\n" + "="*80)
print("Cox Model: Rossi Recidivism Data")
print("="*80)
print(cph_rossi.summary)
print("\nπ Key Findings (Hazard Ratios):")
print("="*80)
# Select key variables
key_vars = ['fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio']
var_names = {
'fin': 'Financial Aid',
'age': 'Age',
'race': 'Race (Black)',
'wexp': 'Work Experience',
'mar': 'Married',
'paro': 'Paroled',
'prio': 'Prior Convictions'
}
for var in key_vars:
if var in cph_rossi.params_.index:
hr = np.exp(cph_rossi.params_[var])
p_val = cph_rossi.summary.loc[var, 'p']
sig = '***' if p_val < 0.001 else '**' if p_val < 0.01 else '*' if p_val < 0.05 else ''
print(f" {var_names[var]:25s}: HR = {hr:.3f} (p = {p_val:.4f}) {sig}")
if hr < 1:
print(f" {'':25s} β Reduces re-arrest risk by {(1-hr)*100:.1f}%")
else:
print(f" {'':25s} β Increases re-arrest risk by {(hr-1)*100:.1f}%")
print("\n Significance: *** p<0.001, ** p<0.01, * p<0.05")
print(f"\nπ Model Concordance: {cph_rossi.concordance_index_:.4f}")
# Visualize coefficients
fig, ax = plt.subplots(figsize=(10, 6))
cph_rossi.plot(ax=ax)
ax.set_title('Cox Model Coefficients: Rossi Dataset')
ax.axvline(0, color='red', linestyle='--', alpha=0.5, label='No effect')
plt.tight_layout()
plt.show()
print("\nπ‘ Interpretation:")
print(" β’ Financial aid β Lower re-arrest risk (protective)")
print(" β’ Older age β Lower re-arrest risk")
print(" β’ More prior convictions β Higher re-arrest risk")
print(" β’ Being married β Lower re-arrest risk")
Key TakeawaysΒΆ
When to Use Survival AnalysisΒΆ
Perfect For:
β
Time-to-event data
β
Censored observations present
β
Want to estimate survival curves
β
Compare groups over time
β
Multiple predictors of survival
Not Appropriate:
β No censoring (use regular regression)
β Event can occur multiple times (use recurrent event models)
β Competing risks (use competing risk models)
Method Selection GuideΒΆ
Situation |
Method |
Why |
|---|---|---|
Single group, describe survival |
Kaplan-Meier |
Simple, visual, no assumptions |
Compare 2+ groups |
KM + Log-Rank Test |
Non-parametric comparison |
Multiple continuous/categorical predictors |
Cox PH Model |
Handle many covariates |
Non-proportional hazards |
Stratified Cox or time-varying |
Relax PH assumption |
Predict individual survival |
Cox Model |
Get personalized curves |
Kaplan-Meier: Best PracticesΒΆ
β
Always show confidence intervals
β
Report median survival time
β
Show number at risk over time
β
Check for adequate follow-up
β
Use for exploratory analysis
Cox Model: Best PracticesΒΆ
β Check proportional hazards assumption
cph.check_assumptions(rossi, p_value_threshold=0.05)
β Report hazard ratios, not coefficients
HR = np.exp(cph.params_)
β Include confidence intervals
β Assess model fit
concordance_index # 0.5 = random, 1.0 = perfect
β Check for influential observations
Interpreting Hazard RatiosΒΆ
Binary Predictor (0/1):
HR = 2.0 β Group 1 has 2Γ the hazard of group 0
HR = 0.5 β Group 1 has half the hazard of group 0
Continuous Predictor:
HR = 1.05 β 5% increase in hazard per unit increase
HR = 0.95 β 5% decrease in hazard per unit increase
10-unit increase: $\(HR_{10} = (HR_1)^{10}\)$
Common PitfallsΒΆ
β Ignoring censoring β Biased estimates
β Not checking PH assumption β Invalid Cox model
β Confusing survival with hazard β Wrong interpretation
β Small sample with many covariates β Overfitting
β Not reporting CI β Canβt assess uncertainty
β Using inappropriate time scale β Misleading results
Censoring TypesΒΆ
Right Censoring (most common):
Study ends before event
Loss to follow-up
Handles correctly in KM and Cox
Informative Censoring:
Censoring related to outcome
Violates assumptions!
Example: Sicker patients drop out
ExtensionsΒΆ
Time-Varying Covariates:
Covariates that change over time
Example: Treatment changes during study
Stratified Cox Model:
Different baseline hazards by strata
Use when PH assumption violated for some variables
Competing Risks:
Multiple types of events possible
Example: Death from different causes
Recurrent Events:
Event can happen multiple times
Example: Hospital readmissions
Software RecommendationsΒΆ
Python:
lifelines: Most comprehensivescikit-survival: sklearn-compatible
R:
survival: Standard packagesurvminer: Visualization
Next ChapterΒΆ
Chapter 12: Unsupervised Learning
Principal Component Analysis (PCA)
K-Means Clustering
Hierarchical Clustering
Matrix Completion
Practice ExercisesΒΆ
Exercise 1: Simulating Censored DataΒΆ
Generate survival times from exponential distribution
Generate censoring times independently
Create observed times and event indicators
Fit Kaplan-Meier and verify estimates
Vary censoring proportion (0%, 25%, 50%, 75%)
Compare bias in estimates
Exercise 2: Log-Rank Power AnalysisΒΆ
Simulate two groups with different survival (HR = 0.7)
Vary sample sizes: 20, 50, 100, 200 per group
Run 1000 simulations for each
Calculate power (proportion of p < 0.05)
Plot power curve
Determine minimum sample size for 80% power
Exercise 3: Cox Model with InteractionsΒΆ
Using Rossi data:
Fit Cox model with interaction: age Γ financial aid
Test if effect of financial aid varies by age
Visualize: survival curves for young vs old, by aid status
Interpret interaction term
Exercise 4: Proportional Hazards CheckΒΆ
Fit Cox model to Rossi data
Check PH assumption using:
Schoenfeld residuals
Log-log plots
Statistical tests
Identify variables violating assumption
Refit with stratification if needed
Exercise 5: Time-Varying EffectsΒΆ
Create dataset where treatment effect changes over time
Fit standard Cox model (misspecified)
Fit Cox with time-varying coefficient
Compare models
Visualize how HR changes over time
Exercise 6: Real Data AnalysisΒΆ
Find survival dataset (medical, engineering, social):
Perform exploratory analysis with KM curves
Compare groups with log-rank tests
Fit Cox model with multiple predictors
Check assumptions
Report findings with visualizations
Provide clinical/practical interpretation