Run this notebook: Open in Colab Open in Kaggle

Chapter 11: Survival Analysis and Censored Data¶

Overview¶

Survival Analysis studies time until an event occurs

Applications¶

Medicine: Time until death, disease recurrence
Engineering: Time until machine failure
Business: Customer churn, employee retention
Social Science: Time until employment, marriage

Key Challenge: Censored Data¶

Censoring: We don’t observe the event for all subjects

Types:

Right Censoring (most common):
- Study ends before event occurs
- Subject drops out
- We know: “event time > observed time”
Left Censoring:
- Event occurred before observation started
- Less common
Interval Censoring:
- Event occurred within a time interval

Why Not Just Use Regression?¶

❌ Ignoring censored data → Biased estimates
❌ Removing censored data → Loss of information
✅ Survival methods → Properly handle censoring

Key Concepts¶

Survival Function¶

\[S(t) = P(T > t)\]

Probability of surviving beyond time $t$

Properties:

$S(0) = 1$ (everyone alive at start)
$S(\infty) = 0$ (eventually all experience event)
Decreasing function

Hazard Function¶

\[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}\]

Interpretation:

Instantaneous risk of event at time $t$
Given survival up to time $t$

Cumulative Hazard¶

\[H(t) = \int_0^t h(u) du = -\log S(t)\]

Methods Covered¶

Kaplan-Meier Estimator: Non-parametric survival curves
Log-Rank Test: Compare survival between groups
Cox Proportional Hazards: Semi-parametric regression
Time-Varying Coefficients: Extensions

11.1 Kaplan-Meier Estimator¶

Non-Parametric Survival Estimation¶

Kaplan-Meier (KM) Formula: $$\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$$

where:

$t_i$ = distinct event times
$d_i$ = number of events at $t_i$
$n_i$ = number at risk at $t_i$ (not yet experienced event or censored)

Key Features¶

✅ No distributional assumptions
✅ Handles censored data properly
✅ Produces survival curve
✅ Confidence intervals available

Median Survival Time¶

Time when $S(t) = 0.5$ (50% survival)

# Kaplan-Meier Example: Brain Cancer Data
# Generate sample survival data
np.random.seed(42)
n = 80

# Simulate survival times (exponential distribution)
times = np.random.exponential(scale=20, size=n)
# Simulate censoring (30% censored)
censoring_times = np.random.exponential(scale=25, size=n)
observed_times = np.minimum(times, censoring_times)
event_observed = (times <= censoring_times).astype(int)

# Create DataFrame
df_surv = pd.DataFrame({
    'time': observed_times,
    'event': event_observed
})

print("📊 Dataset Summary:")
print(f"   Total subjects: {n}")
print(f"   Events observed: {event_observed.sum()}")
print(f"   Censored: {n - event_observed.sum()}")
print(f"   Censoring rate: {(1 - event_observed.mean())*100:.1f}%")

# Fit Kaplan-Meier
kmf = KaplanMeierFitter()
kmf.fit(df_surv['time'], df_surv['event'], label='Overall')

# Plot survival curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Survival function
kmf.plot_survival_function(ax=ax1, ci_show=True)
ax1.set_xlabel('Time')
ax1.set_ylabel('Survival Probability')
ax1.set_title('Kaplan-Meier Survival Curve')
ax1.grid(True, alpha=0.3)

# Add median survival line
median_surv = kmf.median_survival_time_
ax1.axhline(0.5, color='red', linestyle='--', alpha=0.5, label=f'Median')
ax1.axvline(median_surv, color='red', linestyle='--', alpha=0.5)
ax1.text(median_surv + 1, 0.55, f'Median: {median_surv:.1f}', fontsize=10)
ax1.legend()

# Cumulative hazard
kmf.plot_cumulative_density(ax=ax2, ci_show=True)
ax2.set_xlabel('Time')
ax2.set_ylabel('Cumulative Event Probability')
ax2.set_title('Cumulative Incidence (1 - Survival)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n📈 Survival Statistics:")
print(f"   Median survival time: {median_surv:.2f}")
print(f"   Survival at t=10: {kmf.predict(10):.3f}")
print(f"   Survival at t=20: {kmf.predict(20):.3f}")
print(f"   Survival at t=30: {kmf.predict(30):.3f}")

11.2 Comparing Survival Curves: Log-Rank Test¶

Hypothesis Test¶

H₀: Survival distributions are the same across groups
H₁: Survival distributions differ

Log-Rank Statistic¶

Compares observed vs expected events at each time point

\[\chi^2 = \frac{(\sum O_i - \sum E_i)^2}{\sum V_i}\]

where:

$O_i$ = observed events in group
$E_i$ = expected events under H₀
$V_i$ = variance

When to Use¶

✅ Compare 2+ groups
✅ Non-parametric (no assumptions)
✅ Handles censoring

p-value < 0.05 → Significant difference¶

# Compare Survival by Treatment Group
np.random.seed(42)
n_per_group = 40

# Group 1: Control (worse survival)
times_control = np.random.exponential(scale=15, size=n_per_group)
censoring_control = np.random.exponential(scale=20, size=n_per_group)

# Group 2: Treatment (better survival)
times_treatment = np.random.exponential(scale=25, size=n_per_group)
censoring_treatment = np.random.exponential(scale=30, size=n_per_group)

# Create DataFrame
df_groups = pd.DataFrame({
    'time': np.concatenate([np.minimum(times_control, censoring_control),
                           np.minimum(times_treatment, censoring_treatment)]),
    'event': np.concatenate([(times_control <= censoring_control).astype(int),
                            (times_treatment <= censoring_treatment).astype(int)]),
    'group': ['Control']*n_per_group + ['Treatment']*n_per_group
})

# Fit KM for each group
kmf_control = KaplanMeierFitter()
kmf_treatment = KaplanMeierFitter()

control_data = df_groups[df_groups['group'] == 'Control']
treatment_data = df_groups[df_groups['group'] == 'Treatment']

kmf_control.fit(control_data['time'], control_data['event'], label='Control')
kmf_treatment.fit(treatment_data['time'], treatment_data['event'], label='Treatment')

# Log-rank test
results = logrank_test(
    control_data['time'], treatment_data['time'],
    control_data['event'], treatment_data['event']
)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
kmf_control.plot_survival_function(ax=ax, ci_show=True)
kmf_treatment.plot_survival_function(ax=ax, ci_show=True)
ax.set_xlabel('Time')
ax.set_ylabel('Survival Probability')
ax.set_title(f'Kaplan-Meier Curves by Treatment Group\n'
            f'Log-Rank Test: p-value = {results.p_value:.4f}')
ax.grid(True, alpha=0.3)
ax.legend()
plt.show()

print("\n📊 Log-Rank Test Results:")
print(f"   Test statistic: {results.test_statistic:.4f}")
print(f"   p-value: {results.p_value:.4f}")
if results.p_value < 0.05:
    print("   ✅ Significant difference between groups (p < 0.05)")
else:
    print("   ❌ No significant difference (p >= 0.05)")

print(f"\n📈 Median Survival Times:")
print(f"   Control: {kmf_control.median_survival_time_:.2f}")
print(f"   Treatment: {kmf_treatment.median_survival_time_:.2f}")
print(f"   Difference: {kmf_treatment.median_survival_time_ - kmf_control.median_survival_time_:.2f}")

11.3 Cox Proportional Hazards Model¶

Semi-Parametric Regression¶

Cox Model: $$h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)$$

Components:

$h_0(t)$ = baseline hazard (unspecified, non-parametric)
$\exp(\beta^T X)$ = relative risk (parametric)

Hazard Ratio (HR)¶

\[HR = \frac{h(t|X_1)}{h(t|X_0)} = \exp(\beta(X_1 - X_0))\]

Interpretation:

$HR = 1$: No effect
$HR > 1$: Increased risk (bad)
$HR < 1$: Decreased risk (good)

For binary variable (0/1): $$HR = e^\beta$$

For continuous variable:

1-unit increase → hazard multiplied by $e^\beta$

Proportional Hazards Assumption¶

Key Assumption: Hazard ratio is constant over time

\[\frac{h(t|X_1)}{h(t|X_0)} = \text{constant}\]

If violated → need time-varying coefficients

Advantages¶

✅ Handles censoring
✅ Multiple covariates
✅ No baseline hazard assumption
✅ Interpretable hazard ratios
✅ Flexible

Model Fitting¶

Partial Likelihood (not full likelihood)

Only uses ordering of events
Doesn’t need baseline hazard
Maximized via Newton-Raphson

# Cox Proportional Hazards Model
# Generate data with multiple covariates
np.random.seed(42)
n = 200

# Covariates
age = np.random.normal(60, 10, n)
treatment = np.random.binomial(1, 0.5, n)
stage = np.random.choice([1, 2, 3], size=n, p=[0.3, 0.4, 0.3])

# True hazard ratios
# age: HR = 1.03 per year
# treatment: HR = 0.6 (protective)
# stage: HR = 1.5 per stage
true_log_hazard = 0.03*age - 0.5*treatment + 0.4*stage

# Generate survival times
baseline_hazard = 0.01
u = np.random.uniform(0, 1, n)
times = -np.log(u) / (baseline_hazard * np.exp(true_log_hazard))

# Add censoring
censoring_times = np.random.exponential(scale=50, size=n)
observed_times = np.minimum(times, censoring_times)
event_observed = (times <= censoring_times).astype(int)

# Create DataFrame
df_cox = pd.DataFrame({
    'time': observed_times,
    'event': event_observed,
    'age': age,
    'treatment': treatment,
    'stage': stage
})

print("📊 Dataset:")
print(f"   N = {n}")
print(f"   Events: {event_observed.sum()}")
print(f"   Censored: {n - event_observed.sum()}")
print(f"\n{df_cox.head()}")

# Fit Cox model
cph = CoxPHFitter()
cph.fit(df_cox, duration_col='time', event_col='event')

# Summary
print("\n" + "="*70)
print("Cox Proportional Hazards Model")
print("="*70)
print(cph.summary)

# Hazard ratios
print("\n📈 Hazard Ratios (exp(coef)):")
print("="*70)
for var in ['age', 'treatment', 'stage']:
    hr = np.exp(cph.params_[var])
    ci_lower = np.exp(cph.confidence_intervals_.loc[var, '95% lower-bound'])
    ci_upper = np.exp(cph.confidence_intervals_.loc[var, '95% upper-bound'])
    p_value = cph.summary.loc[var, 'p']
    
    print(f"   {var:12s}: HR = {hr:.3f}, 95% CI = [{ci_lower:.3f}, {ci_upper:.3f}], p = {p_value:.4f}")
    
    if var == 'age':
        print(f"                 → 1 year older: {(hr-1)*100:.1f}% increase in hazard")
    elif var == 'treatment':
        if hr < 1:
            print(f"                 → Treatment reduces hazard by {(1-hr)*100:.1f}%")
        else:
            print(f"                 → Treatment increases hazard by {(hr-1)*100:.1f}%")
    elif var == 'stage':
        print(f"                 → Each stage increase: {(hr-1)*100:.1f}% increase in hazard")

print(f"\n📊 Model Performance:")
print(f"   Concordance index: {cph.concordance_index_:.4f}")
print("   (0.5 = random, 1.0 = perfect predictions)")

# Visualize Cox Model Results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 1. Coefficient plot
cph.plot(ax=axes[0])
axes[0].set_title('Coefficient Estimates with 95% CI')
axes[0].axvline(0, color='red', linestyle='--', alpha=0.5, label='No effect')
axes[0].legend()

# 2. Baseline survival
cph.baseline_survival_.plot(ax=axes[1], legend=False)
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Baseline Survival')
axes[1].set_title('Estimated Baseline Survival Function')
axes[1].grid(True, alpha=0.3)

# 3. Survival curves for different profiles
# Create example profiles
profiles = pd.DataFrame([
    {'age': 50, 'treatment': 1, 'stage': 1},
    {'age': 60, 'treatment': 1, 'stage': 2},
    {'age': 70, 'treatment': 0, 'stage': 3}
])

for idx, row in profiles.iterrows():
    surv = cph.predict_survival_function(row.to_frame().T)
    surv.plot(ax=axes[2], label=f"Age={int(row['age'])}, Tx={int(row['treatment'])}, Stage={int(row['stage'])}")

axes[2].set_xlabel('Time')
axes[2].set_ylabel('Survival Probability')
axes[2].set_title('Predicted Survival for Different Profiles')
axes[2].legend(fontsize=8)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Interpretation:")
print("   • Coefficients > 0: Increased hazard (worse survival)")
print("   • Coefficients < 0: Decreased hazard (better survival)")
print("   • CI crossing 0: Not significant")

11.4 Real Dataset: Rossi Recidivism Data¶

Dataset Description¶

Study: Criminal recidivism (re-arrest)
Subjects: 432 male prisoners
Event: Re-arrest
Time: Weeks until re-arrest or end of study
Covariates: Financial aid, age, race, work experience, etc.

Research Question¶

What factors affect time to recidivism?

# Load Rossi dataset
rossi = load_rossi()

print("📊 Rossi Recidivism Dataset:")
print(f"   Shape: {rossi.shape}")
print(f"\nColumns:\n{rossi.columns.tolist()}")
print(f"\nFirst few rows:\n{rossi.head()}")
print(f"\nSummary Statistics:")
print(rossi.describe())

# Event summary
print(f"\n📈 Event Summary:")
print(f"   Total subjects: {len(rossi)}")
print(f"   Re-arrested: {rossi['arrest'].sum()}")
print(f"   Not re-arrested (censored): {len(rossi) - rossi['arrest'].sum()}")
print(f"   Arrest rate: {rossi['arrest'].mean()*100:.1f}%")

# KM curves by financial aid
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# By financial aid
kmf_fin_yes = KaplanMeierFitter()
kmf_fin_no = KaplanMeierFitter()

fin_yes = rossi[rossi['fin'] == 1]
fin_no = rossi[rossi['fin'] == 0]

kmf_fin_yes.fit(fin_yes['week'], fin_yes['arrest'], label='Financial Aid')
kmf_fin_no.fit(fin_no['week'], fin_no['arrest'], label='No Financial Aid')

kmf_fin_yes.plot_survival_function(ax=ax1)
kmf_fin_no.plot_survival_function(ax=ax1)
ax1.set_xlabel('Weeks')
ax1.set_ylabel('Survival (Not Re-arrested)')
ax1.set_title('Survival by Financial Aid Status')
ax1.grid(True, alpha=0.3)

# Log-rank test
results_fin = logrank_test(fin_yes['week'], fin_no['week'],
                          fin_yes['arrest'], fin_no['arrest'])
ax1.text(0.05, 0.05, f'Log-rank p = {results_fin.p_value:.4f}',
        transform=ax1.transAxes, fontsize=10,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# By race
kmf_black = KaplanMeierFitter()
kmf_other = KaplanMeierFitter()

black = rossi[rossi['race'] == 1]
other = rossi[rossi['race'] == 0]

kmf_black.fit(black['week'], black['arrest'], label='Black')
kmf_other.fit(other['week'], other['arrest'], label='Other')

kmf_black.plot_survival_function(ax=ax2)
kmf_other.plot_survival_function(ax=ax2)
ax2.set_xlabel('Weeks')
ax2.set_ylabel('Survival (Not Re-arrested)')
ax2.set_title('Survival by Race')
ax2.grid(True, alpha=0.3)

results_race = logrank_test(black['week'], other['week'],
                           black['arrest'], other['arrest'])
ax2.text(0.05, 0.05, f'Log-rank p = {results_race.p_value:.4f}',
        transform=ax2.transAxes, fontsize=10,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

# Fit Cox model to Rossi data
cph_rossi = CoxPHFitter()
cph_rossi.fit(rossi, duration_col='week', event_col='arrest')

print("\n" + "="*80)
print("Cox Model: Rossi Recidivism Data")
print("="*80)
print(cph_rossi.summary)

print("\n📊 Key Findings (Hazard Ratios):")
print("="*80)

# Select key variables
key_vars = ['fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio']
var_names = {
    'fin': 'Financial Aid',
    'age': 'Age',
    'race': 'Race (Black)',
    'wexp': 'Work Experience',
    'mar': 'Married',
    'paro': 'Paroled',
    'prio': 'Prior Convictions'
}

for var in key_vars:
    if var in cph_rossi.params_.index:
        hr = np.exp(cph_rossi.params_[var])
        p_val = cph_rossi.summary.loc[var, 'p']
        sig = '***' if p_val < 0.001 else '**' if p_val < 0.01 else '*' if p_val < 0.05 else ''
        
        print(f"   {var_names[var]:25s}: HR = {hr:.3f}  (p = {p_val:.4f}) {sig}")
        
        if hr < 1:
            print(f"   {'':25s}  → Reduces re-arrest risk by {(1-hr)*100:.1f}%")
        else:
            print(f"   {'':25s}  → Increases re-arrest risk by {(hr-1)*100:.1f}%")

print("\n   Significance: *** p<0.001, ** p<0.01, * p<0.05")
print(f"\n📈 Model Concordance: {cph_rossi.concordance_index_:.4f}")

# Visualize coefficients
fig, ax = plt.subplots(figsize=(10, 6))
cph_rossi.plot(ax=ax)
ax.set_title('Cox Model Coefficients: Rossi Dataset')
ax.axvline(0, color='red', linestyle='--', alpha=0.5, label='No effect')
plt.tight_layout()
plt.show()

print("\n💡 Interpretation:")
print("   • Financial aid → Lower re-arrest risk (protective)")
print("   • Older age → Lower re-arrest risk")
print("   • More prior convictions → Higher re-arrest risk")
print("   • Being married → Lower re-arrest risk")

Key Takeaways¶

When to Use Survival Analysis¶

Perfect For: ✅ Time-to-event data
✅ Censored observations present
✅ Want to estimate survival curves
✅ Compare groups over time
✅ Multiple predictors of survival

Not Appropriate: ❌ No censoring (use regular regression)
❌ Event can occur multiple times (use recurrent event models)
❌ Competing risks (use competing risk models)

Method Selection Guide¶

Situation	Method	Why
Single group, describe survival	Kaplan-Meier	Simple, visual, no assumptions
Compare 2+ groups	KM + Log-Rank Test	Non-parametric comparison
Multiple continuous/categorical predictors	Cox PH Model	Handle many covariates
Non-proportional hazards	Stratified Cox or time-varying	Relax PH assumption
Predict individual survival	Cox Model	Get personalized curves

Kaplan-Meier: Best Practices¶

✅ Always show confidence intervals
✅ Report median survival time
✅ Show number at risk over time
✅ Check for adequate follow-up
✅ Use for exploratory analysis

Cox Model: Best Practices¶

✅ Check proportional hazards assumption

cph.check_assumptions(rossi, p_value_threshold=0.05)

✅ Report hazard ratios, not coefficients

HR = np.exp(cph.params_)

✅ Include confidence intervals

✅ Assess model fit

concordance_index  # 0.5 = random, 1.0 = perfect

✅ Check for influential observations

Interpreting Hazard Ratios¶

Binary Predictor (0/1):

HR = 2.0 → Group 1 has 2× the hazard of group 0
HR = 0.5 → Group 1 has half the hazard of group 0

Continuous Predictor:

HR = 1.05 → 5% increase in hazard per unit increase
HR = 0.95 → 5% decrease in hazard per unit increase

10-unit increase: $$HR_{10} = (HR_1)^{10}$$

Common Pitfalls¶

❌ Ignoring censoring → Biased estimates
❌ Not checking PH assumption → Invalid Cox model
❌ Confusing survival with hazard → Wrong interpretation
❌ Small sample with many covariates → Overfitting
❌ Not reporting CI → Can’t assess uncertainty
❌ Using inappropriate time scale → Misleading results

Censoring Types¶

Right Censoring (most common):

Study ends before event
Loss to follow-up
Handles correctly in KM and Cox

Informative Censoring:

Censoring related to outcome
Violates assumptions!
Example: Sicker patients drop out

Extensions¶

Time-Varying Covariates:

Covariates that change over time
Example: Treatment changes during study

Stratified Cox Model:

Different baseline hazards by strata
Use when PH assumption violated for some variables

Competing Risks:

Multiple types of events possible
Example: Death from different causes

Recurrent Events:

Event can happen multiple times
Example: Hospital readmissions

Software Recommendations¶

Python:

lifelines: Most comprehensive
scikit-survival: sklearn-compatible

survival: Standard package
survminer: Visualization

Next Chapter¶

Chapter 12: Unsupervised Learning

Principal Component Analysis (PCA)
K-Means Clustering
Hierarchical Clustering
Matrix Completion

Practice Exercises¶

Exercise 1: Simulating Censored Data¶

Generate survival times from exponential distribution
Generate censoring times independently
Create observed times and event indicators
Fit Kaplan-Meier and verify estimates
Vary censoring proportion (0%, 25%, 50%, 75%)
Compare bias in estimates

Exercise 2: Log-Rank Power Analysis¶

Simulate two groups with different survival (HR = 0.7)
Vary sample sizes: 20, 50, 100, 200 per group
Run 1000 simulations for each
Calculate power (proportion of p < 0.05)
Plot power curve
Determine minimum sample size for 80% power

Exercise 3: Cox Model with Interactions¶

Using Rossi data:

Fit Cox model with interaction: age × financial aid
Test if effect of financial aid varies by age
Visualize: survival curves for young vs old, by aid status
Interpret interaction term

Exercise 4: Proportional Hazards Check¶

Fit Cox model to Rossi data
Check PH assumption using:
- Schoenfeld residuals
- Log-log plots
- Statistical tests
Identify variables violating assumption
Refit with stratification if needed

Exercise 5: Time-Varying Effects¶

Create dataset where treatment effect changes over time
Fit standard Cox model (misspecified)
Fit Cox with time-varying coefficient
Compare models
Visualize how HR changes over time

Exercise 6: Real Data Analysis¶

Find survival dataset (medical, engineering, social):

Perform exploratory analysis with KM curves
Compare groups with log-rank tests
Fit Cox model with multiple predictors
Check assumptions
Report findings with visualizations
Provide clinical/practical interpretation