Chapter 11: Survival Analysis and Censored DataΒΆ

OverviewΒΆ

Survival Analysis studies time until an event occurs

ApplicationsΒΆ

  • Medicine: Time until death, disease recurrence

  • Engineering: Time until machine failure

  • Business: Customer churn, employee retention

  • Social Science: Time until employment, marriage

Key Challenge: Censored DataΒΆ

Censoring: We don’t observe the event for all subjects

Types:

  1. Right Censoring (most common):

    • Study ends before event occurs

    • Subject drops out

    • We know: β€œevent time > observed time”

  2. Left Censoring:

    • Event occurred before observation started

    • Less common

  3. Interval Censoring:

    • Event occurred within a time interval

Why Not Just Use Regression?ΒΆ

❌ Ignoring censored data β†’ Biased estimates
❌ Removing censored data β†’ Loss of information
βœ… Survival methods β†’ Properly handle censoring

Key ConceptsΒΆ

Survival FunctionΒΆ

\[S(t) = P(T > t)\]

Probability of surviving beyond time \(t\)

Properties:

  • \(S(0) = 1\) (everyone alive at start)

  • \(S(\infty) = 0\) (eventually all experience event)

  • Decreasing function

Hazard FunctionΒΆ

\[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}\]

Interpretation:

  • Instantaneous risk of event at time \(t\)

  • Given survival up to time \(t\)

Cumulative HazardΒΆ

\[H(t) = \int_0^t h(u) du = -\log S(t)\]

Methods CoveredΒΆ

  1. Kaplan-Meier Estimator: Non-parametric survival curves

  2. Log-Rank Test: Compare survival between groups

  3. Cox Proportional Hazards: Semi-parametric regression

  4. Time-Varying Coefficients: Extensions

11.1 Kaplan-Meier EstimatorΒΆ

Non-Parametric Survival EstimationΒΆ

Kaplan-Meier (KM) Formula: $\(\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)\)$

where:

  • \(t_i\) = distinct event times

  • \(d_i\) = number of events at \(t_i\)

  • \(n_i\) = number at risk at \(t_i\) (not yet experienced event or censored)

Key FeaturesΒΆ

βœ… No distributional assumptions
βœ… Handles censored data properly
βœ… Produces survival curve
βœ… Confidence intervals available

Median Survival TimeΒΆ

Time when \(S(t) = 0.5\) (50% survival)

# Kaplan-Meier Example: Brain Cancer Data
# Generate sample survival data
np.random.seed(42)
n = 80

# Simulate survival times (exponential distribution)
times = np.random.exponential(scale=20, size=n)
# Simulate censoring (30% censored)
censoring_times = np.random.exponential(scale=25, size=n)
observed_times = np.minimum(times, censoring_times)
event_observed = (times <= censoring_times).astype(int)

# Create DataFrame
df_surv = pd.DataFrame({
    'time': observed_times,
    'event': event_observed
})

print("πŸ“Š Dataset Summary:")
print(f"   Total subjects: {n}")
print(f"   Events observed: {event_observed.sum()}")
print(f"   Censored: {n - event_observed.sum()}")
print(f"   Censoring rate: {(1 - event_observed.mean())*100:.1f}%")

# Fit Kaplan-Meier
kmf = KaplanMeierFitter()
kmf.fit(df_surv['time'], df_surv['event'], label='Overall')

# Plot survival curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Survival function
kmf.plot_survival_function(ax=ax1, ci_show=True)
ax1.set_xlabel('Time')
ax1.set_ylabel('Survival Probability')
ax1.set_title('Kaplan-Meier Survival Curve')
ax1.grid(True, alpha=0.3)

# Add median survival line
median_surv = kmf.median_survival_time_
ax1.axhline(0.5, color='red', linestyle='--', alpha=0.5, label=f'Median')
ax1.axvline(median_surv, color='red', linestyle='--', alpha=0.5)
ax1.text(median_surv + 1, 0.55, f'Median: {median_surv:.1f}', fontsize=10)
ax1.legend()

# Cumulative hazard
kmf.plot_cumulative_density(ax=ax2, ci_show=True)
ax2.set_xlabel('Time')
ax2.set_ylabel('Cumulative Event Probability')
ax2.set_title('Cumulative Incidence (1 - Survival)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nπŸ“ˆ Survival Statistics:")
print(f"   Median survival time: {median_surv:.2f}")
print(f"   Survival at t=10: {kmf.predict(10):.3f}")
print(f"   Survival at t=20: {kmf.predict(20):.3f}")
print(f"   Survival at t=30: {kmf.predict(30):.3f}")

11.2 Comparing Survival Curves: Log-Rank TestΒΆ

Hypothesis TestΒΆ

Hβ‚€: Survival distributions are the same across groups
H₁: Survival distributions differ

Log-Rank StatisticΒΆ

Compares observed vs expected events at each time point

\[\chi^2 = \frac{(\sum O_i - \sum E_i)^2}{\sum V_i}\]

where:

  • \(O_i\) = observed events in group

  • \(E_i\) = expected events under Hβ‚€

  • \(V_i\) = variance

When to UseΒΆ

βœ… Compare 2+ groups
βœ… Non-parametric (no assumptions)
βœ… Handles censoring

p-value < 0.05 β†’ Significant differenceΒΆ

# Compare Survival by Treatment Group
np.random.seed(42)
n_per_group = 40

# Group 1: Control (worse survival)
times_control = np.random.exponential(scale=15, size=n_per_group)
censoring_control = np.random.exponential(scale=20, size=n_per_group)

# Group 2: Treatment (better survival)
times_treatment = np.random.exponential(scale=25, size=n_per_group)
censoring_treatment = np.random.exponential(scale=30, size=n_per_group)

# Create DataFrame
df_groups = pd.DataFrame({
    'time': np.concatenate([np.minimum(times_control, censoring_control),
                           np.minimum(times_treatment, censoring_treatment)]),
    'event': np.concatenate([(times_control <= censoring_control).astype(int),
                            (times_treatment <= censoring_treatment).astype(int)]),
    'group': ['Control']*n_per_group + ['Treatment']*n_per_group
})

# Fit KM for each group
kmf_control = KaplanMeierFitter()
kmf_treatment = KaplanMeierFitter()

control_data = df_groups[df_groups['group'] == 'Control']
treatment_data = df_groups[df_groups['group'] == 'Treatment']

kmf_control.fit(control_data['time'], control_data['event'], label='Control')
kmf_treatment.fit(treatment_data['time'], treatment_data['event'], label='Treatment')

# Log-rank test
results = logrank_test(
    control_data['time'], treatment_data['time'],
    control_data['event'], treatment_data['event']
)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
kmf_control.plot_survival_function(ax=ax, ci_show=True)
kmf_treatment.plot_survival_function(ax=ax, ci_show=True)
ax.set_xlabel('Time')
ax.set_ylabel('Survival Probability')
ax.set_title(f'Kaplan-Meier Curves by Treatment Group\n'
            f'Log-Rank Test: p-value = {results.p_value:.4f}')
ax.grid(True, alpha=0.3)
ax.legend()
plt.show()

print("\nπŸ“Š Log-Rank Test Results:")
print(f"   Test statistic: {results.test_statistic:.4f}")
print(f"   p-value: {results.p_value:.4f}")
if results.p_value < 0.05:
    print("   βœ… Significant difference between groups (p < 0.05)")
else:
    print("   ❌ No significant difference (p >= 0.05)")

print(f"\nπŸ“ˆ Median Survival Times:")
print(f"   Control: {kmf_control.median_survival_time_:.2f}")
print(f"   Treatment: {kmf_treatment.median_survival_time_:.2f}")
print(f"   Difference: {kmf_treatment.median_survival_time_ - kmf_control.median_survival_time_:.2f}")

11.3 Cox Proportional Hazards ModelΒΆ

Semi-Parametric RegressionΒΆ

Cox Model: $\(h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\)$

Components:

  • \(h_0(t)\) = baseline hazard (unspecified, non-parametric)

  • \(\exp(\beta^T X)\) = relative risk (parametric)

Hazard Ratio (HR)ΒΆ

\[HR = \frac{h(t|X_1)}{h(t|X_0)} = \exp(\beta(X_1 - X_0))\]

Interpretation:

  • \(HR = 1\): No effect

  • \(HR > 1\): Increased risk (bad)

  • \(HR < 1\): Decreased risk (good)

For binary variable (0/1): $\(HR = e^\beta\)$

For continuous variable:

  • 1-unit increase β†’ hazard multiplied by \(e^\beta\)

Proportional Hazards AssumptionΒΆ

Key Assumption: Hazard ratio is constant over time

\[\frac{h(t|X_1)}{h(t|X_0)} = \text{constant}\]

If violated β†’ need time-varying coefficients

AdvantagesΒΆ

βœ… Handles censoring
βœ… Multiple covariates
βœ… No baseline hazard assumption
βœ… Interpretable hazard ratios
βœ… Flexible

Model FittingΒΆ

Partial Likelihood (not full likelihood)

  • Only uses ordering of events

  • Doesn’t need baseline hazard

  • Maximized via Newton-Raphson

# Cox Proportional Hazards Model
# Generate data with multiple covariates
np.random.seed(42)
n = 200

# Covariates
age = np.random.normal(60, 10, n)
treatment = np.random.binomial(1, 0.5, n)
stage = np.random.choice([1, 2, 3], size=n, p=[0.3, 0.4, 0.3])

# True hazard ratios
# age: HR = 1.03 per year
# treatment: HR = 0.6 (protective)
# stage: HR = 1.5 per stage
true_log_hazard = 0.03*age - 0.5*treatment + 0.4*stage

# Generate survival times
baseline_hazard = 0.01
u = np.random.uniform(0, 1, n)
times = -np.log(u) / (baseline_hazard * np.exp(true_log_hazard))

# Add censoring
censoring_times = np.random.exponential(scale=50, size=n)
observed_times = np.minimum(times, censoring_times)
event_observed = (times <= censoring_times).astype(int)

# Create DataFrame
df_cox = pd.DataFrame({
    'time': observed_times,
    'event': event_observed,
    'age': age,
    'treatment': treatment,
    'stage': stage
})

print("πŸ“Š Dataset:")
print(f"   N = {n}")
print(f"   Events: {event_observed.sum()}")
print(f"   Censored: {n - event_observed.sum()}")
print(f"\n{df_cox.head()}")

# Fit Cox model
cph = CoxPHFitter()
cph.fit(df_cox, duration_col='time', event_col='event')

# Summary
print("\n" + "="*70)
print("Cox Proportional Hazards Model")
print("="*70)
print(cph.summary)

# Hazard ratios
print("\nπŸ“ˆ Hazard Ratios (exp(coef)):")
print("="*70)
for var in ['age', 'treatment', 'stage']:
    hr = np.exp(cph.params_[var])
    ci_lower = np.exp(cph.confidence_intervals_.loc[var, '95% lower-bound'])
    ci_upper = np.exp(cph.confidence_intervals_.loc[var, '95% upper-bound'])
    p_value = cph.summary.loc[var, 'p']
    
    print(f"   {var:12s}: HR = {hr:.3f}, 95% CI = [{ci_lower:.3f}, {ci_upper:.3f}], p = {p_value:.4f}")
    
    if var == 'age':
        print(f"                 β†’ 1 year older: {(hr-1)*100:.1f}% increase in hazard")
    elif var == 'treatment':
        if hr < 1:
            print(f"                 β†’ Treatment reduces hazard by {(1-hr)*100:.1f}%")
        else:
            print(f"                 β†’ Treatment increases hazard by {(hr-1)*100:.1f}%")
    elif var == 'stage':
        print(f"                 β†’ Each stage increase: {(hr-1)*100:.1f}% increase in hazard")

print(f"\nπŸ“Š Model Performance:")
print(f"   Concordance index: {cph.concordance_index_:.4f}")
print("   (0.5 = random, 1.0 = perfect predictions)")
# Visualize Cox Model Results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 1. Coefficient plot
cph.plot(ax=axes[0])
axes[0].set_title('Coefficient Estimates with 95% CI')
axes[0].axvline(0, color='red', linestyle='--', alpha=0.5, label='No effect')
axes[0].legend()

# 2. Baseline survival
cph.baseline_survival_.plot(ax=axes[1], legend=False)
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Baseline Survival')
axes[1].set_title('Estimated Baseline Survival Function')
axes[1].grid(True, alpha=0.3)

# 3. Survival curves for different profiles
# Create example profiles
profiles = pd.DataFrame([
    {'age': 50, 'treatment': 1, 'stage': 1},
    {'age': 60, 'treatment': 1, 'stage': 2},
    {'age': 70, 'treatment': 0, 'stage': 3}
])

for idx, row in profiles.iterrows():
    surv = cph.predict_survival_function(row.to_frame().T)
    surv.plot(ax=axes[2], label=f"Age={int(row['age'])}, Tx={int(row['treatment'])}, Stage={int(row['stage'])}")

axes[2].set_xlabel('Time')
axes[2].set_ylabel('Survival Probability')
axes[2].set_title('Predicted Survival for Different Profiles')
axes[2].legend(fontsize=8)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Interpretation:")
print("   β€’ Coefficients > 0: Increased hazard (worse survival)")
print("   β€’ Coefficients < 0: Decreased hazard (better survival)")
print("   β€’ CI crossing 0: Not significant")

11.4 Real Dataset: Rossi Recidivism DataΒΆ

Dataset DescriptionΒΆ

  • Study: Criminal recidivism (re-arrest)

  • Subjects: 432 male prisoners

  • Event: Re-arrest

  • Time: Weeks until re-arrest or end of study

  • Covariates: Financial aid, age, race, work experience, etc.

Research QuestionΒΆ

What factors affect time to recidivism?

# Load Rossi dataset
rossi = load_rossi()

print("πŸ“Š Rossi Recidivism Dataset:")
print(f"   Shape: {rossi.shape}")
print(f"\nColumns:\n{rossi.columns.tolist()}")
print(f"\nFirst few rows:\n{rossi.head()}")
print(f"\nSummary Statistics:")
print(rossi.describe())

# Event summary
print(f"\nπŸ“ˆ Event Summary:")
print(f"   Total subjects: {len(rossi)}")
print(f"   Re-arrested: {rossi['arrest'].sum()}")
print(f"   Not re-arrested (censored): {len(rossi) - rossi['arrest'].sum()}")
print(f"   Arrest rate: {rossi['arrest'].mean()*100:.1f}%")

# KM curves by financial aid
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# By financial aid
kmf_fin_yes = KaplanMeierFitter()
kmf_fin_no = KaplanMeierFitter()

fin_yes = rossi[rossi['fin'] == 1]
fin_no = rossi[rossi['fin'] == 0]

kmf_fin_yes.fit(fin_yes['week'], fin_yes['arrest'], label='Financial Aid')
kmf_fin_no.fit(fin_no['week'], fin_no['arrest'], label='No Financial Aid')

kmf_fin_yes.plot_survival_function(ax=ax1)
kmf_fin_no.plot_survival_function(ax=ax1)
ax1.set_xlabel('Weeks')
ax1.set_ylabel('Survival (Not Re-arrested)')
ax1.set_title('Survival by Financial Aid Status')
ax1.grid(True, alpha=0.3)

# Log-rank test
results_fin = logrank_test(fin_yes['week'], fin_no['week'],
                          fin_yes['arrest'], fin_no['arrest'])
ax1.text(0.05, 0.05, f'Log-rank p = {results_fin.p_value:.4f}',
        transform=ax1.transAxes, fontsize=10,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# By race
kmf_black = KaplanMeierFitter()
kmf_other = KaplanMeierFitter()

black = rossi[rossi['race'] == 1]
other = rossi[rossi['race'] == 0]

kmf_black.fit(black['week'], black['arrest'], label='Black')
kmf_other.fit(other['week'], other['arrest'], label='Other')

kmf_black.plot_survival_function(ax=ax2)
kmf_other.plot_survival_function(ax=ax2)
ax2.set_xlabel('Weeks')
ax2.set_ylabel('Survival (Not Re-arrested)')
ax2.set_title('Survival by Race')
ax2.grid(True, alpha=0.3)

results_race = logrank_test(black['week'], other['week'],
                           black['arrest'], other['arrest'])
ax2.text(0.05, 0.05, f'Log-rank p = {results_race.p_value:.4f}',
        transform=ax2.transAxes, fontsize=10,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()
# Fit Cox model to Rossi data
cph_rossi = CoxPHFitter()
cph_rossi.fit(rossi, duration_col='week', event_col='arrest')

print("\n" + "="*80)
print("Cox Model: Rossi Recidivism Data")
print("="*80)
print(cph_rossi.summary)

print("\nπŸ“Š Key Findings (Hazard Ratios):")
print("="*80)

# Select key variables
key_vars = ['fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio']
var_names = {
    'fin': 'Financial Aid',
    'age': 'Age',
    'race': 'Race (Black)',
    'wexp': 'Work Experience',
    'mar': 'Married',
    'paro': 'Paroled',
    'prio': 'Prior Convictions'
}

for var in key_vars:
    if var in cph_rossi.params_.index:
        hr = np.exp(cph_rossi.params_[var])
        p_val = cph_rossi.summary.loc[var, 'p']
        sig = '***' if p_val < 0.001 else '**' if p_val < 0.01 else '*' if p_val < 0.05 else ''
        
        print(f"   {var_names[var]:25s}: HR = {hr:.3f}  (p = {p_val:.4f}) {sig}")
        
        if hr < 1:
            print(f"   {'':25s}  β†’ Reduces re-arrest risk by {(1-hr)*100:.1f}%")
        else:
            print(f"   {'':25s}  β†’ Increases re-arrest risk by {(hr-1)*100:.1f}%")

print("\n   Significance: *** p<0.001, ** p<0.01, * p<0.05")
print(f"\nπŸ“ˆ Model Concordance: {cph_rossi.concordance_index_:.4f}")

# Visualize coefficients
fig, ax = plt.subplots(figsize=(10, 6))
cph_rossi.plot(ax=ax)
ax.set_title('Cox Model Coefficients: Rossi Dataset')
ax.axvline(0, color='red', linestyle='--', alpha=0.5, label='No effect')
plt.tight_layout()
plt.show()

print("\nπŸ’‘ Interpretation:")
print("   β€’ Financial aid β†’ Lower re-arrest risk (protective)")
print("   β€’ Older age β†’ Lower re-arrest risk")
print("   β€’ More prior convictions β†’ Higher re-arrest risk")
print("   β€’ Being married β†’ Lower re-arrest risk")

Key TakeawaysΒΆ

When to Use Survival AnalysisΒΆ

Perfect For: βœ… Time-to-event data
βœ… Censored observations present
βœ… Want to estimate survival curves
βœ… Compare groups over time
βœ… Multiple predictors of survival

Not Appropriate: ❌ No censoring (use regular regression)
❌ Event can occur multiple times (use recurrent event models)
❌ Competing risks (use competing risk models)

Method Selection GuideΒΆ

Situation

Method

Why

Single group, describe survival

Kaplan-Meier

Simple, visual, no assumptions

Compare 2+ groups

KM + Log-Rank Test

Non-parametric comparison

Multiple continuous/categorical predictors

Cox PH Model

Handle many covariates

Non-proportional hazards

Stratified Cox or time-varying

Relax PH assumption

Predict individual survival

Cox Model

Get personalized curves

Kaplan-Meier: Best PracticesΒΆ

βœ… Always show confidence intervals
βœ… Report median survival time
βœ… Show number at risk over time
βœ… Check for adequate follow-up
βœ… Use for exploratory analysis

Cox Model: Best PracticesΒΆ

βœ… Check proportional hazards assumption

cph.check_assumptions(rossi, p_value_threshold=0.05)

βœ… Report hazard ratios, not coefficients

HR = np.exp(cph.params_)

βœ… Include confidence intervals

βœ… Assess model fit

concordance_index  # 0.5 = random, 1.0 = perfect

βœ… Check for influential observations

Interpreting Hazard RatiosΒΆ

Binary Predictor (0/1):

  • HR = 2.0 β†’ Group 1 has 2Γ— the hazard of group 0

  • HR = 0.5 β†’ Group 1 has half the hazard of group 0

Continuous Predictor:

  • HR = 1.05 β†’ 5% increase in hazard per unit increase

  • HR = 0.95 β†’ 5% decrease in hazard per unit increase

10-unit increase: $\(HR_{10} = (HR_1)^{10}\)$

Common PitfallsΒΆ

❌ Ignoring censoring β†’ Biased estimates
❌ Not checking PH assumption β†’ Invalid Cox model
❌ Confusing survival with hazard β†’ Wrong interpretation
❌ Small sample with many covariates β†’ Overfitting
❌ Not reporting CI β†’ Can’t assess uncertainty
❌ Using inappropriate time scale β†’ Misleading results

Censoring TypesΒΆ

Right Censoring (most common):

  • Study ends before event

  • Loss to follow-up

  • Handles correctly in KM and Cox

Informative Censoring:

  • Censoring related to outcome

  • Violates assumptions!

  • Example: Sicker patients drop out

ExtensionsΒΆ

Time-Varying Covariates:

  • Covariates that change over time

  • Example: Treatment changes during study

Stratified Cox Model:

  • Different baseline hazards by strata

  • Use when PH assumption violated for some variables

Competing Risks:

  • Multiple types of events possible

  • Example: Death from different causes

Recurrent Events:

  • Event can happen multiple times

  • Example: Hospital readmissions

Software RecommendationsΒΆ

Python:

  • lifelines: Most comprehensive

  • scikit-survival: sklearn-compatible

R:

  • survival: Standard package

  • survminer: Visualization

Next ChapterΒΆ

Chapter 12: Unsupervised Learning

  • Principal Component Analysis (PCA)

  • K-Means Clustering

  • Hierarchical Clustering

  • Matrix Completion

Practice ExercisesΒΆ

Exercise 1: Simulating Censored DataΒΆ

  1. Generate survival times from exponential distribution

  2. Generate censoring times independently

  3. Create observed times and event indicators

  4. Fit Kaplan-Meier and verify estimates

  5. Vary censoring proportion (0%, 25%, 50%, 75%)

  6. Compare bias in estimates

Exercise 2: Log-Rank Power AnalysisΒΆ

  1. Simulate two groups with different survival (HR = 0.7)

  2. Vary sample sizes: 20, 50, 100, 200 per group

  3. Run 1000 simulations for each

  4. Calculate power (proportion of p < 0.05)

  5. Plot power curve

  6. Determine minimum sample size for 80% power

Exercise 3: Cox Model with InteractionsΒΆ

Using Rossi data:

  1. Fit Cox model with interaction: age Γ— financial aid

  2. Test if effect of financial aid varies by age

  3. Visualize: survival curves for young vs old, by aid status

  4. Interpret interaction term

Exercise 4: Proportional Hazards CheckΒΆ

  1. Fit Cox model to Rossi data

  2. Check PH assumption using:

    • Schoenfeld residuals

    • Log-log plots

    • Statistical tests

  3. Identify variables violating assumption

  4. Refit with stratification if needed

Exercise 5: Time-Varying EffectsΒΆ

  1. Create dataset where treatment effect changes over time

  2. Fit standard Cox model (misspecified)

  3. Fit Cox with time-varying coefficient

  4. Compare models

  5. Visualize how HR changes over time

Exercise 6: Real Data AnalysisΒΆ

Find survival dataset (medical, engineering, social):

  1. Perform exploratory analysis with KM curves

  2. Compare groups with log-rank tests

  3. Fit Cox model with multiple predictors

  4. Check assumptions

  5. Report findings with visualizations

  6. Provide clinical/practical interpretation