%matplotlib inline

import numpy
import pandas
from lifelines import KaplanMeierFitter
import matplotlib
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi

Kaplan-Meier survival analysis¶

Kaplan-Meier estimator is a non-parametric method for estimating the survival function of a dataset that includes duration of lifetimes and the occurance of the event in interest.

df = pandas.read_csv('UNC_GEO_survival.csv')

print(df.head())

  GEO accession number  Age   ER  Node  Grade  Size  RFS E  RFS T  OS E  OS T  \
0             GSM80221   55  0.0     0      3     2      0     31     0    31   
1             GSM34464   61  0.0     1      3     2      1      1     1    13   
2             GSM34562   50  1.0     0      3     1      0     29     0    29   
3             GSM34523   39  0.0     1      2     4      0      5     0     5   
4             GSM34565   55  0.0     0      3     2      0     18     0    18   

    Type PAM50 test subtype PAM50 training subtype  
0  Basal                NaN                  Basal  
1  Basal                NaN                  Basal  
2  Basal                NaN                  Basal  
3  Basal                NaN                  Basal  
4  Basal                NaN                  Basal

T = df['OS T']
E = df['OS E']

kmf = KaplanMeierFitter()
kmf.fit(T, E)

kmf.survival_function_
kmf.median_
kmf.plot();

groups = df['Type']
Basal = (groups == 'Basal')
Her2  = (groups == 'Her2')
LumA = (groups == 'LumA')
LumB = (groups == 'LumB')

kmf.fit(T[Basal], E[Basal], label='Basal')
ax = kmf.plot()

kmf.fit(T[Her2], E[Her2], label='Her2')
kmf.plot(ax=ax)

kmf.fit(T[LumA], E[LumA], label='LumA')
kmf.plot(ax=ax)

kmf.fit(T[LumB], E[LumB], label='LumB')
kmf.plot(ax=ax);

Cox proportional hazard model¶

Cox proportional hazard model is a very commonly used tool in survival analysis to study how survival time depends on predictor variables.

cox_d = pandas.get_dummies(pandas.read_csv('UNC_GEO_cox_reg.csv'))

cf = CoxPHFitter()
cf.fit(cox_d, 'RFS T', event_col='RFS E')
cf.print_summary()

n=186, number of events=39

               coef            exp(coef)  se(coef)       z      p  lower 0.95  upper 0.95     
Age         -0.0019               0.9981    0.0120 -0.1608 0.8722     -0.0255      0.0217     
ER          -0.3067               0.7358    0.4954 -0.6191 0.5358     -1.2779      0.6645     
Node         0.5940               1.8112    0.3195  1.8593 0.0630     -0.0323      1.2202    .
Grade        0.0841               1.0877    0.4541  0.1851 0.8531     -0.8062      0.9744     
Size         0.8033               2.2329    0.1807  4.4454 0.0000      0.4491      1.1576  ***
Type_Basal  33.4089 323066992295555.8125       nan     nan    nan         nan         nan     
Type_Her2   32.5433 135955386781099.1719       nan     nan    nan         nan         nan     
Type_LumA   32.5695 139553077750547.5469       nan     nan    nan         nan         nan     
Type_LumB   33.0890 234620830715346.5312       nan     nan    nan         nan         nan     
Type_Normal 34.2235 729569780021460.1250       nan     nan    nan         nan         nan     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Concordance = 0.784

cox_d.head()