Learning Machine Learning

This project is going to explore different unsupervised and supervised algorithms you can use when trying to classify tumour types. The original dataset was published by Parker et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes (J Clin Oncol 27:1160-1167, 2009). The paper describes a 50 gene sub-type classifier for breast tumors (Lum-A, Lum-B, HER2-enriched, basal-like) that was used to create a risk predictor model. This risk predictor includes the gene expression-based subtypes in addition to the more traditional clinical variables, such as tumor size or histologic grade. The microarray data that was used to choose the 50 most relevant genes for the classifier is available on the GEO (Gene expression Omnibus) website.

In [3]:
from bokeh import plotting
from bokeh.charts import Scatter
from bokeh.io import output_notebook
import GEOparse
import numpy
import pandas
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
from sklearn.metrics import v_measure_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
In [4]:
output_notebook()
Loading BokehJS ...

Exploration of the PAM 50 Dataset

In [5]:
full_csv = pandas.read_csv('UNC_GEO_summary.csv')
In [ ]:
platformobj = GEOparse.get_GEO('GPL1390')
In [ ]:
mygeoobj = GEOparse.get_GEO('GSM80221')

The following table shows a summary of microarray dataset in the GEO repository, clinical variables, survival data and tumour sub-type for each patient.

In [8]:
full_csv
Out[8]:
GEO accession number Age ER Node Grade Size RFS event RFS months Overall Survival Event Overall suvival months PAM50 test subtype PAM50 training subtype
0 GSM80221 55 0.0 0 3 2 0 31 0 31 NaN Basal
1 GSM34464 61 0.0 1 3 2 1 1 1 13 NaN Basal
2 GSM34562 50 1.0 0 3 1 0 29 0 29 NaN Basal
3 GSM34523 39 0.0 1 2 4 0 5 0 5 NaN Basal
4 GSM34565 55 0.0 0 3 2 0 18 0 18 NaN Basal
5 GSM34527 50 0.0 0 3 1 0 19 0 19 NaN Basal
6 GSM34437 59 0.0 0 3 2 0 25 0 25 NaN Basal
7 GSM34450 80 0.0 1 3 1 0 16 0 16 NaN Basal
8 GSM34452 60 0.0 0 3 1 0 7 0 7 NaN Basal
9 GSM34481 48 1.0 1 3 3 0 14 0 14 NaN Her2
10 GSM34497 51 0.0 1 3 3 0 21 0 21 NaN Her2
11 GSM34544 50 0.0 0 3 2 0 15 0 15 NaN Her2
12 GSM34528 52 0.0 1 2 3 0 9 0 9 NaN Her2
13 GSM34431 42 0.0 1 3 1 0 25 0 25 NaN Her2
14 GSM34532 72 0.0 1 3 2 0 20 0 20 NaN Her2
15 GSM52895 34 1.0 0 1 1 0 1 0 1 NaN LumA
16 GSM34549 83 1.0 0 2 1 0 9 0 9 NaN LumA
17 GSM34548 50 1.0 0 1 1 0 20 0 20 NaN LumA
18 GSM34428 50 1.0 1 2 2 0 24 0 24 NaN LumA
19 GSM34557 63 1.0 1 2 1 0 9 0 9 NaN LumA
20 GSM34451 40 1.0 0 2 1 0 13 0 13 NaN LumA
21 GSM52896 79 1.0 1 3 2 1 15 0 15 NaN LumB
22 GSM34559 68 0.0 1 3 3 1 3 1 3 Basal NaN
23 GSM34552 84 0.0 0 3 2 0 27 1 27 Basal NaN
24 GSM50154 84 0.0 1 3 3 0 2 0 2 Basal NaN
25 GSM34488 55 0.0 1 2 2 0 10 0 10 Her2 NaN
26 GSM34474 67 0.0 0 2 1 0 118 0 118 LumA NaN
27 GSM34541 57 1.0 1 2 4 1 6 0 51 LumA NaN
28 GSM34430 68 1.0 0 3 2 0 37 0 37 LumA NaN
29 GSM34455 85 1.0 0 2 2 0 32 0 32 LumA NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
157 GSM275814 65 1.0 1 3 2 0 73 0 73 LumB NaN
158 GSM275820 59 0.0 1 3 3 1 17 1 17 LumB NaN
159 GSM275824 52 1.0 0 2 1 0 71 0 71 LumA NaN
160 GSM275826 88 1.0 1 3 2 1 35 1 35 LumB NaN
161 GSM275827 57 1.0 0 3 1 0 64 0 64 LumB NaN
162 GSM275828 72 1.0 1 2 4 1 44 1 45 LumA NaN
163 GSM275831 75 1.0 0 1 2 0 81 0 81 LumA NaN
164 GSM275836 74 1.0 0 2 1 0 76 0 76 LumA NaN
165 GSM275838 66 1.0 0 2 1 0 61 0 61 LumB NaN
166 GSM275840 58 0.0 1 2 3 1 17 1 63 LumB NaN
167 GSM275844 52 1.0 2 2 1 1 11 1 55 LumA NaN
168 GSM275845 49 1.0 0 2 1 0 54 0 54 LumA NaN
169 GSM275849 71 1.0 0 3 1 0 73 0 73 LumA NaN
170 GSM275851 89 1.0 0 3 2 0 62 0 62 LumA NaN
171 GSM275855 79 1.0 0 2 4 0 69 0 69 LumA NaN
172 GSM275856 73 1.0 0 3 1 0 68 0 68 LumA NaN
173 GSM275859 93 1.0 0 2 1 0 24 1 24 LumA NaN
174 GSM275860 84 1.0 0 3 2 0 23 1 23 LumB NaN
175 GSM275862 47 1.0 1 3 2 0 61 0 61 LumA NaN
176 GSM275867 80 1.0 1 2 4 1 58 1 58 LumA NaN
177 GSM275871 48 1.0 0 1 1 0 64 0 64 LumA NaN
178 GSM275874 42 1.0 1 3 2 0 63 0 63 LumA NaN
179 GSM275882 88 1.0 0 2 1 0 13 1 13 LumA NaN
180 GSM275884 80 1.0 1 2 1 0 57 0 57 LumA NaN
181 GSM275886 44 0.0 0 3 2 0 47 0 47 LumA NaN
182 GSM275896 50 1.0 2 3 2 0 44 0 44 LumA NaN
183 GSM275899 62 1.0 0 3 3 0 57 0 57 LumA NaN
184 GSM275901 55 1.0 1 2 2 0 39 0 39 LumA NaN
185 GSM275903 46 1.0 1 3 2 0 45 0 45 LumA NaN
186 GSM275906 67 1.0 0 3 2 0 41 0 41 LumA NaN

187 rows × 12 columns

When using microarray data it is always good to keep in mind which platform was used for each microarray. You can easily check this by printing the metadata for each microarray dataset.

In [ ]:
%%capture
platforms = []
for microarray in full_csv['GEO accession number']:
    microarray = GEOparse.get_GEO(microarray)
    platform = (microarray.metadata['platform_id'])[0]
    platforms.append(platform)

In this case 3 different platforms were used.

In [41]:
platforms_s = set(platforms)
print(platforms_s)
{'GPL885', 'GPL887', 'GPL1390'}

You can filter your microarray dataset by the used platform.

In [ ]:
%%capture
matrix = []
names = []
subtypes_test = []
subtypes_train = []
subtypes = []
for i in full_csv.index:
    accession_number = full_csv['GEO accession number'][i]
    subtype_test = full_csv['PAM50 test subtype'][i]
    subtype_train = full_csv['PAM50 training subtype'][i]
    microarray = GEOparse.get_GEO(accession_number)
    platform = (microarray.metadata['platform_id'])[0]
    if platform == 'GPL1390':
        names.append(accession_number)
        matrix.append(microarray.table['VALUE'])
        subtypes_test.append(subtype_test)
        subtypes_train.append(subtype_train)
        if type(subtype_test) is str:
            subtypes.append(subtype_test)
        else:
            subtypes.append(subtype_train)
    

Unsupervised learning - clustering algorithms

To try out clustering algorithms we merged all the data form the microarray files into one numpy array.

In [11]:
numpy_matrix = numpy.array(matrix)
In [12]:
numpy_matrix2 = numpy.nan_to_num(numpy_matrix)
In [13]:
numpy_matrix.shape
Out[13]:
(167, 22575)

1) k-means

k-means is an unsupervised iterative clustering method where k (the number of clusters) is defined by the user.

In [49]:
mbk = KMeans(n_clusters=5, max_iter=1000, verbose=1, tol=1e-8)
results = mbk.fit_predict(numpy_matrix2)
Initialization complete
Iteration  0, inertia 2948246.988
Iteration  1, inertia 1695266.394
Iteration  2, inertia 1653150.417
Iteration  3, inertia 1628921.216
Iteration  4, inertia 1618336.549
Iteration  5, inertia 1615704.304
Iteration  6, inertia 1614486.475
Converged at iteration 6
Initialization complete
Iteration  0, inertia 3006705.418
Iteration  1, inertia 1710675.445
Iteration  2, inertia 1682714.046
Iteration  3, inertia 1671517.657
Iteration  4, inertia 1662253.259
Iteration  5, inertia 1655401.265
Iteration  6, inertia 1651168.753
Iteration  7, inertia 1649449.574
Iteration  8, inertia 1648139.666
Iteration  9, inertia 1646258.075
Iteration 10, inertia 1642797.628
Iteration 11, inertia 1635935.062
Iteration 12, inertia 1628202.251
Iteration 13, inertia 1627323.042
Iteration 14, inertia 1626213.510
Iteration 15, inertia 1625353.488
Converged at iteration 15
Initialization complete
Iteration  0, inertia 2908083.736
Iteration  1, inertia 1690362.546
Iteration  2, inertia 1677842.120
Iteration  3, inertia 1666242.913
Iteration  4, inertia 1662191.497
Iteration  5, inertia 1659647.015
Iteration  6, inertia 1657904.840
Iteration  7, inertia 1656572.086
Iteration  8, inertia 1654891.859
Iteration  9, inertia 1653754.417
Iteration 10, inertia 1653118.425
Converged at iteration 10
Initialization complete
Iteration  0, inertia 2924352.404
Iteration  1, inertia 1673575.134
Iteration  2, inertia 1666125.019
Iteration  3, inertia 1665363.096
Converged at iteration 3
Initialization complete
Iteration  0, inertia 3011871.285
Iteration  1, inertia 1705812.497
Iteration  2, inertia 1677108.357
Iteration  3, inertia 1669853.815
Iteration  4, inertia 1663518.640
Iteration  5, inertia 1658358.472
Iteration  6, inertia 1657888.938
Converged at iteration 6
Initialization complete
Iteration  0, inertia 2820217.547
Iteration  1, inertia 1679106.132
Iteration  2, inertia 1660979.866
Iteration  3, inertia 1655597.096
Iteration  4, inertia 1651574.041
Iteration  5, inertia 1648888.544
Iteration  6, inertia 1647568.881
Iteration  7, inertia 1646889.614
Converged at iteration 7
Initialization complete
Iteration  0, inertia 2810351.146
Iteration  1, inertia 1653057.004
Iteration  2, inertia 1630455.694
Iteration  3, inertia 1622956.503
Iteration  4, inertia 1618129.258
Iteration  5, inertia 1617726.866
Converged at iteration 5
Initialization complete
Iteration  0, inertia 2651958.420
Iteration  1, inertia 1653246.220
Iteration  2, inertia 1620636.236
Iteration  3, inertia 1616883.707
Converged at iteration 3
Initialization complete
Iteration  0, inertia 2918010.329
Iteration  1, inertia 1699262.672
Iteration  2, inertia 1677500.885
Iteration  3, inertia 1672942.176
Iteration  4, inertia 1669360.752
Iteration  5, inertia 1667686.618
Iteration  6, inertia 1666329.173
Iteration  7, inertia 1666032.890
Converged at iteration 7
Initialization complete
Iteration  0, inertia 2828859.321
Iteration  1, inertia 1670906.627
Iteration  2, inertia 1660070.622
Iteration  3, inertia 1655685.828
Iteration  4, inertia 1652992.695
Iteration  5, inertia 1652627.216
Converged at iteration 5
In [30]:
v_measure_score(results, subtypes)
Out[30]:
0.38669647446509914

2) Affinity propagation

Affinity propagation is an other type of unsupervised learning technique in which the number of clusters is inferred by the data.

In [46]:
ap = AffinityPropagation(verbose=2)
results = ap.fit_predict(numpy_matrix2)
Converged after 40 iterations.
In [47]:
for j in range(14):
    for i in range(167):
        if results[i] == j:
            print(names[i], results[i], subtypes[i])
GSM34562 0 Basal
GSM34565 0 Basal
GSM34559 0 Basal
GSM141070 0 Basal
GSM141071 0 Basal
GSM141072 0 Basal
GSM141110 0 Basal
GSM275897 0 Basal
GSM34450 1 Basal
GSM34452 1 Basal
GSM34497 1 Her2
GSM52895 1 LumA
GSM34549 1 LumA
GSM34451 1 LumA
GSM52896 1 LumB
GSM34567 1 LumA
GSM34550 1 LumA
GSM52891 1 LumB
GSM34475 1 LumB
GSM53474 1 Normal
GSM141079 1 LumA
GSM141081 1 LumA
GSM141085 1 LumA
GSM141088 1 LumA
GSM141001 1 LumA
GSM141077 1 Normal
GSM34481 2 Her2
GSM34544 2 Her2
GSM34431 2 Her2
GSM34548 2 LumA
GSM34488 2 Her2
GSM34474 2 LumA
GSM34541 2 LumA
GSM34455 2 LumA
GSM34459 2 LumA
GSM34448 2 LumA
GSM34553 2 LumA
GSM34446 2 LumA
GSM141092 2 LumA
GSM141075 2 LumB
GSM141095 2 Normal
GSM275782 3 Basal
GSM275810 3 Basal
GSM275825 3 Basal
GSM275830 3 Basal
GSM275833 3 Basal
GSM275839 3 Basal
GSM275841 3 Basal
GSM275848 3 Basal
GSM275857 3 Basal
GSM275863 3 Basal
GSM275876 3 Basal
GSM275887 3 Basal
GSM275888 3 Basal
GSM275905 3 Basal
GSM275909 3 Basal
GSM141108 3 Basal
GSM275802 4 Basal
GSM275837 4 Basal
GSM275843 4 Basal
GSM275847 4 Basal
GSM275847 4 Basal
GSM275885 4 Basal
GSM275889 4 Basal
GSM80221 5 Basal
GSM140999 5 Basal
GSM275792 5 Basal
GSM275813 5 Basal
GSM275823 5 Basal
GSM275829 5 Basal
GSM275834 5 Basal
GSM275850 5 Basal
GSM275852 5 Basal
GSM275865 5 Basal
GSM275866 5 Basal
GSM275883 5 Basal
GSM275894 5 Basal
GSM275904 5 Basal
GSM275879 6 Basal
GSM141105 7 Basal
GSM141073 7 Her2
GSM141096 7 Her2
GSM141099 7 Her2
GSM141121 7 Her2
GSM141102 7 LumB
GSM141117 7 LumB
GSM141097 7 Basal
GSM141098 7 Her2
GSM275791 8 Her2
GSM275793 8 Her2
GSM275795 8 Her2
GSM275796 8 Her2
GSM275800 8 Her2
GSM275803 8 Her2
GSM275805 8 Her2
GSM275815 8 Her2
GSM275816 8 Her2
GSM275817 8 Her2
GSM275832 8 Her2
GSM275858 8 Her2
GSM275861 8 Her2
GSM275873 8 Her2
GSM275878 8 Her2
GSM275890 8 Her2
GSM275893 8 Her2
GSM275846 8 LumB
GSM275860 8 LumB
GSM275874 8 LumA
GSM275896 8 LumA
GSM275898 9 LumA
GSM275907 9 LumA
GSM275864 9 LumB
GSM275794 9 LumB
GSM275808 9 LumA
GSM275809 9 LumB
GSM275812 9 LumA
GSM275836 9 LumA
GSM275882 9 LumA
GSM275884 9 LumA
GSM141093 10 Basal
GSM275820 11 LumB
GSM275891 12 Her2
GSM275824 12 LumA
GSM275828 12 LumA
GSM275831 12 LumA
GSM275838 12 LumB
GSM275844 12 LumA
GSM275851 12 LumA
GSM275855 12 LumA
GSM275856 12 LumA
GSM275859 12 LumA
GSM275867 12 LumA
GSM275899 12 LumA
GSM275842 13 Her2
GSM275877 13 Her2
GSM275880 13 Her2
GSM275881 13 Her2
GSM275783 13 LumA
GSM275788 13 LumA
GSM275804 13 LumA
GSM275818 13 LumA
GSM275872 13 LumA
GSM275892 13 LumA
GSM275900 13 LumA
GSM275908 13 LumA
GSM141084 13 LumB
GSM141114 13 LumB
GSM275799 13 LumB
GSM275807 13 LumB
GSM275821 13 LumB
GSM275822 13 LumB
GSM275895 13 LumB
GSM141118 13 Her2
GSM141002 13 LumB
GSM275811 13 LumA
GSM275814 13 LumB
GSM275826 13 LumB
GSM275827 13 LumB
GSM275840 13 LumB
GSM275845 13 LumA
GSM275849 13 LumA
GSM275862 13 LumA
GSM275871 13 LumA
GSM275886 13 LumA
GSM275901 13 LumA
GSM275903 13 LumA
GSM275906 13 LumA
In [33]:
v_measure_score(results, subtypes)
Out[33]:
0.40616973903747322

Simple check

Compare your results with a randomly generated dataset

In [ ]:
random_clusters = numpy.random.random_integers(0, 4, len(results))
In [127]:
v_measure_score(random_clusters, subtypes)
Out[127]:
0.05368245531372208

3) Principal Component Analysis

Principal component analysis is a dimension reduction technique that preserves components in order of highest variants.

In [50]:
pca = PCA(n_components=50)
transformed = pca.fit_transform(numpy_matrix2)

data_frame = pandas.DataFrame({'x': transformed[:,0], 'y': transformed[:,1], 'our': results, 'truth': subtypes})

myfig = Scatter(data_frame, x='x', y='y', color='our')

plotting.show(myfig)
In [185]:
pca.explained_variance_ratio_
Out[185]:
array([ 0.15281314,  0.11336547,  0.04262466,  0.03526221,  0.03353428,
        0.02901063,  0.02205512,  0.02095834,  0.0190384 ,  0.01618621,
        0.01417247,  0.01265852,  0.01193964,  0.01153478,  0.01100957,
        0.01039784,  0.0098839 ,  0.00953392,  0.00936849,  0.00879479,
        0.00842337,  0.00803581,  0.00790614,  0.00746355,  0.00713668,
        0.00690605,  0.0066281 ,  0.00644893,  0.00631893,  0.00606388,
        0.00595219,  0.00593914,  0.00580179,  0.00575719,  0.00542019,
        0.00527025,  0.00514962,  0.00505571,  0.00495043,  0.0048947 ,
        0.00486142,  0.0047364 ,  0.00468101,  0.00457586,  0.00446293,
        0.0044193 ,  0.00420863,  0.00419003,  0.0040644 ,  0.00403637])

4) t-distributed Stochastic Neighbor Embedding

t-SNE is a non-linear dimensional reduction technique.

In [175]:
tsne = TSNE()
transformed = tsne.fit_transform(numpy_matrix2)
data_frame = pandas.DataFrame({'x': transformed[:,0], 'y': transformed[:,1], 'our': results, 'truth': subtypes})

myfig = Scatter(data_frame, x='x', y='y', color='truth')

plotting.show(myfig)

Supervised learning - classification

Here we take numpy_matrix2 and subtypes and split it into training and test input data (x) and labels (y). After training the algorithm on the training dataset and run a classification on the test dataset, you can measeure the quality of the classification by precision (the number of found true positives compared to all found positives) and recall (the number of found true positives compared to the number of all true positives).

In [43]:
train_x, test_x, train_y, test_y = train_test_split(numpy_matrix2, subtypes)

1) k-nearest neighbors

k-nearest neighbors is simple non-generalizing learning method that infers the class based on a vote form the k nearest neighbor.

In [291]:
knc = KNeighborsClassifier(n_neighbors=5)
knc.fit(train_x, train_y)
pred_y = knc.predict(test_x)
print(classification_report(test_y, pred_y))
             precision    recall  f1-score   support

      Basal       1.00      0.92      0.96        13
       Her2       0.71      0.62      0.67         8
       LumA       0.43      1.00      0.61        10
       LumB       0.00      0.00      0.00        10
     Normal       0.00      0.00      0.00         1

avg / total       0.55      0.64      0.57        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

2) Logistic regression

Logistic regression is a linear model for classification that is based on the logistic equation.

In [292]:
lr = LogisticRegression(C=0.01)
lr.fit(train_x, train_y)
pred_y = lr.predict(test_x)
print(classification_report(test_y, pred_y))
             precision    recall  f1-score   support

      Basal       1.00      0.92      0.96        13
       Her2       1.00      0.88      0.93         8
       LumA       0.53      1.00      0.69        10
       LumB       1.00      0.40      0.57        10
     Normal       0.00      0.00      0.00         1

avg / total       0.86      0.79      0.78        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

3) Random Forest

Random forest is an ensemble method based on a forest of decision trees, where each decision tree is inferred from a random sample from the data.

In [293]:
rf = RandomForestClassifier(n_estimators=500)
rf.fit(train_x, train_y)
pred_y = rf.predict(test_x)
print(classification_report(test_y, pred_y))
             precision    recall  f1-score   support

      Basal       0.92      0.92      0.92        13
       Her2       0.88      0.88      0.88         8
       LumA       0.59      1.00      0.74        10
       LumB       1.00      0.40      0.57        10
     Normal       0.00      0.00      0.00         1

avg / total       0.83      0.79      0.76        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

4) Linear Support Vector Machines

Support vector machines fit a linear decision a boundary in the feature spsace.

In [295]:
cls = LinearSVC(C=0.1)
cls.fit(train_x, train_y)
pred_y = cls.predict(test_x)
print(classification_report(test_y, pred_y))
             precision    recall  f1-score   support

      Basal       1.00      0.92      0.96        13
       Her2       1.00      0.88      0.93         8
       LumA       0.56      1.00      0.71        10
       LumB       1.00      0.40      0.57        10
     Normal       0.00      0.00      0.00         1

avg / total       0.87      0.79      0.78        42

5) Naive Bayes

The Naive Bayes algorithm is a Bayesian inference which assumes that all features are statistically independent.

In [51]:
cls = GaussianNB()
cls.fit(train_x, train_y)
pred_y = cls.predict(test_x)
print(classification_report(test_y, pred_y))
             precision    recall  f1-score   support

      Basal       0.83      1.00      0.91        15
       Her2       0.89      0.67      0.76        12
       LumA       0.90      1.00      0.95         9
       LumB       0.60      0.60      0.60         5
     Normal       0.00      0.00      0.00         1

avg / total       0.82      0.83      0.82        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
In [ ]: