Learning Machine Learning¶

This project is going to explore different unsupervised and supervised algorithms you can use when trying to classify tumour types. The original dataset was published by Parker et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes (J Clin Oncol 27:1160-1167, 2009). The paper describes a 50 gene sub-type classifier for breast tumors (Lum-A, Lum-B, HER2-enriched, basal-like) that was used to create a risk predictor model. This risk predictor includes the gene expression-based subtypes in addition to the more traditional clinical variables, such as tumor size or histologic grade. The microarray data that was used to choose the 50 most relevant genes for the classifier is available on the GEO (Gene expression Omnibus) website.

from bokeh import plotting
from bokeh.charts import Scatter
from bokeh.io import output_notebook
import GEOparse
import numpy
import pandas
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
from sklearn.metrics import v_measure_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB

output_notebook()

Exploration of the PAM 50 Dataset¶

full_csv = pandas.read_csv('UNC_GEO_summary.csv')

platformobj = GEOparse.get_GEO('GPL1390')

mygeoobj = GEOparse.get_GEO('GSM80221')

The following table shows a summary of microarray dataset in the GEO repository, clinical variables, survival data and tumour sub-type for each patient.

full_csv

When using microarray data it is always good to keep in mind which platform was used for each microarray. You can easily check this by printing the metadata for each microarray dataset.

%%capture
platforms = []
for microarray in full_csv['GEO accession number']:
    microarray = GEOparse.get_GEO(microarray)
    platform = (microarray.metadata['platform_id'])[0]
    platforms.append(platform)

In this case 3 different platforms were used.

platforms_s = set(platforms)
print(platforms_s)

{'GPL885', 'GPL887', 'GPL1390'}

You can filter your microarray dataset by the used platform.

%%capture
matrix = []
names = []
subtypes_test = []
subtypes_train = []
subtypes = []
for i in full_csv.index:
    accession_number = full_csv['GEO accession number'][i]
    subtype_test = full_csv['PAM50 test subtype'][i]
    subtype_train = full_csv['PAM50 training subtype'][i]
    microarray = GEOparse.get_GEO(accession_number)
    platform = (microarray.metadata['platform_id'])[0]
    if platform == 'GPL1390':
        names.append(accession_number)
        matrix.append(microarray.table['VALUE'])
        subtypes_test.append(subtype_test)
        subtypes_train.append(subtype_train)
        if type(subtype_test) is str:
            subtypes.append(subtype_test)
        else:
            subtypes.append(subtype_train)

Unsupervised learning - clustering algorithms¶

To try out clustering algorithms we merged all the data form the microarray files into one numpy array.

numpy_matrix = numpy.array(matrix)

numpy_matrix2 = numpy.nan_to_num(numpy_matrix)

numpy_matrix.shape

(167, 22575)

1) k-means¶

k-means is an unsupervised iterative clustering method where k (the number of clusters) is defined by the user.

mbk = KMeans(n_clusters=5, max_iter=1000, verbose=1, tol=1e-8)
results = mbk.fit_predict(numpy_matrix2)

Initialization complete
Iteration  0, inertia 2948246.988
Iteration  1, inertia 1695266.394
Iteration  2, inertia 1653150.417
Iteration  3, inertia 1628921.216
Iteration  4, inertia 1618336.549
Iteration  5, inertia 1615704.304
Iteration  6, inertia 1614486.475
Converged at iteration 6
Initialization complete
Iteration  0, inertia 3006705.418
Iteration  1, inertia 1710675.445
Iteration  2, inertia 1682714.046
Iteration  3, inertia 1671517.657
Iteration  4, inertia 1662253.259
Iteration  5, inertia 1655401.265
Iteration  6, inertia 1651168.753
Iteration  7, inertia 1649449.574
Iteration  8, inertia 1648139.666
Iteration  9, inertia 1646258.075
Iteration 10, inertia 1642797.628
Iteration 11, inertia 1635935.062
Iteration 12, inertia 1628202.251
Iteration 13, inertia 1627323.042
Iteration 14, inertia 1626213.510
Iteration 15, inertia 1625353.488
Converged at iteration 15
Initialization complete
Iteration  0, inertia 2908083.736
Iteration  1, inertia 1690362.546
Iteration  2, inertia 1677842.120
Iteration  3, inertia 1666242.913
Iteration  4, inertia 1662191.497
Iteration  5, inertia 1659647.015
Iteration  6, inertia 1657904.840
Iteration  7, inertia 1656572.086
Iteration  8, inertia 1654891.859
Iteration  9, inertia 1653754.417
Iteration 10, inertia 1653118.425
Converged at iteration 10
Initialization complete
Iteration  0, inertia 2924352.404
Iteration  1, inertia 1673575.134
Iteration  2, inertia 1666125.019
Iteration  3, inertia 1665363.096
Converged at iteration 3
Initialization complete
Iteration  0, inertia 3011871.285
Iteration  1, inertia 1705812.497
Iteration  2, inertia 1677108.357
Iteration  3, inertia 1669853.815
Iteration  4, inertia 1663518.640
Iteration  5, inertia 1658358.472
Iteration  6, inertia 1657888.938
Converged at iteration 6
Initialization complete
Iteration  0, inertia 2820217.547
Iteration  1, inertia 1679106.132
Iteration  2, inertia 1660979.866
Iteration  3, inertia 1655597.096
Iteration  4, inertia 1651574.041
Iteration  5, inertia 1648888.544
Iteration  6, inertia 1647568.881
Iteration  7, inertia 1646889.614
Converged at iteration 7
Initialization complete
Iteration  0, inertia 2810351.146
Iteration  1, inertia 1653057.004
Iteration  2, inertia 1630455.694
Iteration  3, inertia 1622956.503
Iteration  4, inertia 1618129.258
Iteration  5, inertia 1617726.866
Converged at iteration 5
Initialization complete
Iteration  0, inertia 2651958.420
Iteration  1, inertia 1653246.220
Iteration  2, inertia 1620636.236
Iteration  3, inertia 1616883.707
Converged at iteration 3
Initialization complete
Iteration  0, inertia 2918010.329
Iteration  1, inertia 1699262.672
Iteration  2, inertia 1677500.885
Iteration  3, inertia 1672942.176
Iteration  4, inertia 1669360.752
Iteration  5, inertia 1667686.618
Iteration  6, inertia 1666329.173
Iteration  7, inertia 1666032.890
Converged at iteration 7
Initialization complete
Iteration  0, inertia 2828859.321
Iteration  1, inertia 1670906.627
Iteration  2, inertia 1660070.622
Iteration  3, inertia 1655685.828
Iteration  4, inertia 1652992.695
Iteration  5, inertia 1652627.216
Converged at iteration 5

v_measure_score(results, subtypes)

0.38669647446509914

2) Affinity propagation¶

Affinity propagation is an other type of unsupervised learning technique in which the number of clusters is inferred by the data.

ap = AffinityPropagation(verbose=2)
results = ap.fit_predict(numpy_matrix2)

Converged after 40 iterations.

for j in range(14):
    for i in range(167):
        if results[i] == j:
            print(names[i], results[i], subtypes[i])

GSM34562 0 Basal
GSM34565 0 Basal
GSM34559 0 Basal
GSM141070 0 Basal
GSM141071 0 Basal
GSM141072 0 Basal
GSM141110 0 Basal
GSM275897 0 Basal
GSM34450 1 Basal
GSM34452 1 Basal
GSM34497 1 Her2
GSM52895 1 LumA
GSM34549 1 LumA
GSM34451 1 LumA
GSM52896 1 LumB
GSM34567 1 LumA
GSM34550 1 LumA
GSM52891 1 LumB
GSM34475 1 LumB
GSM53474 1 Normal
GSM141079 1 LumA
GSM141081 1 LumA
GSM141085 1 LumA
GSM141088 1 LumA
GSM141001 1 LumA
GSM141077 1 Normal
GSM34481 2 Her2
GSM34544 2 Her2
GSM34431 2 Her2
GSM34548 2 LumA
GSM34488 2 Her2
GSM34474 2 LumA
GSM34541 2 LumA
GSM34455 2 LumA
GSM34459 2 LumA
GSM34448 2 LumA
GSM34553 2 LumA
GSM34446 2 LumA
GSM141092 2 LumA
GSM141075 2 LumB
GSM141095 2 Normal
GSM275782 3 Basal
GSM275810 3 Basal
GSM275825 3 Basal
GSM275830 3 Basal
GSM275833 3 Basal
GSM275839 3 Basal
GSM275841 3 Basal
GSM275848 3 Basal
GSM275857 3 Basal
GSM275863 3 Basal
GSM275876 3 Basal
GSM275887 3 Basal
GSM275888 3 Basal
GSM275905 3 Basal
GSM275909 3 Basal
GSM141108 3 Basal
GSM275802 4 Basal
GSM275837 4 Basal
GSM275843 4 Basal
GSM275847 4 Basal
GSM275847 4 Basal
GSM275885 4 Basal
GSM275889 4 Basal
GSM80221 5 Basal
GSM140999 5 Basal
GSM275792 5 Basal
GSM275813 5 Basal
GSM275823 5 Basal
GSM275829 5 Basal
GSM275834 5 Basal
GSM275850 5 Basal
GSM275852 5 Basal
GSM275865 5 Basal
GSM275866 5 Basal
GSM275883 5 Basal
GSM275894 5 Basal
GSM275904 5 Basal
GSM275879 6 Basal
GSM141105 7 Basal
GSM141073 7 Her2
GSM141096 7 Her2
GSM141099 7 Her2
GSM141121 7 Her2
GSM141102 7 LumB
GSM141117 7 LumB
GSM141097 7 Basal
GSM141098 7 Her2
GSM275791 8 Her2
GSM275793 8 Her2
GSM275795 8 Her2
GSM275796 8 Her2
GSM275800 8 Her2
GSM275803 8 Her2
GSM275805 8 Her2
GSM275815 8 Her2
GSM275816 8 Her2
GSM275817 8 Her2
GSM275832 8 Her2
GSM275858 8 Her2
GSM275861 8 Her2
GSM275873 8 Her2
GSM275878 8 Her2
GSM275890 8 Her2
GSM275893 8 Her2
GSM275846 8 LumB
GSM275860 8 LumB
GSM275874 8 LumA
GSM275896 8 LumA
GSM275898 9 LumA
GSM275907 9 LumA
GSM275864 9 LumB
GSM275794 9 LumB
GSM275808 9 LumA
GSM275809 9 LumB
GSM275812 9 LumA
GSM275836 9 LumA
GSM275882 9 LumA
GSM275884 9 LumA
GSM141093 10 Basal
GSM275820 11 LumB
GSM275891 12 Her2
GSM275824 12 LumA
GSM275828 12 LumA
GSM275831 12 LumA
GSM275838 12 LumB
GSM275844 12 LumA
GSM275851 12 LumA
GSM275855 12 LumA
GSM275856 12 LumA
GSM275859 12 LumA
GSM275867 12 LumA
GSM275899 12 LumA
GSM275842 13 Her2
GSM275877 13 Her2
GSM275880 13 Her2
GSM275881 13 Her2
GSM275783 13 LumA
GSM275788 13 LumA
GSM275804 13 LumA
GSM275818 13 LumA
GSM275872 13 LumA
GSM275892 13 LumA
GSM275900 13 LumA
GSM275908 13 LumA
GSM141084 13 LumB
GSM141114 13 LumB
GSM275799 13 LumB
GSM275807 13 LumB
GSM275821 13 LumB
GSM275822 13 LumB
GSM275895 13 LumB
GSM141118 13 Her2
GSM141002 13 LumB
GSM275811 13 LumA
GSM275814 13 LumB
GSM275826 13 LumB
GSM275827 13 LumB
GSM275840 13 LumB
GSM275845 13 LumA
GSM275849 13 LumA
GSM275862 13 LumA
GSM275871 13 LumA
GSM275886 13 LumA
GSM275901 13 LumA
GSM275903 13 LumA
GSM275906 13 LumA

v_measure_score(results, subtypes)

0.40616973903747322

Simple check¶

Compare your results with a randomly generated dataset

random_clusters = numpy.random.random_integers(0, 4, len(results))

v_measure_score(random_clusters, subtypes)

0.05368245531372208

3) Principal Component Analysis¶

Principal component analysis is a dimension reduction technique that preserves components in order of highest variants.

pca = PCA(n_components=50)
transformed = pca.fit_transform(numpy_matrix2)

data_frame = pandas.DataFrame({'x': transformed[:,0], 'y': transformed[:,1], 'our': results, 'truth': subtypes})

myfig = Scatter(data_frame, x='x', y='y', color='our')

plotting.show(myfig)

pca.explained_variance_ratio_

array([ 0.15281314,  0.11336547,  0.04262466,  0.03526221,  0.03353428,
        0.02901063,  0.02205512,  0.02095834,  0.0190384 ,  0.01618621,
        0.01417247,  0.01265852,  0.01193964,  0.01153478,  0.01100957,
        0.01039784,  0.0098839 ,  0.00953392,  0.00936849,  0.00879479,
        0.00842337,  0.00803581,  0.00790614,  0.00746355,  0.00713668,
        0.00690605,  0.0066281 ,  0.00644893,  0.00631893,  0.00606388,
        0.00595219,  0.00593914,  0.00580179,  0.00575719,  0.00542019,
        0.00527025,  0.00514962,  0.00505571,  0.00495043,  0.0048947 ,
        0.00486142,  0.0047364 ,  0.00468101,  0.00457586,  0.00446293,
        0.0044193 ,  0.00420863,  0.00419003,  0.0040644 ,  0.00403637])

4) t-distributed Stochastic Neighbor Embedding¶

t-SNE is a non-linear dimensional reduction technique.

tsne = TSNE()
transformed = tsne.fit_transform(numpy_matrix2)
data_frame = pandas.DataFrame({'x': transformed[:,0], 'y': transformed[:,1], 'our': results, 'truth': subtypes})

myfig = Scatter(data_frame, x='x', y='y', color='truth')

plotting.show(myfig)

Supervised learning - classification¶

Here we take numpy_matrix2 and subtypes and split it into training and test input data (x) and labels (y). After training the algorithm on the training dataset and run a classification on the test dataset, you can measeure the quality of the classification by precision (the number of found true positives compared to all found positives) and recall (the number of found true positives compared to the number of all true positives).

train_x, test_x, train_y, test_y = train_test_split(numpy_matrix2, subtypes)

1) k-nearest neighbors¶

k-nearest neighbors is simple non-generalizing learning method that infers the class based on a vote form the k nearest neighbor.

knc = KNeighborsClassifier(n_neighbors=5)
knc.fit(train_x, train_y)
pred_y = knc.predict(test_x)
print(classification_report(test_y, pred_y))

             precision    recall  f1-score   support

      Basal       1.00      0.92      0.96        13
       Her2       0.71      0.62      0.67         8
       LumA       0.43      1.00      0.61        10
       LumB       0.00      0.00      0.00        10
     Normal       0.00      0.00      0.00         1

avg / total       0.55      0.64      0.57        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

2) Logistic regression¶

Logistic regression is a linear model for classification that is based on the logistic equation.

lr = LogisticRegression(C=0.01)
lr.fit(train_x, train_y)
pred_y = lr.predict(test_x)
print(classification_report(test_y, pred_y))

             precision    recall  f1-score   support

      Basal       1.00      0.92      0.96        13
       Her2       1.00      0.88      0.93         8
       LumA       0.53      1.00      0.69        10
       LumB       1.00      0.40      0.57        10
     Normal       0.00      0.00      0.00         1

avg / total       0.86      0.79      0.78        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

3) Random Forest¶

Random forest is an ensemble method based on a forest of decision trees, where each decision tree is inferred from a random sample from the data.

rf = RandomForestClassifier(n_estimators=500)
rf.fit(train_x, train_y)
pred_y = rf.predict(test_x)
print(classification_report(test_y, pred_y))

             precision    recall  f1-score   support

      Basal       0.92      0.92      0.92        13
       Her2       0.88      0.88      0.88         8
       LumA       0.59      1.00      0.74        10
       LumB       1.00      0.40      0.57        10
     Normal       0.00      0.00      0.00         1

avg / total       0.83      0.79      0.76        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

4) Linear Support Vector Machines¶

Support vector machines fit a linear decision a boundary in the feature spsace.

cls = LinearSVC(C=0.1)
cls.fit(train_x, train_y)
pred_y = cls.predict(test_x)
print(classification_report(test_y, pred_y))

             precision    recall  f1-score   support

      Basal       1.00      0.92      0.96        13
       Her2       1.00      0.88      0.93         8
       LumA       0.56      1.00      0.71        10
       LumB       1.00      0.40      0.57        10
     Normal       0.00      0.00      0.00         1

avg / total       0.87      0.79      0.78        42

5) Naive Bayes¶

The Naive Bayes algorithm is a Bayesian inference which assumes that all features are statistically independent.

cls = GaussianNB()
cls.fit(train_x, train_y)
pred_y = cls.predict(test_x)
print(classification_report(test_y, pred_y))

             precision    recall  f1-score   support

      Basal       0.83      1.00      0.91        15
       Her2       0.89      0.67      0.76        12
       LumA       0.90      1.00      0.95         9
       LumB       0.60      0.60      0.60         5
     Normal       0.00      0.00      0.00         1

avg / total       0.82      0.83      0.82        42

//anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

	GEO accession number	Age	ER	Node	Grade	Size	RFS event	RFS months	Overall Survival Event	Overall suvival months	PAM50 test subtype	PAM50 training subtype
0	GSM80221	55	0.0	0	3	2	0	31	0	31	NaN	Basal
1	GSM34464	61	0.0	1	3	2	1	1	1	13	NaN	Basal
2	GSM34562	50	1.0	0	3	1	0	29	0	29	NaN	Basal
3	GSM34523	39	0.0	1	2	4	0	5	0	5	NaN	Basal
4	GSM34565	55	0.0	0	3	2	0	18	0	18	NaN	Basal
5	GSM34527	50	0.0	0	3	1	0	19	0	19	NaN	Basal
6	GSM34437	59	0.0	0	3	2	0	25	0	25	NaN	Basal
7	GSM34450	80	0.0	1	3	1	0	16	0	16	NaN	Basal
8	GSM34452	60	0.0	0	3	1	0	7	0	7	NaN	Basal
9	GSM34481	48	1.0	1	3	3	0	14	0	14	NaN	Her2
10	GSM34497	51	0.0	1	3	3	0	21	0	21	NaN	Her2
11	GSM34544	50	0.0	0	3	2	0	15	0	15	NaN	Her2
12	GSM34528	52	0.0	1	2	3	0	9	0	9	NaN	Her2
13	GSM34431	42	0.0	1	3	1	0	25	0	25	NaN	Her2
14	GSM34532	72	0.0	1	3	2	0	20	0	20	NaN	Her2
15	GSM52895	34	1.0	0	1	1	0	1	0	1	NaN	LumA
16	GSM34549	83	1.0	0	2	1	0	9	0	9	NaN	LumA
17	GSM34548	50	1.0	0	1	1	0	20	0	20	NaN	LumA
18	GSM34428	50	1.0	1	2	2	0	24	0	24	NaN	LumA
19	GSM34557	63	1.0	1	2	1	0	9	0	9	NaN	LumA
20	GSM34451	40	1.0	0	2	1	0	13	0	13	NaN	LumA
21	GSM52896	79	1.0	1	3	2	1	15	0	15	NaN	LumB
22	GSM34559	68	0.0	1	3	3	1	3	1	3	Basal	NaN
23	GSM34552	84	0.0	0	3	2	0	27	1	27	Basal	NaN
24	GSM50154	84	0.0	1	3	3	0	2	0	2	Basal	NaN
25	GSM34488	55	0.0	1	2	2	0	10	0	10	Her2	NaN
26	GSM34474	67	0.0	0	2	1	0	118	0	118	LumA	NaN
27	GSM34541	57	1.0	1	2	4	1	6	0	51	LumA	NaN
28	GSM34430	68	1.0	0	3	2	0	37	0	37	LumA	NaN
29	GSM34455	85	1.0	0	2	2	0	32	0	32	LumA	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
157	GSM275814	65	1.0	1	3	2	0	73	0	73	LumB	NaN
158	GSM275820	59	0.0	1	3	3	1	17	1	17	LumB	NaN
159	GSM275824	52	1.0	0	2	1	0	71	0	71	LumA	NaN
160	GSM275826	88	1.0	1	3	2	1	35	1	35	LumB	NaN
161	GSM275827	57	1.0	0	3	1	0	64	0	64	LumB	NaN
162	GSM275828	72	1.0	1	2	4	1	44	1	45	LumA	NaN
163	GSM275831	75	1.0	0	1	2	0	81	0	81	LumA	NaN
164	GSM275836	74	1.0	0	2	1	0	76	0	76	LumA	NaN
165	GSM275838	66	1.0	0	2	1	0	61	0	61	LumB	NaN
166	GSM275840	58	0.0	1	2	3	1	17	1	63	LumB	NaN
167	GSM275844	52	1.0	2	2	1	1	11	1	55	LumA	NaN
168	GSM275845	49	1.0	0	2	1	0	54	0	54	LumA	NaN
169	GSM275849	71	1.0	0	3	1	0	73	0	73	LumA	NaN
170	GSM275851	89	1.0	0	3	2	0	62	0	62	LumA	NaN
171	GSM275855	79	1.0	0	2	4	0	69	0	69	LumA	NaN
172	GSM275856	73	1.0	0	3	1	0	68	0	68	LumA	NaN
173	GSM275859	93	1.0	0	2	1	0	24	1	24	LumA	NaN
174	GSM275860	84	1.0	0	3	2	0	23	1	23	LumB	NaN
175	GSM275862	47	1.0	1	3	2	0	61	0	61	LumA	NaN
176	GSM275867	80	1.0	1	2	4	1	58	1	58	LumA	NaN
177	GSM275871	48	1.0	0	1	1	0	64	0	64	LumA	NaN
178	GSM275874	42	1.0	1	3	2	0	63	0	63	LumA	NaN
179	GSM275882	88	1.0	0	2	1	0	13	1	13	LumA	NaN
180	GSM275884	80	1.0	1	2	1	0	57	0	57	LumA	NaN
181	GSM275886	44	0.0	0	3	2	0	47	0	47	LumA	NaN
182	GSM275896	50	1.0	2	3	2	0	44	0	44	LumA	NaN
183	GSM275899	62	1.0	0	3	3	0	57	0	57	LumA	NaN
184	GSM275901	55	1.0	1	2	2	0	39	0	39	LumA	NaN
185	GSM275903	46	1.0	1	3	2	0	45	0	45	LumA	NaN
186	GSM275906	67	1.0	0	3	2	0	41	0	41	LumA	NaN