This project is going to explore different unsupervised and supervised algorithms you can use when trying to classify tumour types. The original dataset was published by Parker et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes (J Clin Oncol 27:1160-1167, 2009). The paper describes a 50 gene sub-type classifier for breast tumors (Lum-A, Lum-B, HER2-enriched, basal-like) that was used to create a risk predictor model. This risk predictor includes the gene expression-based subtypes in addition to the more traditional clinical variables, such as tumor size or histologic grade. The microarray data that was used to choose the 50 most relevant genes for the classifier is available on the GEO (Gene expression Omnibus) website.
from bokeh import plotting
from bokeh.charts import Scatter
from bokeh.io import output_notebook
import GEOparse
import numpy
import pandas
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
from sklearn.metrics import v_measure_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
output_notebook()
full_csv = pandas.read_csv('UNC_GEO_summary.csv')
platformobj = GEOparse.get_GEO('GPL1390')
mygeoobj = GEOparse.get_GEO('GSM80221')
The following table shows a summary of microarray dataset in the GEO repository, clinical variables, survival data and tumour sub-type for each patient.
full_csv
When using microarray data it is always good to keep in mind which platform was used for each microarray. You can easily check this by printing the metadata for each microarray dataset.
%%capture
platforms = []
for microarray in full_csv['GEO accession number']:
microarray = GEOparse.get_GEO(microarray)
platform = (microarray.metadata['platform_id'])[0]
platforms.append(platform)
In this case 3 different platforms were used.
platforms_s = set(platforms)
print(platforms_s)
You can filter your microarray dataset by the used platform.
%%capture
matrix = []
names = []
subtypes_test = []
subtypes_train = []
subtypes = []
for i in full_csv.index:
accession_number = full_csv['GEO accession number'][i]
subtype_test = full_csv['PAM50 test subtype'][i]
subtype_train = full_csv['PAM50 training subtype'][i]
microarray = GEOparse.get_GEO(accession_number)
platform = (microarray.metadata['platform_id'])[0]
if platform == 'GPL1390':
names.append(accession_number)
matrix.append(microarray.table['VALUE'])
subtypes_test.append(subtype_test)
subtypes_train.append(subtype_train)
if type(subtype_test) is str:
subtypes.append(subtype_test)
else:
subtypes.append(subtype_train)
To try out clustering algorithms we merged all the data form the microarray files into one numpy array.
numpy_matrix = numpy.array(matrix)
numpy_matrix2 = numpy.nan_to_num(numpy_matrix)
numpy_matrix.shape
k-means is an unsupervised iterative clustering method where k (the number of clusters) is defined by the user.
mbk = KMeans(n_clusters=5, max_iter=1000, verbose=1, tol=1e-8)
results = mbk.fit_predict(numpy_matrix2)
v_measure_score(results, subtypes)
Affinity propagation is an other type of unsupervised learning technique in which the number of clusters is inferred by the data.
ap = AffinityPropagation(verbose=2)
results = ap.fit_predict(numpy_matrix2)
for j in range(14):
for i in range(167):
if results[i] == j:
print(names[i], results[i], subtypes[i])
v_measure_score(results, subtypes)
Compare your results with a randomly generated dataset
random_clusters = numpy.random.random_integers(0, 4, len(results))
v_measure_score(random_clusters, subtypes)
Principal component analysis is a dimension reduction technique that preserves components in order of highest variants.
pca = PCA(n_components=50)
transformed = pca.fit_transform(numpy_matrix2)
data_frame = pandas.DataFrame({'x': transformed[:,0], 'y': transformed[:,1], 'our': results, 'truth': subtypes})
myfig = Scatter(data_frame, x='x', y='y', color='our')
plotting.show(myfig)
pca.explained_variance_ratio_
t-SNE is a non-linear dimensional reduction technique.
tsne = TSNE()
transformed = tsne.fit_transform(numpy_matrix2)
data_frame = pandas.DataFrame({'x': transformed[:,0], 'y': transformed[:,1], 'our': results, 'truth': subtypes})
myfig = Scatter(data_frame, x='x', y='y', color='truth')
plotting.show(myfig)
Here we take numpy_matrix2
and subtypes
and split it into training and test input data (x
) and labels (y
). After training the algorithm on the training dataset and run a classification on the test dataset, you can measeure the quality of the classification by precision (the number of found true positives compared to all found positives) and recall (the number of found true positives compared to the number of all true positives).
train_x, test_x, train_y, test_y = train_test_split(numpy_matrix2, subtypes)
k-nearest neighbors is simple non-generalizing learning method that infers the class based on a vote form the k nearest neighbor.
knc = KNeighborsClassifier(n_neighbors=5)
knc.fit(train_x, train_y)
pred_y = knc.predict(test_x)
print(classification_report(test_y, pred_y))
Logistic regression is a linear model for classification that is based on the logistic equation.
lr = LogisticRegression(C=0.01)
lr.fit(train_x, train_y)
pred_y = lr.predict(test_x)
print(classification_report(test_y, pred_y))
Random forest is an ensemble method based on a forest of decision trees, where each decision tree is inferred from a random sample from the data.
rf = RandomForestClassifier(n_estimators=500)
rf.fit(train_x, train_y)
pred_y = rf.predict(test_x)
print(classification_report(test_y, pred_y))
Support vector machines fit a linear decision a boundary in the feature spsace.
cls = LinearSVC(C=0.1)
cls.fit(train_x, train_y)
pred_y = cls.predict(test_x)
print(classification_report(test_y, pred_y))
The Naive Bayes algorithm is a Bayesian inference which assumes that all features are statistically independent.
cls = GaussianNB()
cls.fit(train_x, train_y)
pred_y = cls.predict(test_x)
print(classification_report(test_y, pred_y))