Context RNA sequencing (RNAseq) is one of the most commonly used techniques in life sciences and has been widely used in cancer research, drug development, and cancer diagnosis and prognosis. Sequencing the coding regions or the whole cancer transcriptome can provide valuable information about gene expression changes in tumors. Cancer RNA-Seq enables the detection of strand-specific information, an important component of gene regulation. Cancer transcriptome sequencing captures both coding and non-coding RNA and provides strand orientation for a complete view of expression dynamics.
Objective To use the Principal Component Analysis (PCA) technique to transform a large set of variables into a smaller one that still contains most of the information in the large set Key Concepts
Applying PCA for dimensionality reduction
Visualizing the data in the lower dimension
Data Description This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene expressions of patients having different types of tumor:
BRCA
KIRC
COAD
LUAD
PRAD
Samples (instances) are stored row-wise. Variables (attributes) of each sample are RNA-Seq gene expression levels measured by illumina HiSeq platform. A dummy name (gene_X) is given to each attribute.
Let's start coding!
Importing necessary libraries
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to perform PCA
from sklearn.decomposition import PCA
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# loading the dataset
data = pd.read_csv("gene_data/data.csv", index_col=[0])
labels = pd.read_csv("gene_data/labels.csv", index_col=[0])
data.shape
Output:
(801, 20531)
The data has 801 observations and 20531 attributes.
data.head()
Output:
labels.head()
Output:
In the above output, the Class depicts the type of tumor out of the five types (PRAD, LUAD, BRCA, KIRC, and COAD).
Let's combine the data and its labels.
frames = [labels, data]
df = pd.concat(frames, axis=1)
df.reset_index(drop=True, inplace=True)
# checking the shape of the combined dataframe
df.shape
(801, 20532) <IPython.core.display.Javascript object>
df.head()
Output:
Data Preprocessing
Normalization is important in PCA as it is a variance maximizing exercise.
PCA projects the original data onto directions that maximize the variance.
X = df.iloc[:, 1:].values
# normalizing the features
X = StandardScaler().fit_transform(X)
X.shape
(801, 20531) <IPython.core.display.Javascript object>
# checking the mean and standard deviation
np.mean(X), np.std(X)
(-1.091446422314019e-18, 0.9934763587711302) <IPython.core.display.Javascript object>
The mean is nearly equal to zero and the variance is nearly 1 as we normalized the data.
Applying Principal Component Analysis (PCA)
In the next few lines of code, we will be projecting the 20531-dimensional Cancer RNA-seq data to two-dimensions using PCA.
Original dimensions = 20531
Dimensions after applying PCA = 2
Note: We only use the features for dimensionality reduction, we don't need labels.
pca = PCA(n_components=2)
pca_data = pca.fit_transform(X)
Let's convert the above result into a dataframe.
pca_df = pd.DataFrame(
data=pca_data, columns=["Principal Component 1", "Principal Component 2"]
)
pca_df.tail()
output:
print(
"Explained variance per principal component: {}".format(
pca.explained_variance_ratio_
)
)
Explained variance per principal component: [0.10539781 0.08754232] <IPython.core.display.Javascript object>
Observations
In the above result, the explained variance is shown.
The first principal component explains 10.54% of total variance in the data.
The second principal component explains 8.75% of total variance in the data.
We see a huge reduction in dimensions, from 20531 to 2, and the 2 dimensions explain nearly 20% of total variance in the data.
In some other cases, i.e., some other datasets, PCA is able to explain significantly larger variance than we see in this case.
For example, 2 principal components are able to explain more than 90% of variance in the dataset.
# defining the targets
targets = list(df.Class.unique())
targets
['PRAD', 'LUAD', 'BRCA', 'KIRC', 'COAD'] <IPython.core.display.Javascript object>
Visualizing the data in the lower dimension
We can visualize data in 2 dimensions and also data in 3 dimensions (using 3-D plots).
In some cases, we can also visualize data in 4 dimensions by using different hues for the 4th dimension in a 3-D plot.
But it's impossible for us to visualize and interpret data in 20531 dimensions.
So using PCA, we scaled down to 2 dimensions, and now it's easy for us to visualize the data.
plt.figure(figsize=(10, 10))
sns.scatterplot(
data=pca_df, x="Principal Component 1", y="Principal Component 2", hue=df.Class
)
plt.title("Visualizing the data in the lower dimension", fontsize=20)
plt.show()
output:
Insights
From the above plot, we can observe that the five classes (PRAD, LUAD, BRCA, KIRC, COAD), when projected to a two-dimensional space, can be linearly separable up to some extent.
Other observations can be that the KIRC class is clearly separated out as compared to the other class.
Add-on: Applying t-distributed Stochastic Neighbor Embedding (t-SNE)
In the next few lines of code, we will be projecting the 20531-dimensional Cancer RNA-seq data to two-dimensions using t-SNE.
Original dimensions = 20531
Dimensions after applying t-SNE = 2
Note: t-SNE is covered in the additional learning material of Week 2.
# to perform t-SNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=1)
tsne_data = tsne.fit_transform(X)
Let's convert the above result into a dataframe.
tsne_df = pd.DataFrame(data=tsne_data, columns=["Component 1", "Component 2"])
tsne_df.tail()
output:
Visualizing the data in the lower dimension
plt.figure(figsize=(10, 10))
sns.scatterplot(data=tsne_df, x="Component 1", y="Component 2", hue=df.Class)
plt.title("Visualizing the data in the lower dimension", fontsize=20)
plt.show()
output:
Insights
From the above plot, we can observe that the five classes (PRAD, LUAD, BRCA, KIRC, COAD), when projected to a two-dimensional space, are linearly separable to a large extent.
A few points from the LUAD class are far away from the rest and closer to the KIRK class.
Let's try running t-SNE with different values of perplexity
perplexity = [2, 5, 10, 20, 30, 50, 100, 200]
plt.figure(figsize=(20, 10))
print(
"Visualizing the lower dimensional representation of data for different values of perplexity"
)
for i in range(len(perplexity)):
tsne = TSNE(n_components=2, perplexity=perplexity[i], n_jobs=-2, random_state=1)
# n_jobs specifies the number of parallel jobs to run
# -2 means using all processors except one
X_red = tsne.fit_transform(X)
red_data_df = pd.DataFrame(data=X_red, columns=["Component 1", "Component 2"])
plt.subplot(2, int(len(perplexity) / 2), i + 1)
plt.title("perplexity=" + str(perplexity[i]))
sns.scatterplot(data=red_data_df, x="Component 1", y="Component 2", hue=df.Class),
plt.tight_layout(pad=2)
output: