1 Intro

This exercise will show how to obtain clinical and genomic data from the Cancer Genome Atlas (TGCA) and to perform classical analysis important for clinical data.

These include:

Download the data (clinical and expression) from TGCA
Processing of the data (normalization) and saving it locally using simple table formats.
Unsupervised analysis includes differential expression, PCA and clustering.
Build a machine learning model (classifier) to predict cancer.
Perform survival analysis of molecular markers detected in previous analysis.

First, we start by loading all libraries necessary for this exercise. Please check their documentation if you want to know more.

# Load packages
library("TCGAbiolinks")
library("limma")
library("edgeR")
library("glmnet")
library("factoextra")
library("FactoMineR")
library("caret")
library("SummarizedExperiment")
library("gplots")
library("survival")
library("survminer")
library("RColorBrewer")
library("clusterProfiler")
library("genefilter")

2 TCGA data

In this tutorial, we will focus on Liver Hepatocellular Carcinoma, which is identified in TCGA as LIHC. For LIHC, TCGA provides data for 377 patients including: clinical, expression, DNA methylation and genotyping data. In this tutorial, we will work with clinical and expression data (RNA-seq). Go to https://portal.gdc.cancer.gov/ and search for TCGA-LIHC if you want to understand the data deposited in TCGA. You can also try to find your way through the previous data to look for other data sets of your interest.

We will make use of the TCGAbiolinks library, which allows us to query, prepare and download data from the TCGA portal. TCGAbiolinks provides important functionality as matching data of same the donors across distinct data types (clinical vs expression) and provides data structures to make its analysis in R easy.

To download TCGA data with TCGAbiolinks, you need to follow 3 steps. First, you will query the TCGA database through R with the function GDCquery. This will allow you to investigate the data available at the TCGA database. Next, we use GDCdownload to download raw version of desired files into your computer. Finally GDCprepare will read these files and make R data structures so that we can further analyse them.

Before we get there however we need to know what we are searching for. We can check all the available projects at TCGA with the command bellow. Since there are many lets look at the first 6 projects using the command head().

GDCprojects = getGDCprojects()

head(GDCprojects[c("project_id", "name")])

##      project_id
## 1 CGCI-HTMCP-CC
## 2    TARGET-AML
## 3     GENIE-JHU
## 4     GENIE-MSK
## 5    GENIE-VICC
## 6     GENIE-MDA
##                                                                                          name
## 1                             HIV+ Tumor Molecular Characterization Project - Cervical Cancer
## 2                                                                      Acute Myeloid Leukemia
## 3 AACR Project GENIE - Contributed by Johns Hopkins Sidney Kimmel Comprehensive Cancer Center
## 4                  AACR Project GENIE - Contributed by Memorial Sloan Kettering Cancer Center
## 5                         AACR Project GENIE - Contributed by Vanderbilt-Ingram Cancer Center
## 6                               AACR Project GENIE - Contributed by MD Anderson Cancer Center

As a general rule in R (and especially if you are working in RStudio) whenever some method returns some value or table you are not familiar with, you should check its structure and dimensions. You can always use functions such as head() to only show the first entries and dim() to check the dimension of the data.

We already know that Liver Hepatocellular Carcinoma has as id TCGA-LIHC. We can use the following function to get details on all data deposited for TCGA-LIHC.

TCGAbiolinks:::getProjectSummary("TCGA-LIHC")

## $file_count
## [1] 20162
## 
## $data_categories
##   file_count case_count               data_category
## 1       2634        377            Sequencing Reads
## 2       1634        377                 Biospecimen
## 3       4175        377       Copy Number Variation
## 4       6048        377 Simple Nucleotide Variation
## 5       1698        376     Transcriptome Profiling
## 6       1290        377             DNA Methylation
## 7        803        377                    Clinical
## 8       1696        371        Structural Variation
## 9        184        184          Proteome Profiling
## 
## $case_count
## [1] 377
## 
## $file_size
## [1] 7.142195e+13

Of note, not all patients were measured for all data types. Also, some data types have more files than samples. This is the case when more experiments were performed per patient, i.e. transcriptome profiling was performed both in mRNA and miRNA, or that data have been analysed by distinct computational strategies.

Let us start by querying all RNA-seq data from LIHC project. When using GDCquery we always need to specify the id of the project, i.e. “TCGA_LIHC”, and the data category we are interested in, i.e. “Transcriptome Profiling”. Here, we will focus on a particular type of data summarization for mRNA-seq data (workflow.type), which is based on raw counts estimated with HTSeq.

Note that performing this query will take a few of minutes

query_TCGA = GDCquery(
  project = "TCGA-LIHC",
  data.category = "Transcriptome Profiling", # parameter enforced by GDCquery
  data.type = "Gene Expression Quantification",
  experimental.strategy = "RNA-Seq",
  #workflow.type = "HTSeq - Counts"
  workflow.type = "STAR - Counts"
)

To visualize the query results in a more readable way, we can use the command getResults.

lihc_res = getResults(query_TCGA) # make results as table
# head(lihc_res) # data of the first 6 patients.
colnames(lihc_res) # columns present in the table

##  [1] "id"                        "data_format"              
##  [3] "cases"                     "access"                   
##  [5] "file_name"                 "submitter_id"             
##  [7] "data_category"             "type"                     
##  [9] "file_size"                 "created_datetime"         
## [11] "md5sum"                    "updated_datetime"         
## [13] "file_id"                   "data_type"                
## [15] "state"                     "experimental_strategy"    
## [17] "version"                   "data_release"             
## [19] "project"                   "analysis_id"              
## [21] "analysis_state"            "analysis_submitter_id"    
## [23] "analysis_workflow_link"    "analysis_workflow_type"   
## [25] "analysis_workflow_version" "sample_type"              
## [27] "is_ffpe"                   "cases.submitter_id"       
## [29] "sample.submitter_id"

One interesting question is the tissue type measured at an experiment (normal, solid tissue, cell line). This information is present at column sample_type.

head(lihc_res$sample_type) # first 6 types of tissue.

## [1] "Solid Tissue Normal" "Primary Tumor"       "Primary Tumor"      
## [4] "Primary Tumor"       "Primary Tumor"       "Primary Tumor"

sample_type is a categorical variable, which can be better visualized with the table function, so that we can check how many different categories exist in the data.

table(lihc_res$sample_type) # summary of distinct tissues types present in this study

## 
##       Primary Tumor     Recurrent Tumor Solid Tissue Normal 
##                 371                   3                  50

As you see, there are 50 controls (Solid Tissue Normal) and 371 cancer samples (Primary Tumors). For simplicity, we will ignore the small class of recurrent tumors. Therefore, we will redo the query as

query_TCGA = GDCquery(
  project = "TCGA-LIHC",
  data.category = "Transcriptome Profiling", # parameter enforced by GDCquery
  data.type = "Gene Expression Quantification",
  experimental.strategy = "RNA-Seq",
  workflow.type = "STAR - Counts",
  sample.type = c("Primary Tumor", "Solid Tissue Normal"))

Next, we need to download the files from the query. Before, be sure that you set your current working directory to the place you want to save your data. TCGA will save the data in a directory structure starting with a directory “GDCdata”.

Let us now download the files specified in the query.

GDCdownload(query = query_TCGA)

Given that you need to download many files, the previous operation might take some time. Often the download fails for one or another file. You can re-run the previous command until no error message is given. The method will recognize that the data has already been downloaded and won’t download the data again.

Finally, lets load the actual RNAseq data into R. Remember that the output directory set must be the same to where you downloaded the data.

tcga_data = GDCprepare(query_TCGA)

We can then check the size of the object with the command.

dim(tcga_data)

## [1] 60660   421

There are 3 functions that allow us to access to most important data present in this object, these are: colData(), rowData(), assays(). Use the command ?SummarizedExperiment to find more details. colData() allows us to access the clinical data associated with our samples. The functions colnames() and rownames() can be used to extract the column and rows names from a given table respectively. But be careful if you have large objects. The colnames() and rownames() functions return all column and row names, even if you have tens of thousands. The combination header(colnames()) can make more sense in these cases.

# In R (and other programming languages) you can often
# chain functions to save time and space
colnames(colData(tcga_data))

##  [1] "barcode"                     "patient"                    
##  [3] "sample"                      "shortLetterCode"            
##  [5] "definition"                  "sample_submitter_id"        
##  [7] "sample_type_id"              "tumor_descriptor"           
##  [9] "sample_id"                   "sample_type"                
## [11] "composition"                 "days_to_collection"         
## [13] "state"                       "initial_weight"             
## [15] "pathology_report_uuid"       "submitter_id"               
## [17] "oct_embedded"                "is_ffpe"                    
## [19] "tissue_type"                 "synchronous_malignancy"     
## [21] "ajcc_pathologic_stage"       "days_to_diagnosis"          
## [23] "treatments"                  "last_known_disease_status"  
## [25] "tissue_or_organ_of_origin"   "days_to_last_follow_up"     
## [27] "age_at_diagnosis"            "primary_diagnosis"          
## [29] "prior_malignancy"            "year_of_diagnosis"          
## [31] "prior_treatment"             "ajcc_staging_system_edition"
## [33] "ajcc_pathologic_t"           "morphology"                 
## [35] "ajcc_pathologic_n"           "ajcc_pathologic_m"          
## [37] "classification_of_tumor"     "diagnosis_id"               
## [39] "icd_10_code"                 "site_of_resection_or_biopsy"
## [41] "tumor_grade"                 "progression_or_recurrence"  
## [43] "alcohol_history"             "exposure_id"                
## [45] "race"                        "gender"                     
## [47] "ethnicity"                   "vital_status"               
## [49] "age_at_index"                "days_to_birth"              
## [51] "year_of_birth"               "demographic_id"             
## [53] "days_to_death"               "year_of_death"              
## [55] "bcr_patient_barcode"         "primary_site"               
## [57] "project_id"                  "disease_type"               
## [59] "name"                        "releasable"                 
## [61] "released"                    "sample.aux"

This link provides a basic explanation about tcga_data. Note that both clinical and expression data are present in this object.

Lets look at some potentially interesting features. The table() function (in this context) produces a small summary with the sum of each of the factors present in a given column.

table(tcga_data@colData$vital_status)

## 
##        Alive         Dead Not Reported 
##          255          164            2

table(tcga_data@colData$ajcc_pathologic_stage)

## 
##    Stage I   Stage II  Stage III Stage IIIA Stage IIIB Stage IIIC   Stage IV 
##        189         97          6         73          8         10          3 
##  Stage IVA  Stage IVB 
##          1          2

table(tcga_data@colData$definition)

## 
## Primary solid Tumor Solid Tissue Normal 
##                 371                  50

table(tcga_data@colData$tissue_or_organ_of_origin)

## 
## Liver 
##   421

table(tcga_data@colData$gender)

## 
## female   male 
##    143    278

table(tcga_data@colData$race)

## 
## american indian or alaska native                            asian 
##                                2                              164 
##        black or african american                     not reported 
##                               24                               13 
##                            white 
##                              218

Is there a particular column (feature) that allows you to distinguish tumor tissue from normal tissue?

What about the RNA-seq data? We can use the assay function to obtain the RNA-seq count matrices and rowData to see gene mapping information. Can you tell how many genes and how many samples are included there?

dim(assay(tcga_data))     # gene expression matrices.

## [1] 60660   421

head(assay(tcga_data)[,1:2]) # expression of first 6 genes and first 2 samples

##                    TCGA-DD-A1EG-11A-11R-A213-07 TCGA-RC-A7SF-01A-11R-A352-07
## ENSG00000000003.15                         3729                         2289
## ENSG00000000005.6                            14                            1
## ENSG00000000419.13                          954                          504
## ENSG00000000457.14                          327                          311
## ENSG00000000460.17                           56                           77
## ENSG00000000938.13                          598                          116

head(rowData(tcga_data))     # ensembl id and gene id of the first 6 genes.

## DataFrame with 6 rows and 10 columns
##                      source     type     score     phase            gene_id
##                    <factor> <factor> <numeric> <integer>        <character>
## ENSG00000000003.15   HAVANA     gene        NA        NA ENSG00000000003.15
## ENSG00000000005.6    HAVANA     gene        NA        NA  ENSG00000000005.6
## ENSG00000000419.13   HAVANA     gene        NA        NA ENSG00000000419.13
## ENSG00000000457.14   HAVANA     gene        NA        NA ENSG00000000457.14
## ENSG00000000460.17   HAVANA     gene        NA        NA ENSG00000000460.17
## ENSG00000000938.13   HAVANA     gene        NA        NA ENSG00000000938.13
##                         gene_type   gene_name       level     hgnc_id
##                       <character> <character> <character> <character>
## ENSG00000000003.15 protein_coding      TSPAN6           2  HGNC:11858
## ENSG00000000005.6  protein_coding        TNMD           2  HGNC:17757
## ENSG00000000419.13 protein_coding        DPM1           2   HGNC:3005
## ENSG00000000457.14 protein_coding       SCYL3           2  HGNC:19285
## ENSG00000000460.17 protein_coding    C1orf112           2  HGNC:25565
## ENSG00000000938.13 protein_coding         FGR           2   HGNC:3697
##                             havana_gene
##                             <character>
## ENSG00000000003.15 OTTHUMG00000022002.2
## ENSG00000000005.6  OTTHUMG00000022001.2
## ENSG00000000419.13 OTTHUMG00000032742.2
## ENSG00000000457.14 OTTHUMG00000035941.6
## ENSG00000000460.17 OTTHUMG00000035821.9
## ENSG00000000938.13 OTTHUMG00000003516.3

Finally, we can use some core functionality of R to save the TCGA_data as a .RDS file. This is faster than repeating the previous operations and useful if you need to work in your data for several days. To keep things tidy, we create a directory called data and save our file there.

dir.create("data")

## Warning in dir.create("data"): 'data' already exists

# Save the data as a file, if you need it later, you can just load this file
# instead of having to run the whole pipeline again
saveRDS(object = tcga_data,
        file = file.path("data", "tcga_data.RDS"),
        compress = FALSE)

The data can be loaded with the following command:

tcga_data = readRDS(file = file.path("data","tcga_data.RDS"))

which is the same as:

tcga_data = readRDS(file = "./data/tcga_data.RDS")

3 RNASeq Normalization

3.1 Defining a pipeline

Basics

A typical task on RNA-Seq data is differential expression (DE) analysis, based on some clinical phenotypes. This, in turn, requires normalization of the data, as in its raw format it may have batch effects and other artifacts.

A common approach to such complex tasks is to define a computational pipeline, performing several steps in sequence, allowing the user to select different parameters.

We will now define and run one such pipeline, through the use of an R function.

The function is called limma_pipeline(tcga_data, condition_variable, reference_group), where tcga_data is the data object we have gotten from TCGA and condition_variable is the interesting variable/condition by which you want to group your patient samples. You can also decide which one of the values of your conditional variable is going to be the reference group, with the reference_group parameter.

This function returns a list with three different objects:

A complex object, resulting from running voom, this contains the TMM+voom normalized data;
A complex object, resulting from running eBayes, this contains the the fitted model plus a number of statistics related to each of the probes;
A simple table, resulting from running topTable, which contains the top 100 differentially expressed genes sorted by p-value. This is how the code of this function:

limma_pipeline = function(
  tcga_data,
  condition_variable,
  reference_group = NULL){

  design_factor = colData(tcga_data)[, condition_variable, drop = TRUE]

  group = factor(design_factor)

  if(!is.null(reference_group)){
    group = relevel(group, ref = reference_group)
  }

  design = model.matrix(~ group)

  dge = DGEList(
    counts = assay(tcga_data),
    samples = colData(tcga_data),
    genes = as.data.frame(rowData(tcga_data))
  )

  # filtering
  keep = filterByExpr(dge, design)
  dge = dge[keep, , keep.lib.sizes = FALSE]
  rm(keep)

  # Normalization (TMM followed by voom)
  dge = calcNormFactors(dge)
  v = voom(dge, design, plot = TRUE)

  # Fit model to data given design
  fit = lmFit(v, design)
  fit = eBayes(fit)

  # Show top genes
  topGenes = topTable(fit, coef = ncol(design), number = 100, sort.by = "p")

  return(
    list(
      voomObj = v, # normalized data
      fit = fit, # fitted model and statistics
      topGenes = topGenes # the 100 most differentially expressed genes
    )
  )
}

Please read the Details tab for a step by step explanation of what this pipeline does. For now, copy the limma_pipeline function in your R console, so we can start using it on the TCGA data.

With the following command, we can obtain the DE analysis comparing Primary solid Tumor samples against Solid Tissue Normal. This will be used in the next section, on the classification task.

limma_res = limma_pipeline(
  tcga_data = tcga_data,
  condition_variable = "definition",
  reference_group = "Solid Tissue Normal"
)

Let’s save this object to file, like we did with tcga_data:

# Save the data as a file, if you need it later, you can just load this file
# instead of having to run the whole pipeline again
saveRDS(
  object = limma_res,
  file = file.path("data", "limma_res.RDS"),
  compress = FALSE
)

As an example, we also show here how we can use limma_pipeline to perform DE analysis by grouping patients by gender instead of by tissue type.

gender_limma_res = limma_pipeline(
  tcga_data = tcga_data,
  condition_variable = "gender",
  reference_group = "female"
)

Details

In our pipeline function, we use the package limma. We will select a particular clinical feature of the data to use as class for grouping the samples as either tumor vs normal tissue. This data is available under the column definition for tcga_data, but needs the use of the function colData to be accessed. In addition, limma requires this data to be a factor, so we convert it as such:

clinical_data = colData(tcga_data)
group = factor(clinical_data$definition)

As seen before, we have 2 distinct groups of tissues defined in this column, Solid Tissue Normal (our control samples) and Primary solid Tumor (our samples of interest). In this factor, we also want to define Solid Tissue Normal as being our base or reference level.

group = relevel(group, ref = "Solid Tissue Normal")

Next, we need to create a design matrix, which will indicate the conditions to be compared by the DE analysis. The ~ symbol represents that we are constructing a formula.

design = model.matrix(~group)
head(design)

##   (Intercept) groupPrimary solid Tumor
## 1           1                        0
## 2           1                        1
## 3           1                        1
## 4           1                        1
## 5           1                        1
## 6           1                        1

Before performing DE analysis, we remove genes, which have low amount of counts. We transform our tcga_data object as DGEList, which provides functions for filtering. By default genes with counts with less than 10 reads are removed.

dge = DGEList( # creating a DGEList object
  counts = assay(tcga_data),
  samples = colData(tcga_data),
  genes = as.data.frame(rowData(tcga_data))
)

# filtering
keep = filterByExpr(dge, design) # defining which genes to keep
dge = dge[keep, , keep.lib.sizes = FALSE] # filtering the dge object
rm(keep) #  use rm() to remove objects from memory if you don't need them anymore

Before we fit a model to our data, we normalize the data to minimize batch effects and technical variation with the TMM (trimmed mean of M-values) normalization method. Moreover, to apply limma on RNA-seq, we need to convert the data to have a similar variance as arrays. This is done with the VOOM method.

dge = calcNormFactors(dge, method = "TMM")
v = voom(dge, design, plot = TRUE)

Finally, using lmFit lets fit a series of linear models, one to each of the probes. These data will then be fed to eBayes to produce a complex object which holds a number of statistics that we can use to rank the differentially expressed genes.

fit = lmFit(v, design)
fit = eBayes(fit)

Using the function topTable we can check the top10 genes classified as being differentially expressed. ’

topGenes = topTable(fit, coef = 1, sort.by = "p")
print(topGenes)

##                    source type score phase            gene_id      gene_type
## ENSG00000009307.16 HAVANA gene    NA    NA ENSG00000009307.16 protein_coding
## ENSG00000011304.22 HAVANA gene    NA    NA ENSG00000011304.22 protein_coding
## ENSG00000022840.16 HAVANA gene    NA    NA ENSG00000022840.16 protein_coding
## ENSG00000044115.21 HAVANA gene    NA    NA ENSG00000044115.21 protein_coding
## ENSG00000048828.17 HAVANA gene    NA    NA ENSG00000048828.17 protein_coding
## ENSG00000054118.15 HAVANA gene    NA    NA ENSG00000054118.15 protein_coding
## ENSG00000057608.17 HAVANA gene    NA    NA ENSG00000057608.17 protein_coding
## ENSG00000067560.13 HAVANA gene    NA    NA ENSG00000067560.13 protein_coding
## ENSG00000068697.7  HAVANA gene    NA    NA  ENSG00000068697.7 protein_coding
## ENSG00000070831.17 HAVANA gene    NA    NA ENSG00000070831.17 protein_coding
##                    gene_name level    hgnc_id           havana_gene    logFC
## ENSG00000009307.16     CSDE1     1 HGNC:29905  OTTHUMG00000012060.6 8.426555
## ENSG00000011304.22     PTBP1     2  HGNC:9583 OTTHUMG00000181789.13 7.266500
## ENSG00000022840.16     RNF10     1 HGNC:10055 OTTHUMG00000168999.17 6.839278
## ENSG00000044115.21    CTNNA1     1  HGNC:2509  OTTHUMG00000163502.5 7.433791
## ENSG00000048828.17   FAM120A     2 HGNC:13247  OTTHUMG00000020252.4 7.202252
## ENSG00000054118.15    THRAP3     2 HGNC:22964  OTTHUMG00000007866.8 6.646039
## ENSG00000057608.17      GDI2     1  HGNC:4227 OTTHUMG00000017607.10 7.941541
## ENSG00000067560.13      RHOA     1   HGNC:667  OTTHUMG00000156838.8 7.947602
## ENSG00000068697.7    LAPTM4A     2  HGNC:6924  OTTHUMG00000090750.3 8.189099
## ENSG00000070831.17     CDC42     2  HGNC:1736 OTTHUMG00000002753.10 7.009241
##                     AveExpr        t P.Value adj.P.Val        B
## ENSG00000009307.16 8.203491 125.3471       0         0 759.8161
## ENSG00000011304.22 7.489407 159.3929       0         0 857.5700
## ENSG00000022840.16 6.882460 136.6479       0         0 795.1348
## ENSG00000044115.21 7.801750 121.1519       0         0 746.2174
## ENSG00000048828.17 6.997776 129.3667       0         0 772.8610
## ENSG00000054118.15 6.365231 130.6175       0         0 776.8190
## ENSG00000057608.17 7.627046 131.7043       0         0 779.9930
## ENSG00000067560.13 8.089706 158.9324       0         0 856.2014
## ENSG00000068697.7  7.866955 125.1309       0         0 759.1753
## ENSG00000070831.17 6.925490 125.9530       0         0 762.0301

topTable returns a table with the top genes sorted by P.value expressing if the gene is differentially expressed.

3.2 Visualization

We have also prepared a function that produces PCA plots given the voom object created by the limma_pipeline function. You can inspect this function and try to figure out how it works.

plot_PCA = function(voomObj, condition_variable){
  group = factor(voomObj$targets[, condition_variable])
  pca = prcomp(t(voomObj$E))
  # Take PC1 and PC2 for the plot
  plot(pca$x[, 1:2], col = group, pch = 19)
  # include a legend for points
  legend(
    "bottomleft",
    inset = .01,
    levels(group),
    pch = 19,
    col = seq_along(levels(group))
  )
  return(pca)
}

By calling the function plot_PCA, we get a plot of the first two principal components:

res_pca = plot_PCA(limma_res$voomObj, "definition")

This plot shows that the two sample groups (tumor tissue and healthy tissue) have a well-separated RNA expression profile.

With all the information already available to us created with the limma_pipeline function, we can also, for example, create a heatmap picturing the expression of the top20 differentially expressed genes on our samples.

# first lets get the normalized expression matrix from our limma_res object
expr_mat = as.matrix(t(limma_res$voomObj$E))

# then lets get gene names that are easier to look at
gene_names = limma_res$voomObj$genes[, "gene_name"]
# and use these to rename the genes in our expression matrix
colnames(expr_mat) = gene_names

# we want to get the top20 DE genes
# topGenes is already sorted by the adjusted p-value (in ascending order)
head(limma_res$topGenes, 20)

##                    source type score phase            gene_id      gene_type
## ENSG00000104938.18 HAVANA gene    NA    NA ENSG00000104938.18 protein_coding
## ENSG00000165682.14 HAVANA gene    NA    NA ENSG00000165682.14 protein_coding
## ENSG00000263761.3  HAVANA gene    NA    NA  ENSG00000263761.3 protein_coding
## ENSG00000163217.2  HAVANA gene    NA    NA  ENSG00000163217.2 protein_coding
## ENSG00000136011.15 HAVANA gene    NA    NA ENSG00000136011.15 protein_coding
## ENSG00000164619.10 HAVANA gene    NA    NA ENSG00000164619.10 protein_coding
## ENSG00000145708.11 HAVANA gene    NA    NA ENSG00000145708.11 protein_coding
## ENSG00000160339.16 HAVANA gene    NA    NA ENSG00000160339.16 protein_coding
## ENSG00000019169.10 HAVANA gene    NA    NA ENSG00000019169.10 protein_coding
## ENSG00000251049.2  HAVANA gene    NA    NA  ENSG00000251049.2         lncRNA
## ENSG00000184374.3  HAVANA gene    NA    NA  ENSG00000184374.3 protein_coding
## ENSG00000142748.13 HAVANA gene    NA    NA ENSG00000142748.13 protein_coding
## ENSG00000145824.13 HAVANA gene    NA    NA ENSG00000145824.13 protein_coding
## ENSG00000160323.19 HAVANA gene    NA    NA ENSG00000160323.19 protein_coding
## ENSG00000164100.9  HAVANA gene    NA    NA  ENSG00000164100.9 protein_coding
## ENSG00000114812.13 HAVANA gene    NA    NA ENSG00000114812.13 protein_coding
## ENSG00000164161.10 HAVANA gene    NA    NA ENSG00000164161.10 protein_coding
## ENSG00000181072.11 HAVANA gene    NA    NA ENSG00000181072.11 protein_coding
## ENSG00000160801.14 HAVANA gene    NA    NA ENSG00000160801.14 protein_coding
## ENSG00000183287.14 HAVANA gene    NA    NA ENSG00000183287.14 protein_coding
##                     gene_name level    hgnc_id           havana_gene     logFC
## ENSG00000104938.18     CLEC4M     2 HGNC:13523  OTTHUMG00000182432.4 -9.230079
## ENSG00000165682.14     CLEC1B     1 HGNC:24356  OTTHUMG00000168502.1 -7.672067
## ENSG00000263761.3        GDF2     2  HGNC:4217  OTTHUMG00000188320.2 -9.252944
## ENSG00000163217.2       BMP10     2 HGNC:20869  OTTHUMG00000129573.2 -7.967087
## ENSG00000136011.15      STAB2     2 HGNC:18629  OTTHUMG00000170056.2 -6.135552
## ENSG00000164619.10      BMPER     2 HGNC:24154 OTTHUMG00000128675.24 -6.068348
## ENSG00000145708.11      CRHBP     2  HGNC:2356  OTTHUMG00000102133.3 -6.477126
## ENSG00000160339.16       FCN2     2  HGNC:3624  OTTHUMG00000020892.2 -7.521944
## ENSG00000019169.10      MARCO     2  HGNC:6895  OTTHUMG00000131400.4 -7.222983
## ENSG00000251049.2  AC107396.1     2       <NA>  OTTHUMG00000162168.2 -4.878924
## ENSG00000184374.3     COLEC10     2  HGNC:2220  OTTHUMG00000164971.2 -5.970541
## ENSG00000142748.13       FCN3     2  HGNC:3625  OTTHUMG00000005722.2 -5.866426
## ENSG00000145824.13     CXCL14     2 HGNC:10640  OTTHUMG00000129139.7 -6.679269
## ENSG00000160323.19   ADAMTS13     2  HGNC:1366  OTTHUMG00000020876.4 -3.143669
## ENSG00000164100.9       NDST3     2  HGNC:7682  OTTHUMG00000132959.5 -5.221459
## ENSG00000114812.13      VIPR1     1 HGNC:12694  OTTHUMG00000131792.9 -4.663159
## ENSG00000164161.10       HHIP     2 HGNC:14866  OTTHUMG00000161428.3 -6.529694
## ENSG00000181072.11      CHRM2     2  HGNC:1951  OTTHUMG00000155658.3 -5.542917
## ENSG00000160801.14      PTH1R     2  HGNC:9608  OTTHUMG00000133515.6 -4.167186
## ENSG00000183287.14      CCBE1     2 HGNC:29426  OTTHUMG00000180087.6 -5.430651
##                        AveExpr         t       P.Value     adj.P.Val        B
## ENSG00000104938.18 -3.29792212 -45.73718 7.192449e-166 1.634844e-161 367.9031
## ENSG00000165682.14 -3.36630698 -38.85242 1.182235e-141 1.343610e-137 312.4139
## ENSG00000263761.3  -2.97511698 -38.49714 2.451971e-140 1.857777e-136 309.7851
## ENSG00000163217.2  -3.56629765 -37.47196 1.687960e-136 9.591831e-133 300.6683
## ENSG00000136011.15 -0.87316796 -35.47037 7.669370e-129 3.486495e-125 283.6029
## ENSG00000164619.10 -1.73614838 -32.47381 5.699011e-117 2.158975e-113 256.2907
## ENSG00000145708.11  0.50819057 -31.81064 2.807402e-114 9.116035e-111 250.3490
## ENSG00000160339.16 -0.71253149 -31.77862 3.792409e-114 1.077518e-110 250.0199
## ENSG00000019169.10  0.02106692 -30.75429 6.095338e-110 1.539412e-106 240.3967
## ENSG00000251049.2  -5.58008401 -29.74075 1.000157e-105 2.273357e-102 229.2715
## ENSG00000184374.3   0.41556403 -28.60462 6.106177e-101  1.261758e-97 219.7624
## ENSG00000142748.13  1.97353779 -28.11263  7.540429e-99  1.428283e-95 214.9809
## ENSG00000145824.13 -0.50761409 -26.98018  5.418220e-94  9.473550e-91 203.8152
## ENSG00000160323.19  2.63404340 -26.81833  2.707197e-93  4.395327e-90 202.2345
## ENSG00000164100.9  -4.20542375 -26.40795  1.617913e-91  2.451677e-88 197.4752
## ENSG00000114812.13  0.75603499 -26.26335  6.863412e-91  9.750334e-88 196.7036
## ENSG00000164161.10 -1.63232306 -25.69212  2.107944e-88  2.818445e-85 190.9591
## ENSG00000181072.11 -4.42726426 -25.33904  7.370333e-87  9.307093e-84 186.8836
## ENSG00000160801.14  1.07293354 -25.07932  1.013441e-85  1.212396e-82 184.8461
## ENSG00000183287.14 -1.45879734 -24.92830  4.664431e-85  5.301126e-82 183.2542

# therefore we can just go ahead and select the first 20 values
top_de_genes = limma_res$topGenes$gene_name[1:20]

# we'll get the sample type data so that we can show the groups in the heatmap
sample_type = factor(limma_res$voomObj$targets$sample_type)

# define the color palette for the plot
hmcol = colorRampPalette(rev(brewer.pal(9, "RdBu")))(256)

# perform complete linkage clustering
clust_func = function(x) hclust(x, method = "complete")

# use the inverse of correlation as distance.
dist_func = function(x) as.dist((1 - cor(t(x))) / 2)

# A good looking heatmap involves a lot of parameters and tinkering
gene_heatmap = heatmap.2(
  t(expr_mat[, top_de_genes]),
  scale = "row",          # scale the values for each gene (row)
  density.info = "none",  # turns off density plot inside color legend
  trace = "none",         # turns off trace lines inside the heat map
  col = hmcol,            # define the color map
  labCol = FALSE,         # Not showing column labels
  ColSideColors = as.character(as.numeric(sample_type)), # Show colors for each response class
  dendrogram = "both",    # Show dendrograms for both axis
  hclust = clust_func,  # Define hierarchical clustering method
  distfun = dist_func,  # Using correlation coefficient for distance function
  cexRow = 1.0,           # Resize row labels
  keysize = 1.25,       # Size of the legend
  margins = c(1, 6)        # Define margin spaces
)

3.3 Exercise

Produce a PCA plot using gender as condition (hint: remember that we already run the pipeline for gender). How does it look like? Are the tissues well separated or not?

Can you think about a more interesting feature to group patients? Choose one and pass it to the condition_variable argument of the limma_pipeline function. Make sure to save the result to a variable.

Remember that you can see all available clinical features/phenotypes with the following command:

colnames(colData(tcga_data))

You can also run the plot_PCA function with this new condition_variable that you chose.

Extra challenge for the brave: change the limma_pipeline function so that it returns an arbitrary number of topGenes (defined by the user), instead of always 100.

4 Classification

4.1 Train and test paradigm

Now, we will explore the use of a machine learning method to classify an unseen sample as being a tumor or not. To achieve this goal we are going to build first a simple linear model (with feature selection), and then an Elastic Net model. For this, we need to split the data into two sets: a training set, which we will use to train our model, and a testing set. The test data serves as an independent dataset to validate our results. It is important that you do not use test data to optimize your results or this will include bias in the classifier.

First let’s start by extracting the data that we are going to use to build our model. We want the expression data that has already been normalized and a clinical feature which divides our data into different groups, such as tumor vs. non-tumor or tumor stage. We can get the normalized expression values from limma_res$voomObj$E and the type of sample is determined by the definition column.

limma_res = readRDS(file = "data/limma_res.RDS")

# Transpose and make it into a matrix object
d_mat = as.matrix(t(limma_res$voomObj$E))

# As before, we want this to be a factor
d_resp = as.factor(limma_res$voomObj$targets$definition)

With the data in the correct format we can now divide it into a train set and a test set. We will use the function createDataPartition which creates a vector of booleans (TRUE or FALSE) that we can then use to subset the matrix in this case leaving 75% of samples for training and 25% for testing.

# Divide data into training and testing set

# Set (random-number-generator) seed so that results are consistent between runs
set.seed(42)
train_ids = createDataPartition(d_resp, p = 0.75, list = FALSE)

x_train = d_mat[train_ids, ]
x_test  = d_mat[-train_ids, ]

y_train = d_resp[train_ids]
y_test  = d_resp[-train_ids]

x_train and y_train are the data we will use to train our model, where x is the matrix with the normalized expression values and y is the vector with the response variable, Primary solid Tumor and Solid Tissue Normal.

Following the same logic, x_test and y_test are the matrix with normalized expression values and the response variable respectively. Again, we will only use these (*_test) to perform a prediction and evaluate how good this prediction was after the training process has finished.

4.2 Train Elastic Net model

We will train an Elastic Net model, which is a generalized linear model that combines the best of two other models: LASSO and Ridge Regression. Ridge Regression is often good at doing predictions but its results are not very interpretable. LASSO is good at picking up small signals from lots of noise but it tends to minimize redundancy so if there are two genes that are equally good predictors (features with high correlation) it will tend to select one. Elastic Net is a balance between both methods, it selects the genes or groups of genes (if they are correlated) that best predict each of the conditions and use these to build a model that will then be used for classification.

We can then look at these genes individually to see if some interesting gene of biological relevance for the classification problem is selected. When using Elastic Net there are other parameters than we have to set, specifically alpha. This parameter will define if the Elastic Net will behave more like LASSO (alpha = 1) or like Ridge Regression (alpha = 0). For simplicity we will set it to 0.5 however in a real setting we would probably vary this value in order to find the best model (minimizing the error).

# Train model on training dataset using cross-validation
res = cv.glmnet(
  x = x_train,
  y = y_train,
  alpha = 0.5,
  family = "binomial"
)

4.3 Model Evaluation

After training the model, we can now evaluate it against our test-dataset. The result from cv.glmnet is a complex object that, between other data, holds the coefficients for our model and the mean error found during training.

# Test/Make prediction on test dataset
y_pred = predict(res, newx = x_test, type = "class", s = "lambda.min")

A confusion matrix is a simple table that compares the predictions from our model against their real values. Therefore, it shows us the true positives, true negatives, false positives and false negatives. We can use it to compute a number of accuracy metrics that we can then use to understand how good our model actually is.

confusion_matrix = table(y_pred, y_test)

# Evaluation statistics
print(confusion_matrix)

##                      y_test
## y_pred                Primary solid Tumor Solid Tissue Normal
##   Primary solid Tumor                  92                   0
##   Solid Tissue Normal                   0                  12

print(paste0("Sensitivity: ", sensitivity(confusion_matrix)))

## [1] "Sensitivity: 1"

print(paste0("Specificity: ", specificity(confusion_matrix)))

## [1] "Specificity: 1"

print(paste0("Precision: ",   precision(confusion_matrix)))

## [1] "Precision: 1"

As you see, for this data and under the current parameters, Elastic Net has great accuracy.

We can now take a look at the genes (coefficients), that Elastic Net selected to build its model.

# Getting genes that contribute for the prediction
res_coef = coef(res, s = "lambda.min") # the "coef" function returns a sparse matrix
dim(res_coef)

## [1] 22731     1

head(res_coef) # in a sparse matrix the "." represents the value of zero

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                           s1
## (Intercept)        -5.944552
## ENSG00000000003.15  .       
## ENSG00000000419.13  .       
## ENSG00000000457.14  .       
## ENSG00000000460.17  .       
## ENSG00000000938.13  .

Of course, the number of coefficients is large (there are many genes!). We only want to consider coefficients with non-zero values, as these represent variables (genes) selected by the Elastic Net.

# get coefficients with non-zero values
res_coef = res_coef[res_coef[,1] != 0,]
# note how performing this operation changed the type of the variable
head(res_coef)

##        (Intercept) ENSG00000010610.10 ENSG00000019169.10 ENSG00000043355.12 
##       -5.944551872        0.008458070        0.006874668       -0.023017297 
## ENSG00000066279.18 ENSG00000086991.13 
##       -0.001324842       -0.105565951

# remove first coefficient as this is the intercept, a variable of the model itself
res_coef = res_coef[-1]

relevant_genes = names(res_coef) # get names of the (non-zero) variables.
length(relevant_genes) # number of selected genes

## [1] 102

head(relevant_genes) # few select genes

## [1] "ENSG00000010610.10" "ENSG00000019169.10" "ENSG00000043355.12"
## [4] "ENSG00000066279.18" "ENSG00000086991.13" "ENSG00000087237.12"

You might not know your Ensembl gene annotation by heart, so we can get the common gene name from limma_res$voomObj$genes as in this table we can find the ensembl to gene name correspondence.

head(limma_res$voomObj$genes)

##                    source type score phase            gene_id      gene_type
## ENSG00000000003.15 HAVANA gene    NA    NA ENSG00000000003.15 protein_coding
## ENSG00000000419.13 HAVANA gene    NA    NA ENSG00000000419.13 protein_coding
## ENSG00000000457.14 HAVANA gene    NA    NA ENSG00000000457.14 protein_coding
## ENSG00000000460.17 HAVANA gene    NA    NA ENSG00000000460.17 protein_coding
## ENSG00000000938.13 HAVANA gene    NA    NA ENSG00000000938.13 protein_coding
## ENSG00000000971.16 HAVANA gene    NA    NA ENSG00000000971.16 protein_coding
##                    gene_name level    hgnc_id          havana_gene
## ENSG00000000003.15    TSPAN6     2 HGNC:11858 OTTHUMG00000022002.2
## ENSG00000000419.13      DPM1     2  HGNC:3005 OTTHUMG00000032742.2
## ENSG00000000457.14     SCYL3     2 HGNC:19285 OTTHUMG00000035941.6
## ENSG00000000460.17  C1orf112     2 HGNC:25565 OTTHUMG00000035821.9
## ENSG00000000938.13       FGR     2  HGNC:3697 OTTHUMG00000003516.3
## ENSG00000000971.16       CFH     1  HGNC:4883 OTTHUMG00000035607.6

relevant_gene_names = limma_res$voomObj$genes[relevant_genes, "gene_name"]

head(relevant_gene_names) # few select genes (with readable names now)

## [1] "CD4"   "MARCO" "ZIC2"  "ASPM"  "NOX4"  "CETP"

Are there any genes of particular relevance?

Did limma and Elastic Net select some of the same genes? We can check the common genes between the two results by using the intersect function.

print(intersect(limma_res$topGenes$gene_id, relevant_genes))

##  [1] "ENSG00000104938.18" "ENSG00000136011.15" "ENSG00000019169.10"
##  [4] "ENSG00000251049.2"  "ENSG00000160323.19" "ENSG00000164100.9" 
##  [7] "ENSG00000100362.13" "ENSG00000223922.1"  "ENSG00000126838.10"
## [10] "ENSG00000183166.11" "ENSG00000087237.12" "ENSG00000228842.3" 
## [13] "ENSG00000250266.2"  "ENSG00000100170.10" "ENSG00000260015.1" 
## [16] "ENSG00000107864.15" "ENSG00000184809.13" "ENSG00000226622.6"

Of note, we do not expect a high overlap between genes selected by limma and Elastic net. The reason for this is the fact Elastic Net criteria bias the selection of genes, which are not highly correlated against each other, while not such bias is present in limma.

4.4 Hierarchical clustering

Finally we can take a look at how our samples cluster together by running an hierarchical clustering algorithm. We will only be looking at the genes Elastic Net found and use these to cluster the samples. The genes highlighted in green are the ones that limma had also selected as we’ve seen before. The samples highlighted in red are Solid Tissue Normal, the samples highlighted in black are Primary solid Tumor.

# define the color palette for the plot
hmcol = colorRampPalette(rev(brewer.pal(9, "RdBu")))(256)

# perform complete linkage clustering
clust = function(x) hclust(x, method = "complete")
# use the inverse of correlation as distance.
dist = function(x) as.dist((1 - cor(t(x))) / 2)

# Show green color for genes that also show up in DE analysis
colorLimmaGenes = ifelse(
  # Given a vector of boolean values
  (relevant_genes %in% limma_res$topGenes$gene_id),
  "green", # if true, return green for that value
  "white" # if false, return white for that value
)

# As you've seen a good looking heatmap involves a lot of parameters
d_mat_use <- t(d_mat[,relevant_genes])
rownames(d_mat_use) <- relevant_gene_names
gene_heatmap = heatmap.2(
  d_mat_use,
  scale         = "row",                            # scale the values for each gene (row)
  density.info  = "none",                           # turns off density plot inside color legend
  trace         = "none",                           # turns off trace lines inside the heat map
  col           = hmcol,                            # define the color map
  labRow        = relevant_gene_names,              # use gene names instead of ensembl annotation
  RowSideColors = colorLimmaGenes,
  labCol        = FALSE,                            # Not showing column labels
  ColSideColors = as.character(as.numeric(d_resp)), # Show colors for each response class
  dendrogram    = "both",                           # Show dendrograms for both axis
  hclust        = clust,                            # Define hierarchical clustering method
  distfun       = dist,                             # Using correlation coefficient for distance function
  cexRow        = .6,                               # Resize row labels
  margins       = c(1,5)                            # Define margin spaces
)

4.5 Over-Representation Analysis

As you’ve seen, selected genes group into two classes: genes highly expressed in tumors vs. genes in control. Interestingly, genes also detected by DE analysis are only associated to high expression in the control group. One interesting question is, are selected genes (up in control or up in tumor) associated to any type of common biological problem? Try to do a GO analysis on them.

# Using the same method as in Day-3, get the dendrogram from the heatmap
# and cut it to get the 2 classes of genes

# Extract the hierarchical cluster from heatmap to class "hclust"
hc = as.hclust(gene_heatmap$rowDendrogram)

# Cut the tree into 2 groups, up-regulated in tumor and up-regulated in control
clusters = cutree(hc, k = 2)
table(clusters)

## clusters
##  1  2 
## 42 60

# Getting the gene names in cluster 1 & 2 respectively
genes_cluster_1 = names(clusters[clusters == 1])
genes_cluster_2 = names(clusters[clusters == 2])

head(genes_cluster_1)

## [1] "CD4"    "MARCO"  "CETP"   "SLC5A1" "PVALB"  "PAK5"

head(genes_cluster_2)

## [1] "ZIC2"  "ASPM"  "NOX4"  "AURKA" "MYLK2" "THBS4"

# Here we are using methods from the clusterProfiler package

# Group 1, up in tumor
go_cluster_1 = enrichGO(
  genes_cluster_1,
  OrgDb = "org.Hs.eg.db",
  keyType = "SYMBOL",
  pvalueCutoff = 2,
  qvalueCutoff = 2,
  ont = "ALL",
  pAdjustMethod = "BH"
)

##

head(go_cluster_1@result)

##            ONTOLOGY         ID
## GO:0070372       BP GO:0070372
## GO:0070371       BP GO:0070371
## GO:0070374       BP GO:0070374
## GO:0043410       BP GO:0043410
## GO:0050770       BP GO:0050770
## GO:0031346       BP GO:0031346
##                                                    Description GeneRatio
## GO:0070372                 regulation of ERK1 and ERK2 cascade      5/29
## GO:0070371                               ERK1 and ERK2 cascade      5/29
## GO:0070374        positive regulation of ERK1 and ERK2 cascade      4/29
## GO:0043410                 positive regulation of MAPK cascade      5/29
## GO:0050770                          regulation of axonogenesis      3/29
## GO:0031346 positive regulation of cell projection organization      4/29
##              BgRatio       pvalue   p.adjust     qvalue
## GO:0070372 311/18800 0.0001028177 0.05141972 0.04132284
## GO:0070371 335/18800 0.0001456649 0.05141972 0.04132284
## GO:0070374 220/18800 0.0003444417 0.08105862 0.06514179
## GO:0043410 491/18800 0.0008417299 0.14856534 0.11939275
## GO:0050770 154/18800 0.0016846002 0.19752464 0.15873830
## GO:0031346 341/18800 0.0017651200 0.19752464 0.15873830
##                                   geneID Count
## GO:0070372 CD4/MARCO/DUSP6/MAP2K1/ALKAL1     5
## GO:0070371 CD4/MARCO/DUSP6/MAP2K1/ALKAL1     5
## GO:0070374       CD4/MARCO/MAP2K1/ALKAL1     4
## GO:0043410   CD4/MARCO/RET/MAP2K1/ALKAL1     5
## GO:0050770              RET/MAP2K1/L1CAM     3
## GO:0031346       RET/MAP2K1/ALKAL1/L1CAM     4

# Group 2, up in control
go_cluster_2 = enrichGO(
  genes_cluster_2,
  OrgDb = "org.Hs.eg.db",
  keyType = "SYMBOL",
  pvalueCutoff = 2,
  qvalueCutoff = 2,
  ont = "ALL",
  pAdjustMethod = "BH"
)
head(go_cluster_2@result)

##            ONTOLOGY         ID
## GO:0000212       BP GO:0000212
## GO:0051642       BP GO:0051642
## GO:0061842       BP GO:0061842
## GO:0046326       BP GO:0046326
## GO:0010828       BP GO:0010828
## GO:1902895       BP GO:1902895
##                                                       Description GeneRatio
## GO:0000212                           meiotic spindle organization      2/35
## GO:0051642                                centrosome localization      2/35
## GO:0061842             microtubule organizing center localization      2/35
## GO:0046326                  positive regulation of glucose import      2/35
## GO:0010828 positive regulation of glucose transmembrane transport      2/35
## GO:1902895             positive regulation of miRNA transcription      2/35
##             BgRatio       pvalue  p.adjust   qvalue     geneID Count
## GO:0000212 15/18800 0.0003482062 0.2452173 0.206438 ASPM/AURKA     2
## GO:0051642 32/18800 0.0016125110 0.2452173 0.206438 ASPM/AURKA     2
## GO:0061842 33/18800 0.0017145415 0.2452173 0.206438 ASPM/AURKA     2
## GO:0046326 37/18800 0.0021525892 0.2452173 0.206438  TERT/ERFE     2
## GO:0010828 44/18800 0.0030327224 0.2452173 0.206438  TERT/ERFE     2
## GO:1902895 45/18800 0.0031700817 0.2452173 0.206438  TERT/APLN     2

You could try to make a plot on your own with the top terms per ontology.

There are also other methods that allow us to do this type of analysis. Here we show one of these alternatives, using gprofiler2.

library(gprofiler2)

# Selecting just a few columns so that its easier to visualize the table
gprofiler2_cols = c("significant", "p_value", "intersection_size", "term_id", "term_name")


# Group 1, up in tumor
gost_res_1 = gost(
  genes_cluster_1,
  organism = "hsapiens",
  significant = FALSE
)

head(gost_res_1[["result"]][, gprofiler2_cols])

##   significant    p_value intersection_size    term_id              term_name
## 1       FALSE 0.09931581                 1 CORUM:5921  KSR1-BRAF-MEK complex
## 2       FALSE 0.09931581                 1 CORUM:6505     ERBB3-SPG1 complex
## 3       FALSE 0.09931581                 1 CORUM:6506     ERBB2-SPG1 complex
## 4       FALSE 0.09931581                 1  CORUM:903        RET-Rai complex
## 5       FALSE 0.14890763                 1 CORUM:6953 ARTN-GFRA3-RET comples
## 6       FALSE 0.14890763                 1 CORUM:5222   p14-Mp1-MEK1 complex

# plotting the result using a function from the gprogiler2 package
gprof_plt <- gostplot(gost_res_1, interactive = TRUE, capped = FALSE)
gprof_plt

# Group 2, up in control
gost_res_2 = gost(
  genes_cluster_2,
  organism = "hsapiens",
  significant = FALSE
)

head(gost_res_2[["result"]][, gprofiler2_cols])

##   significant    p_value intersection_size    term_id
## 1       FALSE 0.09931581                 1 CORUM:2581
## 2       FALSE 0.09931581                 1 CORUM:6412
## 3       FALSE 0.09931581                 1 CORUM:6740
## 4       FALSE 0.09931581                 1 CORUM:7462
## 5       FALSE 0.14890763                 1 CORUM:6184
## 6       FALSE 0.14890763                 1  CORUM:631
##                                 term_name
## 1           RasGAP-AURKA-survivin complex
## 2   AURKA-HDAC6 cilia-disassembly complex
## 3                    AURKA-INPP5E complex
## 4 GABA-A receptor (GABRA1, GABRB2, GABRD)
## 5          AuroraB-AuroraC-INCENP complex
## 6            Tacc1-chTOG-AuroraA  complex

You can also try the gprofiler2 web interface: https://biit.cs.ut.ee/gprofiler/gost

4.6 Exercise

Another way to reduce dimensionality is through the use of variance filtering. We can use the varFilter function to do this.

# retain only a small subset of the genes (see documentation for ?varFilter)
d_mat = varFilter(limma_res$voomObj$E, var.func = IQR, var.cutoff = 0.95)

# transpose the matrix, so that it has the same shape as the d_mat we used at the beginning
d_mat = t(d_mat)

# show the dimensions of the matrix
print(dim(d_mat))

## [1]  421 1137

This function takes the normalized RNASeq data stored in the limma_res object, and returns a matrix of the same form. What varFilter does is calculate the interquartile range (IQR) for all genes. It then removes 95% of the genes, starting with those with lower IQR. Therefore, only 5% of the genes will be retained - those with the highest dispersion, and therefore possibly the most useful information. This can easily be understood by remembering that a gene that has the exact same expression across all samples, has a variance (and IQR) of zero. We definitely want to remove those.

Now again, run the commands to train and evaluate the elastic net model. Has the performance increased or decreased? Are there more or less variables selected?

Remember to also re-generate your x_* training and test data (it is not needed for the response, because that hasn’t changed) with the new d_mat:

# size before
print(dim(x_train))

## [1]   317 22730

print(dim(x_test))

## [1]   104 22730

x_train = d_mat[train_ids, ]
x_test  = d_mat[-train_ids, ]

# size after
print(dim(x_train))

## [1]  317 1137

print(dim(x_test))

## [1]  104 1137

If you don’t do this, x_train and x_test will still contain all 22534 genes, instead of only the 1127 we have selected.

5 Survival Analysis

One analysis often performed on TCGA data is survival analysis. In short, this boils down to answering the following question: how more likely is a certain group of patients to live longer than another?

The techniques we are going to see are not, however, limited to survival (to death events) but can be applied to any experiment where patients can be divided into groups and there is an event, something happening at a specific time point.

Before we start, if you don’t have all the libraries we used previously, as well as the objects limma_res and tcga_data already loaded, please do so:

# Load packages
library("TCGAbiolinks")
library("limma")
library("edgeR")
library("glmnet")
library("factoextra")
library("FactoMineR")
library("caret")
library("SummarizedExperiment")
library("gplots")
library("survival")
library("survminer")
library("RColorBrewer")
library("genefilter")

# NOTE: make sure you set the working directory in RStudio correctly
# otherwise R won't be able to find your files
tcga_data = readRDS(file = "data/tcga_data.RDS")
limma_res = readRDS(file = "data/limma_res.RDS")

To warm up and explain the method, we will start with an easy question: does gender influence survival in liver cancer patients?

TCGA, as mentioned before, provides a lot of clinical data for each patient. We need to extract the gender variable, and a few more besides:

# extract clinical data
clinical = tcga_data@colData

dim(clinical)

## [1] 421  62

# these are the columns from the clinical data
# that we are interested in using for survival
survival_cols = c(
  "patient",
  "vital_status",
  "days_to_death",
  "days_to_last_follow_up",
  "gender",
  "ajcc_pathologic_stage"
)


# we are only interested in the "Primary solid Tumor" cases for survival
is_tumor = clinical$definition == "Primary solid Tumor"

# subsetting the clinical data with the relevant cases and columns for survival
clin_df = clinical[is_tumor, survival_cols]

Now we have a new dataframe, clin_df, containing only the information that is relevant to survival analysis. In addition to gender, we have added vital_status (whether patient is alive or dead), ajcc_pathologic_stage (from stage 1 to 4) and two important variables: days_to_death, that is the number of days passed from the initial diagnosis to the patient’s death (clearly, this is only relevant for dead patients), and days_to_last_follow_up that is the number of days passed from the initial diagnosis to the last visit.

Before we can proceed, we need to change part of this information in a way that is acceptable to the methods from the survival package we are using:

# create a new boolean variable that has TRUE for dead patients
# and FALSE for live patients
clin_df$deceased = clin_df$vital_status == "Dead"

# create an "overall survival" variable that is equal to days_to_death
# for dead patients, and to days_to_last_follow_up for patients who
# are still alive
clin_df$overall_survival = ifelse(
  clin_df$deceased, # logical statement
  clin_df$days_to_death, # resuilt if TRUE
  clin_df$days_to_last_follow_up # result if FALSE
)

# show first 10 samples
head(clin_df)

## DataFrame with 6 rows and 8 columns
##                                   patient vital_status days_to_death
##                               <character>  <character>     <integer>
## TCGA-RC-A7SF-01A-11R-A352-07 TCGA-RC-A7SF        Alive            NA
## TCGA-EP-A2KC-01A-11R-A213-07 TCGA-EP-A2KC         Dead            19
## TCGA-ES-A2HS-01A-11R-A180-07 TCGA-ES-A2HS         Dead           688
## TCGA-CC-5259-01A-31R-A213-07 TCGA-CC-5259        Alive            NA
## TCGA-ED-A97K-01A-21R-A38B-07 TCGA-ED-A97K        Alive            NA
## TCGA-DD-A3A1-01A-11R-A213-07 TCGA-DD-A3A1         Dead           233
##                              days_to_last_follow_up      gender
##                                           <integer> <character>
## TCGA-RC-A7SF-01A-11R-A352-07                    579        male
## TCGA-EP-A2KC-01A-11R-A213-07                     NA        male
## TCGA-ES-A2HS-01A-11R-A180-07                     NA        male
## TCGA-CC-5259-01A-31R-A213-07                    250      female
## TCGA-ED-A97K-01A-21R-A38B-07                      6        male
## TCGA-DD-A3A1-01A-11R-A213-07                     NA        male
##                              ajcc_pathologic_stage  deceased overall_survival
##                                        <character> <logical>        <integer>
## TCGA-RC-A7SF-01A-11R-A352-07               Stage I     FALSE              579
## TCGA-EP-A2KC-01A-11R-A213-07               Stage I      TRUE               19
## TCGA-ES-A2HS-01A-11R-A180-07               Stage I      TRUE              688
## TCGA-CC-5259-01A-31R-A213-07            Stage IIIC     FALSE              250
## TCGA-ED-A97K-01A-21R-A38B-07            Stage IIIA     FALSE                6
## TCGA-DD-A3A1-01A-11R-A213-07            Stage IIIA      TRUE              233

Let’s now see if male and female patients have had different prognosis (in this dataset).

5.1 Kaplan-Meier plots

As a first step, we need to define a survival formula with the help of the Surv object.

In R, formulas are special constructs of the form y ~ x, and in the context of linear models you can see x as the independent variable and y as the dependent variable.

This works also for multivariate models: age ~ gender + height is a formula that can be used to predict age from gender and height. You can refer to the documentation of formula for further examples and explanations, by typing ?formula in a R shell.

Let’s get back to survival. We have a categorical variable, gender, that needs to be used to separate (or, more appropriately, stratify) the available death events.

The survival package provides us with an object, Surv, to form a dependent variable out of the overall_survival and deceased information:

Surv(clin_df$overall_survival, clin_df$deceased)

##   [1]  579+   19   688   250+    6+  233   671+ 1567+ 1372   928+ 1115+  621+
##  [13] 1085+  300   423+  140   214  1229   453+ 1088   262  2425+  636+  472+
##  [25]  575+    1+  608+  719+  416   594+  180+ 1731+  802   512+ 1145+  425+
##  [37] 2102+ 2415+  366+  390+ 1791   558   458+ 1823+  408+  399+  303+  612 
##  [49]  415+  357+  601+   52    87  1495+  819+ 2028+   10+  673+ 1804+ 2513+
##  [61]  837   396+ 2455+  724     9+   65   129  1241+ 2746+ 1718+  660   260+
##  [73]    0+  171   347  2324+  752    56  2412+ 1452+ 1622   447+  153  2398+
##  [85]  141+  278    20+  347+  291+ 1852  1970+ 3258  1231+ 1636+  638+ 1633+
##  [97]    0+  410  1030+ 1423    23+   65  1210   931   519+    9  2486     0+
## [109] 1424+  602+  345+ 1067+   12+  322+ 2301+  729+ 2442+  480+    9+ 3125 
## [121] 1450+  409+ 1135   649   555+ 1618+  365+  314+  478+  784+  300   330+
## [133] 1531+    6+  570+ 1516+  848+ 1085+ 2456    36   469   728+  951+   94+
## [145] 2116   486+  372+ 1694  1295+  194  1168+  427+ 1900+  706+  768   229+
## [157]  837   655+  643  1008+  247  1711+  373   382+  387+  219+  608+  879+
## [169]  363+  747+  581   854+  211+ 1553+  500+  412  1302+  328+  308     8+
## [181]  272   507+  430+  304   248+   91  1271   556   425   436+  722+   21+
## [193]  354+  848  1098+  596  2015+  432   344   698+  564+  279  1005    30+
## [205]  799+ 1490  1345+  555  1397    11   129   547   115   552+   NA+  693+
## [217] 1091+  744+  898+  363+  474+  170+   46   627  1339+  827   780+ 1876+
## [229]  631+ 3308+  782+  849+  810+  315   468+  395+  296  1149   520+  770 
## [241] 1562+  644+  585+ 1351+  601   574+ 1049+ 3675+   16   299   906+  102 
## [253]   67    12   101   476+  349   419   763+  365   757  2245+ 3437+  538+
## [265] 2728+  444+  701+ 1386   327+ 2184+  406+    0+ 1939+  554+  899    44+
## [277]    0+ 1233+  412+  262   925+ 2532   860+  520+   20+  359   697+  640+
## [289]   15+  395+  406+ 1624  2131  1989+  361+  107   341+    6+  366   917+
## [301] 3478+  672+  381  1779+ 1560   989+  217   588+   22+  942+   56  2317+
## [313]  639  1363+  898+  693   632+  452   848+   31  2752+  587+  394    97 
## [325]   14  1219+ 2202+  103    27   195  1066+  698+  359+   91   415   183+
## [337] 1531+  633   662+  566+  756+ 1855+  765   829+  171   562+  535   615+
## [349]  283   816+  630+  555+ 1242+  910+  711  2017+   79+  680+  223  2542 
## [361]  361+   34   387+  303   438   658+  137+ 2301+  400+ 1685   449+

This modifies our overall survival vector by adding censoring information (the + just after the time), which requires a small digression.

This data is right censored, meaning that for some patients we only have the time of the last follow up but we don’t know if they died at a later date or not.

These patients are kept in the early stages of the analysis (eg, they are part of the survival curve) but they are dropped (or as it is said, censored) when the time of their last follow up arrives.

Now that the survival time has been tagged with the censoring, we can add the categorical independent variable gender, and effectively create a formula

Surv(clin_df$overall_survival, clin_df$deceased) ~ clin_df$gender

We now have a survival formula that can be passed to the survfit function to fit the survival model, and then to another function to produce the Kaplan-Meier plots. Actually, when executing the survival analysis with survfit, we can exclude the clin_df$ if we tell the function to use clin_df as data by using the data = parameter.

# fit a survival model
fit = survfit(Surv(overall_survival, deceased) ~ gender, data = clin_df)

print(fit)

## Call: survfit(formula = Surv(overall_survival, deceased) ~ gender, 
##     data = clin_df)
## 
##    1 observation deleted due to missingness 
##                 n events median 0.95LCL 0.95UCL
## gender=female 121     51   1490    1005    2456
## gender=male   249     79   2486    1423      NA

# we produce a Kaplan Meier plot
ggsurvplot(fit, data = clin_df)

This Kaplan-Meier plot shows two very similar trends until almost the 2000-day mark, where females seem to have a worse survival probability. But is there a significant difference?

The difference between two such “event curves” is best tested via the logrank test, which is, fundamentally, a repeated test of independence. survminer will add the p-value of such test if we tell it to do so:

ggsurvplot(fit, data = clin_df, pval = TRUE)

The p-value is non-significant, so gender alone does not significantly sway prognosis in this dataset.

If you find this strange because the curves deviate at the 2000-day mark, you have to remember that the amount of patients involved matters. At that point, only a few patients remain, and any difference is likely to not be significant.

Can we see the number of patients dying (or being “censored”) as Time increases? Indeed we can, with what is called the “at risk table”.

ggsurvplot(fit, data = clin_df, pval = TRUE, risk.table = TRUE, risk.table.col = "strata")

With the risk.table = TRUE argument, we get the number of patients “at risk”, that is neither dead nor censored at a certain time point.

The argument risk.table.col = "strata" tells survminer to colour the table in the same way as the strata, or groups, are coloured.

You can see that most of the patients die or are censored before the 2000-day mark, and therefore it makes sense that the p-value would not be significant.

Another question could be: how does tumor stage affect survival?

The ajcc_pathologic_stage variable that TCGA provides for this tumor contains both stages and sub-stages, eg stage iiia or stage ivb. We want to join together the sub-stages, to increase the group size and reduce complexity (and thus increase the power of the logrank statistics).

# remove any of the letters "a", "b" or "c", but only if they are at the end
# of the name, eg "stage iiia" would become simply "stage iii"
clin_df$ajcc_pathologic_stage = gsub("[ABC]$", "", clin_df$ajcc_pathologic_stage)

# let's see how many patients we have for each stage
print(table(clin_df$ajcc_pathologic_stage))

## 
##   Stage I  Stage II Stage III  Stage IV 
##       171        86        85         5

# since we have too few Stage 4 patients, we put them together with Stage IIIs
clin_df[which(clin_df$ajcc_pathologic_stage == "Stage III"), "ajcc_pathologic_stage"] = "Stage III+"
clin_df[which(clin_df$ajcc_pathologic_stage == "Stage IV"),  "ajcc_pathologic_stage"] = "Stage III+"

print(table(clin_df$ajcc_pathologic_stage))

## 
##    Stage I   Stage II Stage III+ 
##        171         86         90

We can now fit a new survival model with the tumor stage groups (one to four, plus the “not reported”):

fit = survfit(Surv(overall_survival, deceased) ~ ajcc_pathologic_stage, data=clin_df)

# we can extract the survival p-value and print it
pval = surv_pvalue(fit, data = clin_df)$pval
print(pval)

## [1] 3.081341e-06

# we produce a Kaplan-Meier plot from the fitted model
ggsurvplot(fit, data = clin_df, pval = TRUE, risk.table = TRUE)

We get an overall p-value testing the null hypothesis that all the curves are similar at every time point. In this case, the p-value is small enough that we can reject the null hypothesis.

What we saw here is an easy way of producing Kaplan-Meier plots to investigate survival, as well as evaluating whether the survival curves are significantly different or not. A more interesting application to this is using, for example, gene expression to divide the patients into groups, to see whether up or down regulation of genes affects survival. We’ll see this in the next section.

6 Gene expression and survival

6.1 Overview

Ealier, we looked at the RNASeq data for this tumor. We found some genes that are differentially expressed between the healthy and disease samples, and we also trained an Elastic Net model and investigated those predictors that are important in discriminating healthy and disease tissue.

So, taking one of the selected genes, we know that they are either up-regulated in the tumor tissue tissue and not in the normal tissue, or viceversa. But do these genes also provide an advantage, or disadvantage, regarding prognosis?

We already have the top differentially expressed genes, ordered by significance, in the limma_res$topGenes data.frame, so we just have to take the first one.

# let's extract the table of differential expression we got earlier
genes_df = limma_res$topGenes

# print the first row, to see the gene name, the logFC value and the p-value
genes_df[1, ]

##                    source type score phase            gene_id      gene_type
## ENSG00000104938.18 HAVANA gene    NA    NA ENSG00000104938.18 protein_coding
##                    gene_name level    hgnc_id          havana_gene     logFC
## ENSG00000104938.18    CLEC4M     2 HGNC:13523 OTTHUMG00000182432.4 -9.230079
##                      AveExpr         t       P.Value     adj.P.Val        B
## ENSG00000104938.18 -3.297922 -45.73718 7.192449e-166 1.634844e-161 367.9031

# get the ensembl gene id of the first row
gene_id = genes_df[1, "gene_id"]

# also get the common gene name of the first row
gene_name = genes_df[1, "gene_name"]

We now have selected a gene. Let’s visualize how much differentially expressed it is:

# get the actual expression matrix from the limma result object.
# we transpose it because we want genes in the columns and patients in the rows
expr_df = t(limma_res$voomObj$E)

# let's show the first five genes for the first five patients
expr_df[1:5, 1:5]

##                              ENSG00000000003.15 ENSG00000000419.13
## TCGA-DD-A1EG-11A-11R-A213-07           6.837526           4.871361
## TCGA-RC-A7SF-01A-11R-A352-07           6.029358           3.847252
## TCGA-EP-A2KC-01A-11R-A213-07           6.718381           4.467630
## TCGA-ES-A2HS-01A-11R-A180-07           7.704514           4.778052
## TCGA-CC-5259-01A-31R-A213-07           7.419676           4.137047
##                              ENSG00000000457.14 ENSG00000000460.17
## TCGA-DD-A1EG-11A-11R-A213-07           3.328110          0.7929382
## TCGA-RC-A7SF-01A-11R-A352-07           3.151630          1.1446657
## TCGA-EP-A2KC-01A-11R-A213-07           3.643625          3.6100075
## TCGA-ES-A2HS-01A-11R-A180-07           2.021699         -0.9282608
## TCGA-CC-5259-01A-31R-A213-07           2.900462          0.4534566
##                              ENSG00000000938.13
## TCGA-DD-A1EG-11A-11R-A213-07          4.1979667
## TCGA-RC-A7SF-01A-11R-A352-07          1.7327274
## TCGA-EP-A2KC-01A-11R-A213-07          1.5842699
## TCGA-ES-A2HS-01A-11R-A180-07          0.6855566
## TCGA-CC-5259-01A-31R-A213-07          0.7479039

# visualize the gene expression distribution on the diseased samples (in black)
# versus the healthy samples (in red)
expr_diseased = expr_df[rownames(clin_df), gene_id]
expr_healthy = expr_df[setdiff(rownames(expr_df), rownames(clin_df)), gene_id]

boxplot(
  expr_diseased, expr_healthy,
  names = c("Diseased", "Healthy"),
  main = "Distribution of gene expression"
)

Quite so! To see if its expression also influences prognosis, we take all the expression values in the diseased samples, then take the median of them.

Patients with expression greater than the median we put in the up-regulated groups, and the others in the down-regulated group.

# get the expression values for the selected gene
clin_df$gene_value = expr_df[rownames(clin_df), gene_id]

# find the median value of the gene and print it
median_value = median(clin_df$gene_value)
median_value

## [1] -5.043144

# divide patients in two groups, up and down regulated.
# if the patient expression is greater or equal to them median we put it
# among the "up-regulated", otherwise among the "down-regulated"
clin_df$gene = ifelse(clin_df$gene_value >= median_value, "UP", "DOWN")

# we can fit a survival model, like we did in the previous section
fit = survfit(Surv(overall_survival, deceased) ~ gene, data = clin_df)

# we can extract the survival p-value and print it
pval = surv_pvalue(fit, data = clin_df)$pval
pval

## [1] 0.8437825

# and finally, we produce a Kaplan-Meier plot
ggsurvplot(
  fit,
  data = clin_df,
  pval = TRUE,
  risk.table = TRUE,
  title = paste(gene_name)
)

You can also save the plot to file instead of simply visualizing it:

This gene does not appear to make a difference for prognosis. But what about the other differentially expressed genes?

6.2 Exercise

Loops, plots and multiple testing

We are not limited to the top gene, though. We could, for example, look at the ten most differentially expressed genes, one by one, and save the survival plot to see if some of them provide a significant effect on survival.

To achieve this, we can place the previous code in a for loop, and save each result to a different file. Try it.

for (i in 1:10) {
  # get the ensembl gene id of the i-th row
  gene_id = genes_df[i, "gene_id"]

  # also get the common gene name of the i-th row
  gene_name = genes_df[i, "gene_name"]

  # get the expression values for the selected gene
  clin_df$gene_value = expr_df[rownames(clin_df), gene_id]

  # find the median value of the gene
  median_value = median(clin_df$gene_value)

  # divide patients in two groups, up and down regulated.
  # if the patient expression is greater or equal to them median we put it
  # among the "up-regulated", otherwise among the "down-regulated"
  clin_df$gene = ifelse(clin_df$gene_value >= median_value, "UP", "DOWN")

  # we can now run survival like we did in the previous section
  fit = survfit(Surv(overall_survival, deceased) ~ gene, data = clin_df)
  survp = ggsurvplot(fit, data = clin_df, pval = TRUE, risk.table = TRUE, title = paste(gene_name))

  # we can save the survival plot to file, try it by uncommenting it
  # ggsave(file=paste0(gene_name, ".pdf"), print(survp))

  # let's also save the p-value for later.. you'll see why :)
  genes_df[i, "surv_pv"] = surv_pvalue(fit, data = clin_df)$pval
}

Now we have generated ten plots, and likely found some interesting candidates with significant p-value. Since we performed a statistical test multiple times (a task called multiple testing), we need to correct all the p-values.

The function p.adjust helps us do this. Read how to use it with ?p.adjust, and make sure to select the Benjamini-Hochberg method (also called fdr).

# take the 10 pvalues you produced earlier, and put them in a vector called vec_pvalues.
vec_pvalues <- genes_df[1:10, "surv_pv"]
# Then apply the Benjamini-Hochberg procedure, also called the FDR procedure
p.adjust(vec_pvalues, method = "fdr")

##  [1] 0.914484980 0.008500502 0.527638992 0.914484980 0.220585499 0.169574398
##  [7] 0.030626886 0.169626164 0.914484980 0.022865840

After you do this, you will notice that some of the p-values that were significant are now insignificant.

This is important to remember when applying statistical procedures multiple times:

if you keep trying, there’s always a chance of getting a significant p-value when the null hypothesis is actually true.

You must, therefore, control for this chance via a multiple test correction procedure.

# Let's now run the previous steps again, but now we add the adjusted p-values back on our table
adjusted_p_values = p.adjust(genes_df[1:10, "surv_pv"], method = "fdr")

# add these values as a new column in the genes_df dataframe
genes_df[1:10, "surv_padj"] = adjusted_p_values

# let's show them in a nice tabular way
genes_df[1:10, c("gene_name", "surv_pv", "surv_padj")]

##                     gene_name      surv_pv   surv_padj
## ENSG00000104938.18     CLEC4M 0.8437825222 0.914484980
## ENSG00000165682.14     CLEC1B 0.0008500502 0.008500502
## ENSG00000263761.3        GDF2 0.3693472946 0.527638992
## ENSG00000163217.2       BMP10 0.7956483148 0.914484980
## ENSG00000136011.15      STAB2 0.1323512995 0.220585499
## ENSG00000164619.10      BMPER 0.0678297594 0.169574398
## ENSG00000145708.11      CRHBP 0.0091880658 0.030626886
## ENSG00000160339.16       FCN2 0.0848130819 0.169626164
## ENSG00000019169.10      MARCO 0.9144849797 0.914484980
## ENSG00000251049.2  AC107396.1 0.0045731680 0.022865840

What is multiple testing?

Whenever we perform a statistical test, there is a chance to produce a “discovery”, or significant result, that does not in fact correspond to a ground truth (that is: despite a significant p-value, the null hypothesis is true).

That is because “significance” is an arbitrary probabilistic cutoff. In many fields, a 5% chance of rejecting the null hypothesis when, in fact, it is true is widely accepted. Other fields or applications require much stricter cutoffs.

Can you guess what would happen if you performed a multitude of such tests, each at a significance level of 5%? In the worst case (that is: all null hypotheses are true), you might get an incorrectly-significant test every 20 or so.

Those wrongly-rejected null hypotheses are called false positives, and if you don’t control for it they will impede your future research efforts. This is very relevant in genomics: when we check the expression of tens of thousands of genes for an effect on survival, for example, we don’t want to end up with a bunch of candidates that won’t hold up in replication experiments.

Correction methods like the Benjamini-Hochberg (also called the “False Discovery Rate” method) help us control the amount of false positives that we are willing to accept, by further rejecting the weakest “discoveries” we’ve made in a multiple-testing setup - like today’s Exercise.

6.3 Exercise

Perform the following tasks.

Retrieve the following numbers from the data: how many patients are dead or alive, how many patients were in which tumor stage and how many patients had a prior malignancy. You can do this using the table function.
Check if groups of patients are associated to survival, tumour grade or any other clinical variable. You can use the table function for some of these analysis.
If you do a heatmap with all the differentially expressed genes, do you get nice clusters in the dendrogram? Can you separate these clusters into 2 main cluster (up-regulated in cancer vs up-regulated in control)?
Regarding the genes associated to each group (cancer or control), are they associated to some particular biological function? Use differential expression analysis and GO enrichment analysis to solve this task.

Good luck!

Analysis of Cancer Genome Atlas in R

Tiago Maié

Fabio Ticconi

Martin Grasshoff

Martin Manolov

17 November 2023