Introduction

In this section, we will learn to search and download DNA methylation (epigenetic) and gene expression (transcription) data from the newly created NCI Genomic Data Commons (GDC) portal and prepare them into a Summarized Experiment object.

The figure below highlights the workflow part which will be covered in this section. Part of the workflow covered in this section

Downloading data

Loading required libraries

library(TCGAbiolinks)
library(SummarizedExperiment)
library(DT)
library(dplyr)

Gene expression

query.exp <- GDCquery(project = "TCGA-LUSC",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ",
                   barcode = c("TCGA-34-5231-01","TCGA-77-7138-01"))
GDCdownload(query.exp)
exp <- GDCprepare(query = query.exp,
                          save = TRUE, 
                          save.filename = "Exp_LUSC.rda",
                          summarizedExperiment = TRUE)
exp
## class: RangedSummarizedExperiment 
## dim: 57035 2 
## metadata(0):
## assays(1): HTSeq - FPKM-UQ
## rownames(57035): ENSG00000000003 ENSG00000000005 ...
##   ENSG00000281912 ENSG00000281920
## rowData names(3): ensembl_gene_id external_gene_name
##   original_ensembl_gene_id
## colnames(2): TCGA-34-5231-01A-21R-1820-07
##   TCGA-77-7138-01A-41R-2045-07
## colData names(69): patient barcode ...
##   subtype_Homozygous.Deletions subtype_Expression.Subtype
colData(exp) %>% as.data.frame %>% datatable(options = list(scrollX = TRUE), rownames = TRUE)
assay(exp)[1:5,] %>% datatable (options = list(scrollX = TRUE), rownames = TRUE)
rowRanges(exp)
## GRanges object with 57035 ranges and 3 metadata columns:
##                   seqnames                 ranges strand | ensembl_gene_id
##                      <Rle>              <IRanges>  <Rle> |     <character>
##   ENSG00000000003     chrX [100627109, 100639991]      - | ENSG00000000003
##   ENSG00000000005     chrX [100584802, 100599885]      + | ENSG00000000005
##   ENSG00000000419    chr20 [ 50934867,  50958555]      - | ENSG00000000419
##   ENSG00000000457     chr1 [169849631, 169894267]      - | ENSG00000000457
##   ENSG00000000460     chr1 [169662007, 169854080]      + | ENSG00000000460
##               ...      ...                    ...    ... .             ...
##   ENSG00000281904     chr2   [90365737, 90367699]      + | ENSG00000281904
##   ENSG00000281909    chr15   [22480439, 22484840]      - | ENSG00000281909
##   ENSG00000281910    chr16   [58559796, 58559931]      - | ENSG00000281910
##   ENSG00000281912     chr1   [45303910, 45305619]      + | ENSG00000281912
##   ENSG00000281920     chr2   [65623272, 65628424]      + | ENSG00000281920
##                   external_gene_name original_ensembl_gene_id
##                          <character>              <character>
##   ENSG00000000003             TSPAN6       ENSG00000000003.13
##   ENSG00000000005               TNMD        ENSG00000000005.5
##   ENSG00000000419               DPM1       ENSG00000000419.11
##   ENSG00000000457              SCYL3       ENSG00000000457.12
##   ENSG00000000460           C1orf112       ENSG00000000460.15
##               ...                ...                      ...
##   ENSG00000281904         AC233263.6        ENSG00000281904.1
##   ENSG00000281909         AC100757.4        ENSG00000281909.1
##   ENSG00000281910           SNORA50A        ENSG00000281910.1
##   ENSG00000281912          LINC01144        ENSG00000281912.1
##   ENSG00000281920         AC007389.5        ENSG00000281920.1
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

DNA methylation

This subsection describes how to download DNA methylation using the Bioconductor package TCGAbiolinks (Colaprico et al. 2016) from NCI Genomic Data Commons (GDC) portal. In this example, we will download DNA methylation data (Infinium HumanMethylation450 platform) for two TCGA-LUSC (TCGA Lung Squamous Cell Carcinoma) samples. GDCquery function will search in the GDC database for the information required to download the data, this information is used by the GDCdownload function which will request the files to GDC, those files will be compacted into a 76 MB tar.gz file. After the download is completed GDCdownload will uncompress the tar.gz file and move its files to a folder; the default is GDCData/(Project)/(source)/(data.category)/(data.type)), in our example, it will be GDCdata/TCGA-LUSC/harmonized/DNA_Methylation/Methylation_Beta_Value/

Data saved after GDCdownload is executed

Data saved after GDCdownload is executed

Finally, GDCprepare transforms the downloaded data into a summarizedExperiment object (Huber et al. 2015) or a data frame. If SummarizedExperiment is set to TRUE, TCGAbiolinks will add to the object molecular sub-type information, which was defined by The Cancer Genome Atlas (TCGA) Research Network reports (the full list of papers can be seen in TCGAquery_subtype section in TCGAbiolinks vignette), and clinical information.

query.met <- GDCquery(project = "TCGA-LUSC", 
                      data.category = "DNA Methylation",
                      platform = "Illumina Human Methylation 450", 
                      barcode = c("TCGA-34-5231-01A-21D-1818-05","TCGA-77-7138-01A-41D-2043-05"))
GDCdownload(query.met)
met <- GDCprepare(query = query.met,
                  save = TRUE, 
                  save.filename = "DNAmethylation_LUSC.rda",
                  summarizedExperiment = TRUE)

The object created is a Sum

met
## class: RangedSummarizedExperiment 
## dim: 485577 2 
## metadata(0):
## assays(1): ''
## rownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873
## rowData names(7): Composite.Element.REF Gene_Symbol ...
##   CGI_Coordinate Feature_Type
## colnames(2): TCGA-34-5231-01A-21D-1818-05
##   TCGA-77-7138-01A-41D-2043-05
## colData names(69): patient barcode ...
##   subtype_Homozygous.Deletions subtype_Expression.Subtype
colData(met) %>% as.data.frame %>% datatable(options = list(scrollX = TRUE), rownames = TRUE)
assay(met)[1:5,] %>% datatable (options = list(scrollX = TRUE), rownames = TRUE)
rowRanges(met)
## GRanges object with 485577 ranges and 7 metadata columns:
##              seqnames                 ranges strand |
##                 <Rle>              <IRanges>  <Rle> |
##   cg00000029    chr16 [ 53434200,  53434201]      * |
##   cg00000108     chr3 [ 37417715,  37417716]      * |
##   cg00000109     chr3 [172198247, 172198248]      * |
##   cg00000165     chr1 [ 90729117,  90729118]      * |
##   cg00000236     chr8 [ 42405776,  42405777]      * |
##          ...      ...                    ...    ... .
##    rs9363764     chr6   [67522149, 67522149]      * |
##     rs939290     chr3   [14617359, 14617359]      * |
##     rs951295    chr15   [45707625, 45707625]      * |
##     rs966367     chr2   [12008094, 12008094]      * |
##    rs9839873     chr3   [86613005, 86613005]      * |
##              Composite.Element.REF
##                        <character>
##   cg00000029            cg00000029
##   cg00000108            cg00000108
##   cg00000109            cg00000109
##   cg00000165            cg00000165
##   cg00000236            cg00000236
##          ...                   ...
##    rs9363764             rs9363764
##     rs939290              rs939290
##     rs951295              rs951295
##     rs966367              rs966367
##    rs9839873             rs9839873
##                                                                  Gene_Symbol
##                                                                  <character>
##   cg00000029                                                  RBL2;RBL2;RBL2
##   cg00000108 C3orf35;C3orf35;C3orf35;C3orf35;C3orf35;C3orf35;C3orf35;C3orf35
##   cg00000109                       FNDC3B;FNDC3B;FNDC3B;FNDC3B;FNDC3B;FNDC3B
##   cg00000165                                                               .
##   cg00000236                                                           VDAC3
##          ...                                                             ...
##    rs9363764                                                               .
##     rs939290                                                               .
##     rs951295                                     RP11-718O11.1;RP11-718O11.1
##     rs966367                     AC096559.1;AC096559.1;AC096559.1;AC096559.1
##    rs9839873                                                               .
##                                                                                              Gene_Type
##                                                                                            <character>
##   cg00000029                                              protein_coding;protein_coding;protein_coding
##   cg00000108                           lincRNA;lincRNA;lincRNA;lincRNA;lincRNA;lincRNA;lincRNA;lincRNA
##   cg00000109 protein_coding;protein_coding;protein_coding;protein_coding;protein_coding;protein_coding
##   cg00000165                                                                                         .
##   cg00000236                                                                            protein_coding
##          ...                                                                                       ...
##    rs9363764                                                                                         .
##     rs939290                                                                                         .
##     rs951295                                                                           lincRNA;lincRNA
##     rs966367                                                           lincRNA;lincRNA;lincRNA;lincRNA
##    rs9839873                                                                                         .
##                                                                                                                                                Transcript_ID
##                                                                                                                                                  <character>
##   cg00000029                                                                                           ENST00000262133.9;ENST00000544405.5;ENST00000567964.5
##   cg00000108 ENST00000328376.8;ENST00000332506.6;ENST00000425564.2;ENST00000425932.4;ENST00000426078.4;ENST00000452017.3;ENST00000466204.4;ENST00000481400.4
##   cg00000109                                     ENST00000336824.7;ENST00000415807.5;ENST00000416957.4;ENST00000443501.1;ENST00000469491.4;ENST00000478016.1
##   cg00000165                                                                                                                                               .
##   cg00000236                                                                                                                               ENST00000022615.7
##          ...                                                                                                                                             ...
##    rs9363764                                                                                                                                               .
##     rs939290                                                                                                                                               .
##     rs951295                                                                                                             ENST00000559600.1;ENST00000560705.1
##     rs966367                                                                         ENST00000412294.4;ENST00000438292.4;ENST00000450916.1;ENST00000451644.4
##    rs9839873                                                                                                                                               .
##                                           Position_to_TSS
##                                               <character>
##   cg00000029                               -221;-1420;222
##   cg00000108 18552;18552;6505;31445;18143;447;18552;18552
##   cg00000109      157692;158618;151333;71272;158587;71273
##   cg00000165                                            .
##   cg00000236                                        13872
##          ...                                          ...
##    rs9363764                                            .
##     rs939290                                            .
##     rs951295                                    2429;2546
##     rs966367                           965;142208;919;977
##    rs9839873                                            .
##                            CGI_Coordinate Feature_Type
##                               <character>  <character>
##   cg00000029  CGI:chr16:53434489-53435297      N_Shore
##   cg00000108   CGI:chr3:37451927-37453047            .
##   cg00000109 CGI:chr3:172039703-172040934            .
##   cg00000165   CGI:chr1:90724932-90727247      S_Shore
##   cg00000236   CGI:chr8:42410918-42411241            .
##          ...                          ...          ...
##    rs9363764   CGI:chr6:68634840-68635154            .
##     rs939290   CGI:chr3:14602211-14603323            .
##     rs951295  CGI:chr15:45704255-45705206      S_Shelf
##     rs966367   CGI:chr2:11784857-11785127            .
##    rs9839873   CGI:chr3:86990460-86991366            .
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Session Info

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Bioc2017.TCGAbiolinks.ELMER_0.0.0.9000
##  [2] BiocInstaller_1.26.0                  
##  [3] ELMER_2.0.1                           
##  [4] TCGAbiolinks_2.5.6                    
##  [5] dplyr_0.7.2                           
##  [6] SummarizedExperiment_1.6.3            
##  [7] DelayedArray_0.2.7                    
##  [8] matrixStats_0.52.2                    
##  [9] Biobase_2.36.2                        
## [10] GenomicRanges_1.28.4                  
## [11] GenomeInfoDb_1.12.2                   
## [12] IRanges_2.10.2                        
## [13] S4Vectors_0.14.3                      
## [14] BiocGenerics_0.22.0                   
## [15] bindrcpp_0.2                          
## [16] MultiAssayExperiment_1.2.1            
## [17] DT_0.2                                
## [18] ELMER.data_2.0.1                      
## 
## loaded via a namespace (and not attached):
##   [1] rtracklayer_1.36.4            ggthemes_3.4.0               
##   [3] prabclus_2.2-6                R.methodsS3_1.7.1            
##   [5] tidyr_0.6.3                   ggplot2_2.2.1                
##   [7] acepack_1.4.1                 bit64_0.9-7                  
##   [9] knitr_1.16                    aroma.light_3.6.0            
##  [11] R.utils_2.5.0                 data.table_1.10.4            
##  [13] rpart_4.1-11                  hwriter_1.3.2                
##  [15] RCurl_1.95-4.8                AnnotationFilter_1.0.0       
##  [17] doParallel_1.0.10             GenomicFeatures_1.28.4       
##  [19] RSQLite_2.0                   commonmark_1.2               
##  [21] bit_1.1-12                    BiocStyle_2.4.0              
##  [23] xml2_1.1.1                    httpuv_1.3.5                 
##  [25] assertthat_0.2.0              viridis_0.4.0                
##  [27] hms_0.3                       evaluate_0.10.1              
##  [29] DEoptimR_1.0-8                dendextend_1.5.2             
##  [31] km.ci_0.5-2                   DBI_0.7                      
##  [33] geneplotter_1.54.0            htmlwidgets_0.9              
##  [35] reshape_0.8.6                 EDASeq_2.10.0                
##  [37] matlab_1.0.2                  purrr_0.2.2.2                
##  [39] selectr_0.3-1                 ggpubr_0.1.4                 
##  [41] backports_1.1.0               trimcluster_0.1-2            
##  [43] annotate_1.54.0               biomaRt_2.32.1               
##  [45] ensembldb_2.0.3               withr_1.0.2                  
##  [47] Gviz_1.20.0                   BSgenome_1.44.0              
##  [49] robustbase_0.92-7             checkmate_1.8.3              
##  [51] GenomicAlignments_1.12.1      mclust_5.3                   
##  [53] mnormt_1.5-5                  cluster_2.0.6                
##  [55] lazyeval_0.2.0                genefilter_1.58.1            
##  [57] edgeR_3.18.1                  pkgconfig_2.0.1              
##  [59] labeling_0.3                  nlme_3.1-131                 
##  [61] ProtGenerics_1.8.0            nnet_7.3-12                  
##  [63] devtools_1.13.2               bindr_0.1                    
##  [65] rlang_0.1.1                   diptest_0.75-7               
##  [67] downloader_0.4                AnnotationHub_2.8.2          
##  [69] dichromat_2.0-0               rprojroot_1.2                
##  [71] Matrix_1.2-10                 KMsurv_0.1-5                 
##  [73] zoo_1.8-0                     base64enc_0.1-3              
##  [75] whisker_0.3-2                 GlobalOptions_0.0.12         
##  [77] viridisLite_0.2.0             rjson_0.2.15                 
##  [79] bitops_1.0-6                  shinydashboard_0.6.1         
##  [81] R.oo_1.21.0                   ConsensusClusterPlus_1.40.0  
##  [83] Biostrings_2.44.1             blob_1.1.0                   
##  [85] shape_1.4.2                   stringr_1.2.0                
##  [87] ShortRead_1.34.0              readr_1.1.1                  
##  [89] scales_0.4.1                  memoise_1.1.0                
##  [91] magrittr_1.5                  plyr_1.8.4                   
##  [93] zlibbioc_1.22.0               compiler_3.4.1               
##  [95] RColorBrewer_1.1-2            Rsamtools_1.28.0             
##  [97] XVector_0.16.0                htmlTable_1.9                
##  [99] Formula_1.2-2                 MASS_7.3-47                  
## [101] stringi_1.1.5                 yaml_2.1.14                  
## [103] locfit_1.5-9.1                latticeExtra_0.6-28          
## [105] ggrepel_0.6.5                 survMisc_0.5.4               
## [107] grid_3.4.1                    VariantAnnotation_1.22.3     
## [109] tools_3.4.1                   circlize_0.4.1               
## [111] rstudioapi_0.6                foreach_1.4.3                
## [113] foreign_0.8-69                git2r_0.18.0                 
## [115] gridExtra_2.2.1               digest_0.6.12                
## [117] shiny_1.0.3                   cmprsk_2.2-7                 
## [119] fpc_2.1-10                    Rcpp_0.12.12                 
## [121] broom_0.4.2                   httr_1.2.1                   
## [123] survminer_0.4.0               AnnotationDbi_1.38.1         
## [125] biovizBase_1.24.0             ComplexHeatmap_1.14.0        
## [127] psych_1.7.5                   kernlab_0.9-25               
## [129] colorspace_1.3-2              rvest_0.3.2                  
## [131] XML_3.98-1.9                  splines_3.4.1                
## [133] flexmix_2.3-14                plotly_4.7.0                 
## [135] xtable_1.8-2                  jsonlite_1.5                 
## [137] UpSetR_1.3.3                  modeltools_0.2-21            
## [139] R6_2.2.2                      Hmisc_4.0-3                  
## [141] htmltools_0.3.6               mime_0.5                     
## [143] glue_1.1.1                    BiocParallel_1.10.1          
## [145] DESeq_1.28.0                  class_7.3-14                 
## [147] interactiveDisplayBase_1.14.0 codetools_0.2-15             
## [149] mvtnorm_1.0-6                 lattice_0.20-35              
## [151] tibble_1.3.3                  curl_2.7                     
## [153] survival_2.41-3               limma_3.32.3                 
## [155] roxygen2_6.0.1                rmarkdown_1.6                
## [157] munsell_0.4.3                 GetoptLong_0.1.6             
## [159] GenomeInfoDbData_0.99.0       iterators_1.0.8              
## [161] reshape2_1.4.2                gtable_0.2.0

Bibliography

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2). Nature Publishing Group: 115–21.