Metadata
- Name
- hg38 reference and annotation files
- Repository
- ZENODO
- Identifier
- doi:10.5281/zenodo.5146236
- Description
- This repo contains reference and annotation files for hg38. We are following the [TOPMed pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). Reach out to Arushi Varshney at arushiv AT umich DOT edu if you have any questions.
Files:
1. bwa index = bwa.tar.gz
2. star index = star.tar.gz
3. ENCODE blacklist = blacklist.tar
4. gencode v30 annotations = gencode.tar.gz
5. containers with STAR (RNA) and BWA (ATAC) = containers.tar.gz
Notes on these files:
### hg38 fasta:
I downloaded the TOPMed fasta tar [Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) as use this as-is. The TOPMed GitHub describes that they obtained the Broad institute's GRCh38 reference, removed ALT, HLA and Decoy contigs, and added ERCC spike-in reference annotations. Refer to their [README](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for more details. They don't mention PARs but we checked the reference files and both chrY PARs are hard masked - as [ENCODE](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/) also recommends.
### Gencode v30 gene annotations: gencode.tar.gz
I downloaded the file [gencode.v30.annotation.gtf.gz](https://www.gencodegenes.org/human/release_30.html) from the gencode website, and downloaded the file [ERCC92.genes.patched.gtf](https://personal.broadinstitute.org/francois/resources/). I then appended the ERCC patched gtf to the gencode annotation gtf
```
gunzip gencode.v30.annotation.gtf.gz
cat gencode.v30.annotation.gtf ERCC92.genes.patched.gtf > gencode.v30.annotation.ERCC92.gtf
```
### STAR index: star.tar.gz; container with star in containers.tar.gz
A STAR index is shared on the TOPMed GitHub, but it was generated for STAR version STAR_2.6.1d. Since I've been using the version 2.7.3a, I followed their steps to generate the STAR reference again. I used the gencode gtf described above and generated the STAR index.
```
STAR --runMode genomeGenerate --genomeDir STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_test --genomeFastaFiles Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta --sjdbGTFfile gencode.v30.annotation.ERCC92.gtf --sjdbOverhang 100 --runThreadN 10
```
### BWA index: bwa.tar.gz
I generated the BWA index using the fasta above
```
ln -s Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta hg38.fa
bwa index hg38.fa
```
### ENCODE Blacklist: blacklist.tar
I used the blacklist [here](https://theparkerlab.med.umich.edu/data/arushiv/hg38_references_annots/blacklist/) that I obtained from this [Kundaje website](https://sites.google.com/site/anshulkundaje/projects/blacklists). - Data or Study Types
- multiple
- Source Organization
- Unknown
- Access Conditions
- available
- Year
- 2021
- Access Hyperlink
- https://doi.org/10.5281/zenodo.5146236
Distributions
- Encoding Format: HTML ; URL: https://doi.org/10.5281/zenodo.5146236