DataMed Data Discovery Index

<< Go Back

Metadata

Name: hg38 reference and annotation files
Repository: ZENODO
Identifier: doi:10.5281/zenodo.5146236
Description: This repo contains reference and annotation files for hg38. We are following the [TOPMed pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). Reach out to Arushi Varshney at arushiv AT umich DOT edu if you have any questions.

Files:

1. bwa index = bwa.tar.gz

2. star index = star.tar.gz

3. ENCODE blacklist = blacklist.tar

4. gencode v30 annotations = gencode.tar.gz

5. containers with STAR (RNA) and BWA (ATAC) = containers.tar.gz

Notes on these files:
### hg38 fasta:
I downloaded the TOPMed fasta tar [Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) as use this as-is. The TOPMed GitHub describes that they obtained the Broad institute's GRCh38 reference, removed ALT, HLA and Decoy contigs, and added ERCC spike-in reference annotations. Refer to their [README](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for more details. They don't mention PARs but we checked the reference files and both chrY PARs are hard masked - as [ENCODE](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/) also recommends.
### Gencode v30 gene annotations: gencode.tar.gz
 I downloaded the file [gencode.v30.annotation.gtf.gz](https://www.gencodegenes.org/human/release_30.html) from the gencode website, and downloaded the file [ERCC92.genes.patched.gtf](https://personal.broadinstitute.org/francois/resources/). I then appended the ERCC patched gtf to the gencode annotation gtf
```
gunzip gencode.v30.annotation.gtf.gz
cat gencode.v30.annotation.gtf  ERCC92.genes.patched.gtf > gencode.v30.annotation.ERCC92.gtf
```
### STAR index: star.tar.gz; container with star in containers.tar.gz
A STAR index is shared on the TOPMed GitHub, but it was generated for STAR version STAR_2.6.1d. Since I've been using the version 2.7.3a, I followed their steps to generate the STAR reference again. I used the gencode gtf described above and generated the STAR index.

```
STAR --runMode genomeGenerate  --genomeDir STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_test  --genomeFastaFiles Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta  --sjdbGTFfile gencode.v30.annotation.ERCC92.gtf  --sjdbOverhang 100 --runThreadN 10
```
### BWA index: bwa.tar.gz
I generated the BWA index using the fasta above
```

ln -s Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta hg38.fa
bwa index hg38.fa
```

### ENCODE Blacklist: blacklist.tar
I used the blacklist [here](https://theparkerlab.med.umich.edu/data/arushiv/hg38_references_annots/blacklist/) that I obtained from this [Kundaje website](https://sites.google.com/site/anshulkundaje/projects/blacklists).
Data or Study Types: multiple
Source Organization: Unknown
Access Conditions: available
Year: 2021
Access Hyperlink: https://doi.org/10.5281/zenodo.5146236

Distributions

Encoding Format: HTML ; URL: https://doi.org/10.5281/zenodo.5146236