• Home
  • About
  • Repositories
  • Search
  • Web API
  • Feedback
<< Go Back

Metadata

Name
hg38 reference and annotation files
Repository
ZENODO
Identifier
doi:10.5281/zenodo.5146236
Description
This repo contains reference and annotation files for hg38. We are following the [TOPMed pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). Reach out to Arushi Varshney at arushiv AT umich DOT edu if you have any questions.

Files:

1. bwa index = bwa.tar.gz

2. star index = star.tar.gz

3. ENCODE blacklist = blacklist.tar

4. gencode v30 annotations = gencode.tar.gz

5. containers with STAR (RNA) and BWA (ATAC) = containers.tar.gz

Notes on these files:
### hg38 fasta:
I downloaded the TOPMed fasta tar [Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz](https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz) as use this as-is. The TOPMed GitHub describes that they obtained the Broad institute&#39;s GRCh38 reference, removed ALT, HLA and Decoy contigs, and added ERCC spike-in reference annotations. Refer to their [README](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for more details. They don&#39;t mention PARs but we checked the reference files and both chrY PARs are hard masked - as [ENCODE](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/) also recommends.
### Gencode v30 gene annotations: gencode.tar.gz
&nbsp;I downloaded the file [gencode.v30.annotation.gtf.gz](https://www.gencodegenes.org/human/release_30.html) from the gencode website, and downloaded the file [ERCC92.genes.patched.gtf](https://personal.broadinstitute.org/francois/resources/). I then appended the ERCC patched gtf to the gencode annotation gtf
```
gunzip gencode.v30.annotation.gtf.gz
cat gencode.v30.annotation.gtf&nbsp; ERCC92.genes.patched.gtf &gt; gencode.v30.annotation.ERCC92.gtf
```
### STAR index: star.tar.gz; container with star in containers.tar.gz
A STAR index is shared on the TOPMed GitHub, but it was generated for STAR version STAR_2.6.1d. Since I&#39;ve been using the version 2.7.3a, I followed their steps to generate the STAR reference again. I used the gencode gtf described above and generated the STAR index.

```
STAR --runMode genomeGenerate&nbsp; --genomeDir STAR_genome_GRCh38_noALT_noHLA_noDecoy_ERCC_v30_test&nbsp; --genomeFastaFiles Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta&nbsp; --sjdbGTFfile gencode.v30.annotation.ERCC92.gtf&nbsp; --sjdbOverhang 100 --runThreadN 10
```
### BWA index: bwa.tar.gz
I generated the BWA index using the fasta above
```

ln -s Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta hg38.fa
bwa index hg38.fa
```

### ENCODE Blacklist: blacklist.tar
I used the blacklist [here](https://theparkerlab.med.umich.edu/data/arushiv/hg38_references_annots/blacklist/) that I obtained from this [Kundaje website](https://sites.google.com/site/anshulkundaje/projects/blacklists).
Data or Study Types
multiple
Source Organization
Unknown
Access Conditions
available
Year
2021
Access Hyperlink
https://doi.org/10.5281/zenodo.5146236

Distributions

  • Encoding Format: HTML ; URL: https://doi.org/10.5281/zenodo.5146236
This project was funded in part by grant U24AI117966 from the NIH National Institute of Allergy and Infectious Diseases as part of the Big Data to Knowledge program. We thank all members of the bioCADDIE community for their valuable input on the overall project.