Umap and Bismap quantifying genome and methylome mappability

Karimzadeh M, Ernst C, Kundaje A, Hoffman MM. 2018. Umap and Bismap: quantifying genome and methylome mappability. doi: https://doi.org/10.1093/nar/gky677 Nucleic Acids Research, Volume 46, Issue 20, 16 November 2018, Page e120. (BibTeX)

The free Umap software package efficiently identifies uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome (methylome).

Mappability of unconverted genome

Umap uses three steps to identify the mappability of a genome for a given read length k. First, it generates all possible k-mers of the genome. Second, it maps these unique k-mers to the genome with Bowtie. Third, Umap marks the start position of each k-mer that aligns to only one region in the genome. Umap repeats these steps for a range of different k-mers and stores the data of each chromosome in a binary vector X with the same length as the chromosome's sequence. For read length k, X_i= 1 means that the sequence starting at X_i and ending at X_i+k is uniquely mappable on the forward strand. Since we align to both strands of the genome, the reverse complement of this same sequence starting at X_i+k in the reverse strand is also uniquely mappable. X_i= 0 means that the sequence starting at X_i and ending at X_i+k can be mapped to at least two different regions in the genome.

Mappability of the bisulfite-converted genome

To identify the single-read mappability of a bisulfite-converted genome, we create two altered genome sequences. In the first sequence, we convert all cytosines to thymine (C → T). In the other sequence we convert all guanines to adenine (G → A). Our approach follows those of Bismark and BWA-meth. We convert the genome sequence this way because bisulfite treatment converts un-methylated cytosine to uracil which is read as thymine. Similarly the guanine that base-pairs with the un-methylated cytosine in the - strand converts to adenine. These two conversions, however, never occur at the same time on the same read. We identify the uniquely mappable regions of these two genomes separately, and then combine the data to represent the single-read mappability of the forward and reverse strands in the bisulfite-converted genome.

Bismap requires special handling of reverse complementation of C → T or G → A converted genomes. Conversion of C → T on the sequence AATTCCGG produces AATTTTGG. In the Bowtie index, the reverse complement would be CCAAAATT. However for the purpose of identifying the mappability of the bisulfite-converted genome, we expect the reverse complement to be TTGGAATT. The reason is that both forward and reverse strands undergo bisulfite treatment simultaneously. There is no DNA replication after bisulfite treatment. To handle this issue, Bismap creates its own reverse complemented chromosomes and suppresses Bowtie's usual reverse complement mapping.

Umap and Bismap each take approximately 200 CPU hours to run on the human genome for a given read length. This can be parallelized in a computing cluster over 400 cores to take only 30 min.

Measures of mappability

Umap efficiently identifies the single-read mappability of any genome for a range of sequencing read lengths. The single-read mappability of a genomic region is a fraction of that region which overlaps with at least one uniquely mappable k-mer. The Bismap extension of Umap produces the single-read mappability of a bisulfite-converted genome. Both Umap and Bismap produce an integer vector for each chromosome that efficiently defines the mappability for any region and can be converted to a browser extensible data (BED) file. In addition to single-read mappability, we can measure the mappability of a genomic region by another approach. To quantify the single-read mappability of a given genomic region, we measure the fraction of potential uniquely mappable reads in that region. A region, however, can have 100% single-read mappability, but in practice require a high coverage sequencing to properly identify that region. For example, a 1 kbp region with 100% single-read mappability can be mappable due to a minimum of 10 unique 100-mers that none of them overlap or a maximum of 1100 unique 100-mers that highly overlap. Therefore, we define the multi-read mappability, the probability that a randomly selected read of length k in a given region is uniquely mappable. For the genomic region G_i:j starting at i and ending at j, there are j - i + k + 1 k-mers that overlap with G_i:j. The multi-read mappability of G_i:j is the fraction of those k-mers that are uniquely mappable.

File formats

After March 2020: The single-read mappability files are in correct BED filemat with 0-based indexing. In addition, these BED files now contain non-overlapping intervals. Umap version 1.2.0 will now generate 0-based BED files with non-overlapping intervals.

Before March 2020: The single-read mappability files use 1-based indexing with inclusive intervals. Overlapping entries in these BED files occur due to non-unique k-mers. We recommend that you ignore non-unique entries in single-read mappability BED files and use multi-read mappability bedGraph or bigWig files for investigating the occurence of non-unique k-mers in an interval.

Update on March 2020

Thanks to Roger Kramer, we noticed that the BED files provided before March 2020 follow 1-based inclusive interval indexing, and therefore are not proper BED files. The trackhub and links below are updated and have fixed this issue. Also, the BED files of single-read mappability do not have overlapping intervals anymore as occurred with the previous version. The bedGraph and bigWig files, however, have always followed the proper format. The software repository on Bitbucket has also been updated to generate proper BED files with non-overlapping intervals.

Update on 0-values for mappability - July 2024

Currently 0-values in the single-read and multi-read mappability datasets are omitted entirely from the resulting BED and bigWig files. Future data releases will include 0-values instead of omitting the data to mark areas of no unique mappability for a given k-mer length.

Track hub, file access, and software

UCSC Genome Browser

View the Umap and Bismap track hub in the UCSC genome browser for GRCh37/hg19, GRCh38/hg38, GRCm37/mm9, or GRCm38/mm10.

Using the track hub

The track hub is composed of two different supertracks; Umap and Bismap. The Umap supertrack contains the tracks for the unconverted genome and the Bismap supertrack contains the tracks for the bisulfite-converted genome.

Each supertrack contains the single-read mappability and multi-read mappability tracks for four different read lengths: 24 bp, 36 bp, 50 bp, and 100 bp.

By default, the Umap and Bismap single-read mappability and multi-read mappability tracks for 24 bp are shown, while other tracks are hidden. In order to change the tracks you want to view in the genome browser, click on the supertrack name and change the default settings.

The single-read mappability tracks are in BED6 format, showing regions of the genome that are uniquely mappable by at least one k-mer. In Umap, unlike Bismap, the mappability of the reverse strand is the same as the forward strand. Thus in Bismap, the mappability of both strands is shown. In Bismap, only regions that are uniquely mappable in both strands must be considered uniquely mappable. For this reason, we calculated the multi-read mappability of bisulfite-converted genome only for the regions that are uniquely mappable in both strands.

View the Umap and Bismap track hub in UCSC genome browser.

The current track hub data is based on Umap V1.1.0. If you need to access data produced using Umap V1.0.0 (before May 2^nd 2017), they are deposited in Zenodo.

For the data in the track hub, we did not use any of the haplotypes, unlocalized contigs, or unplaced contigs. Upon frequent requests, we generated mappability of the hg38 genome without haplotypes but including unlocalized and unplaced contigs using the following kmers: 24, 36, 50, 101, 150, 200, 250, and 500 (NOTE: For 500 k-mer lengths, these have been overflowed in the uint8 files to 244 lengths). You can download the Umap uint8 files of this assembly from here.

Direct links

Download the BED files as a tarball for further analysis:

Download the uint8 files below:

Download individual k-mer files of Umap or Bismap:

Method	Organism	Genome	Measure	k24	k36	k50	k100
Umap	Human	hg38	Single-read	k24	k36	k50	k100
Umap	Human	hg38	Multi-read	k24	k36	k50	k100
Umap	Human	hg19	Single-read	k24	k36	k50	k100
Umap	Human	hg19	Multi-read	k24	k36	k50	k100
Umap	Mouse	mm10	Single-read	k24	k36	k50	k100
Umap	Mouse	mm10	Multi-read	k24	k36	k50	k100
Umap	Mouse	mm9	Single-read	k24	k36	k50	k100
Umap	Mouse	mm9	Multi-read	k24	k36	k50	k100
Bismap	Human	hg38	Single-read	k24	k36	k50	k100
Bismap	Human	hg38	Multi-read	k24	k36	k50	k100
Bismap	Human	hg19	Single-read	k24	k36	k50	k100
Bismap	Human	hg19	Multi-read	k24	k36	k50	k100
Bismap	Mouse	mm10	Single-read	k24	k36	k50	k100
Bismap	Mouse	mm10	Multi-read	k24	k36	k50	k100
Bismap	Mouse	mm9	Single-read	k24	k36	k50	k100
Bismap	Mouse	mm9	Multi-read	k24	k36	k50	k100

Software and documentation

Read the documentation for Umap and Bismap software, which begins with a quick start.

Support

Please ask questions about Umap on our mailing list. If you want to report a bug or request a feature, use Umap issue tracker. We are interested in all comments on the package, and the ease of use of installation and documentation.

Source code

GitHub repository

Credits

Umap was originally develped by Anshul Kundaje in MATLAB. The original Umap repository is available here. Here we describe a reimplementation of Umap in Python, where we added new features including Bismap.