Reorder chromosomes

TLDR

# Important! --bin-size should be the same resolution as matrix.cool
user@dev:/tmp hictk load <(hictk dump --join matrix.cool) \
                         output.cool \
                         --chrom-sizes=<(hictk dump --table=chroms matrix.cool | sort -k2,2nr) \
                         --format=bg2 \
                         --bin-size=1kbp \
                         --transpose-lower-triangular-pixels

Why is this needed?

Sometimes we want to compare files using the same reference genome assembly, but with different chromosome orders (e.g. in one file chromosomes are sorted by size while in the other they are sorted by name). This can be a problem especially when trying to visually compare such files. This tutorial shows how to convert a .cool file with chromosomes sorted by name to a .cool file with chromosomes sorted by size. The same procedure can be applied to .hic files.

Walkthrough

For this tutorial, we will use file 4DNFIOTPSS3L.hic as an example, which can be downloaded from here.

First, we extract the list of chromosomes from the input file:

user@dev:/tmp hictk dump 4DNFIOTPSS3L.hic --table=chroms | tee chrom.sizes

2L  23513712
2R  25286936
3L  28110227
3R  32079331
4   1348131
X   23542271
Y   3667352

Second, we re-order chromosomes:

user@dev:/tmp sort -k2,2nr chrom.sizes | tee chrom.sizes.sorted

3R  32079331
3L  28110227
2R  25286936
X   23542271
2L  23513712
Y   3667352
4   1348131

Next, we dump pixels in bedGraph2 format (see below for how to make this step more efficient):

user@dev:/tmp hictk dump 4DNFIOTPSS3L.hic --join --resolution 1kbp > pixels.bg2

user@dev:/tmp head pixels.bg2

2L  5000    6000    2L      5000    6000    41
2L  5000    6000    2L      6000    7000    126
2L  5000    6000    2L      7000    8000    60
2L  5000    6000    2L      8000    9000    77
2L  5000    6000    2L      9000    10000   97
2L  5000    6000    2L      10000   11000   3
2L  5000    6000    2L      11000   12000   1
2L  5000    6000    2L      12000   13000   66
2L  5000    6000    2L      13000   14000   116
2L  5000    6000    2L      14000   15000   64

Finally, we load pixels into a new .hic file

user@dev:/tmp hictk load pixels.bg2 \
                         output.hic \
                         --chrom-sizes=chrom.sizes.sorted \
                         --transpose-lower-triangular-pixels \
                         --format=bg2 \
                         --bin-size=1kbp

[2024-09-27 19:00:40.344] [info]: Running hictk v1.0.0-fbdcb591
[2024-09-27 19:00:40.353] [info]: begin loading pixels into a .hic file...
[2024-09-27 19:00:42.504] [info]: preprocessing chunk #1 at 4847310 pixels/s...
[2024-09-27 19:00:45.244] [info]: preprocessing chunk #2 at 3649635 pixels/s...
[2024-09-27 19:00:48.180] [info]: preprocessing chunk #3 at 3407155 pixels/s...
[2024-09-27 19:00:50.616] [info]: preprocessing chunk #4 at 4105090 pixels/s...
[2024-09-27 19:00:53.251] [info]: preprocessing chunk #5 at 3203434 pixels/s...
[2024-09-27 19:00:54.358] [info]: writing header at offset 0
[2024-09-27 19:00:54.358] [info]: begin writing interaction blocks to file "output.hic"...
[2024-09-27 19:00:54.358] [info]: [1000 bp] writing pixels for 3R:3R matrix at offset 171...
[2024-09-27 19:01:01.039] [info]: [1000 bp] written 9571521 pixels for 3R:3R matrix
...
[2024-09-27 19:01:26.831] [info]: [1000 bp] initializing expected value vector
[2024-09-27 19:01:32.649] [info]: [1000 bp] computing expected vector density
[2024-09-27 19:01:32.649] [info]: writing 1 expected value vectors at offset 93720080...
[2024-09-27 19:01:32.649] [info]: writing 0 normalized expected value vectors at offset 93848475...
[2024-09-27 19:01:32.682] [info]: ingested 114355295 interactions (48437845 nnz) in 52.337885908s!

Lastly, we check that chromosomes are properly sorted:

user@dev:/tmp hictk dump output.hic --table=chroms

3R  32079331
3L  28110227
2R  25286936
X   23542271
2L  23513712
Y   3667352
4   1348131

Tips and tricks

There is one potential problem with the above solution, and that is the size of file pixels.bg2 Luckily, we can completely avoid generating this file by using output redirection and process substitutions:

user@dev:/tmp hictk load <(hictk dump 4DNFIOTPSS3L.hic --join --resolution 1kbp) \
                         output.hic \
                         --chrom-sizes=chrom.sizes.sorted \
                         --transpose-lower-triangular-pixels \
                         --format=bg2 \
                         --bin-size=1kbp

Note that hictk still needs to generate some temporary file to load interactions into a new .cool or .hic file. When processing large files, it is a good idea to specify custom folder where to create temporary files through the --tmpdir flag:

user@dev:/tmp hictk load <(hictk dump 4DNFIOTPSS3L.hic --join --resolution 1kbp) \
                         output.hic \
                         --chrom-sizes=chrom.sizes.sorted \
                         --transpose-lower-triangular-pixels \
                         --format=bg2 \
                         --bin-size=1kbp \
                         --tmpdir=/var/tmp/

Another option you may want to consider when working with .hic files is the --threads option, which can significantly reduce the time required to load interactions into .hic files.