Reordering chromosomes#
TLDR#
# Important! --bin-size should be the same resolution as matrix.cool
user@dev:/tmp hictk load --format=bg2 \
--bin-size=1000 \
<(hictk dump --table=chroms matrix.cool |
sort -k2,2nr) \
output.cool \
< <(hictk dump --join matrix.cool)
Why is this needed?#
Sometimes we want to compare files using the same reference genome assembly, but with different chromosome orders (e.g. in one file chromosomes are sorted by size while in the other they are sorted by name). This can be a problem especially when trying to visually compare such files. This tutorial shows how to convert a .cool file with chromosomes sorted by name to a .cool file with chromosomes sorted by size. The same procedure can be applied to .hic files.
Walkthrough#
For this tutorial, we will use file 4DNFIZ1ZVXC8.mcool
as an example, which can be downloaded from here.
First, we extract the list of chromosomes from the input file:
user@dev:/tmp hictk dump 4DNFIZ1ZVXC8.mcool --table=chroms | tee chrom.sizes
chr2L 23513712
chr2R 25286936
chr3L 28110227
chr3R 32079331
chr4 1348131
chrX 23542271
chrY 3667352
Second, we re-order chromosomes:
user@dev:/tmp sort -k2,2nr chrom.sizes | tee chrom.sizes.sorted
chr3R 32079331
chr3L 28110227
chr2R 25286936
chrX 23542271
chr2L 23513712
chrY 3667352
chr4 1348131
Next, we dump pixels in bedGraph2 format (see below for how to make this step more efficient):
user@dev:/tmp hictk dump 4DNFIZ1ZVXC8.mcool --join --resolution=1000 > pixels.bg2
user@dev:/tmp head pixels.bg2
chr2L 5000 6000 chr2L 5000 6000 127
chr2L 5000 6000 chr2L 6000 7000 129
chr2L 5000 6000 chr2L 7000 8000 60
chr2L 5000 6000 chr2L 8000 9000 77
chr2L 5000 6000 chr2L 9000 10000 97
chr2L 5000 6000 chr2L 10000 11000 3
chr2L 5000 6000 chr2L 11000 12000 1
chr2L 5000 6000 chr2L 12000 13000 66
chr2L 5000 6000 chr2L 13000 14000 116
chr2L 5000 6000 chr2L 14000 15000 64
Finally, we load pixels into a new .cool file
user@dev:/tmp hictk load --format=bg2 \
--bin-size=1000 \
chrom.sizes.sorted \
output.cool < pixels.bg2
[2024-03-21 12:27:16.998] [info]: Running hictk v0.0.10-1c2bafd
[2024-03-21 12:27:16.998] [info]: begin loading unsorted pixels into a .cool file...
[2024-03-21 12:27:17.077] [info]: writing chunk #1 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 12:27:20.945] [info]: done writing chunk #1 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 12:27:20.945] [info]: writing chunk #2 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 12:27:24.890] [info]: done writing chunk #2 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 12:27:24.890] [info]: writing chunk #3 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 12:27:28.823] [info]: done writing chunk #3 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 12:27:28.823] [info]: writing chunk #4 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 12:27:32.668] [info]: done writing chunk #4 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 12:27:32.668] [info]: writing chunk #5 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 12:27:36.070] [info]: done writing chunk #5 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 12:27:36.070] [info]: writing chunk #6 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 12:27:36.079] [info]: done writing chunk #6 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 12:27:36.080] [info]: merging 6 chunks into "output.cool"...
[2024-03-21 12:27:38.572] [info]: processing chr3R:20786000-20787000 chr3R:20808000-20809000 at 4091653 pixels/s...
[2024-03-21 12:27:41.443] [info]: processing chr3L:7391000-7392000 chr3L:7417000-7418000 at 3484321 pixels/s...
[2024-03-21 12:27:44.292] [info]: processing chr2R:9278000-9279000 chrX:5993000-5994000 at 3510004 pixels/s...
[2024-03-21 12:27:47.062] [info]: processing chrX:14217000-14218000 chrX:17476000-17477000 at 3611412 pixels/s...
[2024-03-21 12:27:49.901] [info]: ingested 119208613 interactions (48469783 nnz) in 32.902465965s!
Lastly, we check that chromosomes are properly sorted:
user@dev:/tmp hictk dump 4DNFIZ1ZVXC8.mcool --table=chroms
chr3R 32079331
chr3L 28110227
chr2R 25286936
chrX 23542271
chr2L 23513712
chrY 3667352
chr4 1348131
Tips and tricks#
There is one potential problem with the above solution, and that is the size of file pixels.bg2
Luckily, we can completely avoid generating this file by using output redirection and process substitutions:
user@dev:/tmp hictk load --format=bg2 \
--bin-size=1000 \
chrom.sizes.sorted \
output.cool \
< <(hictk dump 4DNFIZ1ZVXC8.mcool --join --resolution=1000)
Note that hictk still needs to generate some temporary file to load interactions into a new .cool or .hic file.
When processing large files, it is a good idea to specify custom folder where to create temporary files through the --tmpdir
flag:
user@dev:/tmp hictk load --format=bg2 \
--bin-size=1000 \
--tmpdir=/var/tmp/ \
chrom.sizes.sorted \
output.cool \
< <(hictk dump 4DNFIZ1ZVXC8.mcool --join --resolution=1000)
Another option you may want to consider when working with .hic files, is the --threads
option, which can significantly reduce the time required to load interactions into .hic files.