Dump interactions to .cool or .hic file#

TLDR#

# Important! --bin-size should be the same resolution as matrix.cool
user@dev:/tmp hictk load --format=bg2 \
                         --bin-size=1000 \
                         <(hictk dump --table=chroms matrix.cool)
                         output.cool \
                         < <(hictk dump --join
                                        --range=chr2L:0-10,000,000
                                        --range2=chr3R:0-10,000,000
                                        matrix.cool)

Why is this needed?#

hictk dump can read interactions from .cool, .mcool, and .hic files and write them in text format to stdout. Additionally, hictk dump supports fetching interactions overlapping a pair of regions of interest through the --range and --range2 CLI options. However, instead of writing interactions to stdout, we may want to write them to a new .cool or .hic file. This tutorial shows how this can be accomplished using hictk dump and hictk load.

Walkthrough#

For this tutorial, we will use file 4DNFIZ1ZVXC8.mcool as an example, which can be downloaded from here.

First, we extract the list of chromosomes from the input file:

user@dev:/tmp hictk dump 4DNFIZ1ZVXC8.mcool --table=chroms | tee chrom.sizes

chr2L       23513712
chr2R       25286936
chr3L       28110227
chr3R       32079331
chr4        1348131
chrX        23542271
chrY        3667352

Second, we dump pixels in bedGraph2 format (see below for how to make this step more efficient):

user@dev:/tmp hictk dump 4DNFIZ1ZVXC8.mcool \
                         --join \
                         --resolution=1000 \
                         --range=chr2L:5,000,000-10,000,000 \
                         --range2=chr3R:7,500,000-10,000,000 > pixels.bg2

user@dev:/tmp head pixels.bg2

chr2L       5000000 5001000 chr3R   7506000 7507000 1
chr2L       5000000 5001000 chr3R   7624000 7625000 1
chr2L       5000000 5001000 chr3R   7943000 7944000 1
chr2L       5000000 5001000 chr3R   8014000 8015000 1
chr2L       5000000 5001000 chr3R   8130000 8131000 1
chr2L       5000000 5001000 chr3R   8245000 8246000 1
chr2L       5000000 5001000 chr3R   8855000 8856000 1
chr2L       5000000 5001000 chr3R   9032000 9033000 1
chr2L       5000000 5001000 chr3R   9171000 9172000 1
chr2L       5000000 5001000 chr3R   9380000 9381000 1

Finally, we load pixels into a new .cool file

user@dev:/tmp hictk load --format=bg2 \
                         --bin-size=1000 \
                         chrom.sizes \
                         output.cool < pixels.bg2

[2024-03-21 13:22:57.542] [info]: Running hictk v0.0.10-1c2bafd
[2024-03-21 13:22:57.542] [info]: begin loading unsorted pixels into a .cool file...
[2024-03-21 13:22:57.613] [info]: writing chunk #1 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 13:22:57.630] [info]: done writing chunk #1 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 13:22:57.630] [info]: writing chunk #2 to intermediate file "/tmp/output.cool.tmp/output.cool.tmp"...
[2024-03-21 13:22:57.634] [info]: done writing chunk #2 to tmp file "/tmp/output.cool.tmp/output.cool.tmp".
[2024-03-21 13:22:57.634] [info]: merging 2 chunks into "output.cool"...
[2024-03-21 13:22:57.676] [info]: ingested 26214 interactions (25085 nnz) in 0.133955616s!

Removing empty chromosomes from the reference genome#

This can be easily achieved by grepping chr2L and chr3R when generating the chrom.sizes file.

user@dev:/tmp hictk dump 4DNFIZ1ZVXC8.mcool --table=chroms |
              grep -e 'chr2L' -e 'chr3R' |
              tee chrom.sizes

chr2L       23513712
chr3R       32079331

Tips and tricks#

There is one potential problem with the above solution, and that is the size of file pixels.bg2 Luckily, we can completely avoid generating this file by using output redirection and process substitutions:

user@dev:/tmp hictk load --format=bg2 \
                         --bin-size=1000 \
                         chrom.sizes \
                         output.cool \
                         < <(hictk dump 4DNFIZ1ZVXC8.mcool \
                                        --join \
                                        --resolution=1000 \
                                        --range=chr2L:0-10,000,000 \
                                        --range2=chr3R:0-10,000,000)

Note that hictk still needs to generate some temporary file to load interactions into a new .cool or .hic file. When processing large files, it is a good idea to specify custom folder where to create temporary files through the --tmpdir flag:

user@dev:/tmp hictk load --format=bg2 \
                         --bin-size=1000 \
                         --tmpdir=/var/tmp/ \
                         chrom.sizes.sorted \
                         output.cool \
                         < <(hictk dump 4DNFIZ1ZVXC8.mcool \
                                        --join \
                                        --resolution=1000 \
                                        --range=chr2L:0-10,000,000 \
                                        --range2=chr3R:0-10,000,000)

Another option you may want to consider when working with .hic files, is the --threads option, which can significantly reduce the time required to load interactions into .hic files.