Dump interactions to .cool or .hic file¶
TLDR¶
# Important! --bin-size should be the same resolution as matrix.cool
user@dev:/tmp hictk load - \
output.cool \
--chrom-sizes=<(hictk dump --table=chroms matrix.cool) \
--format=bg2 \
--bin-size=1kbp \
< <(hictk dump --join
--range=2L:0-10,000,000
--range2=3R:0-10,000,000
matrix.cool)
Why is this needed?¶
hictk dump can read interactions from .cool, .mcool, and .hic files and write them in text format to stdout.
Additionally, hictk dump supports fetching interactions overlapping a pair of regions of interest through the --range and --range2 CLI options.
However, instead of writing interactions to stdout, we may want to write them to a new .cool or .hic file.
This tutorial shows how this can be accomplished using hictk dump and hictk load.
Walkthrough¶
For this tutorial, we will use file 4DNFIOTPSS3L.hic as an example, which can be downloaded from here.
First, we extract the list of chromosomes from the input file:
user@dev:/tmp hictk dump 4DNFIOTPSS3L.hic --table=chroms | tee chrom.sizes
2L 23513712
2R 25286936
3L 28110227
3R 32079331
4 1348131
X 23542271
Y 3667352
Second, we dump pixels in bedGraph2 format (see below for how to make this step more efficient):
user@dev:/tmp hictk dump 4DNFIOTPSS3L.hic \
--join \
--resolution=1kbp \
--range=2L:5,000,000-10,000,000 \
--range2=3R:7,500,000-10,000,000 > pixels.bg2
user@dev:/tmp head pixels.bg2
2L 5000000 5001000 3R 7506000 7507000 1
2L 5000000 5001000 3R 7624000 7625000 1
2L 5000000 5001000 3R 7943000 7944000 1
2L 5000000 5001000 3R 8014000 8015000 1
2L 5000000 5001000 3R 8130000 8131000 1
2L 5000000 5001000 3R 8245000 8246000 1
2L 5000000 5001000 3R 8855000 8856000 1
2L 5000000 5001000 3R 9032000 9033000 1
2L 5000000 5001000 3R 9171000 9172000 1
2L 5000000 5001000 3R 9380000 9381000 1
Finally, we load pixels into a new .cool file
user@dev:/tmp hictk load pixels.bg2 \
output.cool \
--chrom-sizes=chrom.sizes \
--format=bg2 \
--bin-size=1kbp
[2024-09-27 18:54:58.532] [info]: Running hictk v1.0.0-fbdcb591
[2024-09-27 18:54:58.540] [info]: begin loading unsorted pixels into a .cool file...
[2024-09-27 18:54:58.629] [info]: writing chunk #1 to intermediate file "/tmp/hictk-tmp-XXXXatmfuM/output.cool.tmp"...
[2024-09-27 18:54:58.641] [info]: done writing chunk #1 to tmp file "/tmp/hictk-tmp-XXXXatmfuM/output.cool.tmp".
[2024-09-27 18:54:58.642] [info]: merging 1 chunks into "output.cool"...
[2024-09-27 18:54:58.672] [info]: ingested 26214 interactions (25085 nnz) in 0.139864314s!
Removing empty chromosomes from the reference genome¶
This can be easily achieved by grepping 2L and 3R when generating the chrom.sizes file.
user@dev:/tmp hictk dump 4DNFIOTPSS3L.hic --table=chroms |
grep -e '2L' -e '3R' |
tee chrom.sizes
2L 23513712
3R 32079331
Tips and tricks¶
There is one potential problem with the above solution, and that is the size of file pixels.bg2
Luckily, we can completely avoid generating this file by using output redirection and process substitutions:
user@dev:/tmp hictk load - \
output.cool \
--chrom-sizes=chrom.sizes \
--format=bg2 \
--bin-size=1kbp \
< <(hictk dump 4DNFIOTPSS3L.hic \
--join \
--resolution=1kbp \
--range=2L:0-10,000,000 \
--range2=3R:0-10,000,000)
Note that hictk still needs to generate some temporary file to load interactions into a new .cool or .hic file.
When processing large files, it is a good idea to specify custom folder where to create temporary files through the --tmpdir flag:
user@dev:/tmp hictk load - \
output.cool \
--chrom-sizes=chrom.sizes \
--format=bg2 \
--bin-size=1kbp \
--tmpdir=/var/tmp/ \
< <(hictk dump 4DNFIOTPSS3L.hic \
--join \
--resolution=1kbp \
--range=2L:0-10,000,000 \
--range2=3R:0-10,000,000)
Another option you may want to consider when working with .hic files, is the --threads option, which can significantly reduce the time required to load interactions into .hic files.