File validation#
Why is this needed?#
hictk validate
can detect several types of data corruption in .hic and .cool files, from simple file truncation due to e.g. failed downloads to subtle index corruption in .cool files.
Cooler index corruption#
To make a long story short, older versions of cooler (including v0.8.3) had a bug in cooler zoomify
that caused the generation of invalid file indexes. This results in duplicate pixels with different values being reported for the affected region.
Example:
chrom1 |
start1 |
end1 |
chrom2 |
start2 |
end2 |
count |
balanced |
---|---|---|---|---|---|---|---|
chr1 |
10828000 |
10830000 |
chr1 |
11002000 |
11004000 |
1 |
0.000208987 |
chr1 |
10828000 |
10830000 |
chr1 |
11002000 |
11004000 |
1 |
0.000208987 |
chr1 |
10828000 |
10830000 |
chr1 |
11006000 |
11008000 |
1 |
0.000199523 |
chr1 |
10828000 |
10830000 |
chr1 |
11006000 |
11008000 |
3 |
0.000598569 |
chr1 |
10828000 |
10830000 |
chr1 |
11010000 |
11012000 |
4 |
0.000695946 |
chr1 |
10828000 |
10830000 |
chr1 |
11010000 |
11012000 |
2 |
0.000347973 |
chr1 |
10828000 |
10830000 |
chr1 |
11020000 |
11022000 |
1 |
0.000219669 |
chr1 |
10828000 |
10830000 |
chr1 |
11020000 |
11022000 |
1 |
0.000219669 |
chr1 |
10828000 |
10830000 |
chr1 |
11030000 |
11032000 |
3 |
0.000499071 |
chr1 |
10828000 |
10830000 |
chr1 |
11030000 |
11032000 |
2 |
0.000332714 |
… |
… |
… |
… |
… |
… |
… |
… |
Unfortunately, this is not a rare issue, as the above bug currently affects most (possibly all) .mcool files released by 4DNucleome:
hictk validate#
hictk validate
was initially developed to detect files affected by the above issue and was later extended to also validate .cool, .scool and .hic files.
Perform a quick check to detect truncated or otherwise invalid files:
# Validate a .hic file
user@dev:/tmp$ hictk validate test/data/hic/4DNFIZ1ZVXC8.hic8
### SUCCESS: "test/data/hic/4DNFIZ1ZVXC8.hic8" is a valid .hic file.
# Validate a .cool file
user@dev:/tmp$ hictk validate test/data/integration_tests/4DNFIZ1ZVXC8.mcool
uri="test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/2500000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=not_checked
### SUCCESS: "test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/2500000" is a valid Cooler.
uri="test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=not_checked
### SUCCESS: "test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000000" is a valid Cooler.
...
uri="test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=not_checked
### SUCCESS: "test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000" is a valid Cooler.
The quick check will not detect Cooler files with corrupted index, as this requires the --validate-index
option:
user@dev:/tmp$ hictk validate --validate-index 4DNFI9GMP2J8.mcool::/resolutions/1000000
uri="4DNFI9GMP2J8.mcool::/resolutions/1000000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=false
### FAILURE: "4DNFI9GMP2J8.mcool::/resolutions/1000000" is not a valid Cooler.
Restoring corrupted .mcool files#
Luckily, the base resolution of .mcool files corrupted as described in Cooler index corruption is still valid, and so corrupted resolutions can be regenerated from the base resolution.
File restoration is automated with hictk fix-mcool
:
hictk fix-mcool 4DNFI9GMP2J8.mcool 4DNFI9GMP2J8.fixed.mcool
hictk fix-mcool
is basically a wrapper around hictk zoomify
and hictk balance
.
When balancing, hictk fix-mcool
will try to use the same parameters used to balance the original .mcool file. When this is not possible, hictk fix-mcool
will fall back to the default parameters used by hictk balance
.