File validation#

Why is this needed?#

hictk validate can detect several types of data corruption in .hic and .cool files, from simple file truncation due to e.g. failed downloads to subtle index corruption in .cool files.

Cooler index corruption#

To make a long story short, older versions of cooler (including v0.8.3) had a bug in cooler zoomify that caused the generation of invalid file indexes. This results in duplicate pixels with different values being reported for the affected region.

Example:

Output of cooler dump for corrupted file 4DNFI9GMP2J8.mcool#

chrom1

start1

end1

chrom2

start2

end2

count

balanced

chr1

10828000

10830000

chr1

11002000

11004000

1

0.000208987

chr1

10828000

10830000

chr1

11002000

11004000

1

0.000208987

chr1

10828000

10830000

chr1

11006000

11008000

1

0.000199523

chr1

10828000

10830000

chr1

11006000

11008000

3

0.000598569

chr1

10828000

10830000

chr1

11010000

11012000

4

0.000695946

chr1

10828000

10830000

chr1

11010000

11012000

2

0.000347973

chr1

10828000

10830000

chr1

11020000

11022000

1

0.000219669

chr1

10828000

10830000

chr1

11020000

11022000

1

0.000219669

chr1

10828000

10830000

chr1

11030000

11032000

3

0.000499071

chr1

10828000

10830000

chr1

11030000

11032000

2

0.000332714

Unfortunately, this is not a rare issue, as the above bug currently affects most (possibly all) .mcool files released by 4DNucleome:

_images/4dnucleome_bug_notice.avif

hictk validate#

hictk validate was initially developed to detect files affected by the above issue and was later extended to also validate .cool, .scool and .hic files.

Perform a quick check to detect truncated or otherwise invalid files:

# Validate a .hic file
user@dev:/tmp$ hictk validate test/data/hic/4DNFIZ1ZVXC8.hic8
### SUCCESS: "test/data/hic/4DNFIZ1ZVXC8.hic8" is a valid .hic file.

# Validate a .cool file
user@dev:/tmp$ hictk validate test/data/integration_tests/4DNFIZ1ZVXC8.mcool
uri="test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/2500000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=not_checked
### SUCCESS: "test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/2500000" is a valid Cooler.
uri="test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=not_checked
### SUCCESS: "test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000000" is a valid Cooler.
...
uri="test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=not_checked
### SUCCESS: "test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000" is a valid Cooler.

The quick check will not detect Cooler files with corrupted index, as this requires the --validate-index option:

user@dev:/tmp$ hictk validate --validate-index 4DNFI9GMP2J8.mcool::/resolutions/1000000
uri="4DNFI9GMP2J8.mcool::/resolutions/1000000"
is_hdf5=true
unable_to_open_file=false
file_was_properly_closed=true
missing_or_invalid_format_attr=false
missing_or_invalid_bin_type_attr=false
missing_groups=[]
is_valid_cooler=true
index_is_valid=false
### FAILURE: "4DNFI9GMP2J8.mcool::/resolutions/1000000" is not a valid Cooler.

Restoring corrupted .mcool files#

Luckily, the base resolution of .mcool files corrupted as described in Cooler index corruption is still valid, and so corrupted resolutions can be regenerated from the base resolution.

File restoration is automated with hictk fix-mcool:

hictk fix-mcool 4DNFI9GMP2J8.mcool 4DNFI9GMP2J8.fixed.mcool

hictk fix-mcool is basically a wrapper around hictk zoomify and hictk balance.

When balancing, hictk fix-mcool will try to use the same parameters used to balance the original .mcool file. When this is not possible, hictk fix-mcool will fall back to the default parameters used by hictk balance.