chdman leaves trailing garbage in the last hunk after the CD data #13048

schellingb · 2024-12-05T20:07:27Z

I generated a CUE/BIN pair with 1 empty data track and 81 very short audio tracks.

When converting the CUE file with

chdman createcd -c none -i 82Tracks.cue -o 82Tracks.cue.chd

OUTPUT: 82Tracks-CHD.zip

It generates a seemingly correct .CHD file, but the last hunk (440, mapped to index 422 starting at 8264448) has garbage bytes instead of zeroes after the end of the data. That hunk seems to be made of:

1 unit that is the final sector of the final track
3 zeroed units (padding?)
1 unit of repeated data from a seemingly random previous audio track (track 37 in this case)
3 more zeroed units

Where is this duplicated unit of track 37 coming from? I tried debugging chdman but couldn't figure out why it ends up with bytes from track 37. My guess would be that the asynchronously processed work items somehow have a bucket of buffers that for the last hunk will reuse something that already had 8 units in it but then because it's at the end of the data it just fills it out partially, leaving the previously written trailing data.

The audio tracks all are filled from start to end with the same byte, starting with 0x01 0x01 0x01 0x01 ... in the first audio track (track 2 after the data track) going until 0x51 0x51 0x51 0x51... for the final 81st audio track (track 82). So it's very easy to see in a hex editor that after the (supposed to be) last series of 0x51 bytes there suddenly is one more block of 0x24 bytes.

I'm wondering whether the selection of garbage bytes is deterministic across all platforms and versions of chdman.

The text was updated successfully, but these errors were encountered:

schellingb · 2024-12-07T16:53:34Z

So I boiled this down to a much simpler case. I generated (many) BIN/CUE files with just one track and X frames full of data, then count how many units are in the CHD. Whenever there are more units than frames, I know there's garbage at the end. This made me realize it's somewhat simple and consistent:

When creating a CHD for a CD-ROM (Hunk Size is 19,584 bytes and Unit Size is 2,448 bytes)
Writing the header and meta data and then more than 256 hunks of data
The final hunk only has 4 or less units that map to an actual CD track
The final hunk will contain the latter half from exactly 256 hunks ago

So the byte range EOF - 9792 until EOF is a copy of EOF - 5023296 until EOF - 5013504.

Here's the simplest input I could come up with, it has 2049 frames of audio that make a CHD with 256 full hunks and then the final 257th hunk with a single unit the is the final frame of the CD track. After that there's repeated garbage in the latter half of that 257th hunk.

Input: 2049Frames-BINCUE.zip - Output: 2049Frames-CHD.zip

I tried making a file with 256 hunks and 99 tracks which makes the meta section take up more than 9792 bytes, but the final hunk didn't repeat part of the meta info section (which is 5023296 bytes from the end of the file). So it only seems to start happening with 257 hunks of data.

I have confirmed this on Win64, Win32, Linux-x64 and Linux-arm64. Also it seems the same with 0.220, 0.251, 0.267 and 0.272. So overall very consistent. This then maybe should be classified as a "known quirk" instead of a bug that needs to be fixed to avoid needless changing of checksums of a large number of CHDs that have already been created with this behavior. If there ever are written specs for CHD, I guess this behavior could be documented there.

By the way, I have no idea how this affects non-CD type CHDs.

FoxhackDN · 2024-12-07T17:27:26Z

Does extracting a CHD with garbage back to bin/cue keep the garbage at the end?

schellingb · 2024-12-07T18:13:39Z

No it doesn't, the BIN/CUE gets rebuilt exactly like the input, with all the units that map to the actual CD track. The garbage is just in the "unused" area of the final hunk of the CHD. When I made the first post it seemed much more random than how it turned out. It still probably is unintended behavior that would be cleaner if it has never existed. But with the behavior likely having been there for years, it might be best to just leave it and document it somewhere. The source code is hard for me to read, it would be great if someone who knows the source could confirm that it is exactly what I described and that it is consistent across platforms and past (and future) versions.

When there are more than 256 hunks of data, and only half or less of the final hunk is used, then there will be a repeat of the latter half of the hunk 256 hunks before the final one.

If that can be confirmed in the source code, we can maybe infer to how this affects other non-CD type CHDs and hopefully sleep well knowing this is consistent and deterministic :-)

cuavas · 2024-12-07T20:03:26Z

Well, it always has to pad out the CHD file to a whole number of hunks. It isn’t extracting the padding when it extracts the content. The padding shouldn’t be included when calculating digests.

It appears the padding can’t leak any sensitive data because at worst it contains leftover data from earlier in the input.

Maybe it should be zeroed (or we could do something akin to PKCS padding if we want to get fancy), but I don’t think it’s a severe bug.

schellingb · 2024-12-08T02:25:24Z

I'm operating under the assumption that the output of chdman is supposed to be deterministic (at least when not using compression). Meaning it is supposed to create bit-exact outputs no matter what machine/environment it is being run from (given the same input). When I opened the issue I wasn't sure if the selection of garbage bytes was more random (or dependent on CPU/threads/...). But further investigation has shown that to be less likely, I got the same exact 256 hunk offset data duplication on various platforms and inputs. It would be nice to confirm the behavior in the source code. I can give it another shot, maybe I can find that 256 somewhere in there and figure it out. Then we can document this at least here and close the issue as "known won't fix" or something.

ajrhacker · 2024-12-08T03:34:47Z

This is one of the issues I think I might have fixed in my uncommitted CHD WIP.

firewave · 2024-12-29T15:38:01Z

This sounds like an issue I fixed ages ago. I discovered this (among a lot of others) while writing unit tests for chdman. Unfortunately those are outside of the tree and were not run in the CI so they bitrod badly.

I might be willing to look into resurrecting them. But that would only make sense if we can figure out how to integrate them into the builds.

schellingb · 2024-12-29T17:25:12Z

I still think there's a good argument to leave this as is, to avoid suddenly creating different but also-valid CHD files. It really seems consistent and deterministic behavior, not exposing any weird data or anything. I debugged into chdman and realized the use of garbage bytes in the last hunks from 256 hunks ago is due to this in chd.h:

static constexpr int WORK_BUFFER_HUNKS = 256;

Given an input CD-ROM where the total number of (padded) frames exceeds more than 2048 (more than 256 hunks) but when divided by 8 has a remainder of 4 (can only be 4 or 0 due to the track padding), the repeated bytes will be present in the latter half of the final hunk.

Besides WORK_BUFFER_HUNKS, this is due to cdrom_file::TRACK_PADDING being 4 and cdrom_file::FRAMES_PER_HUNK being 8 (in cdrom.h). Then the very last call to the function chd_file_compressor::async_read will only fill half of the final hunk. If there are 256 or less total hunks the latter half will be zeroes thanks due to zeroing of the newly initialized work buffer being done in chd_file_compressor::compress_begin, so that of course is already perfectly fine.

So again, I'd argue for keeping this as is to avoid introducing somewhat unnecessary inconsistency. As long as the constants WORK_BUFFER_HUNKS, TRACK_PADDING and FRAMES_PER_HUNK and CHD version number stay the same, it might be best to just document this quirk somewhere (which this issue ticket kinda is).

firewave mentioned this issue Jan 10, 2025

Various failures in compiling current MAME integrated tests #7716

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chdman leaves trailing garbage in the last hunk after the CD data #13048

chdman leaves trailing garbage in the last hunk after the CD data #13048

schellingb commented Dec 5, 2024

schellingb commented Dec 7, 2024

FoxhackDN commented Dec 7, 2024

schellingb commented Dec 7, 2024

cuavas commented Dec 7, 2024

schellingb commented Dec 8, 2024

ajrhacker commented Dec 8, 2024

firewave commented Dec 29, 2024

schellingb commented Dec 29, 2024

chdman leaves trailing garbage in the last hunk after the CD data #13048

chdman leaves trailing garbage in the last hunk after the CD data #13048

Comments

schellingb commented Dec 5, 2024

schellingb commented Dec 7, 2024

FoxhackDN commented Dec 7, 2024

schellingb commented Dec 7, 2024

cuavas commented Dec 7, 2024

schellingb commented Dec 8, 2024

ajrhacker commented Dec 8, 2024

firewave commented Dec 29, 2024

schellingb commented Dec 29, 2024