Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chdman leaves trailing garbage in the last hunk after the CD data #13048

Open
schellingb opened this issue Dec 5, 2024 · 8 comments
Open

chdman leaves trailing garbage in the last hunk after the CD data #13048

schellingb opened this issue Dec 5, 2024 · 8 comments

Comments

@schellingb
Copy link

I generated a CUE/BIN pair with 1 empty data track and 81 very short audio tracks.

INPUT: 82Tracks-BINCUE.zip

When converting the CUE file with

chdman createcd -c none -i 82Tracks.cue -o 82Tracks.cue.chd

OUTPUT: 82Tracks-CHD.zip

It generates a seemingly correct .CHD file, but the last hunk (440, mapped to index 422 starting at 8264448) has garbage bytes instead of zeroes after the end of the data. That hunk seems to be made of:

  • 1 unit that is the final sector of the final track
  • 3 zeroed units (padding?)
  • 1 unit of repeated data from a seemingly random previous audio track (track 37 in this case)
  • 3 more zeroed units

Where is this duplicated unit of track 37 coming from? I tried debugging chdman but couldn't figure out why it ends up with bytes from track 37. My guess would be that the asynchronously processed work items somehow have a bucket of buffers that for the last hunk will reuse something that already had 8 units in it but then because it's at the end of the data it just fills it out partially, leaving the previously written trailing data.

The audio tracks all are filled from start to end with the same byte, starting with 0x01 0x01 0x01 0x01 ... in the first audio track (track 2 after the data track) going until 0x51 0x51 0x51 0x51... for the final 81st audio track (track 82). So it's very easy to see in a hex editor that after the (supposed to be) last series of 0x51 bytes there suddenly is one more block of 0x24 bytes.

I'm wondering whether the selection of garbage bytes is deterministic across all platforms and versions of chdman.

@schellingb
Copy link
Author

So I boiled this down to a much simpler case. I generated (many) BIN/CUE files with just one track and X frames full of data, then count how many units are in the CHD. Whenever there are more units than frames, I know there's garbage at the end. This made me realize it's somewhat simple and consistent:

  • When creating a CHD for a CD-ROM (Hunk Size is 19,584 bytes and Unit Size is 2,448 bytes)
  • Writing the header and meta data and then more than 256 hunks of data
  • The final hunk only has 4 or less units that map to an actual CD track
  • The final hunk will contain the latter half from exactly 256 hunks ago

So the byte range EOF - 9792 until EOF is a copy of EOF - 5023296 until EOF - 5013504.

Here's the simplest input I could come up with, it has 2049 frames of audio that make a CHD with 256 full hunks and then the final 257th hunk with a single unit the is the final frame of the CD track. After that there's repeated garbage in the latter half of that 257th hunk.

Input: 2049Frames-BINCUE.zip - Output: 2049Frames-CHD.zip

I tried making a file with 256 hunks and 99 tracks which makes the meta section take up more than 9792 bytes, but the final hunk didn't repeat part of the meta info section (which is 5023296 bytes from the end of the file). So it only seems to start happening with 257 hunks of data.

I have confirmed this on Win64, Win32, Linux-x64 and Linux-arm64. Also it seems the same with 0.220, 0.251, 0.267 and 0.272. So overall very consistent. This then maybe should be classified as a "known quirk" instead of a bug that needs to be fixed to avoid needless changing of checksums of a large number of CHDs that have already been created with this behavior. If there ever are written specs for CHD, I guess this behavior could be documented there.

By the way, I have no idea how this affects non-CD type CHDs.

@FoxhackDN
Copy link
Contributor

Does extracting a CHD with garbage back to bin/cue keep the garbage at the end?

@schellingb
Copy link
Author

No it doesn't, the BIN/CUE gets rebuilt exactly like the input, with all the units that map to the actual CD track. The garbage is just in the "unused" area of the final hunk of the CHD. When I made the first post it seemed much more random than how it turned out. It still probably is unintended behavior that would be cleaner if it has never existed. But with the behavior likely having been there for years, it might be best to just leave it and document it somewhere. The source code is hard for me to read, it would be great if someone who knows the source could confirm that it is exactly what I described and that it is consistent across platforms and past (and future) versions.

When there are more than 256 hunks of data, and only half or less of the final hunk is used, then there will be a repeat of the latter half of the hunk 256 hunks before the final one.

If that can be confirmed in the source code, we can maybe infer to how this affects other non-CD type CHDs and hopefully sleep well knowing this is consistent and deterministic :-)

@cuavas
Copy link
Member

cuavas commented Dec 7, 2024

Well, it always has to pad out the CHD file to a whole number of hunks. It isn’t extracting the padding when it extracts the content. The padding shouldn’t be included when calculating digests.

It appears the padding can’t leak any sensitive data because at worst it contains leftover data from earlier in the input.

Maybe it should be zeroed (or we could do something akin to PKCS padding if we want to get fancy), but I don’t think it’s a severe bug.

@schellingb
Copy link
Author

I'm operating under the assumption that the output of chdman is supposed to be deterministic (at least when not using compression). Meaning it is supposed to create bit-exact outputs no matter what machine/environment it is being run from (given the same input). When I opened the issue I wasn't sure if the selection of garbage bytes was more random (or dependent on CPU/threads/...). But further investigation has shown that to be less likely, I got the same exact 256 hunk offset data duplication on various platforms and inputs. It would be nice to confirm the behavior in the source code. I can give it another shot, maybe I can find that 256 somewhere in there and figure it out. Then we can document this at least here and close the issue as "known won't fix" or something.

@ajrhacker
Copy link
Contributor

This is one of the issues I think I might have fixed in my uncommitted CHD WIP.

@firewave
Copy link
Contributor

This sounds like an issue I fixed ages ago. I discovered this (among a lot of others) while writing unit tests for chdman. Unfortunately those are outside of the tree and were not run in the CI so they bitrod badly.

I might be willing to look into resurrecting them. But that would only make sense if we can figure out how to integrate them into the builds.

@schellingb
Copy link
Author

I still think there's a good argument to leave this as is, to avoid suddenly creating different but also-valid CHD files. It really seems consistent and deterministic behavior, not exposing any weird data or anything. I debugged into chdman and realized the use of garbage bytes in the last hunks from 256 hunks ago is due to this in chd.h:

static constexpr int WORK_BUFFER_HUNKS = 256;

Given an input CD-ROM where the total number of (padded) frames exceeds more than 2048 (more than 256 hunks) but when divided by 8 has a remainder of 4 (can only be 4 or 0 due to the track padding), the repeated bytes will be present in the latter half of the final hunk.

Besides WORK_BUFFER_HUNKS, this is due to cdrom_file::TRACK_PADDING being 4 and cdrom_file::FRAMES_PER_HUNK being 8 (in cdrom.h). Then the very last call to the function chd_file_compressor::async_read will only fill half of the final hunk. If there are 256 or less total hunks the latter half will be zeroes thanks due to zeroing of the newly initialized work buffer being done in chd_file_compressor::compress_begin, so that of course is already perfectly fine.

So again, I'd argue for keeping this as is to avoid introducing somewhat unnecessary inconsistency. As long as the constants WORK_BUFFER_HUNKS, TRACK_PADDING and FRAMES_PER_HUNK and CHD version number stay the same, it might be best to just document this quirk somewhere (which this issue ticket kinda is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants