-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chdman leaves trailing garbage in the last hunk after the CD data #13048
Comments
So I boiled this down to a much simpler case. I generated (many) BIN/CUE files with just one track and X frames full of data, then count how many units are in the CHD. Whenever there are more units than frames, I know there's garbage at the end. This made me realize it's somewhat simple and consistent:
So the byte range Here's the simplest input I could come up with, it has 2049 frames of audio that make a CHD with 256 full hunks and then the final 257th hunk with a single unit the is the final frame of the CD track. After that there's repeated garbage in the latter half of that 257th hunk. Input: 2049Frames-BINCUE.zip - Output: 2049Frames-CHD.zip I tried making a file with 256 hunks and 99 tracks which makes the meta section take up more than 9792 bytes, but the final hunk didn't repeat part of the meta info section (which is 5023296 bytes from the end of the file). So it only seems to start happening with 257 hunks of data. I have confirmed this on Win64, Win32, Linux-x64 and Linux-arm64. Also it seems the same with 0.220, 0.251, 0.267 and 0.272. So overall very consistent. This then maybe should be classified as a "known quirk" instead of a bug that needs to be fixed to avoid needless changing of checksums of a large number of CHDs that have already been created with this behavior. If there ever are written specs for CHD, I guess this behavior could be documented there. By the way, I have no idea how this affects non-CD type CHDs. |
Does extracting a CHD with garbage back to bin/cue keep the garbage at the end? |
No it doesn't, the BIN/CUE gets rebuilt exactly like the input, with all the units that map to the actual CD track. The garbage is just in the "unused" area of the final hunk of the CHD. When I made the first post it seemed much more random than how it turned out. It still probably is unintended behavior that would be cleaner if it has never existed. But with the behavior likely having been there for years, it might be best to just leave it and document it somewhere. The source code is hard for me to read, it would be great if someone who knows the source could confirm that it is exactly what I described and that it is consistent across platforms and past (and future) versions. When there are more than 256 hunks of data, and only half or less of the final hunk is used, then there will be a repeat of the latter half of the hunk 256 hunks before the final one. If that can be confirmed in the source code, we can maybe infer to how this affects other non-CD type CHDs and hopefully sleep well knowing this is consistent and deterministic :-) |
Well, it always has to pad out the CHD file to a whole number of hunks. It isn’t extracting the padding when it extracts the content. The padding shouldn’t be included when calculating digests. It appears the padding can’t leak any sensitive data because at worst it contains leftover data from earlier in the input. Maybe it should be zeroed (or we could do something akin to PKCS padding if we want to get fancy), but I don’t think it’s a severe bug. |
I'm operating under the assumption that the output of chdman is supposed to be deterministic (at least when not using compression). Meaning it is supposed to create bit-exact outputs no matter what machine/environment it is being run from (given the same input). When I opened the issue I wasn't sure if the selection of garbage bytes was more random (or dependent on CPU/threads/...). But further investigation has shown that to be less likely, I got the same exact 256 hunk offset data duplication on various platforms and inputs. It would be nice to confirm the behavior in the source code. I can give it another shot, maybe I can find that 256 somewhere in there and figure it out. Then we can document this at least here and close the issue as "known won't fix" or something. |
This is one of the issues I think I might have fixed in my uncommitted CHD WIP. |
This sounds like an issue I fixed ages ago. I discovered this (among a lot of others) while writing unit tests for I might be willing to look into resurrecting them. But that would only make sense if we can figure out how to integrate them into the builds. |
I still think there's a good argument to leave this as is, to avoid suddenly creating different but also-valid CHD files. It really seems consistent and deterministic behavior, not exposing any weird data or anything. I debugged into chdman and realized the use of garbage bytes in the last hunks from 256 hunks ago is due to this in chd.h: static constexpr int WORK_BUFFER_HUNKS = 256; Given an input CD-ROM where the total number of (padded) frames exceeds more than 2048 (more than 256 hunks) but when divided by 8 has a remainder of 4 (can only be 4 or 0 due to the track padding), the repeated bytes will be present in the latter half of the final hunk. Besides So again, I'd argue for keeping this as is to avoid introducing somewhat unnecessary inconsistency. As long as the constants |
I generated a CUE/BIN pair with 1 empty data track and 81 very short audio tracks.
INPUT: 82Tracks-BINCUE.zip
When converting the CUE file with
OUTPUT: 82Tracks-CHD.zip
It generates a seemingly correct .CHD file, but the last hunk (440, mapped to index 422 starting at 8264448) has garbage bytes instead of zeroes after the end of the data. That hunk seems to be made of:
Where is this duplicated unit of track 37 coming from? I tried debugging chdman but couldn't figure out why it ends up with bytes from track 37. My guess would be that the asynchronously processed work items somehow have a bucket of buffers that for the last hunk will reuse something that already had 8 units in it but then because it's at the end of the data it just fills it out partially, leaving the previously written trailing data.
The audio tracks all are filled from start to end with the same byte, starting with
0x01 0x01 0x01 0x01 ...
in the first audio track (track 2 after the data track) going until0x51 0x51 0x51 0x51...
for the final 81st audio track (track 82). So it's very easy to see in a hex editor that after the (supposed to be) last series of0x51
bytes there suddenly is one more block of0x24
bytes.I'm wondering whether the selection of garbage bytes is deterministic across all platforms and versions of chdman.
The text was updated successfully, but these errors were encountered: