Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: add internal erofs writer code #57

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

allisonkarlitskaya
Copy link
Collaborator

This introduces experimental code for writing erofs images ourselves, instead of using the external mkcomposefs CLI.

It's currently disabled by default. You can test it by setting the COMPOSEFS_FORMAT=new environment variable.

This currently produces a different output than the output of mkcomposefs, which is why it's gated behind an environment variable. The plan is to add a compatibility mode to our internal writer code so that it produces as similar of an output as possible and then switch over to using it once we are convinced that it's equivalent. Then the COMPOSEFS_FORMAT= variable will disable this compatibility mode.

There's also a COMPOSEFS_DUMP_EROFS=1 environment variable (which works with both mkcomposefs and our internal code) which will dump the erofs layout for diffing. There's also a standalone erofs-debug binary that will do the same.

Additionally, this introduces two new files in docs:

  • a detailed description of the parts of erofs that we use
  • a document which attempts to describe the decisions made in creating an erofs composefs image (in terms of which order the files are in, etc).

The main idea here is to start a serious effort towards standardizing the composefs label we want to start adding to container images: it should be possible to define what will be in that label by way of documentation instead of saying "run this software and use the output". These two new documents, taken together with the existing "oci.md" form a rough (and still incomplete) outline for that.

Many thanks to Gao Xiang [email protected] for helping clarify many points about the erofs file format for the documentation.

Closes #56

This introduces experimental code for writing erofs images ourselves,
instead of using the external mkcomposefs CLI.

It's currently disabled by default.  You can test it by setting the
`COMPOSEFS_FORMAT=new` environment variable.

This currently produces a different output than the output of
mkcomposefs, which is why it's gated behind an environment variable.
The plan is to add a compatibility mode to our internal writer code so
that it produces as similar of an output as possible and then switch
over to using it once we are convinced that it's equivalent.  Then the
`COMPOSEFS_FORMAT=` variable will disable this compatibility mode.

There's also a `COMPOSEFS_DUMP_EROFS=1` environment variable (which
works with both `mkcomposefs` and our internal code) which will dump the
erofs layout for diffing.  There's also a standalone `erofs-debug`
binary that will do the same.

Additionally, this introduces two new files in docs:
 - a detailed description of the parts of erofs that we use
 - a document which attempts to describe the decisions made in creating
   an erofs composefs image (in terms of which order the files are in,
   etc).

The main idea here is to start a serious effort towards standardizing
the composefs label we want to start adding to container images: it
should be possible to define what will be in that label by way of
documentation instead of saying "run this software and use the output".
These two new documents, taken together with the existing "oci.md" form
a rough (and still incomplete) outline for that.

Many thanks to Gao Xiang <[email protected]> for helping clarify many
points about the erofs file format for the documentation.

Closes containers#56

Signed-off-by: Allison Karlitskaya <[email protected]>
@allisonkarlitskaya
Copy link
Collaborator Author

cc @hsiangkao

ignores it.
- `XATTR_FILTER` (`0x0004`): set if the xattr bloom filter should be
used. Read about this in the inode section.
* `blkszbits`: log2 of the block size. Better set this to 12 (4096).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, also it may be helpful to indicate that it's useful to make the image mountable on any page sizes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please clarify what you meant by this? Should the block size be larger or smaller than the page size in order to be able to mount it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the modern platforms, the smallist page size is 4KiB. EROFS has supported < page size blocks for all platforms now. So leave it as 4KiB will keep images mountable all the time..

* `blkszbits`: log2 of the block size. Better set this to 12 (4096).
* `root_nid`: the reference to the root inode. See the inodes section for
what that means. Normally inodes are stored in u64, but this is somewhat
randomly a u16, which means that you're gonna need to put the root

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our initial use cases, __le16 is enough, however for the recent incremental builds, if the root inode is updated and outplaced update is applied, it may not be enough.
Currently, it's been worked around to inplace update the root inode, but I will find a chance to extend it as 64 bits later.

The basic idea is that you find the prefix for your xattr from the list (like
`user.` or `security.`) and then you store only the "suffix" part, along with
the prefix index. If you can't find a prefix, you use 0 (which is conceptually
a prefix of ""). If the prefix matches the entire name then the suffix is `""`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could define 0 as a valid prefix number, although 0 will be ignored and cannot be listed at all in kernel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know. I think I didn't fully understand how the "prefix" part of xattrs is so special when I was writing this...

Note: you really need to do this "compression" step, because it's assumed
during the lookup phase. ie: if we're looking for an xattr `"user.xyz"` then
we'll only consider the entries that have the prefix index for `user.` set on
them. If you didn't properly "compress" your xattr names, they won't be found.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is due to 0 will be always ignored by the kernel.

There's support in the erofs format for custom prefixes. That's when the high
bit of the prefix index is set. These got added circa kernel version 6.4 with
a patch series ending with `6a318ccd7e08` ("erofs: enable long extended
attribute name prefixes") but aren't documented here because we don't use them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I guess you could consider to use them since this feature is landed before overlayfs data-only layer feature, so if the new on-disk format uses data-only layer feature, I guess you could safely use it too.

For some reason a bit value of 1 here indicates the absence of a particular
xattr, which is opposite to the usual arrangement. You'd think it was for
compatibility, but the filter is only engaged if the feature bit is present in
the superblock.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is due to incremental builds too, because some on-disk inodes could be previously generated and we don't need to touch them to do an incremental build

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Makes sense, thanks!

the kernel.
* `mkfs.erofs` also tries to ensure that the inline data ends in the same
disk block as the last byte of the inode metadata (ie: inode header plus
xattrs). This is theoretically not required by the kernel.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but the whole concept of inlined data is to save extra i/o as well as storage. Although the kernel supports it but it's never a design goal of tail inline inodes...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this IO amplification question a bit. I have basically zero understanding of how filesystems actually work in the kernel, so this is possibly quite naive.

Maybe thinking about how much IO you need to do to load a single inode is a bit of a skewed metric. I imagine that (during boot, for example) more or less the entire inode table is going to end up paged in and cached. And in that case the best way to reduce IO would be to reduce the number of pages that need to be loaded, by using the least amount of padding possible, even if it hurts the alignment of a single inode.

Do you have benchmarks or anything like that in this direction?

Copy link

@hsiangkao hsiangkao Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this IO amplification question a bit. I have basically zero understanding of how filesystems actually work in the kernel, so this is possibly quite naive.

Maybe thinking about how much IO you need to do to load a single inode is a bit of a skewed metric. I imagine that (during boot, for example) more or less the entire inode table is going to end up paged in and cached. And in that case the best way to reduce IO would be to reduce the number of pages that need to be loaded, by using the least amount of padding possible, even if it hurts the alignment of a single inode.

Do you have benchmarks or anything like that in this direction?

In principle, the kernel doesn't do any policy strategies, because it might be good for your workload, but not good for others (e.g. users just would like to read specific inodes). So you might win for some cases, but lose for other useful cases. It's not quite something to serve for generic use cases.

EROFS always reads needed block-sized metadata when users request for the best random metadata performance as other kernel fses. if you need to warm up metadata, you could use fadvise (maybe as a seperate thread) in advance.

incorrect symlink detection in fast symlink").
* In general, when faced with the task of writing out an inode with inline
data present, you may need to add padding bytes before the start of the
inode in order to ensure that the inline data falls within a single block.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the negative impacts of tail inline is that
previous in principle you could read 64 (4096/64) inodes at once for each metadata 4k i/o,
now you can only read 1 or 2 inodes if the tail inline is too huge.
It could cause some impacts to ls workload (because it just read metadata), so I tend to just inline small data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty good point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind, though, that for composefs, this is only about directories. We inline symlinks unconditionally but they're usually very small. As for regular files, we only inline those when they're less than 32 bytes: otherwise they're handled as redirects to the data layer.

Being about directories it again turns into a question of the cost of a single operation vs the value of reducing the overall amount of IO (because adjacent operations effectively end up prepopulating the cache for future operations when things are more tightly packed). Imagine the metric is find / instead of ls for example.

Unlike the question of padding between inodes, though, my gut feeling is that the question could go in the other direction here: wasting more space (in the blocks sections) helps keep the inode table smaller, which helps load it faster and keep it in the cache....

This is again where it would be nice to say something about benchmarks...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your gut feeling about the appropriate number for inlining directories? or: at what point do we do another block?

Oh: on that topic there is a slight ambiguity here that I never resolved. Let's say I'm writing a small directory in 100 bytes and I put it into a single block. What is the size field? 100 or 4096? If it was inline it would obviously be 100. This has a potential impact in knowing where the name of the last entry is terminated (end of file vs scanning for nul). I'd say 100 but composefs currently does 4096.

Does it matter at all?

Copy link

@hsiangkao hsiangkao Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind, though, that for composefs, this is only about directories. We inline symlinks unconditionally but they're usually very small. As for regular files, we only inline those when they're less than 32 bytes: otherwise they're handled as redirects to the data layer.

Being about directories it again turns into a question of the cost of a single operation vs the value of reducing the overall amount of IO (because adjacent operations effectively end up prepopulating the cache for future operations when things are more tightly packed). Imagine the metric is find / instead of ls for example.

Unlike the question of padding between inodes, though, my gut feeling is that the question could go in the other direction here: wasting more space (in the blocks sections) helps keep the inode table smaller, which helps load it faster and keep it in the cache....

This is again where it would be nice to say something about benchmarks...

So from the on-disk arrangement, there is no sliver bullet too. But I guess you could also consider fadvise warm up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your gut feeling about the appropriate number for inlining directories? or: at what point do we do another block?

Oh: on that topic there is a slight ambiguity here that I never resolved. Let's say I'm writing a small directory in 100 bytes and I put it into a single block. What is the size field? 100 or 4096? If it was inline it would obviously be 100. This has a potential impact in knowing where the name of the last entry is terminated (end of file vs scanning for nul). I'd say 100 but composefs currently does 4096.

Does it matter at all?

It should be 100, 4096 might work if the trailing filename has \0 but it's not recommended...

Copy link
Collaborator

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of code here obviously...I only skimmed it. I hope we can keep on iterating towards merging some of this work with the C composefs (especially docs) and aim towards interop.


use anyhow::{bail, Result};

use crate::{dumpfile::write_dumpfile, image::FileSystem};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imports here seem like a pretty random order...I've more commonly seen

  • std
  • external crates
  • internal modules

image,
};

fn round_up(n: usize, to: usize) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems like it could use a few unit tests and a doc comment

};

fn round_up(n: usize, to: usize) -> usize {
(n + to - 1) & !(to - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like at least the systemd equivalent of this does a bit more work to handle overflow

@@ -0,0 +1,276 @@
# Canonical composefs file format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently more "proposed composefs 1.1 format" per below for now

`libcomposefs` writes 256 files named from `00` to `ff` into the root directory
as character devices with major/minor of (0, 0). Those are overlayfs whiteouts
and they are needed for older versions of overlayfs which don't support "data
only" layers. We don't target these versions, so *we don't add these files*.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and almost everywhere else...you use "we" to mean this codebase only but I think it'd be more helpful for us to consider "we" to also include the C composefs project.

So we'd rephrase this to something more like "1.1 format doesn't have these files" or so?

FileType, InodeXAttrHeader, Superblock, XAttrHeader,
};

fn round_up(n: usize, to: usize) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second copy of this?


#[derive(Clone, Copy, Debug, Default, Immutable, IntoBytes, TryFromBytes)]
#[repr(u8)]
pub enum FileType {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use https://docs.rs/rustix/latest/rustix/fs/enum.FileType.html ?

Ohh, is it for the IntoBytes?

#[derive(Debug, Default, Immutable, IntoBytes, KnownLayout, TryFromBytes)]
#[repr(C)]
pub struct Superblock {
// vertical whitespace every 16 bytes (hexdump-friendly)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems misplaced?

}
if offset < data.len() {
println!("{offset:08x} Padding");
let padding = &data[offset..data.len()];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just leave off the data.len() here i.e. .. means ..data.len()

Comment on lines +384 to +386
let entry_data = &self.data
[n * size_of::<DirectoryEntryHeader>()..(n + 1) * size_of::<DirectoryEntryHeader>()];
DirectoryEntryHeader::try_ref_from_bytes(entry_data).unwrap()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if zerocopy didn't have a way to cast the data to a &[DirectoryEntryHeader]` that we can just index into

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an internal erofs writer implementation
3 participants