This document defines the semantics of a gcsfuse file system mounted with a GCS bucket, including how files and directories map to object names, what consistency guarantees are made, etc.
The following compatibility constraints are worth noting:
-
gcsfuse does not support GCS buckets with object versioning enabled. (The default is to have this disabled.) No guarantees are made about its behavior when used with such a bucket.
-
Reading or modifying a file backed by an object with a
contentEncoding
property set may yield surprising results, and no guarantees are made about the behavior. It is recommended that you do not use gcsfuse to interact with such objects.(See the Object resource and performance tips pages for more info on
contentEncoding
, and this writeup for an explanation of why it can't be reliably supported.)
By default, gcsfuse has two forms of caching enabled that reduce consistency guarantees. They are discussed in this section, along with their trade-offs and the situations in which they are and are not safe to use.
The default behavior is appropriate, and brings significant performance benefits, when the bucket is never modified or is modified only via a single gcsfuse mount. If you are using gcsfuse in a situation where multiple actors will be modifying a bucket, be sure to read the rest of this section carefully and consider disabling caching.
Important: The rest of this document assumes that caching is disabled (by
setting --stat-cache-ttl 0
and --type-cache-ttl 0
. This is not the default.
If you want the consistency guarantees discussed in this document, you must use
these options to disable caching.
The cost of the consistency guarantees discussed in the rest of this document
is that gcsfuse must frequently send stat object requests to GCS in order to
get the freshest possible answer for the kernel when it asks about a particular
name or inode, which happens frequently. This can make what appear to the user
to be simple operations, like ls -l
, take quite a long time.
To alleviate this slowness, gcsfuse supports using cached data where it would
otherwise send a stat object request to GCS, saving some round trips. This
behavior is controlled by the --stat-cache-ttl
flag, which can be set to a
value like 10s
or 1.5h
. (The default is one minute.) Positive and negative
stat results will be cached for the specified amount of time.
--stat-cache-ttl
also controls the duration for which gcsfuse allows the
kernel to cache inode attributes. Caching these can help with file system
performance, since otherwise the kernel must send a request for inode attributes
to gcsfuse for each call to write(2)
, stat(2)
, and others.
Warning: Using stat caching breaks the consistency guarantees discussed in this document. It is safe only in the following situations:
- The mounted bucket is never modified.
- The mounted bucket is only modified on a single machine, via a single gcsfuse mount.
- The mounted bucket is modified by multiple actors, but the user is confident that they don't need the guarantees discussed in this document.
Because GCS does not forbid an object named foo
from existing next to an
object named foo/
(see the Name conflicts section above),
when gcsfuse is asked to look up the name "foo" it must stat both objects.
The stat cache enabled with --stat-cache-ttl
can help with this, but it does
not help fully until after the first request. For example, assume that there is
an object named foo
but not one named foo/
, and the stat cache is enabled.
When the user runs ls -l
, the following happens:
-
The objects in the bucket are listed. This causes a stat cache entry for
foo
to be created. -
ls
asks to stat the name "foo", causing a lookup request to be sent for that name. -
gcsfuse sends GCS stat requests for the object named
foo
and the object namedfoo/
. The first will hit in the stat cache, but the second will have to go all the way to GCS to receive a negative result.
The negative result for foo/
will be cached, but that only helps with the
second invocation of ls -l
.
To alleviate this, gcsfuse supports a "type cache" on directory inodes. When
--type-cache-ttl
is set, each directory inode will maintain a mapping from
the name of its children to whether those children are known to be files or
directories or both. When a child is looked up, if the parent's cache says that
the child is a file but not a directory, only one GCS object will need to be
statted. Similarly if the child is a directory but not a file.
Warning: Using type caching breaks the consistency guarantees discussed in this document. It is safe only in the following situations:
- The mounted bucket is never modified.
- The type (file or directory) for any given path never changes.
GCS has a feature called object versioning that allows buckets to be put into a mode where the history of each object is maintained, even when it is overwritten or deleted. gcsfuse does not support such buckets, and the file system semantics discussed below do not apply to buckets in this mode—the behavior for such buckets is undefined.
GCS object names map directly to file paths using the separator '/'. Object names ending in a slash represent a directory, and all other object names represent a file. Directories are by default not implicitly defined; they exist only if a matching object ending in a slash exists.
This is all much clearer with an example. Say that the GCS bucket contains the following objects:
- burrito/
- enchilada/
- enchilada/0
- enchilada/1
- queso/
- queso/carne/
- queso/carne/nachos
- taco
Then the gcsfuse directory structure will be as follows, where a trailing slash indicates a directory and the top level is the contents of the root directory of the file system:
burrito/
enchilada/
0
1
queso/
carne/
nachos
taco
As mentioned above, by default there is no allowance for the implicit existence
of directories. Since the usual file system operations like mkdir
will do the
right thing, if you set up a bucket's structure using only gcsfuse then you
will not notice anything odd about this. If, however, you use some other tool
to set up objects in GCS (such as the storage browser in the Google Developers
Console), you may notice that not all objects are visible until you create
leading directories for them.
For example, say that you use some other tool to set up a single object named
"foo/bar" in your bucket, then mount the bucket with gcsfuse. The file system
will initially appear empty, since there is no "foo/" object. However if you
subsequently run mkdir foo
, you will now see a directory named "foo"
containing a file named "bar".
gcsfuse supports a flag called --implicit-dirs
that changes the behavior.
When this flag is enabled, name lookup requests from the kernel use the GCS
API's Objects.list operation to search for objects that would implicitly define
the existence of a directory with the name in question. So, in the example
above, there would appear to be a directory named "foo".
The use of --implicit-dirs
has some drawbacks (see issue #7 for a
more thorough discussion):
-
The feature requires an additional request to GCS for each name lookup, which may have costs in terms of request budget and latency.
-
GCS object listings are only eventually consistent, so directories that recently implicitly sprang into existence due to the creation of a child object may not show up for several minutes (or, in rare extreme cases, hours or days). Similarly, recently deleted objects may continue for a time to implicitly define directories that eventually wink out of existence. Even if an up to date listing is seen once, it is not guaranteed to be seen on the next lookup.
-
With this setup, it will appear as if there is a directory called "foo" containing a file called "bar". But when the user runs
rm foo/bar
, suddenly it will appear as if the file system is completely empty. This is contrary to expectations, since the user hasn't runrmdir foo
. -
gcsfuse sends a single Objects.list request to GCS, and treats the directory as being implicitly defined if the results are non-empty. In rare cases (notably when many objects have recently been deleted) Objects.list may return an arbitrary number of empty responses with continuation tokens, even for a non-empty name range. In order to bound the number of requests, gcsfuse simply ignores this subtlety. Therefore in rare cases an implicitly defined directory will fail to appear.
With each record in GCS is stored object and metadata generation numbers. These provide a total order on requests to modify an object's contents and metadata, compatible with causality. So if insert operation A happens before insert operation B, then the generation number resulting from A will be less than that resulting from B.
In the discussion below, the term "generation" refers to both object generation
and meta-generation numbers from GCS. In other words, what we call "generation"
is a pair (G, M)
of GCS object generation number G
and associated
meta-generation number M
.
As in any file system, file inodes in a gcsfuse file system logically contain file contents and metadata. A file inode is initialized with a particular generation of a particular object within GCS (the "source generation"), and its contents are initially exactly the contents and metadata of that generation.
When a file is created anew because it doesn't already exist and open(2)
was
called with O_CREAT
, an empty object with the appropriate name is created in
GCS. The resulting generation is used as the source generation for the inode,
and it is as if that object had been pre-existing and was opened.
Inodes may be opened for writing. Modifications are reflected immediately in
reads of the same inode by processes local to the machine using the same file
system. After a successful fsync
or a successful close
, the contents of the
inode are guaranteed to have been written to the GCS object with the matching
name if the object's generation and meta-generation numbers still matched the
source generation of the inode. (They may not have if there had been
modifications from another actor in the meantime.) There are no guarantees
about whether local modifications are reflected in GCS after writing but before
syncing or closing.
Modification time (stat::st_mtim
on Linux) is tracked for file inodes, and can
be updated in usual the usual way using utimes(2)
or futimens(2)
. When dirty
inodes are written out to GCS objects, mtime is stored in the custom metadata
key gcsfuse_mtime
in an unspecified format.
There is one special case worth mentioning: mtime updates to unlinked inodes may be silently lost. (Of course content updates to these inodes will also be lost once the file is closed.)
There are no guarantees about other inode times (such as stat::st_ctim
and
stat::st_atim
on Linux) except that they will be set to something reasonable.
If a new generation is assigned to a GCS object due to a flush of a file inode, the source generation of the inode is updated and the inode ID remains stable. Otherwise, if a new generation is created by another machine or in some other manner from the local machine, the new generation is treated as an inode distinct from any other inode already created for the object name.
In other words: inode IDs don't change when the file system causes an update to GCS, but any update caused remotely will result in a new inode.
Inode IDs are local to a single gcsfuse process, and there are no guarantees about their stability across machines or invocations on a single machine.
One of the fundamental operations in the VFS layer of the kernel is looking up the inode for a particular name within a directory. gcsfuse responds to such lookups as follows:
- Stat the object with the given name within the GCS bucket.
- If the object does not exist, return an error.
- Call the generation of the object
(G, M)
. If there is already an inode for this name with source generation(G, M)
, return it. - Create a new inode for this name with source generation
(G, M
).
The intent of these conventions is to make it appear as though local writes to
a file are in-place modifications as with a traditional file system, whereas
remote overwrites of a GCS object appear as some other process unlinking the
file from its directory and then linking a distinct file using the same name.
The st_nlink
field will reflect this when using fstat(2)
.
Note the following consequence: if machine A opens a file and writes to it, then machine B deletes or replaces its backing object, or updates it metadata, then machine A closes the file, machine A's writes will be lost. This matches the behavior on a single machine when process A opens a file and then process B unlinks it. Process A continues to have a consistent view of the file's contents until it closes the file handle, at which point the contents are lost.
gcsfuse sets the following pieces of GCS object metadata for file objects:
-
contentType
is set to GCS's best guess as to the MIME type of the file, based on its file extension. -
The custom metadata key
gcsfuse_mtime
is set to track mtime, as discussed above.
gcsfuse directory inodes exist simply to satisfy the kernel and export a way to look up child inodes. Unlike file inodes:
-
There are no guarantees about stability of directory inode IDs. They may change from lookup to lookup even if nothing has changed in the GCS bucket. They may not change even if the directory object in the bucket has been overwritten.
-
gcsfuse does not keep track of modification time for directories. There are no guarantees for the contents of
stat::st_mtim
or equivalent, or the behavior ofutimes(2)
and similar. -
There are no guarantees about
stat::st_nlink
.
Despite no guarantees about the actual times for directories, their time fields
in stat
structs will be set to something reasonable.
Unlike reads for a particular object, listing operations in GCS are
eventually consistent. This means that directory listings in
gcsfuse may be arbitrarily far out of date. Additionally, seeing a fresh
listing once does not imply that future listings will be fresh. This applies at
the user level to commands like ls
, and to the posix interfaces they use like
readdir
.
GCS offers no way to delete an object if and only if other objects don't exist. It is therefore impossible to atomically check whether a directory is empty and delete its backing object.
gcsfuse does the pragmatic thing here: it lists objects with the directory's
name as a prefix, returning ENOTEMPTY
if anything shows up, and otherwise
deletes the backing object. Because listing operations in GCS are only
eventually consistent, this may sometimes cause an ENOTEMPTY
error when unlinking an empty directory, and may sometimes mean that a
non-empty directory is successfully unlinked.
Note that by their definition, implicit directories cannot be empty.
gcsfuse represents symlinks with empty GCS objects that contain the custom
metadata key gcsfuse_symlink_target
, with the value giving the target of a
symlink. In other respects they work like a file inode, including receiving the
same permissions.
gcsfuse offers close-to-open and fsync-to-open consistency. As discussed above,
close
and fsync
create a new generation of the object before returning, as
long as the object hasn't been changed since it was last observed by the
gcsfuse process. On the other end, open
guarantees to observe a generation at
least as recent as all generations created before open
was called.
Therefore if:
- machine A opens a file and writes then successfully closes or syncs it, and
- the file was not concurrently unlinked from the point of view of A, and
- machine B opens the file after machine A finishes closing or syncing,
then machine B will observe a version of the file at least as new as the one created by machine A.
By default, all inodes in a gcsfuse file system show up as being owned by the
UID and GID of the gcsfuse process itself, i.e. the user who mounted the file
system. All files have permission bits 0644
, and all directories have
permission bits 0755
(but see below for issues with use by other users).
Changing inode mode (using chmod(2)
or similar) is unsupported, and changes
are silently ignored.
These defaults can be overriden with the --uid
, --gid
, --file-mode
, and
--dir-mode
flags.
The fuse kernel layer itself restricts file system access to the mounting user (cf. fuse.txt). So no matter what the configured inode permissions, by default other users will receive "permission denied" errors when attempting to access the file system. This includes the root user.
This can be overridden by setting -o allow_other
to allow other users to
access the file system. Be careful! There may be security
implications.
Because GCS offers no way to delete an object conditional on the non-existence
of other objects, there is no way for gcsfuse to unlink a directory if and only
if it is empty. So gcsfuse takes the simple route, and always allows a
directory to be unlinked, even if non-empty. The contents of a non-empty
directory that is unlinked are not deleted but simply become inaccessible—the
placeholder object for the unlinked directory is simply removed. (Unless
--implicit-dirs
is set; see the section on implicit directories above.)
gcsfuse implements requests from the kernel to read the contents of a directory
(as when listing a directory with ls
, for example) by calling
Objects.list in the GCS API. The call uses a delimiter of /
to avoid
paying the bandwidth and request cost of also listing very large
sub-directories.
However, with this implementation there is no way for gcsfuse to distinguish a
child directory that actually exists (because its placeholder object is
present) and one that is only implicitly defined. So when --implicit-dirs
is
not set, directory listings may contain names that are inaccessible in a later
call from the kernel to gcsfuse to look up the inode by name. For example, a
call to readdir(3) may return names for which fstat(2) returns ENOENT
.
It is possible to have a GCS bucket containing an object named foo
and
another object named foo/
:
-
This situation can easily happen when writing to GCS directly, since there is nothing special about those names as far as GCS is concerned.
-
This situation may happen if two different machines have the same bucket mounted with gcsfuse, and at about the same time one creates a file named "foo" and the other creates a directory with the same name. This is because the creation of the object
foo/
is not preconditioned on the absence of the object namedfoo
, and vice versa.
Traditional file systems did not allow multiple directory entries with the same name, so all tools and kernel code are structured around this assumption. Therefore it's not possible for gcsfuse to faithfully preserve both the file and the directory in this case.
Instead, when a conflicting pair of foo
and foo/
objects both exist, it
appears in the gcsfuse file system as if there is a directory named foo
and a
file or symlink named foo\n
(i.e. foo
followed by U+000A, line feed). This
is what will appear when the parent's directory entries are read, and gcsfuse
will respond to requests to look up the inode named foo\n
by returning the
file inode. \n
in particular is chosen because it is not
legal in GCS object names, and therefore is not ambiguous.
gcsfuse files can be memory-mapped for reading and writing using mmap(2). If you make modifications to such a file and want to ensure that they are durable, you must do the following:
-
Keep the file descriptor you supplied to mmap(2) open while you make your modifications.
-
When you are finished modifying the mapping, call msync(2) and check for errors.
-
Call munmap(2) and check for errors.
-
Call close(2) on the original file descriptor and check for errors.
If none of the calls returns an error, the modifications have been made durable in GCS, according to the usual rules documented above.
See the notes on fuseops.FlushFileOp for more details.
Not all of the usual file system features are supported. Most prominently:
-
Renaming directories is not supported. A directory rename cannot be performed atomically in GCS and would therefore be arbitrarily expensive in terms of GCS operations, and for large directories would have high probability of failure, leaving the two directories in an inconsistent state.
-
File and directory permissions and ownership cannot be changed. See the section above.
-
Modification times are not tracked for any inodes except for files.
-
No other times besides modification time are tracked. For example, ctime and atime are not tracked (but will be set to something reasonable). Requests to change them will appear to succeed, but the results are unspecified.