Safe filesystem API for a FUSE implementation in Rust

UPDATE: By late February 2024 this discussion has moved away from an approach using Syncer and looked into various ways of providing a filesystem on Safe Network for a FUSE based mountable drive and the potential Safe filesystem APIs and the features they might support.

I’ve summarised the status from my perspective in the following reply:


Syncer

This is a reminder of an earlier discussion and some experiments I did using the old Safe CLI / APIs (remember Safe NFS?) and a very interesting but still experimental Rust project called Syncer. See:

Syncer has some very interesting goals and is structured in a similar way to Safe Network: hashing of files as blobs (chunks) which live on a remote server but are cached locally on disk, using rsync to retrieve blobs as needed when reading the filesystem, and to push mutations back to the server. A POSIX like API is used to access this cache, and on top of that is a FUSE based mountable drive.

The project has been dead for a long time but the author claimed very good performance with very similar goals to Safe: a massive virtual drive allowing access to files as if they were held locally.

I had some success in getting Syncer to mount a local drive connected to a local Safe Network testnet based on the old Safe CLI and Safe NFS API. Those experiments are summarised in the topic above. A problem I had with this was that Safe NFS had no concept of directories, but I understand they will be present in the new FoldersAPI which is just being explored, and being based on CRDT registers (see notes on early draft PR).

It might be worth taking a look for synergies here in the Syncer codebase and the Safe Network ‘backend’.

Syncer Design

Here’s a diagram I came up with trying to summarise how Syncer worked.


(image from this reply).

Here’s the author’s summary of the design:

Syncer github: GitHub - runfalk/syncer: A filesystem that pretends you have all your files locally while caching only the most recently used

Syncer and Safe Network

I wonder if this could be revisited and provide a shortcut towards a capable Safe Network Files API (more POSIX) which can be easily used to create a mountable FUSE drive and API level support for numerous other projects.

Limitations of Syncer - inodes

A limitation of the Syncer code is that it uses the high level FUSE API, which uses temporary inodes and so lacks support for hard links.

For a Safe Network filesystem based on the TreeCRDT, the handling of inodes across multiple concurrent user systems was a problem, and while some stochastic solutions were considered they had their flaws (e.g. some risk of inode collisions in a large filesystem).

This limitation is I think solvable using the Register APIs, because inodes can be kept in the local system as temporary identifiers for the life of the disk mount, and the Safe CRDT implementation can use 256 bit content-hashes with no risk of collision.

A POSIX Style FUSE File System on Safe Network

Each client device will manage local inodes (that never cross the Safe APIs), and the underlying FUSE implementation can map between inodes and hashes, to implement the low level FUSE interface on top of a Register CRDT based filesystem. Soft and hard-links and full POSIX style support become feasible provided the Register CRDT based filesystem implements support for symbolic-links and hard-links. I believe this is feasible by using multiple types of metadata chunk: directory, file, symbolic-link and hard-links.

14 Likes

Syncer seems interesting.

Regarding the registers approach, I wonder if it addresses the fundamental problems of cycle detection and concurrent moves. That was pretty much the point of Kleppman’s crdt-tree paper.

WHY A MOVE OPERATION IS HARD
Applications that rely on a tree data model often need to
move a node from one location to another location within
the tree, such that all of its children move along with it:
• In a filesystem, any file or directory can be moved
to a different parent directory. Moreover, renaming a
file or directory is equivalent to moving it to a new
name without changing its parent directory.
• In a rich text editor, a paragraph can be turned into
a bullet point. In the XML tree this corresponds to
creating new list and bullet point nodes, and then
moving the paragraph node inside the bullet point.
• In graphics software, grouping two objects corresponds to creating a new group node, and then
moving the two objects into the new group node.

As these operations are so common, it is not obvious why a
move operation should be difficult in a replicated setting. In
this section we demonstrate some problems that arise with
replicated trees, before proceeding to our solution in §3

Concurrent moves of the same node
The first difficulty arises when the same node is concurrently moved into different locations on different replicas.
This scenario is illustrated in Figure 1, where replica 1
moves node A to be a child of B, while concurrently
replica 2 moves A to be a child of C. After the replicas
communicate, what should the merged state of the tree be?
If a move operation is implemented by deleting the
moved subtree from its old location, and then re-creating
it at the new location, the merged state will be as shown in
Figure 1a: the concurrent moves will duplicate the moved
subtree, since each move independently recreates the subtree in each destination location. We believe that this duplication is undesirable, since subsequent edits to nodes in
the duplicated subtree will apply to only one of the copies.
Two users who believe they are collaborating on the same
file may in fact be editing two different copies, which will
then become inconsistent with each other. In the rich text
editor and graphics software examples, such duplication is
also undesirable.

Another possible resolution is for the destination locations of both moves to refer to the same node, as shown in
Figure 1b. However, the result is a DAG, not a tree. POSIX
filesystems do not allow this outcome, since they do not
allow hardlinks to directories.

In our opinion, the only reasonable outcomes are those
shown in Figure 1c and 1d: the moved subtree appears
either in replica 1’s destination location or in replica 2’s
destination location, but not in both. Which one of these
two is picked is arbitrary, due to the symmetry between
the two replicas. The “winning” location could be picked
based on a timestamp in the operations, similarly to the “last
writer wins” conflict resolution method of Thomas’s write
rule [8]. The timestamp in this context need not come from
a physical clock; it could also be logical, such as a Lamport
timestamp [9].

We tested this scenario with file sync products Dropbox
and Google Drive by concurrently moving the same directory to two different destination directories.1 Dropbox exhibited the undesirable duplication behaviour of Figure 1a,
while the outcome on Google Drive was as in Figure 1c/d.
2.2 Moving a node to be a descendant of itself
On a filesystem, the destination directory of a move operation must not be a subdirectory of the directory being
moved. For example, if b is a subdirectory of a, then the
Unix shell command mv a a/b/ will fail with an error.
This restriction is required because allowing this operation
would introduce a cycle into the directory graph, and so the
filesystem would no longer be a tree. Any tree data structure
that supports a move operation must handle this case.
In an unreplicated tree it is easy to prevent cycles being
introduced: if the node being moved is an ancestor of the
destination node, the operation is rejected. However, in a
replicated setting, different replicas may perform operations
that are individually safe, but whose combination leads to
a cycle. One such example is illustrated in Figure 2. Here,
replica 1 moves B to be a child of A, while concurrently
replica 2 moves A to be a child of B. As each replica
propagates its operation to the other replica, a careless implementation might end up in the state shown in Figure 2a:

A and B have formed a cycle, detached from the tree.
Another possible outcome is shown in Figure 2b: the
nodes involved in the concurrent moves (and their children)
could be duplicated, so that both “A as a child of B” and
“B as a child of A” can exist in the tree. However, such
duplication is undesirable for the same reasons as in §2.1.
In our opinion, the best way of handling the conflicting
operations of Figure 2 is to choose either Figure 2c or 2d:
that is, either the result of applying replica 1’s operation and
ignoring replica 2’s operation, or vice versa. Like in §2.1, the
winning operation can be picked based on a timestamp.
As before, we tested this scenario with Google Drive
and Dropbox. In Google Drive, one replica was able to
successfully sync with the server, while the other replica
displayed the “unknown error” message shown in Figure 3.

4 Likes

Good question. Unfortunately the author never responded to the couple of issues I opened and hasn’t touched the repo in about 6 years I think, so I don’t think we can ask.

I didn’t know any Rust when I was looking at it last time, so if I do so again I’ll keep that in mind.

4 Likes

yeah, same question would apply both to the syncer approach and to maidsafe’s register based approach.

As a purist, I always want things to be provably correct. Though I suppose an argument can be made that dropbox and google drive achieved success even with clear concurrency problems.

5 Likes

I suspect we are fine. In SAFE we can move (i.e. add a part of a tree to a new location/dir) but we don’t remove. So the old data is still there, although it may be back in history.

The key element I think is the perpetual part of SAFE and maintaining history.

Also the Registers that contain either links to data_maps (files) or links to more registers (directories) don’t actually contain the data. That is all in chunks on the network.

A nice feature is this. One client app may just make this standard FS with limited meta data (i.e. the data_map pointed to is a chunk with metadata and the data_map, the client app can parse how it wants).

So we could have, container based storage, blobs, SOLID, other labelled data and so on. The apps can fight out which is best.

What I think will be neat though is wasm of all this and make things like these filesystem representations, dns, movie players, audio players, tts and stt systems all appear as modules. So the actual end user apps can collect which module it needs and wrap it in the app logic.

8 Likes

So then, concretely, how does the registers based solution deal with the cycle problem illustrated here where concurrent moves result in A and B pointing at eachother and disconnected from the rest of the tree?

It may be that the answer is “It doesn’t and we are ok with that”. But I think it’s good to clarify.

3 Likes

I don’t see the issue of parenting on our setup. We don’t move stuff to a new place. We do move stuff, but the old stuff is still there.

The only issue I see if somebody does a link to the old location when we were “moving” but that’s OK as the old location still exists. It’s just in history

8 Likes

The beauty of a safe COW approach? :cow2::cow::cowboy_hat_face:

1 Like

A way to implement soft and hard links for a FUSE based POSIX style filesystem on top of the Register CRDT. [Added to the OP]

Limitations of Syncer - inodes

A limitation of the Syncer code is that it uses the high level FUSE API, which uses temporary inodes and so lacks support for hard links.

For a Safe Network filesystem based on the TreeCRDT, the handling of inodes across multiple concurrent user systems was a problem, and while some stochastic solutions were considered they had their flaws (e.g. some risk of inode collisions in a large filesystem).

This limitation is I think solvable using the Register APIs, because inodes can be kept in the local system as temporary identifiers for the life of the disk mount, and the Safe CRDT implementation can use 256 bit content-hashes with no risk of collision.

A POSIX Style FUSE File System on Safe Network

Each client device will manage local inodes (that never cross the Safe APIs), and the underlying FUSE implementation can map between inodes and hashes, to implement the low level FUSE interface on top of a Register CRDT based filesystem. Soft and hard-links and full POSIX style support become feasible provided the Register CRDT based filesystem implements support for symbolic-links and hard-links. I believe this is feasible by using multiple types of metadata chunk: directory, file, symbolic-link and hard-links.

@danda - does this make sense to you?

4 Likes

yes, I think so. local inode: u64 → safe hash: u256.

Inode is identifier for FUSE/OS. Hash is world-wide global identifier for safe network.

Mapping is stored in the inode metadata.

Should be no collisions,.

regarding soft and hard links, the sn_fs code implements those, if you need an example.

4 Likes

Some further random thoughts.

I would suggest tackling such a filesystem in phases:

  1. read-only, not caching.
  2. read-only, caching.
  3. read-write, caching. (local first)

In each phase, experience and experimental results will be gained that inform the next phase.

Writes will be more complicated than in a typical [networked] filesystem because there is the added complexity of payments, which also requires some sort of wallet access.

Meanwhile, a read-only filesystem is still a useful thing. It would basically function as a local cache of data on the safenetwork, starting from a particular root node.

Mounting

I would think a mount command should look something like:

mount safe://<xorname> /path/to/local/mountpoint

This provides an initial u256 network identifier to act as the root network node, which will be mapped to a local Inode u64 identifier for the local FS root. This is the mapping bootstrap.

In other words, each mount will map a specific network directory as the root of a local filesystem, which will then locally mirror whatever exists as a sub-tree of the root network node.

This probably works best with the proposed register filesystem. It could perhaps be adapted to the existing FileContainer filesystem with addition of some logic to handle safe-url paths when mounting, but syncing might prove challenging.

Synchronizing

With the TreeCRDT approach in sn_fs, sync was a built-in property of the data structure. All operations are logged, append-only. They can be sent to the other side and each op applied remotely, to result in same final state for an entire tree.

Something similar may be workable with the Register crdt API. The FUSE filesystem will need a network (register) API to request all log operations since a specific clock time, where clock is native to the register crdt, not realtime. I don’t know if such an API presently exists or not. Also, iiuc, the register based FS would have one register per filesystem node (directory, file, or link). So then, there is probably one log per register. This complicates things compared to the TreeCRDT, where there is just one log for the entire tree. So anyway, that seems an important area of API coordination between the network nodes and the client Fuse FS. Because the client basically needs a way to tell the server, give me all the updates for every register in the sub-tree of root register. But then registers are stored on different nodes, so client must fetch recursively. And if client already has a tree, then it effectively needs to provide a clock-time for each tree node, which could be a lot of data. The devil may be in the details here. There will likely be a sweet spot for the size of trees that can be synced in a reasonable time. @anselme might have further ideas/insights, or @davidrusu.

I’m uncertain how sync would be performed using the present FilesContainer API. It seems, umm, tricky.

edit: ps: and I still think cycles are a problem that really can’t be solved with registers. ie, it is possible that the state of the tree as a whole will diverge between network and client, and convergence is supposed to be a guaranteed property of a CRDT. The point of the tree-crdt paper was basically solving this problem. In other words, the register CRDTs individually will converge, but not necessarily the tree as a whole, which exists “outside” of any single CRDT. maybe I’m wrong, or its not important? :wink:

8 Likes

Going a bit further with the “ps:” in the last comment, I still think a better (more correct and efficent?) model would be to make tree-crdt a native safe-network type, as register is.

The filesystems could be truly modelled as a synchronizable tree. The tradeoff is that each tree has a fixed root and is basically its own graph. Huge trees probably wouldn’t be such a good idea. Linking between trees would also be an unsolved problem.

Also, research would be needed to make a much more efficient tree-crdt type than my crude prototype. There’s been a lot of crdt optimization in recent years, so perhaps some of that could be applied. If I were maidsafe, I would hire or contract with an expert for exactly this purpose, and also another filesystem expert. Two such domain experts could probably make something kick-ass.

I don’t think any of that will happen though. At least not now. just my $.02 from the sidelines.

If the network works well enough in its core functionality, perhaps such a sub-project will happen eventually anyway. So let’s call this future prognostication.

4 Likes

Thanks very much for your thoughts and suggestions, all very helpful.

There’s a lot to get my head around so I will definitely be breaking the process and design up into components and building most of them in stages, swapping out temporary placeholders when new parts are ready and so on.

My current view on Registers is one per directory, and one root node. The children of that node include one entry per directory entry, plus the preceding root node, whose children will be the entries from the previous state and so on. So the history is retained through the trail from one root node to the next.

Using that approach I also foresee being able to mount by specificying the xor address for a register, and an optional hash if you want to choose a point in the history. I’m still interested to hear if that’s the way to support access to register histories, but so far I can’t see an alternative.

Today I’ve identified the major components and begun to define the interfaces, and stages in their development. I’m not sure how feasible it will be for me to develop all of this, but I’ll at least be able to share a design and outline the capabilities of the different parts.

The good thing is that significant parts can be well defined using existing Rust crates/traits (such as std::fs and easyfuse) making it straightforward to develop them independently, so others could contribute if it generates interest. I’m also hopeful that as MaidSafe develop their own filesystem some of the SN APIs can fill in some areas, but that will depend very much on how much programmatic support they implement for file access. There may also be old MaidSafe code that is still applicable.

A lot to look at!

6 Likes

btw: There’s an important distinction between using Registers vs tree-crdt as native datatype for a filesystem.

The way I envision it, the tree-crdt would be stored on a single node. A register is also stored on a single node.

The difference is that the tree-crdt would contain all entries and history for an entire filesystem tree. This makes it quite efficient for a client node to synchronize.

With registers, each filesystem node (directory, symlink, hardlink) is in its own register, and each register is typically stored on a different network node. This puts a lot more burden on the client and the network, as the client must recursively fetch log operations for each register. And similar for writing, though only for filesystem nodes that changed.

I think this is the first time this distinction has been discussed.

edit: it could be possible (future enhancement) for a tree-crdt to be split across nodes, eg living in a few chunks as it grows beyond what a single node could accomodate. But even still, it should be split into just a few pieces, rather than potentially thousands or tens of thousands.

6 Likes

@happybeing I haven’t thought this all the way through yet, but building on my previous comment, it seems possible to simply use the safe-network’s present chunk-storage capability to create a local-first FUSE crdt-tree filesystem with no extra network support.

At simplest, one could take the existing sn_fs and write the logs to local disk. Remember these logs are append-only. When the log size reaches the size of a network chunk multiple[1], write it to the network as a “versioned file”, which is basically just a sequence of chunks.

With this information, any client can take a safeurl argument for the location of the “file” and easily retrieve the entire append-only log, which may be spread across multiple chunks.

voila: we have just synced filesystems from client A to client B.

Both A and B should have the full log locally, which they can use for faster re-mounts.

A limitation of this system is that only client A can write. B and all others are read-only[2], so they should not be permitted to perform local writes, or that confuses the log. But this limitation is really just a matter of network permissions, and I’m uncertain what the present status is there, with regards to eg group write perms. With group write perms, multi-party writes should “just work”.

I may be forgetting about something, but so far it seems super simple and workable. To me, a lot more workable than the registers approach, but time will tell for that. The hardest part will likely be making the payments for writes. In a first prototype, that could probably be farmed out to safe-cli, ie just call safe-cli put <snfs-log-file>[3], either from within mounted process, or possibly from an external daemon.

[1] logs may be written to network more frequently, though it wastes space to write less-than-full chunks.

[2] it shouldn’t actually be any problem for B to mount read/write. They will just be working in a (local-first) mode where they can incorporate updates from the network but can never write back their changes.

[3] Given that the log file is append-only, I believe the client will only actually upload the latest chunks that do not already exist in the network, which is a very nice property. This also avoids having to use lower-level APIs to write the chunks + chunk-list + payment ourself.

5 Likes

After mulling overnight and a couple of hours on the sofa going around designs, problems, limitations etc I think there’s a workable approach.

First though, I think it’s hard to impossible to implement a versioned traditional filesystem using the Register CRDTs to represent the tree. At least anything that seems practical and sensible, and not ridiculously complex, limited or broken.

Again I’d love to be contradicted about that and shown how it can work, but it’s beyond me to implement a versioned tree with useful qualities that way, even without hard links.

So I understand why you suggest the TreeCRDT. I’m inclined to agree, but also think there’s a useful simpler step that can be taken before and towards that.

A Versioned POSIX Style Filesystem

I’m thinking, use one Register to track versions of a single object that holds the full filesystem tree (ie a tree of directories, and pointers to datamaps for files in each directory) that is serialised to disk as a single file.

This can be stored on the local disk as encrypted chunks for security along with the chunks for any files that have not yet been uploaded, as well as any cached chunks. The Register holds a single history of pointers to different versions of that immutable filesystem tree object.

The tree object is a snapshot of the state of the whole filesystem, but doesn’t need to store the content of any files so is compact. Which makes operations that are the equivalent of a full backup, or of publishing the entire tree, very cheap and efficient. For example, the tree for thousands of files could fit in a 1MB chunk. So publishing or backing up would cost one chunk and one Register update.

The Register is updated each time any modified files and the filesystem tree are uploaded, and the Register has the pointer to the latest tree written on top of the previous one.

Versioning is therefore trivial, implementation is very much simpler, and your filesystem can support any features you want including soft and hard links.

What’s tricky is merging changes but this allows the TreeCRDT to be slotted in later, in place of the single snapshot tree if that functionality is wanted, perhaps at the expense of hard links. But if people are working on different parts of the tree, merging might still be possible and even automatic for many use cases.

So you could support different styles of working with different features such as collaboration without hard links, single author publishing with hard links, and easy versioning in either case. Both very fast and cheap.

I’m still chewing this over but it seems a good first step that creates a truly versioned filesystem that can start as one person r/w with many people read-only, but is also a step towards collaboration (many people r/w).

Fingers crossed there might already be useful code based on an in memory filesystem for example, that could help get to a proof-of-concept relatively quickly.

Any thoughts?

6 Likes

I think it sounds workable if you can truly limit to a single writer, but that may not be possible.

A single owner could mount the filesystem in 2 or 100 different places and be updating all at the same time. In other words, I think it would be possible to create cycles if one tries, or by accident. And its also unclear to me how each client merges in changes from the master network copy and deals with conflicts.

Eg, let’s say Bob is running two instances. So both can write. And both need to somehow read eachother’s changes and merge them in, regularly. Now consider (a) conflicts, and (b) the cycles example I posted earlier. I think you would do well to analyze what the outcomes will be with this design, and if you consider the outcome acceptable or not.

4 Likes

One additional observation about the sn_fs approach I posted above. Recall that in the simplest implementation, the logs are simply stored as a single logical file using the existing safe-network FilesContainer, which will auto-chunk the file. This log actually contains the content of all files as part of the Write operation.

This means that the individual files within the tree crdt would not be accessible individually within the safe network. ie, each file is not split into chunks. Rather, the log as a whole is split into chunks, and thus the only way to extract individual files is to read the entire log, applying ops until reaching the final state.

This is analogous to an .iso file, where a single file in filesystem A (eg ext4) contains all directory and file content of filesystem B (cdrom, iso9660), which can be independently mounted. So here, the FilesContainer FS is A, and sn_fs is B.

edit: filesystem A could also be the register-based proposal @happybeing is contemplating or whatever @bochaco is cooking up. sn_fs would not care, just as a .iso file can exist on ext2, ext4, btrfs, xfs, zfs, reiserfs, ntfs, fat, etc.

Of course, there are many ways to design, optimize, complicate the scheme, with various tradeoffs. But I figured that is a non-obvious consideration worth pointing out.

6 Likes

Thanks all useful points.

This is very early stage of course so I’ll continue working and think about your points and criticisms.

5 Likes

What I am seeing is this (it also relates to the other question Safe API - Registers - #89 by happybeing )

  • A Register Entry holds a Key
  • That key points to a Value (chunk)
  • That Chunk Holds metadata
    • File structs with name/size/data_map etc.
    • Links to another Directory (Register)

When ANY entry in the Directory (Register) changes a new RegisterEntry is written.

This can happen with concurrent access and create a fork in the register. This means 2 separate entities had write permissions and wrote to the register (Directory) at the same time. This is OK as users need to merge that and registers allow this merge (a new RegisterEntry is written that is the latest entry with both of the forks as its ancestor).

I am not seeing the issue, but I could be totally missing something here.

8 Likes