Ryyn (massive multi-device virtual FS)

riddim · May 8, 2025, 4:36pm

As a compromise - maybe you tell us if you make it through the voting..?

So we have an additional motivation to throw votes at your app..?

loziniak · May 8, 2025, 6:24pm

Will this be similar to Syncthing?

oetyng · May 8, 2025, 7:35pm

Yes, very similar.

riddim · May 9, 2025, 6:21am

I actually think it’s spoken like rhythm - just without the t and an n instead of an m

riddim · May 9, 2025, 6:57am

I think @Southside found one of your ads that are floating around the internet

oetyng · May 9, 2025, 10:13am

I should perhaps add that in the process of developing Ryyn, there will be two independent libraries developed, that I could have entered as independent projects in IF (but didn’t think of).

These are Autonomi-backed storages:

A Key-Value store.
An EventStore.

Later these will have bindings to a few of the most used languages.

For those not into the technical side of things, these are sorts of databases suited for different type of usecases/applications. They will be able to be used by developers when building software.

So Ryyn is actually 3 projects in 1.

Will extract those after everything is wired up and ready for an alpha release.

oetyng · May 12, 2025, 10:32am

I thought I’d give a brief overview of differences, it could be deeper, but keeping it short now.
I’m continuously looking for inspiration and learning from projects like these.
The reason is also to shine some light on what it actually is that Ryyn does, what makes its internals stand out and how it leverages Autonomi to do this (this part is more implied than directly explained here, will have to wait).

I know this is a bit too abstract and technical for many, I just didn’t have more time for this in particular. I am working on documentation in parallel with coding, and most work is on the technical conceptual level, because everything user facing just requires a few lines, so far.

Ryyn vs SyncThing

For a user doing basic sync these two systems look almost the same. Add file and it appears elsewhere. Modify file and change propagates. Delete file and deletion propagates. The systems look the same.
Superficially, Ryyn and SyncThing behave identically.

But when conditions get slightly more difficult, Ryyn’s choice of system immediately stands out.

A brief comparison:

Scenario	SyncThing	Ryyn
Two edits on disconnected devices	Conflict copy, unstructured	Two causal forks, trackable
Accidental deletion	Gone	Recoverable from DAG
Reinstall, DB wipe, device loss, crash	Treated as new device	Identity history persists, pick up where you left
File changes back and forth	Overwrites or oscillates	Multiple versions stored, switchable
Merge required	Impossible	Expressible by diff ancestry
Data loss possible?	Yes, on overwrite	No, by design

Ryyn is a multi-writer, eventual-consistency system, tracking ancestry with inbuilt preservation of all forks. The system creates a bounded DAG of known, ordered versions per file, per replica.
It is fundamentally more powerful than SyncThing. It’s an entirely different system class, doing distributed state tracking instead of snapshot/file-copy model.
Ryyn has a selective materialization model, not just “sync latest”, but “materialize version X” (choose version for use), since in contrast to SyncThing, Ryyn treats replicas as causally constrained emitters, not naive writers.

Conceptually, the internal cost model of Ryyn is higher to give a better and expanded user experience, where SyncThing creates a shallower internal logic, a conceptually lower internal cost, but becomes more fragile in correctness in edge cases.
(Note: This says nothing about the actual codebase which can be inflated, complicated, messy and vice versa regardless of conceptual internal cost model. Idea vs implementation.)

One thing I am avoiding on purpose is a multitude of user input options (and non-options). More fine-grained control can be implemented later, but I’ve estimated that the largest need and most powerful tool for a broad user base is pure and simple fire-and-forget backup of folders.

loziniak · May 12, 2025, 10:42am

I imagine, that another difference would be, tha SyncThing needs less maintenance (unless you lose data and have to redo your changes), Ryyn will need merging, recovering etc., right? So, SyncThing requires you to stick to some rules to keep your data consistent, and it’s probably not suitable for certain usecases, when you cannot do this, while Ryyn keeps you covered, but needs your attention to recover from disaster?

Question 2 – could Ryyn be used as a simple Version Control System? Is it sitting conceptually between Git and SyncThing?

happybeing · May 12, 2025, 10:47am

This is awesome @oetyng and shows why I said what I did earlier. @Erwin you need to read this!

It is a VCS or will be if someone builds UI on it.

I don’t see any way in which SyncThing is going to be better than this because if, as he surely will, Edward implements the underlying data model correctly, you can build VCS, backup, anything you want on top - including something that behaves like SyncThing (ie simple) but never loses data should you need to hit the “Advanced recovery” button.

Awesome stuff!

oetyng · May 12, 2025, 11:34am

No, I wouldn’t say so. Recovery steps are practically as simple as starting a new clone.
This is because the merging is happening all the time. The replicas in a sync group (devices watching the same folder) periodically look for new events in each other’s streams, and merge them right then and there (if participating with pull_enabled), so a converging stream is always held by every replica. The sync group converges on the same state in basic configuration (more on advanced further down).

On disaster, the disk is assumed to be wiped, and the steps are just as when starting from scratch: a selected folder is cloned.

So, on disaster (or new generaly), user:

Downloads application,
Enters key and can view every sync group related to that key (simple lookup in network),
Chooses to continue with all, or select individual sync groups (watched folders)

This is the default path from first iterations of Ryyn.

In later iterations, when selecting a sync group, if the user only wants a specific replica within the sync group, they clone only that one. It may have been an emitter only replica (push_enabled && !pull_enabled), or an archive (!push_enabled && !pull_enabled), and thus user only wants the specific event stream it held, not latest or more progressed streams.

First steps of implementation will not delve deep into those features, but they are there and enabled, ready to be built. I.e. first implementations will use full participants (push_enabled && pull_enabled) by default and have a default sync behaviour/strategy. From there it’s ready for expanding into the full feature set.

The merge model is also simple to begin with (though still preserving all data fully DAG-compliant), where DAG features are ready and prepared for but will not be developed and thus exposed to users in earlier iterations. (With that I’m basically referring to more fine-grained user interaction with versions etc.)

oetyng · May 12, 2025, 11:34am

Yes, it’s like what @happybeing said.

oetyng · May 12, 2025, 11:40am

There is a large difference, where SyncThing uses block-level diffs, while Ryyn uses binary diffs. They both have their trade-offs, but I would say that they are specifically apt for their respective system classes - and they both have considerable weaknesses, it’s not easy to do this

Since making the choice there, I have not currently looked deeper into it, but there is possibility to optimize for different uses. And perhaps a clean solution could be made for covering more of the cases in one go.

Erwin · May 12, 2025, 8:31pm

This is 200% a killer app ! A fast “eventually consistent” syncer would be extremely useful for a lot of (clearnet) applications as a backend.

oetyng · May 19, 2025, 4:58pm

I have added a description of the file sharing protocol to github. (I.e. not the replication between devices, but the sharing of files and folders to others.)

I have a few of these in progress, I was thinking they should go to the community github as well.

So, file sharing protocol found here.

oetyng · May 30, 2025, 5:12pm

Hey everyone,

There has not been much updates in this thread, I have worked like a little beaver with the code base.

I’m going to share a tiny little bit on what’s going on. The full details are always available in the GitHub repo (as code primarily, the readme is far behind).

But, I have written up a simplified, and incomplete, overview of the system.
All of the described parts are evolving in parallel over continuous iterations. The code base is in heavy flux and the iterations are quite rapid with a mix of vertical and horisontal coverage. This is a result of evolving a combination of functionality that work together in a sometimes innovative way, and to some degree applying fairly ambitious requirements.
I’m prioritizing type safety (designing to not allow invalid states), logical, coherent and efficient code on all levels from architecture down to things like “zero-copy data pipelines”. Basically the essentials of a well-designed system. These are heavy boulders to push, but the payoff momentum down the line is invaluable.
That said, things are still in the context of meeting the IF definition of deployed app at the final date.

So, anyway, the simplified and incomplete overview of current system:

Overview

1. Setup

A) There is an existing setup:

→ Auto start on boot

B) New setup:

→ Authenticate

Private key /
Mnemonic /
PIN /
user + pwd /

(depends on entropy requirements)

2. On start

2.1 Check local state

Db:

Do we have replicas
Do we have files

If we have files etc: → 2.3.2 Scan

2.2 Discovery

Derive keys for groups as long as a group is found.

Display list of groups (available folders) to user.
Await user input for:

→ 2.3.2 Clone or → 2.3.3 Add folder

2.3 Clone | Scan | Add folder

2.3.1 Clone

From all groups of replicas listed in Discovery step, user picks one or more groups i.e. folders. A group is formed by the devices who created/cloned a specific folder. Such a device-folder pair is called a replica. (Technically though, it’s actually the device+folder+local-path triplet that defines a replica, i.e. you can have many replicas of same folder on the same device, if you clone them to different paths. But there’s no immediately apparent use for that for the average user, so we can just say device+folder = replica.)
The actual cloning consists of reading the event streams of all replicas of a group selected by the user. Usually only the stream of one replica is needed when cloning.
The events are replayed locally, and for every fs change they represent, the corresponding folder and file is created (with metadata, temp name suffix and empty content)
Thus, first the file hierarchy is materialised
.. then contents of individual files are materialised (if opted for, else lazy load can be used as well)

2.3.2 Scan

Detect changes in a folder
Apply changes to local db
Replicate changes to network

2.3.3 Add folder

A folder is chosen for backup
Scan is run like 2.3.2

2.4 Start workers

A strict isolation between replicas is held with instance scoped keys and local state (dbs and snapshots).

Workers

FS watcher
Ingest (See 3. Pipeline)
Uploads
– Uploader
– Verifier
Event builder (reads from uploads, writes to events)
Event stream (reads from events)
Event Stream Merger * i.e. syncing between devices
Periodic Scanner *

3. Pipeline

Notify fs event → WAL
Fs events mapping
Metadata diff calc → meta delta
Local persistence of content snapshot
Content diff calc
DraftEvent w meta delta and content diff layout → local persistence
Fault tolerant content upload
DraftEvent → Event
Event network persistence

4. User interface

Two types of interfaces a GUI and a CLI, same functionalities:

See folders available on other devices on same root key.
Clone folders
Add folders
Stop syncing folder (remove from watched)
Set replica mode: push & pull, push, pull, none (equivalent to: full participation, write only, read only, archive)
Misc config such as
– Event sync delay
– Content sync delay
– Caching strategies
– Breakpoint size for bdiff vs chunked hash
– Remote event scan interval
– Merge strategy *
– Fs scan interval *

5. Shutdown signal

Graceful shutdown when device shuts down: workers finish what was in flight etc.

*Will be implemented later

Misc notes

File size handling and performance

I am considering doing a combination of binary diff and chunked hashing, to meet the needs over the full spectrum of file sizes. With an application like this it is not really doable to identify a typical use case. It’s a very broad spectrum of users and applications where this will have utility.

In less tech-speak this is a performance-related improvement that increases the utility of the application.

Backup coverage/fidelity

The range of metadata has been ambitious from start, with an approach that everything that is the slightest meaningful to capture, will be captured and used as needed or applicable down the line. This is something where SyncThing has been lacking, with a very limited set of metadata.

KV-store and EventStore (databases)

Initial design-work has been done to extract these independent applications. It’s not a priority, but I think I will find it rewarding with regards to development quality much in the same way as the focus on type safety and system coherence. Still, initial plan of deferring to after an alpha release stands for now.

Large scale use, automated instances

Even though not directing work towards it, I’m keeping this in mind and figuring out what this could look like, and how aligned every design decision is with it. It will be interesting to hear from users what needs could be met.

I’ll leave it at that for now. There are many protocols and specifications to add to the GitHub readme when time allows. I’m looking forward to it because I think many will find it very interesting, and feedback will be valuable. I will post about such updates here.

Toivo · May 30, 2025, 6:34pm

I would not like to interrupt, but I’m curious if big files might cause performance issues?

oetyng · May 30, 2025, 8:06pm

Don’t worry @Toivo, that’s all good . Besides, this particular question is very good.

For small files, binary diff performs better than chunked hashing.
However the entire file and the previous version is loaded into memory to do the diffing. So, it’s 2x the file size. The larger the file, the more you load into memory at compare time.
Additionally, with the library I’m using, limitations start to appear with files like 2+GiB.
Not great. But with small files it is better than chunked hashing, because the diff is exact, while chunked hashing has overhead. Extreme example: If the chunks are say 4kb, and only 1 byte changed, well, then that is 4000x overhead as we backup a whole chunk when we detect that the hash has changed.

With chunked hashing, you only need to read one chunk at a time. So, that’s streaming. And you can choose granularity. Of course, you can read them all in parallel to speed things up, but you can dial it down to load only as much into memory as you want. Which is what you’d do with large files, or when handling many files concurrently. Still though, there is no need to load an entire previous version, so reading all chunks into memory at once for hashing and hash-comparison will not mean 2x memory footprint as with binary diff, only 1x.

So, the great thing with chunked hashing is that you just keep hashes, and then pass around a path, an offset and size for every chunk, and you read it into memory when you need it, where you need it. You have your previous version hashes, you’re notified that a file has changed, and you read the chunks one-by-one or batches, from start to end, hash them and compare each with the previous version hash at that offset. If it changed, that chunk will be uploaded as the new version at that offset. And hopefully, the actual change within that chunk was very close to the full size of that chunk, for minimum overhead. There doesn’t seem to be that good ways to know in beforehand how large the change actually is, so we probably use some heuristics based on file size - like a larger file may usually have larger changes than a smaller file.

About self-encryption. @neo mentioned the other week or so an example of 10 files of 400KiB each that would result in 41 chunks with self-encryption. With Ryyn it will be 1 chunk (the data), and a fraction (say 1KiB) co-located in another chunk (the event). Roughly, self-encryption gives 40x more chunks for that file-size.That’s not great.
With large files, the difference is much smaller.
Self-encryption is cool, but it doesn’t really check any boxes here.

With Ryyn, a 1MiB change to a 2GiB file will lead to about 1MiB of new data uploaded to network, and this is what it is supposed to do: be efficient with any number and size of continuous changes to large or small files. There used to be code for updating a self-encrypted file to not produce an entire new copy of the changed file for upload. It was intricate, I didn’t get into it properly before it was removed. Maybe it can be introduced again, I don’t know.

One thing btw that these captured changes as versions of a file allow is the possibility to share a specific version of a file, without revealing either previous versions or later versions. I think that can be quite useful.

In short: It’s a balance between speed and memory. The faster we go, the more memory is required (up to the size of the file). The less memory we want to use, the slower it goes (down to the time it takes end-to-end for one chunk-read and upload x number of chunks).
Optimally, we never load more into memory than we can push through to the network, but if that as well is too much for our requirements, that’s when we slow down the reads.

Ryyn uses fault tolerant uploads where every change is described as an upload task which is stored in a local db. It has all info about path, offset, size in addition to time of insert, time of last upload attempt, number of attempts, payment receipt, time of last verification attempt (verify it is online) etc. Things like that. You can “drop” (you’ll point to them) 2000 files of 4GiB into Ryyn, and it will not lose a chunk, even in difficult cases (we’ll see how difficult, in testing), and you won’t load more into mem than what is configured, you won’t try to upload more data than configured (or optimal, if using dynamic metrics), you can track in realtime upload rate, cost per GiB etc.
This approach allows full control of the data pipeline.

Well, you see, questions are great

neo · May 30, 2025, 10:12pm

Have you thought about using a mmap style of crate, (memmap2). Mapping files to virtual memory and so only the parts of the file you are accessing are in memory. Saves loading the whole file first then doing diff.

Another advantage is that processing the diff can begin immediately rather than waiting for a 1 or 10 or 100GB or even a TB file to load. In the end the overall time taken should be similar

Been using it to access an index file that is over 200GB and works well for that.

oetyng · May 30, 2025, 11:06pm

Ah, that would be very good. I have been thinking about it but pushed it back. I am looking at the bidiff dependency now and I think I can implement this in there. I’m forking it. Let’s see what results.

oetyng · June 1, 2025, 12:06am

@neo, I have something here oetyng/mmap_bidiff: Zero-copy binary diffing and patching using memory-mapped files.

I’m at sorting out kinks stage with it. But this should be (if working) a major improvement to the binary diff I used before.

Topic		Replies	Views
Syncer: a caching FUSE based filesystem in Rust Apps	26	4535	July 11, 2020
NewYearNewNet [04/01/2024 Testnet] [Offline] Releases	521	6346	January 22, 2024
SAFE Network Dev Update - March 19, 2020 Updates	135	4777	April 23, 2020
SAFE Network Dev Update - July 16, 2020 Updates	28	3805	October 23, 2020
New Release: Vault Phase 1 (mock vault) Releases	85	5613	February 12, 2020