As a compromise - maybe you tell us if you make it through the voting..?
![]()
So we have an additional motivation to throw votes at your app..?
As a compromise - maybe you tell us if you make it through the voting..?
![]()
So we have an additional motivation to throw votes at your app..?
Will this be similar to Syncthing?
Yes, very similar.
I actually think it’s spoken like rhythm - just without the t and an n instead of an m ![]()
I think @Southside found one of your ads that are floating around the internet
![]()
I should perhaps add that in the process of developing Ryyn, there will be two independent libraries developed, that I could have entered as independent projects in IF (but didn’t think of).
These are Autonomi-backed storages:
Later these will have bindings to a few of the most used languages.
For those not into the technical side of things, these are sorts of databases suited for different type of usecases/applications. They will be able to be used by developers when building software.
So Ryyn is actually 3 projects in 1.
Will extract those after everything is wired up and ready for an alpha release.
I thought I’d give a brief overview of differences, it could be deeper, but keeping it short now.
I’m continuously looking for inspiration and learning from projects like these.
The reason is also to shine some light on what it actually is that Ryyn does, what makes its internals stand out and how it leverages Autonomi to do this (this part is more implied than directly explained here, will have to wait).
I know this is a bit too abstract and technical for many, I just didn’t have more time for this in particular. I am working on documentation in parallel with coding, and most work is on the technical conceptual level, because everything user facing just requires a few lines, so far.
For a user doing basic sync these two systems look almost the same. Add file and it appears elsewhere. Modify file and change propagates. Delete file and deletion propagates. The systems look the same.
Superficially, Ryyn and SyncThing behave identically.
But when conditions get slightly more difficult, Ryyn’s choice of system immediately stands out.
A brief comparison:
| Scenario | SyncThing | Ryyn |
|---|---|---|
| Two edits on disconnected devices | Conflict copy, unstructured | Two causal forks, trackable |
| Accidental deletion | Gone | Recoverable from DAG |
| Reinstall, DB wipe, device loss, crash | Treated as new device | Identity history persists, pick up where you left |
| File changes back and forth | Overwrites or oscillates | Multiple versions stored, switchable |
| Merge required | Impossible | Expressible by diff ancestry |
| Data loss possible? | Yes, on overwrite | No, by design |
Ryyn is a multi-writer, eventual-consistency system, tracking ancestry with inbuilt preservation of all forks. The system creates a bounded DAG of known, ordered versions per file, per replica.
It is fundamentally more powerful than SyncThing. It’s an entirely different system class, doing distributed state tracking instead of snapshot/file-copy model.
Ryyn has a selective materialization model, not just “sync latest”, but “materialize version X” (choose version for use), since in contrast to SyncThing, Ryyn treats replicas as causally constrained emitters, not naive writers.
Conceptually, the internal cost model of Ryyn is higher to give a better and expanded user experience, where SyncThing creates a shallower internal logic, a conceptually lower internal cost, but becomes more fragile in correctness in edge cases.
(Note: This says nothing about the actual codebase which can be inflated, complicated, messy and vice versa regardless of conceptual internal cost model. Idea vs implementation.)
One thing I am avoiding on purpose is a multitude of user input options (and non-options). More fine-grained control can be implemented later, but I’ve estimated that the largest need and most powerful tool for a broad user base is pure and simple fire-and-forget backup of folders.
I imagine, that another difference would be, tha SyncThing needs less maintenance (unless you lose data and have to redo your changes), Ryyn will need merging, recovering etc., right? So, SyncThing requires you to stick to some rules to keep your data consistent, and it’s probably not suitable for certain usecases, when you cannot do this, while Ryyn keeps you covered, but needs your attention to recover from disaster?
Question 2 – could Ryyn be used as a simple Version Control System? Is it sitting conceptually between Git and SyncThing?
This is awesome @oetyng and shows why I said what I did earlier. @Erwin you need to read this!
It is a VCS or will be if someone builds UI on it.
I don’t see any way in which SyncThing is going to be better than this because if, as he surely will, Edward implements the underlying data model correctly, you can build VCS, backup, anything you want on top - including something that behaves like SyncThing (ie simple) but never loses data should you need to hit the “Advanced recovery” button.
Awesome stuff!
No, I wouldn’t say so. Recovery steps are practically as simple as starting a new clone.
This is because the merging is happening all the time. The replicas in a sync group (devices watching the same folder) periodically look for new events in each other’s streams, and merge them right then and there (if participating with pull_enabled), so a converging stream is always held by every replica. The sync group converges on the same state in basic configuration (more on advanced further down).
On disaster, the disk is assumed to be wiped, and the steps are just as when starting from scratch: a selected folder is cloned.
So, on disaster (or new generaly), user:
This is the default path from first iterations of Ryyn.
In later iterations, when selecting a sync group, if the user only wants a specific replica within the sync group, they clone only that one. It may have been an emitter only replica (push_enabled && !pull_enabled), or an archive (!push_enabled && !pull_enabled), and thus user only wants the specific event stream it held, not latest or more progressed streams.
First steps of implementation will not delve deep into those features, but they are there and enabled, ready to be built. I.e. first implementations will use full participants (push_enabled && pull_enabled) by default and have a default sync behaviour/strategy. From there it’s ready for expanding into the full feature set.
The merge model is also simple to begin with (though still preserving all data fully DAG-compliant), where DAG features are ready and prepared for but will not be developed and thus exposed to users in earlier iterations. (With that I’m basically referring to more fine-grained user interaction with versions etc.)
Yes, it’s like what @happybeing said.
There is a large difference, where SyncThing uses block-level diffs, while Ryyn uses binary diffs. They both have their trade-offs, but I would say that they are specifically apt for their respective system classes - and they both have considerable weaknesses, it’s not easy to do this ![]()
Since making the choice there, I have not currently looked deeper into it, but there is possibility to optimize for different uses. And perhaps a clean solution could be made for covering more of the cases in one go.
This is 200% a killer app ! A fast “eventually consistent” syncer would be extremely useful for a lot of (clearnet) applications as a backend.
I have added a description of the file sharing protocol to github. (I.e. not the replication between devices, but the sharing of files and folders to others.)
I have a few of these in progress, I was thinking they should go to the community github as well.
So, file sharing protocol found here.
Hey everyone,
There has not been much updates in this thread, I have worked like a little beaver with the code base.
I’m going to share a tiny little bit on what’s going on. The full details are always available in the GitHub repo (as code primarily, the readme is far behind).
But, I have written up a simplified, and incomplete, overview of the system.
All of the described parts are evolving in parallel over continuous iterations. The code base is in heavy flux and the iterations are quite rapid with a mix of vertical and horisontal coverage. This is a result of evolving a combination of functionality that work together in a sometimes innovative way, and to some degree applying fairly ambitious requirements.
I’m prioritizing type safety (designing to not allow invalid states), logical, coherent and efficient code on all levels from architecture down to things like “zero-copy data pipelines”. Basically the essentials of a well-designed system. These are heavy boulders to push, but the payoff momentum down the line is invaluable.
That said, things are still in the context of meeting the IF definition of deployed app at the final date.
So, anyway, the simplified and incomplete overview of current system:
(depends on entropy requirements)
If we have files etc: → 2.3.2 Scan
Derive keys for groups as long as a group is found.
→ 2.3.2 Clone or → 2.3.3 Add folder
From all groups of replicas listed in Discovery step, user picks one or more groups i.e. folders. A group is formed by the devices who created/cloned a specific folder. Such a device-folder pair is called a replica. (Technically though, it’s actually the device+folder+local-path triplet that defines a replica, i.e. you can have many replicas of same folder on the same device, if you clone them to different paths. But there’s no immediately apparent use for that for the average user, so we can just say device+folder = replica.)
The actual cloning consists of reading the event streams of all replicas of a group selected by the user. Usually only the stream of one replica is needed when cloning.
The events are replayed locally, and for every fs change they represent, the corresponding folder and file is created (with metadata, temp name suffix and empty content)
Thus, first the file hierarchy is materialised
.. then contents of individual files are materialised (if opted for, else lazy load can be used as well)
A strict isolation between replicas is held with instance scoped keys and local state (dbs and snapshots).
uploads, writes to events)events)Two types of interfaces a GUI and a CLI, same functionalities:
Graceful shutdown when device shuts down: workers finish what was in flight etc.
*Will be implemented later
I am considering doing a combination of binary diff and chunked hashing, to meet the needs over the full spectrum of file sizes. With an application like this it is not really doable to identify a typical use case. It’s a very broad spectrum of users and applications where this will have utility.
In less tech-speak this is a performance-related improvement that increases the utility of the application.
The range of metadata has been ambitious from start, with an approach that everything that is the slightest meaningful to capture, will be captured and used as needed or applicable down the line. This is something where SyncThing has been lacking, with a very limited set of metadata.
Initial design-work has been done to extract these independent applications. It’s not a priority, but I think I will find it rewarding with regards to development quality much in the same way as the focus on type safety and system coherence. Still, initial plan of deferring to after an alpha release stands for now.
Even though not directing work towards it, I’m keeping this in mind and figuring out what this could look like, and how aligned every design decision is with it. It will be interesting to hear from users what needs could be met.
I’ll leave it at that for now. There are many protocols and specifications to add to the GitHub readme when time allows. I’m looking forward to it because I think many will find it very interesting, and feedback will be valuable. I will post about such updates here.
I would not like to interrupt, but I’m curious if big files might cause performance issues?
Don’t worry @Toivo, that’s all good
. Besides, this particular question is very good.
For small files, binary diff performs better than chunked hashing.
However the entire file and the previous version is loaded into memory to do the diffing. So, it’s 2x the file size. The larger the file, the more you load into memory at compare time.
Additionally, with the library I’m using, limitations start to appear with files like 2+GiB.
Not great. But with small files it is better than chunked hashing, because the diff is exact, while chunked hashing has overhead. Extreme example: If the chunks are say 4kb, and only 1 byte changed, well, then that is 4000x overhead as we backup a whole chunk when we detect that the hash has changed.
With chunked hashing, you only need to read one chunk at a time. So, that’s streaming. And you can choose granularity. Of course, you can read them all in parallel to speed things up, but you can dial it down to load only as much into memory as you want. Which is what you’d do with large files, or when handling many files concurrently. Still though, there is no need to load an entire previous version, so reading all chunks into memory at once for hashing and hash-comparison will not mean 2x memory footprint as with binary diff, only 1x.
So, the great thing with chunked hashing is that you just keep hashes, and then pass around a path, an offset and size for every chunk, and you read it into memory when you need it, where you need it. You have your previous version hashes, you’re notified that a file has changed, and you read the chunks one-by-one or batches, from start to end, hash them and compare each with the previous version hash at that offset. If it changed, that chunk will be uploaded as the new version at that offset. And hopefully, the actual change within that chunk was very close to the full size of that chunk, for minimum overhead. There doesn’t seem to be that good ways to know in beforehand how large the change actually is, so we probably use some heuristics based on file size - like a larger file may usually have larger changes than a smaller file.
About self-encryption. @neo mentioned the other week or so an example of 10 files of 400KiB each that would result in 41 chunks with self-encryption. With Ryyn it will be 1 chunk (the data), and a fraction (say 1KiB) co-located in another chunk (the event). Roughly, self-encryption gives 40x more chunks for that file-size.That’s not great.
With large files, the difference is much smaller.
Self-encryption is cool, but it doesn’t really check any boxes here.
With Ryyn, a 1MiB change to a 2GiB file will lead to about 1MiB of new data uploaded to network, and this is what it is supposed to do: be efficient with any number and size of continuous changes to large or small files. There used to be code for updating a self-encrypted file to not produce an entire new copy of the changed file for upload. It was intricate, I didn’t get into it properly before it was removed. Maybe it can be introduced again, I don’t know.
One thing btw that these captured changes as versions of a file allow is the possibility to share a specific version of a file, without revealing either previous versions or later versions. I think that can be quite useful.
In short: It’s a balance between speed and memory. The faster we go, the more memory is required (up to the size of the file). The less memory we want to use, the slower it goes (down to the time it takes end-to-end for one chunk-read and upload x number of chunks).
Optimally, we never load more into memory than we can push through to the network, but if that as well is too much for our requirements, that’s when we slow down the reads.
Ryyn uses fault tolerant uploads where every change is described as an upload task which is stored in a local db. It has all info about path, offset, size in addition to time of insert, time of last upload attempt, number of attempts, payment receipt, time of last verification attempt (verify it is online) etc. Things like that. You can “drop” (you’ll point to them) 2000 files of 4GiB into Ryyn, and it will not lose a chunk, even in difficult cases (we’ll see how difficult, in testing), and you won’t load more into mem than what is configured, you won’t try to upload more data than configured (or optimal, if using dynamic metrics), you can track in realtime upload rate, cost per GiB etc.
This approach allows full control of the data pipeline.
Well, you see, questions are great ![]()
Have you thought about using a mmap style of crate, (memmap2). Mapping files to virtual memory and so only the parts of the file you are accessing are in memory. Saves loading the whole file first then doing diff.
Another advantage is that processing the diff can begin immediately rather than waiting for a 1 or 10 or 100GB or even a TB file to load. In the end the overall time taken should be similar
Been using it to access an index file that is over 200GB and works well for that.
Ah, that would be very good. I have been thinking about it but pushed it back. I am looking at the bidiff dependency now and I think I can implement this in there. I’m forking it. Let’s see what results.
@neo, I have something here oetyng/mmap_bidiff: Zero-copy binary diffing and patching using memory-mapped files.
I’m at sorting out kinks stage with it. But this should be (if working) a major improvement to the binary diff I used before.