Thank you guys!
Hey Mark. I agree these are important things. I have been thinking about it because my format is definitely unique in some ways, very different from ant file
. That fact is not something I like particularly, but I didn’t really see what I wanted out there so I went for that, went for finding what made most sense to me. But it’s better for everyone if there is a shared format, much better. So I’m definitely prepared to change things to reach that. The format of storage is fairly well abstracted away (hasn’t been full focus on that though, so it can be better).
I’ll describe the major differences briefly:
Changes to files are made in two parts, metadata and content.
Big difference 1:
Metadata is cashed in a local db and is straight forward to compare.
The actual change though is saved in network as an event that is a part of the event stream of the backed up folder. (This event stream is what makes it possible to restore it to any state it ever had.)
Big difference 2:
Comparing the content is made by calculating the binary diff (actually using the rsync diff algo for that, so there can be similarities). The previous version of the content is cached on fs. If not there when access requested it is downloaded. So, the diff of the contents is what is saved on every change. Getting the file from network means base_content + diff_0 + diff_1 + .. diff_n
.
Big difference 3:
Very small file changes are buffered in shards (Scratchpads.. this magic data type, so far), which are compacted into chunks when reaching threshold size. The address to a specific piece of small content (whether base
or diff
) is an address to a chunk and an offset, from where to read the length prefix and get the full range of bytes of that piece. (This is significantly more efficient for storage, but access efficiency depends on how sparse and random the future reads are.)
Not sure what would be possible for us all within the IF timeframe. I think we could all look at what particular solutions there are and see what makes sense and help each other out. I’ve seen it mentioned in discord I’ve just been so busy lately so I haven’t been able to follow up on it. I am about to go through all other repos soon to sanity check things etc., and I will be able to say more after that I think.
I think yes, conceptually there are parts here that could be a basis for an Autonomi file system, and there is also previous work on this. I’d love to look into that sometime after Ryyn.
The efficiency is very tightly connected to the data format, and there is still much to work on there. And Scratchpad mutability plays a part there as well.
Thanks!