So the hash of the name is the address? How do you avoid collisions?
If it exists you cannot create it. A bit like DNS.
There are other approaches as well so hash of owner + type is much harder to collide, but collision avoidance is as simple as try to store.
I get that with say a person selecting a URL for a website, but what if an app wants to generate a mutable data item? Does it just have to keep on trying til it finds an address that hasnāt been used, or must app developers be persuaded to always use very long identifiers for data they create to lessen the risk of collisions?
Another quick question on this subject. Is there always just one level of xor addresses, or can there be multiple levels differentiated by type-tag?
so for example is it possible to have
123ā¦xyz : type-tag 1500 and
123ā¦xyz : type-tag 1501
as two separate addresses on the network? Or is the type-tag just to give additional information about how the data should be processed?
Currently the name is the address in types like Sequence (also in previous AppendOnlyData), and itās chosen randomly if the user doesnāt provide a xorname. E.g. at the moment in CLI providing such optional xorname is only possible for Sequence creation $ safe seq store --xorname <xorname hex> <data>
. You should get an error if you have a collision, eventually we can have the CLI to retry with a new address in such scenarios.
This is how it used to be (yes, reach tag refers to a different xor address space), but weāll have to wait to see how the new data types will work.
Yes, but it isnāt as bad as it sounds:
- you can convert short names into long, effectively random strings (cf. DNS)
- the address space is so large that if the addresses are reasonably random, clashes will be so rare you are unlikely to ever encounter them (although all good programmers should cater for them!)
eg by adding a salt or randon number and hashing?
Thanks again Mark,
Just brainstorming, but that might theoretically allow a type tag where the first part of the address was derived from a personās public name (and thus reserved for only that public name,) and the second part derived from a file name. This could allow a type of āflatā filesystem that could be accessed without ever downloading an index of pointers.
So if you had a public file called āuseful-informationā at {hash of āMarkā} + {hash of āuseful-informationā} , anybody could access that by knowing the formula, but nobody could squat it either.
If people can discover your method they can squat, at a cost, so it isnāt a solution for that. Only for creating a way to look data up in a very large address space. If you want to prevent squatting you either have to keep the mechanism secret (not possible) or have some kind of gatekeeping (e.g. lookup in an index which a squatter cannot write to).
The idea is that with a particular type tag, the network would prevent anybody except āmarkā from storing information at an address beginning with (hash of āMarkā). Same for addresses beginning with (hash of āDavidā) on that type tag.
Whether thatās desirable for the network to do I donāt know, but I canāt see that itās necessarily impossible.
Its not impossible, but would require this to be built into the network. This isnāt what the old API allowed, so you could create a topic to advocate for this.
Yes, maybe it would be worth making a proper suggestion.
For a while there I was thinking type tags didnāt have that power to open up a whole new address scheme.
For a first implementation, Iām just planning to keep the design as simple as possible. After that is working, we can think of possible optimizations.
Do you mean immutable data files? The current thinking is that inode metadata entries in the tree will store an XorName representing a file. This is actually same/similar as in the current FilesContainer design. We canāt hash the XorName to s smaller (eg 16 bit) value, because a) that loses data so we couldnāt actually find the file and b) chance of hash collisions.
It ācouldā store the data_map itself, that can represent a dir or a file. It will be unique (hash of content) so stored like a blob. We just need to watch for files smaller than 3Kb though. We have a few options, but just actual encrypt of those may be enough? Needs some thought.
Yes, definitely more thought/design will be needed here, but we know we have some options. I plan to look at that in more detail once basic fuse+file_api+crdt_tree is demonstrated.
Content of very small files could even be stored directly in the inode. In this case, they would not be directly accessible on the network via SafeUrl, so thatās a tradeoff.
ooh⦠that just sparked a thought.
Could the network handle a url type that leads nowhere but either exists or does not? xorurls are almost long enough to be useful as data in and of themselves??.. I donāt know about making them per user, to avoid clashes but those could be very useful, if they were of the order of 64B/chars available as simple data points.
A risk perhaps of those being dust and spam ā¦depends how it was implements⦠assuming that it could be.
No, I meant mutable specifically, on the basis that we can choose the address of mutable data when we create it, as opposed to immutable where itās obviously dictated by the content.
The requirements of the use case I was thinking of would be quite different I suppose from the filesystem usage, which needs to be pretty flexible. I was thinking for the example of a search index (inverted index) the tree could be quite static, but point to mutable containers where all the action would take place.
If youāll bear with me Iāll try and explain what I meant, though I may be hopelessly oversimplifying the idea of trees - Iāve not quite got my head round the different types yet! But here goes, to store the words A, AN, ANT and AND:
A (27)
N (16)
T (34) D (58)
Iāve just put numerical 2 digit numbers here, but theoretically 2 bytes could give us 256x256 random possibilities for the value at each node (I think.)
When the new file is created, we create its address by hashing eg. AND + 58 + public name of owner, therefore no information loss, at least not in quite the sense you were thinking.
To finish off, I was then thinking that after the file gets to a certain size, new nodes would point to new trees, instead of straight to the file, to keep the download size down.
Probably nothing in it as an idea, but thought I might as well explain what I meant at least!
@david-beinn thx for the explanation.
So iiuc you are contemplating using a crdt_tree to represent the actual content of the file. Thus far, our design is only using it for the directory structure and file metadata. Partly this is because SAFE already has infrastructure/design around immutable data infrastructure, so it seems best to leverage that.
In a fully clean room design, all avenues for storing file content could be explored. There are various algorithms that have been designed for collaborative text editing that might work for ascii files at least, eg: logoot, LSeq, treedoc, Woot, RGA.
Iām not ruling anything out at this point.
I guess you could look at it like that, but the crdt tree āfileā is only being used to point to other files, in the same way that a directory structure is. The files itās pointing to could be any kind of (mutable) data.
Perhaps I confused things by saying store the āwordsā when these are really more akin to filenames in the way Iām thinking. In the search index example above, hash of (ANT + 34 + publicname) leads us to an address where the (mutable) file āANTā has a list of all the websites containing references to ants.
Obvious problem for a lot of use cases is that ā34ā has to always be the value of that node, because the location of the mutable data file stays constant - if it was a file structure you would never be able to move a file within it. To put it another way, the location of a file is intrinsically derived from and linked to the āsignpostā file.
I was generating this idea when Iād given up on being able to get to addresses more directly by hashing. I think perhaps for the kind of use cases I have in mind, the following idea might be more promising: (quoted from my post above)
Thinking further on this, does the fact that every data object requires a unique address on the network mean that using namespaces become impossible because effectively every variable, or at least those that are written to disk rather than held in memory, is global?