Safe Network Dev Update - August 27, 2020

So the hash of the name is the address? How do you avoid collisions?

If it exists you cannot create it. A bit like DNS.

There are other approaches as well so hash of owner + type is much harder to collide, but collision avoidance is as simple as try to store.

2 Likes

I get that with say a person selecting a URL for a website, but what if an app wants to generate a mutable data item? Does it just have to keep on trying til it finds an address that hasn’t been used, or must app developers be persuaded to always use very long identifiers for data they create to lessen the risk of collisions?

Another quick question on this subject. Is there always just one level of xor addresses, or can there be multiple levels differentiated by type-tag?

so for example is it possible to have

123…xyz : type-tag 1500 and
123…xyz : type-tag 1501

as two separate addresses on the network? Or is the type-tag just to give additional information about how the data should be processed?

Currently the name is the address in types like Sequence (also in previous AppendOnlyData), and it’s chosen randomly if the user doesn’t provide a xorname. E.g. at the moment in CLI providing such optional xorname is only possible for Sequence creation $ safe seq store --xorname <xorname hex> <data>. You should get an error if you have a collision, eventually we can have the CLI to retry with a new address in such scenarios.

4 Likes

This is how it used to be (yes, reach tag refers to a different xor address space), but we’ll have to wait to see how the new data types will work.

Yes, but it isn’t as bad as it sounds:

  • you can convert short names into long, effectively random strings (cf. DNS)
  • the address space is so large that if the addresses are reasonably random, clashes will be so rare you are unlikely to ever encounter them (although all good programmers should cater for them!)
3 Likes

eg by adding a salt or randon number and hashing?

1 Like

Thanks again Mark,

Just brainstorming, but that might theoretically allow a type tag where the first part of the address was derived from a person’s public name (and thus reserved for only that public name,) and the second part derived from a file name. This could allow a type of ā€˜flat’ filesystem that could be accessed without ever downloading an index of pointers.

So if you had a public file called ā€˜useful-information’ at {hash of ā€˜Mark’} + {hash of ā€˜useful-information’} , anybody could access that by knowing the formula, but nobody could squat it either.

1 Like

If people can discover your method they can squat, at a cost, so it isn’t a solution for that. Only for creating a way to look data up in a very large address space. If you want to prevent squatting you either have to keep the mechanism secret (not possible) or have some kind of gatekeeping (e.g. lookup in an index which a squatter cannot write to).

The idea is that with a particular type tag, the network would prevent anybody except ā€˜mark’ from storing information at an address beginning with (hash of ā€˜Mark’). Same for addresses beginning with (hash of ā€˜David’) on that type tag.

Whether that’s desirable for the network to do I don’t know, but I can’t see that it’s necessarily impossible.

1 Like

Its not impossible, but would require this to be built into the network. This isn’t what the old API allowed, so you could create a topic to advocate for this.

1 Like

Yes, maybe it would be worth making a proper suggestion.

For a while there I was thinking type tags didn’t have that power to open up a whole new address scheme.

1 Like

For a first implementation, I’m just planning to keep the design as simple as possible. After that is working, we can think of possible optimizations.

Do you mean immutable data files? The current thinking is that inode metadata entries in the tree will store an XorName representing a file. This is actually same/similar as in the current FilesContainer design. We can’t hash the XorName to s smaller (eg 16 bit) value, because a) that loses data so we couldn’t actually find the file and b) chance of hash collisions.

2 Likes

It ā€œcouldā€ store the data_map itself, that can represent a dir or a file. It will be unique (hash of content) so stored like a blob. We just need to watch for files smaller than 3Kb though. We have a few options, but just actual encrypt of those may be enough? Needs some thought.

2 Likes

Yes, definitely more thought/design will be needed here, but we know we have some options. I plan to look at that in more detail once basic fuse+file_api+crdt_tree is demonstrated.

Content of very small files could even be stored directly in the inode. In this case, they would not be directly accessible on the network via SafeUrl, so that’s a tradeoff.

5 Likes

ooh… that just sparked a thought.

Could the network handle a url type that leads nowhere but either exists or does not? xorurls are almost long enough to be useful as data in and of themselves??.. I don’t know about making them per user, to avoid clashes but those could be very useful, if they were of the order of 64B/chars available as simple data points.

A risk perhaps of those being dust and spam …depends how it was implements… assuming that it could be.

No, I meant mutable specifically, on the basis that we can choose the address of mutable data when we create it, as opposed to immutable where it’s obviously dictated by the content.

The requirements of the use case I was thinking of would be quite different I suppose from the filesystem usage, which needs to be pretty flexible. I was thinking for the example of a search index (inverted index) the tree could be quite static, but point to mutable containers where all the action would take place.

If you’ll bear with me I’ll try and explain what I meant, though I may be hopelessly oversimplifying the idea of trees - I’ve not quite got my head round the different types yet! But here goes, to store the words A, AN, ANT and AND:

               A (27)

       N (16)

T (34) D (58)

I’ve just put numerical 2 digit numbers here, but theoretically 2 bytes could give us 256x256 random possibilities for the value at each node (I think.)

When the new file is created, we create its address by hashing eg. AND + 58 + public name of owner, therefore no information loss, at least not in quite the sense you were thinking.

To finish off, I was then thinking that after the file gets to a certain size, new nodes would point to new trees, instead of straight to the file, to keep the download size down.

Probably nothing in it as an idea, but thought I might as well explain what I meant at least!

1 Like

@david-beinn thx for the explanation.

So iiuc you are contemplating using a crdt_tree to represent the actual content of the file. Thus far, our design is only using it for the directory structure and file metadata. Partly this is because SAFE already has infrastructure/design around immutable data infrastructure, so it seems best to leverage that.

In a fully clean room design, all avenues for storing file content could be explored. There are various algorithms that have been designed for collaborative text editing that might work for ascii files at least, eg: logoot, LSeq, treedoc, Woot, RGA.

I’m not ruling anything out at this point.

1 Like

I guess you could look at it like that, but the crdt tree ā€˜file’ is only being used to point to other files, in the same way that a directory structure is. The files it’s pointing to could be any kind of (mutable) data.

Perhaps I confused things by saying store the ā€˜words’ when these are really more akin to filenames in the way I’m thinking. In the search index example above, hash of (ANT + 34 + publicname) leads us to an address where the (mutable) file ā€˜ANT’ has a list of all the websites containing references to ants.

Obvious problem for a lot of use cases is that ā€˜34’ has to always be the value of that node, because the location of the mutable data file stays constant - if it was a file structure you would never be able to move a file within it. To put it another way, the location of a file is intrinsically derived from and linked to the ā€˜signpost’ file.

I was generating this idea when I’d given up on being able to get to addresses more directly by hashing. I think perhaps for the kind of use cases I have in mind, the following idea might be more promising: (quoted from my post above)

Thinking further on this, does the fact that every data object requires a unique address on the network mean that using namespaces become impossible because effectively every variable, or at least those that are written to disk rather than held in memory, is global?