Pre-RFC: Linkable Data Structure

knuppe · January 5, 2016, 1:21am

I started writing a proposal to improve support for dynamic data in the SAFE network and I would like some opinions on my idea.

The basic idea is to link the structured data as a parent element and implement some methods to retrieve the child identifiers/XorName using the root identifier. Something like get_child_sd(root_identifier, offset, length) -> Vec<(XorName, PubKey)>

@dirvine propose a RFC to append the data in the main data structure, but I think this design has a limitation… The appended data will always be transferred with the structured data. This can be potentially slow/inefficient.

Here is a early^2 draft of the RFC.

Thanks in advance (=

neo · January 5, 2016, 2:25am

What about using 2 SD’s for each element. First is owned by the APP for the linking info and the other is owned by say the APP & user and contains the info.

This allows the structure to be flexible and defined by the APP and allows the user to modify the data. EG a database element.

Drawbacks is it is expensive requiring 2 SD’s per element

Pros: it can be done using the current system and allows different apps (databases) to implement their own “linking” structure and not be tied to a “network defined” linking

knuppe · January 5, 2016, 10:20am

[quote=“neo, post:2, topic:6598, full:true”]
What about using 2 SD’s for each element. First is owned by the APP for the linking info and the other is owned by say the APP & user and contains the info.[/quote]
The problem though, is security. If the “linking” logic is in the client side (app or web)… It will be very easily compromised/faked.

tfa · January 5, 2016, 11:06am

We don’t need a new data structure to handle hierarchical data. Current SD can handle this by using a name composed of parent name concatenated with child number. For example, a simple tree of names could be:

Name
- Name-0
  - Name-0-0
  - Name-0-1
- Name-1
  - Name-1-0
  - Name-1-1

Like all SDs the name that must be passed to the create function is the hash of this name. The unhashed name has to be stored in the data part so that we can retrieve the parent and the children of each node.

For example: if the unhashed name of current node is “Topic1234-4-5” then:

You hash “Topic1234” to retrieve the root node
You hash “Topic1234-4” to retrieve the parent node
To loop over the children, you increment a counter N and for each value you hash “Topic1234-4-5-N”. You stop the loop when you get nothing.

knuppe · January 5, 2016, 11:18am

To add new data the user must fetch ALL the data, just the get the number of elements in the node… Even so, we are talking about a huge concurrent system… Imagine some system like “SAFEreddit”, this link logic will be broken in seconds

knuppe · January 5, 2016, 11:27am

Here is some design concepts/ideas.

Root Element

I’m the root element, and I allow other “users” to relate data to me [here is blacklist/whitelist - optional].
The maximum of child levels is one. [optional]
The users can link only one reply/data to me.[optional]
The users can’t update/delete the linked data.[optional]

Child Element 1

I’m linked to the root
The network can store me in the linking data because my signature is valid, I’m aware (and respect) the rules from root.

Child Element 2

I’m linked to the child element 1 (CE1)
I can’t be linked by the network to the element because the root element does not allow two levels of data.

…

tfa · January 5, 2016, 11:40am

You can store the number of children in the parent node if that’s you want. I didn’t do it because:

most of the times you only need to loop over the children
adding a child doesn’t modify the parent

But if you need to get the number of children and a node can have many children then just add it in the data part.

knuppe · January 5, 2016, 12:31pm

Sorry to disagree with you, but number of elements cannot be stored in the parent node, because, the parent node is encrypted/signed by a user and the child node by another user.

The problem is not to store the data, is to retrieve the data in a robust way. For instance, if two users add the key in the same time “node-0-1” (node zero, child 1), what happens to the data? How can we retrieve all of them?

happybeing · January 5, 2016, 12:34pm

To do this the parent has to be writable by anyone creating a child - this is the problem @knuppe highlights as making the whole structure vulnerable.

One of the aims here is to allow multiple (untrusted) people to be able to append data (eg add comments to a blog post, or make replies to a comment) without them being able to break any other part of the structure (accidentally or deliberately) .

This is the key challenge of this new decentralised approach - and it’s hard, so please keep trying!

neo · January 5, 2016, 12:59pm

You could then have the core network update the linkages SDs for the APP

Even if you have a core method of doing the linkage using whatever method, the APP still has to identify itself and prove its right to add/subtract to the linked structure. So if you use my initial idea, or a core implementation of any sort then it can be faked because the APP supplies the authentication and people could extract that from the APP, or fork the APP to do as they please.

What you really need is a method for an APP to authenticate itself without the authentication being revealed to the PC running the APP. Some sort of APP registration that the core network recognises and uses to lock/own the linkage SDs.

The APP developer would then use his keys to register the APP, and subsequent updates and the core network uses the APP registration to unlock the SDs. That way the keys to decode the SDs is never sent to the APP running or stored with the APP and the linkages SDs are owned by the developer whose keys were used to register the APP.

So as far as I can see, its not a special SD linkage you need but a method for the core network to register an APP and let it have keys that are never sent to/from the PC running the APP

tfa · January 5, 2016, 1:48pm

In this case you don’t store the children count in the parent node and then parent node doesn’t need to be modified when a child is added. This was my initial proposal in my first post here.

Adding the children count was only for @knuppe who didn’t want to do this:

But it was clear that I prefer not storing the children count. This way there is no redudancy and the structure cannot be broken.

knuppe · January 5, 2016, 2:11pm

In my opinion, there is no point to create a APP registration. A well-designed APP, will store the data with some unique key (eg. XorName("{APP-GUID} + {content body} or {other 'unique' key logic}")), also the structured data has a tag property, that allows the APP developer define what “type” the data is. To the network, this object contents is just a hashed key with some encrypted value and the network don’t care about it’s contents…

With this proposal, we are just saying that this piece of data “A” is linked to piece of data “B” by some rules defined by the root element.

Let’s suppose that we have the “SAFEreddit” scenario, and we have a topic with 8894 comments:

Why we need to fetch all the 8894 comments to store a new comment?
How the APP knows, what are the child keys related to the topic (without a link system)?
How the APP shows the string “8894 comments” in the post?

tfa · January 5, 2016, 2:13pm

The first user will win. The second user will have to try “node-0-2”, and then “node-0-3” … until the put succeeds.

As I said in my first post to retrieve all the children of a node you loop over them by incrementing a counter.

To get a complete tree, you do this recursively from the root node or any other intermediate node depending on what your are looking for.

et4te · January 5, 2016, 2:17pm

I’m not sure it makes sense to add a Linked / Appendable type to routing given that there is a structured data type which is designed specifically for the purpose of defining custom data types. For example, the NFS example code uses a composite directory listing type which it defines in terms of the StructuredData type provided by routing, the same concept could be applied to lists.

The unified structure RFC also details that this is a way of creating data types which are secure by default in that they are verified at every modification to protect against various forms of attacks.

As for updating two structured data elements at the same time by two different users, I don’t see how this would be a problem given that only incremented versions can be ‘put’ in the network, so two items with the same version cannot both be added, one will fail. I am of the opinion that the client should be responsible for ensuring that the version which is ‘put’ gets updated appropriately rather than the routing. (as @tfa seems to be suggesting)

et4te · January 5, 2016, 2:44pm

@knuppe In your SAFEreddit example,

It wouldn’t be possible to fetch the 8894 comments unless they were in combination smaller than 100kb given the design of StructuredData dictates that this is the size limit. So you fetch the data item which points to the comments but never ask for the chunks unless you want to do that (e.g when you are listing the comments).
After closer inspection of the datamap data structure: http://maidsafe.net/safe_core/master/self_encryption/datamap/index.html, it seems that there is no concept of keys here so in reality you would have to fetch all of the chunks to reconstruct a keyed data structure. You could simply store the keys as a separate put if you didn’t want to reconstruct all the chunks to get to them.

tfa · January 5, 2016, 2:56pm

Because he has to know the next child number to use. But the search can be optimized by not doing a linear search: the successive tries could be 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 12288, 10240, 9216, 8704, 8960, 8832, 8911, 8886, 8898, 8892, 8895, 8893, 8894 (in bold the lookups that succeed)

Here he knows the next index to try is 8894. If the put fails (because another user added a comment in the meantime) then he tries 8895, 8896, … until it works.

There is a link system, but it is implicit with the name (stored in the data part): Dummy-3-2 is the third child of Dummy-3, which itself is the fourth child of Dummy.

happybeing · January 5, 2016, 4:00pm

We’re going round in circles here. Obviously the above doesn’t scale, and your solution to that was to store the count, which doesn’t solve the problem of scale for the reasons @knuppe and I pointed out above.

We need something that scales, and works, unless we’re only talking about a small subset of applications.

knuppe · January 5, 2016, 4:08pm

@tfa, this lookup system will be possible but not more efficient, imagine hundreds of users using this lookup technique just to discover a simple/common information like “number of comments of a single post”… This is just a very simple sample, imagine if each comment has more information like “number of likes”, “number of up/down votes” or the “reputation of the commenter”, etc.

This thing will consume the entire network just with lookups. I have the feeling that such logic to link the data should be in the core of the network, not in the application layer.

@et4te, about the datamap: In my thoughts yesterday, I designed a new map called LinkedData that has the root identifier and a vector of child elements + their public keys. When a “user” adds a child element, it references the parent identifier in the SD, so then a network manager (or other authority), reads this parent identifier during the Put event and append the child identifier to the LinkedData object (here we can design a consensus system or algorithm to handle this).

@happybeing, perfect you got the idea!

et4te · January 5, 2016, 7:38pm

I can’t shake the feeling that this would be a better fit with a memory based resource rather than disk. There I would imagine it making sense to have lists, tuple spaces and all kinds of nice data structures which naturally map to memory.

Persisting things in memory would be a really cool addition to safe_vault too and could form the beginning of a compute layer, where a ‘transient’ memory use combined with CPU could be used to compute arbitrary computations. I think an argument could be made to keep the disk primitives as simple as possible, since ultimately this will be the slowest form of persistence and would fit better with files.

I’m having a look at safe_vault at the moment in order to see how the chunk_store can be re-created for a memory persistence model. Then its a matter of weighting the value of memory relative to disk (clearly it is a more expensive resource).

neo · January 6, 2016, 1:11am

As soon as the APP running on the PC uses its keys then the user can extract (hack) those keys for ownership of the SD data and scramble your linkage.

That is the inherent problem with an APP accessing its SD data for writing, it has to expose its ownership keys for the SD data in order to instruct the network to write/rewrite SD data.

And I was under the impression that this is the problem you were trying to fix by getting the core network to implement a linkage system. But the problem is that if your APP uses its keys while running on a users PC then a USER has the opportunity to extract those security keys to scramble your linkage/database or whatever you are building in SD data.

It does not matter what you put into the core network if the APP which runs on a users PC uses its security keys to ask the network to do that work. That is why if you create a core network system that registers APPs then the core network can do the (re)writes without requiring the keys to the ownership of the SD data elements be sent by the APP. Any forks/mods to the APP then is a new APP and thus has different registration.

I am assuming you understand that APPs do not run on a protected server (there are none), but on the users PC.

Topic		Replies	Views
RFC: Dynamic Data Support RFCs	13	2014	April 5, 2016
RFC - Unified Structured Data RFCs	12	3306	September 22, 2015
[RFC] Data Hierarchy Refinement RFCs	22	2470	January 31, 2020
Simple Structured Data implementation with new routing Features	0	770	August 14, 2016
New data type - distributed set (multi value per key) Features	3	987	September 3, 2016

Pre-RFC: Linkable Data Structure

Related topics