[RFC] Data Types Refinement

dirvine · January 18, 2020, 12:05am

This has been a key thought throughout, directories and files are really insanely poor, we need something like semantic web, but one that works. I suspect this is an area where e as humans are rubbish at categorising and I really hope computers/ai etc. will do that job for us. Not searches as per google, but knowledge gathering in a new form. Don’t ask me what yet though

draw · January 18, 2020, 12:35am

Maybe a bit off-topic here, but speaking about organising directory path’s and files, this week trending on Github (a Rust cmdline program):

anon57419684 · January 18, 2020, 11:22pm

Having only skimmed the proposal it looks good.

FWIW to add potentially relevant information:

I’ve also encountered dictionary used to refer to map’s structure; perhaps to avoid collision with map the common function?

Separately, is list a possible alternative to sequence?

oetyng · January 18, 2020, 11:27pm

Hi @anon57419684,
Thanks for reading through the proposal, and good to hear your input.

This has certainly been on the map (pun intended), as I’ve used C# a lot, (where Dictionary is used).
But, given that Map is shorter, and is used in Rust for basically the same thing, it felt like it had better potential in this case. I haven’t really been bothered by any collision while implementing this.

oetyng · January 21, 2020, 7:37pm

Changelog Data Types RFC

2020-01-21

Replaced Guarded types with per-request-configuration of concurrency control.

Note: If you strongly disagree with any of the above updates, please discuss it in the forum topic, for possible revert, or other change.

Simplify the optimistic concurrency

(…and the data types, even further)

Background

Currently we have two distinct types for optimistic concurrency check. This means that once you create an instance, it is fixed for its lifetime, to one of those configurations.

One could ask why this design was chosen.

Let’s look at what the reason for optimistic concurrency is:

When multiple writers operate on the same location, you have a desire to not overwrite a value that you didn’t intend to overwrite. That means, if you read value 5 and want to increment it and write 6, you only want to store that 6 as long as the original value 5 has not changed in between. Otherwise, if another writer stored their operation in between, you would overwrite that change and lose information.
This is called a race condition.

Now, let’s look at this in our context.

Who are we doing this for?
What are we protecting?

In our capacity as one of the writers, we do this for ourselves, because we don’t want to overwrite some value unintentionally. The optimistic concurrency doesn’t protect the data from being corrupted by the other writers. They can still just write anything they want. They just reload until they have the correct version and can send it in with what ever value they want.

So, with optimistic concurrency control, we are working under the assumption that all writers share the same desire to not overwrite the data of someone else, and therefore carries out the correct operations to do so.

The problem

In the example above, we descibe how a writer wanting to bypass the optimistic concurrency, just have to retry writing and getting the version, while request is rejected, until it passes.

In fact, we have an application doing exactly this in our code today: the Authenticator. It does this because the need for optional concurrency control is there, but we have designed it as if it isn’t. So we circumvent it in code instead. As explained here:

“[For the safe authenticator] we may authorise two apps from different devices at the same time. App auth is versioned. So one will pass and one will fail. From an authenticator (not API) POV we’d like to recover from the error by retrying using the next version.”

So, we have introduced the concept of data type flavours, implemented two additional data type configurations denoted by these flavours, increased API footprint, and bloated the end user (as in developers) experience. All of this, to try enforce a certain user pattern based on the false assumption that it’s the only one they need.

Already before release, we have proved ourselves wrong by the implementation in Authenticator, which has to bypass the concurrency control, to work correctly.

So, what could we do different?

Solution

Do not fix the concurrency control of an instance over its entire lifespan, instead, pass in optimistic concurrency check as parameter to operations.

It’s as simple as adding an ExpectedVersion enum parameter, with one of the variants being Any, which would indicate that we will write regardless of the version.

pub enum ExpectedVersion {
    Any, // this means concurrency check is OFF
    Specific(u64), // this means concurrency check is ON
}

Why is this safe to do?

Because of what we described above: the concurrency control is meant for us - the current writer. We are not preventing someone else from circumventing it. Thus we can just as well simplify, widen the capability, and the end result is the same; the other writers still have to execute code correctly, for the concurrency control to work. If it’s not in their interest, then the original concurrency control was no help; they are rouge players with write access (so you messed up, or were hacked). And we will do it because it is in our interest.

Implementation status

Until we order mutations with PARSEC at the data handlers (on the “after Phase2a” list), the versions might not be the same in all the vaults.

That means to say, that version handling at the network requires more work and until then it might not work as expected.

While this is the case, ExpectedVersion.Any variant might be disabled or implemented later in case we want to avoid that uncertainty in wait for the work to finish.

Existing implementations

In EventSourcing databases, where we have concurrent writers to streams with versions, this is a standard practice since the invention of the concept within programming. Streams have continuous optional concurrency control, and the writer decides from time to time if it desires to maintain changes or overwrite them regardless.

Results

Let’s look at what it results in:

The current types…

UnpublishedUnsequencedMutableData
UnpublishedSequencedMutableData

UnpublishedUnsequencedAppendOnlyData
PublishedUnsequencedAppendOnlyData
UnpublishedSequencedAppendOnlyData
PublishedSequencedAppendOnlyData

UnpublishedImmutableData
PublishedImmutableData

…are under under way of becoming:

PrivateMap
PublicMap
PrivateGuardedMap
PublicGuardedMap

PrivateSequence
PublicSequence
PrivateGuardedSequence
PublicGuardedSequence

PrivateBlob
PublicBlob

…and with this proposal become:

PrivateMap
PublicMap

PrivateSequence
PublicSequence

PrivateBlob
PublicBlob

Less code

Additionally, we will be able to cut down on code and various more or less pointless distinctions there, that are nice excersices in ninja coding (kudos for that), but are not solving a real problem.

The implementations of Map and Sequence will simply require less code, and the API foot print will be smaller, leaving a much more digestible impression for newcoming developers, as well as a long term more ergonomic experience working with SAFENetwork.

happybeing · January 21, 2020, 7:39pm

Very well explained and looks like a very desirable change.

I think we already had this with the old NFS API where passing a version of 0 succeeded regardless of the current version, so I think old MutableData was already capable of this? Anyway

tfa · January 21, 2020, 10:17pm

Just to be sure I understand correctly:

The number of data types is reduced from 8 to 5
The number of possible operations will be preserved in the long term but is momentarily reduced by removing “unsequenced” updates where user doesn’t pass expected version number

This part is globally OK with me, except this element:

This data structure is not allowed for Perpetual Web. This is the reason why I count only 5 new data structures.

oetyng · January 21, 2020, 10:42pm

No, to 6. See below.

Yes.

Not sure why you would say it is not.
I guess you think values are deleted. They are not.
Every key has a history, so you could say that every key maps to a Sequence (simplified).

Edit: I would however rather say that there are 3 data types (Blob | Map | Sequence), and two scopes (Private | Public), for a total of 6 flavours.

tfa · January 21, 2020, 11:03pm

I thought that Map meant old MutableData and so we cannot have a public one, be it a sequenced one or an unsequenced one (in the old terminology, which is with or without a version number passed by the caller). And you even agreed to this in a past topic.

But maybe it is just that I am lost with the new terms.

oetyng · January 21, 2020, 11:19pm

Ah, I don’t know what past topic you refer to. I might have been unclear at some point (this thread/topic here, is the only one I’ve discussed Map though I think).

To clarify:

Map replaces MutableData. MutableData feats go more or less into the PrivateMap. So what is newly introduced is PublicMap, as to be consistent with the other types (every type has a Private and Public scope).

In reality though, it is the key-val semantics of AppendOnlyData that has been extracted out into a Map type, and MutableData more or less scrapped - with exception for a few parts that were merged.

This is because AD key-val parts and MD were overlapping (two different implementations of key-val that were never consolidated before), and there was no Private|Public duality over one single type, instead it was over two different types (MD and AD), and at the same time AD was in essence two types (Map and Sequence). Although that was a very limited “Map”, meaning there was no fully capable equivalence to a “Public MutableData”. And fixing that, is the biggest feature addition with this RFC.

You can do Update and Delete on both Private and Public Map, but on the Public one, there’s always a history of values for every key - i.e. no public data can be truly deleted.

intrz · January 22, 2020, 9:45am

I suppose you could even add ( Private | Public ) as an Enum parameter. Not sure whether that would be better or not. It simplifies the data types a bit, but makes it slightly less immediately clear if something is private or public.

oetyng · January 22, 2020, 9:57am

Might be eventually, with other improvements that are being planned as we speak.
I think for now, with this RFC, it’s good as is.

But there is currently a “step 2” of the data types work, which if accepted will go into a new RFC. It’s dealing with the backend handling of data, but also how the Private | Public scope applies to the data type representations and the underlying data.

So, these things are likely to be discussed there in that case.

tfa · January 22, 2020, 9:45pm

Sorry, I meant a past post in this topic. More precisely:

PublicMap is equivalent to these 2 elements (previous Sequenced/Unsequenced subclassing being replaced by the new ExpectedVersion enum), simply because:

and so, logically, PublicMap shouldn’t exist.

I am not sure to understand what you intend to do about public/private scopes, but don’t forget that we need the previous AppendableData feature not exclusively in a public way, but also in a private way.

Edit: To be more precise on my doubts:

Historical values is a feature that should be orthogonal to public/private scope.

oetyng · January 22, 2020, 11:05pm

I think you need to read the RFC again, because the things you say don’t make sense.

These few lines from the RFC should say it all:

This proposal replaces MutableData and AppendOnlyData .
It merges MD and the key-val part of AD into a single type Map , and separates a Sequence type out from AD .

Map forms a perpetual MD in private and public form, while Sequence is essentially an AD without key semantics.

[…]

PrivateMap ~ ( UnpublishedUnsequencedMutableData / UnpublishedSequencedMutableData )
PublicMap - New!

PrivateSequence ~ ( UnpublishedUnsequencedAppendOnlyData / UnpublishedSequencedAppendOnlyData )
PublicSequence ~ ( PublishedUnsequencedAppendOnlyData / PublishedSequencedAppendOnlyData )

But let me know if it’s still unclear.

It’s the same as before.

Public types are perpetual, that means nothing is deleted, data is only appended, i.e. a “history of values” builds up.

jlpell · January 23, 2020, 12:23pm

I’ve been reflecting on what I’ve read in the forum and on github with regard to how the data types are used/structured in addition to the ideas presented in this RFC. I wanted to offer the following diagram and set of definitions as a variation of possible object hierarchy. I won’t go in to the nuances of private vs. public and guarded vs. unguarded flavors discussed in this RFC since those are easy extensions.

A few definitions.

Chunk : a generic term to refer to any contiguous piece of data that is 1MB in size.
– Block : a permanent and immutable Chunk that cannot be changed once it is written. It’s XOR address is based on a hash of the data content.
– Blob : a mutable Chunk that can be updated. All of a Blob’s content is rewritten during every update, but its unique XOR address stays the same.
– Blend : an appendable Chunk that consists of 250 lines/entries, each 4kB in size. Only the first empty line is written during each update. Each entry is immutable once written, but the Chunk itself is mutated as new lines are added, hence it is a Blend of a Block and a Blob.
File : The highest level of organization for a set of data. “Everything is a File…” before in enters the SAFE Network and when it leaves. This is what is provided by and to the human client.
Sequence: an ordered set/collection of Chunks where repetitions are allowed. Could be built with, or be an extension of, a Blend.
Map : A File descriptor that points to none or more Chunks and/or Sequences. Could be built with, or be an extension of, a Blob. Small files could be stored entirely within a Map, without the need for a Sequence. This is analogous to the way that modern filesystems store small files within an inode, rather than take up another entire disk sector.

dirvine · January 23, 2020, 12:34pm

Great to see all these ideas and inputs. It really feels like great group think and this is proving it works.

oetyng · January 23, 2020, 1:07pm

Thanks @jlpell!
Fantastic that you are putting in time and thinking of how we can do this better.

Your object hierarchy idea is similar to what I am working on as a step 2 of data types work. I get very tempted to put that out for comments now

I will look closer at your ideas soon, and I think they are going to be absolutely great to weave into that discussion.

tfa · January 23, 2020, 2:22pm

I will rephrase it differently to try to make me understood. I have 2 fears:

Firstly, when you say:

I interpret this as: if the map is private then previous versions of keys and values are not preserved and the map needs to be public to preserve their whole history. But by doing this, you lose former Unpublished(…)AppendOnlyData possibilities.

Secondly, the extraction of the notion of Map might open the door to a workaround of Perpetual Web principle

I cannot be more specific because you did not show how the elements you want to separate (map and sequence) will be recombined to implement the existing data structures.

Concretely, can you show how the following example will be translated in the new system:

Take an AD at version V0:

V0
KeyA	ValueA
KeyB	ValueB

Update it in a transaction with one insertion, one deletion, and one update and then it has 2 versions:

V0		V1
KeyA	ValueA
KeyB	ValueB	KeyB	New ValueB
		KeyC	ValueC

Current version is V1, but data at version V0 are still accessible. This was possible for both a published and an unpublished AD (this is about my first fear)

We couldn’t cheat because the AD was a single structure. With a diagram for an equivalent example in the new system, I could check that there will be no risks of cheating with the extracted map part (this is about my second fear)

oetyng · January 23, 2020, 2:38pm

Ok, but I go through this quite extensively. Did you actually read the RFC again @tfa?
I think this part here answers all your questions. But I will go back there now and clarify even further.

From the RFC, under Map:

Key versioning

Both Public and Private Maps use key versioning. This means that there is a history of values for each key in the Map. An Update SHALL append a new value at the specific key, and thus incrementing the key version. A Delete SHALL append a Tombstone to that key’s value array (and similarily increment the key version).
[…]
The difference between Private and Public is that the Private Map allows hard-delete and hard-update, which erases the actual value.
[…]

Private Map

The characteristics of a PrivateMap is that Delete actually deletes the value; that data is permanently removed from the network.
The PrivateMap API is extended with ‘hard_delete’ and ‘hard_update’ to allow for this.
Using delete , the actual value is still available in the history. With hard_delete , the value is deleted, but the previous history is intact. Deletion of a key along with the entire history of it is also possible (this is similar to how MD works today).
And ultimately, the entire instance can be deleted from the network.

hard_delete and hard_update

These operations replace old value with Tombstone , and then appends another Tombstone when delete , and the new value when update .

The vector is also reflecting version, which is the reason for inserting a Tombstone , as it hard-deletes the data, but also increments version.

Using hard_delete and hard_update operations, the data history would look something like this:
[value] -> update(new_value) -> [Tombstone, new_value] -> expected_version = 2
[Tombstone, new_value] -> delete -> [Tombstone, Tombstone, Tombstone] -> expected_version = 3
…while the ‘soft’ delete and update gives:
[value] -> update(new_value) -> [value, new_value] -> expected_version = 2
[value, new_value] -> delete -> [value, new_value, Tombstone] -> expected_version = 3

There will always be a version. The PrivateMap allows for erasing a value, but the version will always be bumped, so there’s no fooling anyone about that.

tfa · January 23, 2020, 10:18pm

Maybe the source of the misunderstanding is that after having extracted basic operations (key/value management and sequence management) I thought you would re-implement the original (Un)published(Un)sequencedAppendOnlyData multi-key feature by combining these basic operations in a certain way.

In fact you didn’t mention this, but in my mind it was obvious and an understatement that we preserve this feature. For example they are used in safe file sync command because a directory is implemented by such an object.

Maybe the problem is that you don’t intend to do this and this explains why we are going in circles.

But if you intend to re-implement original *AppendOnlyData structures you should tell more about it. For example you could explain how the use case example I mentioned in my last post would be done.

Topic		Replies	Views
[RFC] Data Hierarchy Refinement RFCs	22	2470	January 31, 2020
An Overview of the New Data Types Development	40	2111	October 21, 2020
RFC 54 - Published and Unpublished DataType RFCs	40	3294	July 25, 2019
RFC: Dynamic Data Support RFCs	13	2014	April 5, 2016
RFC - Remove Transaction Managers RFCs	5	2504	July 1, 2015