[RFC] Data Types Refinement

oetyng · November 28, 2019, 6:29pm

Data Types Refinement

tags: `rfc`

Status: proposed
Type: enhancements and simplifications of data types
Related components: safe-nd, safe-api, safe_vault, safe_client_libs
Start Date: 18-11-2019
Discussion: [ … ]
Supersedes:
Superseded by: N.A.

Summary

This proposal replaces MutableData and AppendOnlyData.
It merges MD and the key-val part of AD into a single type Map, and separates a Sequence type out from AD.

Map forms a perpetual MD in private and public form, while Sequence is essentially an AD without key semantics.

This proposal additionally removes the Sequencedand Unsequenced flavours, and lets an instance be capable of both.
The result is that we remove the life time distinction between concurrency controlled types and those without.
Instead a parameter is used to determine the level of concurrency control on a specific operation, essentially making it optional per request for any given instance.

Conventions

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Motivation

Keeping the MD behaviour when published.
Removing unnecessary requirement of a key for a strict append-only structure.
Streamline / unify data type permissions and ownership (as well as any data type related parts).
Separate key-val and queue capabilities into distinct data types.
Cut down on different data type flavours.
Improve naming.

Keeping behaviour

It is quite plausible that apps start out with (or defaults to) a purely private iteration of the user data. There are numerous examples of very popular and widely used apps today that would work in this way. A supposed, or optional, later iteration in the app lifecycle would include making the data public, while still supporting the same functionality.

While it probably would be possible to work around the different behaviour client side, it is a profoundly unnecessary complication, as we can already now identify a very probable use case, where we want the private and public data to be handled in the same way, with extensions to the API for handling the specific differences between private and public.

Removing key from append-only type

With the current design of AD we are forcing the user of this data type to always provide a unique key which in most use cases wouldn’t be required for this kind of structure, plus we are forcing them to pay the price of the uniqueness validation. It is also confusing with regards to the purpose and usage of the data type, and speaks of a catch-all attempt and unfinished design.

As the saying goes:

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupery

Streamline data types design

The data types, having been implemented over a longer period of time, have seen compromises that probably wouldn’t have been done had it all been designed at the same time. For example AD and MD have two different implementations of permissions and ownership. We have three data types, of which two have some overlap in capabilities (MD and AD key-value), and one mixes two traditional data structures (AD with key-val and queue). We also have a Public / Private concept that applies differently to one of the key-val types than the other (MD is simply Private all the time, but can sort-of be made Public if copying to the other type, AD can be both Private and Public).
It is a quite low hanging fruit to let these data types be implemented more consistently, so that MD is replaced with PrivateMap, and permissions and ownership is handled in one way all over, instead of in multiple ways.

Cut down on different data type flavours

The earlier notion of Sequenced and Unsequenced, is a definition that is neither necessary nor adding value. What it does though, is to complicate the data type nomenclature and ecosystem, and to introduce accidental complexity, as well as a very confusing developer experience.
The purpose of these types is to denote concurrency control. When fixing them on a per-life-time-basis for an instance, we have not solved any actual problem, however we have limited real use cases.
Instead of the type distinction, we will allow for a parameter variant signalling that a write is allowed to go through, regardless of underlying data version.
This way, we keep the feature, but remove the artificial limit that the per-life-time distinction imposes.
The decision is a result of studying existing long-time practices with concurrency controlled streams of data, as well as analyzing the actual requirements on such a feature.
Important to know, is that this feature is based on future work not yet completed, with regards to the network consistency management.

Improve naming

This is really a bonus, and not a driving motivator.
However, if we consider Arrays, HashMaps, Lists etc. they are not named ArrayData, HashMapData, ListData. We have room to simplify the name of our most fundamental components.

This is not a trivial question, since a logical and elegant (simple) nomenclature is very important for code readability, system understanding and developer experience. Furthermore, it is not an isolated phenomenon but speaks of the logic and elegance (simplicity) of the entire system.

Assumptions

This document doesn’t cover the basic functionality of the data types that remains unchanged.

Detailed Description

Synopsis

MutableData and AppendOnlyData becomes Map and Sequence
Unpublished / Published becomes Private / Public
Unsequenced / Sequenced types are removed, and instead this is decided on a per-request-basis.
ImmutableData becomes Blob.

The current types…

UnpublishedUnsequencedMutableData
UnpublishedSequencedMutableData

UnpublishedUnsequencedAppendOnlyData
PublishedUnsequencedAppendOnlyData
UnpublishedSequencedAppendOnlyData
PublishedSequencedAppendOnlyData

UnpublishedImmutableData
PublishedImmutableData

…thus become the following:

PrivateMap
PublicMap

PrivateSequence
PublicSequence

PrivateBlob
PublicBlob

New and old names side by side:

PrivateMap ~ (UnpublishedUnsequencedMutableData / UnpublishedSequencedMutableData)
PublicMap - New!

PrivateSequence ~ (UnpublishedUnsequencedAppendOnlyData / UnpublishedSequencedAppendOnlyData)
PublicSequence ~ (PublishedUnsequencedAppendOnlyData / PublishedSequencedAppendOnlyData)

PrivateBlob ~ (UnpublishedImmutableData)
PublicBlob ~ (PublishedImmutableData)

The following table gives a rough overview of the capabilities.

Type	Can edit	Can add to	Can delete	Write once	Can share	Concurrency control	Can change owner
PrivateMap	x	x	x	-	x	x (*)	x
PublicMap	-	x	-	-	x	x (*)	x
PrivateSequence	x	x	x	-	x	x (*)	x
PublicSequence	-	x	-	-	x	x (*)	x
PrivateBlob	-	-	x	x	x (**)	-	-
PublicBlob	-	-	-	x	x	-	-
() Optional on a per-request-basis, via parameter.*
() Only as part of instantiation, as the entire Blob is immutable.

Blob

The current data type ImmutableData is in this proposal renamed to Blob.

The type is further sub-divided into:

PrivateBlob (former UnpublishedImmutableData)
PublicBlob (former PublishedImmutableData)

Sequence

This API is similar to current AppendOnlyData.

A Sequence MAY expose the following APIs (or equivalent):
(Slightly simplified.)

Sequence Writes (aka mutations)

- PutSequence()
- DeletePrivateSequence()
- Append(value)

Sequence Reads

- GetSequence -> Result[Sequence]
- GetSequenceShell -> Result[Sequence] // only metadata
- GetSequenceRange {
            versionStart, 
            versionEnd
        } -> Result[Values]
- GetSequenceValueAt(version) -> Result[Value]
- GetSequenceExpectedVersions -> Result[ExpectedVersions]
- GetSequenceCurrentEntry -> Result[SequenceEntry]

Additionally there is the regular Permissions and Owners API from AD (see Owner and Permissions) which MAY remain largely unchanged.

Private Sequence

A PrivateSequence MUST allow:

Deletion of the entire structure.

… and MAY allow:

Deletion of specific indices (no restrictions).

Implementation of specific-index-deletion is deferred for now as we reach a consensus on the scope of such mutability.

Public Sequence

A PublicSequence MUST NOT allow deletions. Data is there in perpetuity.

Map

This API is supposed to reflect current MutableData as well as extend it for a more complete access to the versions.

The combined flavours of a Map MAY expose the following APIs (or equivalent):
(Slightly simplified.)

- PutMap(data) // Creates map.
- DeletePrivateMap() // Deletes the map from the network.
- CommitMapTx(tx) // Commit multiple operations, fail or succeed all together.
- GetMap()  -> Result[Map] // Returns everything.
- GetMapShell {
                expected_data_version,
            } -> Result[Map] // All metadata, but not the actual data.
- GetMapVersion()  -> Result[Version] // Data version.
- GetMapValue(key ) -> Result[Value] // Current value of specific key.
- GetMapValueAt {
                key,
                version,
            } -> Result[Value] // Value of a specific key at a specific version.
- GetMapValues() -> Result[Values] // All current values.
- GetMapEntries() -> Result[Entries] // All key-value pairs (current value).
- GetMapExpectedVersions() -> Result[ExpectedVersions] // Expected versions for data, owner and permissions.
- GetMapKeyHistory(key) -> Result[Values] // All history of specific key.
- GetMapKeyHistoryRange { 
            key,
            versionStart, 
            versionEnd
        } -> Result[Values] // A range of the history of a specific key.
- GetMapKeys() -> Result[Vec<Key>] // All keys.
- GetMapKeyHistories() -> Result[KeyHistories] // All history of all keys.

Here, tx is a transaction object (equivalent to current MdEntryActions) that we load with any add/update/delete-operations we want to perform atomically.

Additionally there is the regular Permissions and Owners API (see Owner and Permissions) from AD which MAY remain largely unchanged in functionality, while it also MAY be renamed as per the above description; AccessList etc…

Updating Map owners and permissions

As these reference the Map data as it were at a specific point in time, there MUST be a top level version for its data. This version MUST be incremented every time there is an operation that increments a key-version.

This means that owners and permissions works largely the same in Map (and Sequence), as they currently do for AD.

Key versioning

Both Public and Private Maps use key versioning. This means that there is a history of values for each key in the Map. An Update SHALL append a new value at the specific key, and thus incrementing the key version. A Delete SHALL append a Tombstone to that key’s value array (and similarily increment the key version). At that point, it would make sense that Add(key, value) again is successful for that key (while it previously would give KeyExists error), in which case the new value is simply appended after the tombstone. An Update or Get returns EntryDoesNotExist when current value is a Tombstone.
The difference between Private and Public is that the Private Map allows hard-delete and hard-update, which erases the actual value.

Private Map

A PrivateMap SHALL allow:
→ Deletion of the entire structure.
→ Deletion of current key value.
→ Deletion of keys.

The characteristics of a PrivateMap is that it additionally allows for deleting a value; that data is permanently removed from the network. This however, will always bump the version.
The PrivateMap API is extended with ‘hard_delete’ and ‘hard_update’ to allow for this. So this is an addition to the capabilities from PublicMap that it also has.

The key can be treated as a single value, where delete or hard_delete just renders the key virtually deleted (i.e. GET will respond KeyDoesNotExist). Using delete, the actual value is still available in the history.
With hard_delete, the value is deleted, but the previous history is intact. Deletion of a key along with the entire history of it MAY also be possible (this is similar to how MD works today).
The options are then to delete either the entire history but leaving the last version, deleting the entire history and not leave version, or deleting the key and the history (which leaves no opportunity to keep the key version).
And ultimately, the entire instance can be deleted from the network.

hard_delete and hard_update

These operations replace old value with Tombstone , and then appends another Tombstone when delete, and the new value when update.

The vector is also reflecting version, which is the reason for inserting a Tombstone, as it hard-deletes the data, but also increments version.

Using hard_delete and hard_update operations, the data history would look something like this:

[value] -> update(new_value) -> [Tombstone, new_value] -> expected_version = 2
[Tombstone, new_value] -> delete -> [Tombstone, Tombstone, Tombstone] -> expected_version = 3

…while the ‘soft’ delete and update gives:

[value] -> update(new_value) -> [value, new_value] -> expected_version = 2
[value, new_value] -> delete -> [value, new_value, Tombstone] -> expected_version = 3

If possible, this MAY be extended with hard_delete_at and hard_update_at, as to allow deletion at a specific version.
Although the above scheme would likely need to be implemented differently to allow for version to be properly bumped, while not changing any existing data versions.

Public Map

A PublicMap would essentially expose the exact same API as the above with the exception of hard_delete, hard_update and deletion of the instance, as the PublicMap data is perpetual.

The API for Update/Delete exists, however no data is ever removed from the network, the key history is maintained. Any data going in, will exist there in perpetuity, albeit new data might be appended for a given key, which effectively emulates the mutability of the PrivateMap.

Owner and Permissions

For Map and Sequence, the ownership and permissions API is as follows:
(Slightly simplified.)

- Set[Map|Sequence]Owner {
        owner,
        expected_version,
    }
- Set[Public|Private][Map|Sequence]AccessList {
        access_list,
        expected_version,
    }
- Get[Map|Sequence]Owner() -> Result(Owner)
- Get[Map|Sequence]OwnerAt(version) -> Result(Owner)
- Get[Map|Sequence]OwnerHistory() -> Result(Vec(Owner))
- Get[Map|Sequence]OwnerHistoryRange { 
        versionStart, 
        versionEnd
    } -> Result(Vec(Owner))
- Get[Map|Sequence]AccessList -> Result(AccessList)
- Get[Map|Sequence]AccessListAt(version) -> Result(AccessList)
- Get[Public|Private][Map|Sequence]AccessListHistory -> Result(Vec([Public|Private]AccessList))
- Get[Public|Private][Map|Sequence]AccessListHistoryRange(versionStart, versionEnd) -> Result(Vec([Public|Private]AccessList))
- Get[Public|Private][Map|Sequence]UserPermissions(user) -> Result([Public|Private]UserAccess)
- Get[Public|Private][Map|Sequence]UserPermissionsAt {
        version,
        (user | public_key),
    }  -> Result([Public|Private]UserAccess)

Concurrency control

The previous Sequenced / Unsequenced flavours of Map and Sequence are with this proposal removed.
The types were subject to concurrency control, specifically optimistic concurrency. In contrast to previous logic, where an instance was forever locked to one behaviour, without actually enforcing the behaviour (it was easily bypassed), we now allow for the user to decide on a case-by-case-basis if a request should honour existing values or not.
This works exactly the same as with the previous terminology, but we are now combining the two capabilities into one, and thereby cutting down on the data types, as well as giving developers more freedom, as we assume less about their requirements and behaviour.

In other words: the solution chosen here is to not fix the concurrency control of an instance over its entire lifespan, and instead pass in optimistic concurrency check as parameter to operations.
It’s as simple as adding an ExpectedVersion enum parameter, with one of the variants being Any, which would indicate that we will write regardless of the version.

pub enum ExpectedVersion {
    Any, // this means concurrency check is OFF
    Specific(u64), // this means concurrency check is ON
}

This is safe to do because the concurrency control is meant for us - the current writer. We are not preventing someone else from circumventing it. Thus we can just as well simplify, widen the capability, and the end result is the same; we will execute code correctly because it is in our interest, and the other writers (defined as not being “us”) still have to execute code correctly, for the concurrency control to work. If it’s not in their interest, then the original concurrency control was no help; they are rouge players with write access (so the owner of the data messed up in one way or another).

Encryption

In all of the above cases, the contents of the structures can be either Encrypted or RawContent.
This can simply be called for what it is, since the concept of something being encrypted is commonly known to mean secret, very secure and protected from unauthorized access (even though how it works and handling of keys is something quite alien to many - so as long as encrypting something is as simple as pressing a button or tick a checkbox, it should be fine UX wise).

It enables us to reclaim the describing property Private for usage in a more natural taxonomy of our data types (as per the current proposal).

Sharing

No changes are proposed here.

Considerations for Private/Public naming

The question was raised whether there is a risk that people may incorrectly believe something they upload as Private and un-encryped, is private as in 100% not accessible whatsoever by anyone else (while in reality, as long as it is not encrypted, even Private data can be read by nodes when it is in transit).

Response to considerations

Data put as Private should be encrypted by default. For many reasons, one being that it removes the “whisper-game” rumours of “your data can be read by the nodes when in transit”, which will be very sticky on the various corners of internet, and degrade the message of “your data is safe on the network”. Another reason is that a user regardless has to be educated on the fact that the data is not encrypted by default; Unpublished doesn’t really avoid the risk suggested above. It’s not fully clear from that name that the data would not be encrypted, so it’s still something a user has to learn (or we solve it for them).
So, it is actually better that we educate a user that if you want to upload your data unencrypted you have to do that explicitly (with all warnings and education there).
The motivation for that is that those having the use case that they want it unencrypted, would be in a much better position to actually grasp what that entails.

Additionally, having “private conversations” etc. is a common and well understood concept, which means that the data of the conversation is private, but it is still shared with someone of your choice (the other party/-ies in the conversation). It is therefore argued here that “Private” is commonly known to be optionally shared, and that we explicitly use the word Encrypted instead when we want to point out that it is in fact not readable by anyone else, unless explicitly granted by the owner. (See motivation behind use of the term in the Encryption section.)

Alternatives

Some alternatives have come up regarding the new names of the data structures.
The current set of names have been chosen as a whole, with the intent to have them roughly levelled aesthetically - meaning that they carry about the same amount of information and do not differ much in how descriptive they are. It has been considered more important with consistency in the general design with preference for palatable and easy-enough to understand names, than having the ultimate descriptive name at the cost of the former properties.
The idea being that any developer regardless will have to visit the RFC / documentation at some point to fully understand the data types (how to use them, what their capabilities and limitations are), so we might just as well lean towards more palatable naming, and leave the very detailed descriptions for the documentation.

Alternatives to Map

An alternative name KeyValue or KeyValueStore has been proposed as well, which would be more descriptive than Map.
While it probably is more descriptive, the arguments for Map, are that “map” is a common representation of key-value structures within many programming languages, and obviously a bit more slick of a name as well.

Alternatives to Sequence

A question was raised regarding the descriptive value of this name, and whether it is easy to relate to it based on previous usage within the programming domain. AppendOnly was suggested.
Also here, it might very well be the case that it is not the ultimate name in the aspect of description, but just as with Map, a slick and catchy name that was easy-enough to understand, was preferred.

Arguments against the alternatives

A “sequence” and a “map” are real things in themselves and also describe the properties of these data structure fairly well. An “appendonly” and a “keyvalue”, are a thing only by convention if we decide so, they’re not actual words or things otherwise.
It would seem that both KeyValue and AppendOnly are not entirely complete names as they are, and that tucking the word Store (or similar) to them is almost necessary as to get all the way in terms of clarity (KeyValueStore, AppendOnlyStore). And by that we are again quite far away from the slick and simple names. (Not to mention we’ve now introduced additional connotations with Store that we might not want there.)

Additional naming considerations

The use of Private / Public antonyms is a deliberate step away from sharing the same word base in the two opposing descriptions (e.g. Unpublished / Published).
It is RECOMMENDED that we use natural, existing and commonly used antonyms which do not share word base (e.g. Open / Closed), by which we achieve a much higher level of readability and clarity, additionally lowering the risk of confusion when speaking of the terms - since the words do not sound similar.

Implementation roadmap

This RFC may or may not be implemented for Fleming.
Impoortant to notice, is that until we order mutations with PARSEC at the data handlers (on the After Phase2a list of Fleming), the data versions might not be the same in all the vaults.
That means to say, that version handling at the network requires more work, and until this feature is implemented (given that it is before Fleming), it might not work as expected.
While this is the case, ExpectedVersion.Any variant might be disabled or implemented later in case we want to avoid that uncertainty in wait for the work to finish.

Unresolved questions

There needs to be more details (to be added later) covering aspects of unified permissions handling, sharing, encryption keys and how encrypted content is to be handled by default for Private types.

Changelog Data Types RFC

2020-01-30

Clarified Private | Public scope implications for Map. (thanks @tfa)
Added description of possible extensions for deletion scheme in PrivateMap.

2020-01-21

Replaced Guarded types with per-request-configuration of concurrency control.

2020-01-16

Ordered the listing of old and new type names to match index of them. (thanks @david-beinn)
Added a listing with new and old name side by side.
Fixed flipped scope / concurrency (Public / Private, Guarded) in names.
Fixed flipped order of Public / Private scopes between listing and table. (thanks @david-beinn)
Removed non-existing old types from listing. (thanks @tfa)
Updated Sentried to Guarded. (thanks @JPL, @dask, @jlpell)
Removed idea of excluding Private from names, from Unresolved questions section. (thanks @jlpell)
Found antonym with different wordbase for Encrypted; RawContent, replacing RawContent. (thanks @jlpell)

Note: If you strongly disagree with any of the above updates, please discuss it in the forum topic, for possible revert, or other change.

oetyng · November 28, 2019, 6:43pm

I’ve got some very WIP code going on here: https://github.com/oetyng/safe-nd/tree/datatypes-refinement
if someone would like to look at actual implementation as well. I’ve been working on it both as a way to get more rust-time under the belt, but also as to delve deeper into the proposal.

So far the idea has been to copy append_only_data.rs into sequence.rs and map.rs, and then remove key-value semantics from sequence, and append/index semantics from map.
I’ve worked on top of @adam’s simplification of append_only_data (removing the macro) , and also the indexing naming conventions (both of which are currently pending PR’s to in safe-nd).

I’ve tried to have minimal differences in owners and permissions handling between these. But as I said, still very much WIP, so we’ll see where that ends up.

joshuef · November 29, 2019, 11:15am

Nice work @oetyng.

Really good write up. I dig the new names. I’d go with what you’re proposing, and even the Blob and PublicBlob, I think makes sense and keeps things terse/clear. Same w/ Map for me , it’s clear and simple.

The functionality changes make sense and I think the whole aim here is solid. Improving the data types and making them simpler (for users) too.

oetyng · December 2, 2019, 5:11pm

Update

Sentried Map hard-delete etc.

I think that while using the vector for determining version (which is very slick), this is preferred for the sentried flavour, otherwise the concurrency control is disabled.

Private Sequence

For now I think we can defer implementation of specific-index-deletion to a later point in time, and leave PrivateSequence be identical to PublicSequence (with the one exception that the entire instance can be deleted).

Background

With regards to the hard-delete functionality for PrivateSequence individual indices, it is not yet clear what the boundaries of that would be, or if it is even necessary.

Example questions:

Do we want to allow hard-delete on any and all indices? (this would be consistent with PrivateMap)
Do we want to allow hard-delete only on the current index?
Do we want to allow hard-update? (this would be consistent with PrivateMap)
Are there other operations that we want?

I did identify some additional operations that suddenly made more sense if allowing deletion, but it certainly added a lot more faff, so in the end I got a bit doubtful about it. I even began to think that maybe Sequence should only have the Append capability (except for deleting the entire instance)?

Originally considered

Append // Same as before, nothing special here.
Delete // Deletes value at an index. What happens with subsequent entries? Should their index remain unchanged or be decremented?

Potential additions

Insert // If we can delete at an index, it becomes natural that we can insert at an index.
Update // If we can delete and insert at an index, we should be able to update as well.
Swap // If we can delete and insert at indices, it seems to make sense to allow for swapping as well.

happybeing · January 7, 2020, 12:30pm

Bravo @oetyng, this is very clear and you’ve managed to convey this extremely well given that it can be understood (I think!) in a single reading when it includes so much detail and explanation. I really appreciate the extra explanations on various aspects including naming which can be so contentious (because it matters) although I didn’t really have any difficulty accepting the changes you propose.

At first read I think this is sensible and a worthwhile improvement, with obvious reservations about any delay to Fleming. I’d always go for a change like this though unless it really isn’t practical with resource and time concerns, because I think it offers a lot of value down the road.

One question re permissions - this is written with (pre Labels) ‘access lists’ in mind, but presumably can be easily upgraded later, or the APIs modified to support Labels if that is also adopted now. Is that clear or do you think it might be more tricky and need some thought?

BTW you get strong vote from me for encrypting private data by default. Lots of benefits there, are there any counter arguments? They aren’t obvious to me if the are.

Finally, is this too late for Fleming? I hope not.

oetyng · January 7, 2020, 2:09pm

That’s good to hear Mark!

Regarding permissions, yes most of this was written before the Labels. We’ve been discussing this briefly, I think @joshuef can perhaps fill in with his view on how this fits together. At this stage there’s no hinder to upgrade I would say anyway.
(In a live network, upgrading with a new format of authorization, would be trickier. Mostly in that the history of previous permissions would be using the old schema, and a migration might not be perfect. So, keeping the old schema would need the old API as to keep access to that information. But on the other hand, a migration might be absolutely fine as well.)

It could be part of Fleming, and that’s an aim
Most of the code for this has already been written now (comes a bit as a bonus for me because I do spiking / designing iteratively), and there’s been some code review.

But there are still reviews and merges to be done, and they could take some while from what it seems.

Additionally, there are quite widespread updates to do throughout the libraries (safe_client_libs, safe-api, safe_vault, of which I’ve done a great deal in safe_vault already as well), and more tests to be (re-)written.

So, we’ll have to see in the end.

joshuef · January 7, 2020, 2:13pm

I think theses changes don’t have to be considered in terms of Fleming. If we have them, then sweet .

IMO they have more impact when considering API releases etc. Though again, I think we can be building out the API with these in mind, using any new naming we settle upon in APIs to get ahead of the curve and avoid any more breaking API changes.

w/r/t labels, I don’t think there’s that much knock on here. This RFC changes some data types and adds some functionality, but isn’t touching permissions on data themselves. So this should work well alongside labels, I believe.

dirvine · January 7, 2020, 3:55pm

I hear music in my ears Let’s hope so as that would be nice. It seems so from first glance so hopefully any cross over would be minimal.

Traktion · January 7, 2020, 6:28pm

Just seeing this thread now (didn’t see it before?) and I think it is excellent! Well thought through and a big improvement!

Nothing critical to add and having private data encrypted by default seems absolutely logical. Looking forward to seeing only integrated!

danda · January 7, 2020, 8:47pm

while in reality, as long as it is not encrypted, even Private data can be read by nodes when it is in transit

basic safe network question here: why is any data sent unencrypted, especially Private data?

if I understand correctly, all data in vaults is encrypted, so vault owners can’t be responsible for what is in there.

It seems to me a simpler story to tell people that all data is sent encrypted across the wire, period. hence: Secure.

oetyng · January 7, 2020, 9:56pm

I don’t know. I’ve wondered that myself. I’m afraid someone who’s been here longer than me will have to answer it.
(Here’s a relevant post from some time back that talk about this as well: RFC 55 - Unpublished ImmutableData - #10 by happybeing)

But I’m totally with you there, everything leaving the client should be encrypted. So, that’s what I’m advocating based on what I know.

Maybe there’s some use case when it’s not important, but it should be opt out then IMO.

Edit: I’ve been informed that this task is on the roadmap: Safe Network

I’m not sure of the scope at the moment, but at least that would cover the data-at-rest part. Then there is in-transit. But that feature would be a major part of addressing this issue. Let’s await the guys who knows this, and they’ll probably enlighten us on the subject.

dirvine · January 7, 2020, 11:09pm

The issue is just that. The client needs to encrypt. The network itself won’t care. So if anyone writes a client (and their own SCL etc.) then the network will say, OK store what you want.

This is where the network cannot enforce all clients do X and they still cannot (the network only stores stuff and cannot encrypt / decrypt client data all the way from the client). However moving critical things to the network is good, where possible. Encryption is tough as the network needs to create keys and cannot really do that securely so we abdicate to clients, but they need to use SCL/safe_nd etc. they still could bypass all of that.

jlpell · January 8, 2020, 8:46am

How did I miss this post for over two months?? Thank you @happybeing for bumping it back to the top. It’s hard to convey my thoughts on how superb this RFC is. You really have outdone yourself @oetyng! (Again, I repeat, 5 out 5 stars and two thumbs up.) This RFC does a spectacular job at distilling the essential properties of the past datatypes into a readily understandable and coherent system.

A few recommendations on wording/nomenclature:

Instead of having delete and hard_delete, or update and hard_update, I would use unique terms for each operation. Based on your description IMO you should have something like delete, update, kill, and revive. These also fit well with your “tombstone” nomenclature.

I like your line of thinking here and see your reasoning. However, it only became clear after your explanation. Also, “SentriedSequence” is a bit of a mouthful. I would recommend using an alternative and often used term to convey the same meaning, ie. “Protected” instead. So a “ProtectedMap” or a “ProtectedSequence” are Maps and Sequences protected from concurrent race conditions.

When one think’s of “ImmutableData”, the concept implied is a form of data that is a permanent, rigid, carved in stone… an unchangeable construct. However, the term “blob” invokes the exact opposite connotation, ie. an amorphous, variable, undefined, and fluid thing that changes every time you poke it. My suggestion is to use a term more intuitively inline with the intent of immutable data. For example, “Block” invokes a more rigid mental image. This term is not really ideal though due to its overuse in recent years, ie. blockchain, parsec blocks etc. (EDIT: Maybe Block really is a great term to capitalize on and take control of?) For me the best imagery that comes to mind are stone tiles such as those found in sumerian cuneiform, or hieroglyphics carved on a tablet or slab of granite.

So instead of blob, what about something like these?

Tile, PublicTile, PrivateTile

Glyph, PublicGlyph, PrivateGlyph

Of those I think Tile is the best… it even rhymes with file, such as “a Map of many Tiles makes a File.”

The only concern here is that if you eliminate the Private and just go with Map and PublicMap things might get confused with the std::Map datastructure if you are not careful with namespace. Map is nice because it fits well with the terminology used by dirvine since the beginning, ie. the “Data Map”. It also represents quite literally a map to find your data. The only other terms off the top of my head that is similar would be “Chart” or “Graph”.

I do think Sequence is nice, but is there a reason why you didn’t pick List instead?

I agree with you here, that NotEncrypted is a poor choice and distinct antonyms are better for improved understanding and readability. I also agree that PlainText is a poor choice since it’s not always text. How about one of the following combinations?

Encrypted / Deciphered

Encrypted / Decrypted

Encrypted / Decoded

Encrypted / Raw

happybeing · January 8, 2020, 10:00am

I think it was posted to a limited group (not me) and only yesterday was that restriction removed.

BTW I think you make good suggestions on naming.

oetyng · January 8, 2020, 1:22pm

So, this is what I’ve been having in mind.
At least we could make sure that anything we implement for client-side execution, do this?

Then if someone else comes and do a separate SCL, it will stand on its own merits, and if users want encryption by default, and it doesn’t do it, then probably less luck out there for that implementation.

Yep, this is correct.

Thanks @Traktion, good to hear

Very happy to hear it @jlpell and thanks for the kind words, and the input! Very valuable.

I like it, would you like to expand on those suggestions? I’m a bit unclear about the revive one for example.

Tombstone is old database word, so I can’t take credit for that one
Tombstone (data store) - Wikipedia.

I’ll toss in a a bit of background to how I work, for fun and fact
My process when it comes to design is to always look for standard wordings in the domain. Most often every usage of it has its own flavor attached to it, depending on context, so there is always wiggle room for interpretation, overlap etc. and we can choose which of them we find most relevant. That’s why it’s always a bit of research to find the most suitable words to include in the vocabulary for the specific context.
Then, I most often do an extensive lookup of synonyms, and their usage in both nearby as other contexts. I weigh words against each other and see how they fit with existing concepts and together with a current set of words under consideration.

I’m fine with inventing words, and carve out a space for my domain in the world by claiming that this is now the meaning of this word, and be confident that it will break grounds. But, that is a practice I reserve for remarkable / innovative things where it could be motivated. In all other cases I prefer to be as lean and comprehensible as possible and make everything fit in nicely with well understood existing concepts.

This kind of work is often completely left out when doing near the metal coding - where its all about byte shifts and elliptic curves (so to speak ), and if no-one picks up on that task, we often end up with quite confusing and alien code, APIs and in the end also the product that will be used.

So, that is the basis for a coherent system. The language all through. By getting a clear language, also the concepts become clear, and the actual system flows and logic can not only be organized in smarter and more intuitive ways, which makes them more robust, but also actually solve the real world problems and not artificial problems that arise out of accidental complexity and concept confusion.

A misconception is that low level code doesn’t need to be nice and easy to understand, that design is only for UX and not for actual code bases, and if only the most hard core devs understand it, then only better since it proves their elevated master mind position . And that is a path to failure IMO. It’s a lack of understanding that tools, all tools, need to be ergonomic. Programming languages are just tools, for problem solving, and they should never stand in the way of the important work - the problem solving.

So, I specialize in problem solving, and as any engineer I am very keen to see the tools in a good shape, easy to use, sharp and fit for their purpose.

Ah yes. I think your reasoning is sound. It’s good to emphasize the immutability (while keeping it slick). Block, Chunk, Tile are good candidates. We have a slight problem with Block and existing connotations, but we can also choose to be bold and claim the word for our context.
I’d love to hear more people chip in with suggestions here.
I’ll be revisiting this specific part actually, since there are currently ongoing internal discussions about the concepts here, as well as a new proposal brewing which relates to this. But more about that another time.
Blob is used in cloud storage world to denote a big piece of (more or less) immutable data. Now, nothing is truly immutable in today’s blob storage, but it often has a bit more … inertia … than other types of storage. So, that’s why it was chosen, basically for closeness to existing related usage.

Yeah, I agree. And I’ve had similar thoughts. I think Protected is a good alternative, but also it is overlapping a bit with other things, like private and encrypted, often used interchangeably with those. So, since we already are dealing with those concepts, it seems to me that there can be confusion and the user still has to look up what exactly this means.
The benefit of Sentried in that case is that at least there’s no overlap with other words there.

I considered it. But to me personally, it is very closely associated with operations that are not available on this data structure. I’m not alien to go with List anyway, but I think it will disconnect us slightly from the notion of append-only there (which could be an OK compromise).
So, Sequence, while less familiar, I think intuitively conveyed the append-only nature a bit better.

Nice. Good ones. Out of those I would probably pick (with slight modification)

Encrypted / RawContent

What do you think? I’m happy to go for that one.

jlpell · January 8, 2020, 1:54pm

Yes. Agreed.

Yes, looks good to me.

Your thoughts on “the Blob” vs. Block or Tile?

oetyng · January 8, 2020, 1:55pm

Yep, was just adding those in an edit above, when you responded

jlpell · January 8, 2020, 2:26pm

I’d say be bold and unchain the blocks unless there is a better use for the term Block within the SAFE ecosystem . The term Chunk is a nice general term that can refer to any and all data objects that were formed by splitting up a file.

I suspect part of the reason that was chosen is because blob is often defined in the dictionary as a large drop of liquid. Real"Clouds" have large drops of liquid water in them.

oetyng · January 8, 2020, 2:30pm

unchain the blocks catchy

That’s interesting, and plausible. Etymology is always such a stimulating practice.

Today Chunk is reserved for this usage, i.e. a blob of data is split up in chunks.
But as can be read from this citation: “Blobs were originally just big amorphous chunks of data […]”, a blob and a chunk of data can be synonymous.

The chunk of data just means to say some data which has been amassed out of the environment.
So, not necessarily a piece of a file or a blob, but just a piece of data [from our world].

This distinction is going to be more important if we want to move towards having all data fundamentally stored as chunks in the network, as it would require that some content would have to fit all in one single chunk, as it would be too small for self-encryption. But there will be more information and discussion about this later, in another topic.

I’ll be circling back to this shortly, but I’d love to hear other people’s input on the use of Block or similar, when talking about immutable data in the network.

jlpell · January 8, 2020, 2:47pm

Yes, marketing wise people often just shrug when you tell them that project SAFE started before bitcoin and blockchain. A common question is, “well what is taking you so long then?” Instead, a different narrative might be:

First blockchain was launched and the people rejoiced. But chains are heavy, slow, constraining, they limit freedom, they limit choice. Then came MaidSafe, the breaker of chains, who created a thing with none of those pains. The blocks now roam free, in their SAFE place. You will need a Map to find them; hidden in spare space.

Yes folks, that was my first crypto poem. Bring on the memes.

Topic		Replies	Views
[RFC] Data Hierarchy Refinement RFCs	22	2470	January 31, 2020
An Overview of the New Data Types Development	40	2092	October 21, 2020
RFC 54 - Published and Unpublished DataType RFCs	40	3294	July 25, 2019
RFC: Dynamic Data Support RFCs	13	2014	April 5, 2016
RFC - Remove Transaction Managers RFCs	5	2504	July 1, 2015

[RFC] Data Types Refinement

Data Types Refinement

tags: rfc

Summary

Conventions

Motivation

Keeping behaviour

Removing key from append-only type

Streamline data types design

Cut down on different data type flavours

Improve naming

Assumptions

Detailed Description

Synopsis

Blob

Sequence

Sequence Writes (aka mutations)

Sequence Reads

Private Sequence

Public Sequence

Map

Updating Map owners and permissions

Key versioning

Private Map

hard_delete and hard_update

Public Map

Owner and Permissions

Concurrency control

Encryption

Sharing

Considerations for Private/Public naming

Response to considerations

Alternatives

Alternatives to Map

Alternatives to Sequence

Arguments against the alternatives

Additional naming considerations

Implementation roadmap

Unresolved questions

Changelog Data Types RFC

Update

Sentried Map hard-delete etc.

Private Sequence

Originally considered

Potential additions

Related topics

tags: `rfc`