Data Types Refinement
tags: rfc
- Status: proposed
- Type: enhancements and simplifications of data types
- Related components:
safe-nd
,safe-api
,safe_vault
,safe_client_libs
- Start Date: 18-11-2019
- Discussion: [ … ]
- Supersedes:
- Superseded by: N.A.
Summary
This proposal replaces MutableData
and AppendOnlyData
.
It merges MD
and the key-val part of AD
into a single type Map
, and separates a Sequence
type out from AD
.
Map
forms a perpetual MD
in private and public form, while Sequence
is essentially an AD
without key
semantics.
This proposal additionally removes the Sequenced
and Unsequenced
flavours, and lets an instance be capable of both.
The result is that we remove the life time distinction between concurrency controlled types and those without.
Instead a parameter is used to determine the level of concurrency control on a specific operation, essentially making it optional per request for any given instance.
Conventions
- The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Motivation
- Keeping the
MD
behaviour when published. - Removing unnecessary requirement of a key for a strict append-only structure.
- Streamline / unify data type permissions and ownership (as well as any data type related parts).
- Separate key-val and queue capabilities into distinct data types.
- Cut down on different data type flavours.
- Improve naming.
Keeping behaviour
It is quite plausible that apps start out with (or defaults to) a purely private iteration of the user data. There are numerous examples of very popular and widely used apps today that would work in this way. A supposed, or optional, later iteration in the app lifecycle would include making the data public, while still supporting the same functionality.
While it probably would be possible to work around the different behaviour client side, it is a profoundly unnecessary complication, as we can already now identify a very probable use case, where we want the private and public data to be handled in the same way, with extensions to the API for handling the specific differences between private and public.
Removing key from append-only type
With the current design of AD
we are forcing the user of this data type to always provide a unique key which in most use cases wouldn’t be required for this kind of structure, plus we are forcing them to pay the price of the uniqueness validation. It is also confusing with regards to the purpose and usage of the data type, and speaks of a catch-all attempt and unfinished design.
As the saying goes:
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupery
Streamline data types design
The data types, having been implemented over a longer period of time, have seen compromises that probably wouldn’t have been done had it all been designed at the same time. For example AD
and MD
have two different implementations of permissions and ownership. We have three data types, of which two have some overlap in capabilities (MD
and AD
key-value), and one mixes two traditional data structures (AD
with key-val and queue). We also have a Public
/ Private
concept that applies differently to one of the key-val types than the other (MD
is simply Private
all the time, but can sort-of be made Public if copying to the other type, AD
can be both Private
and Public
).
It is a quite low hanging fruit to let these data types be implemented more consistently, so that MD
is replaced with PrivateMap
, and permissions and ownership is handled in one way all over, instead of in multiple ways.
Cut down on different data type flavours
The earlier notion of Sequenced
and Unsequenced
, is a definition that is neither necessary nor adding value. What it does though, is to complicate the data type nomenclature and ecosystem, and to introduce accidental complexity, as well as a very confusing developer experience.
The purpose of these types is to denote concurrency control. When fixing them on a per-life-time-basis for an instance, we have not solved any actual problem, however we have limited real use cases.
Instead of the type distinction, we will allow for a parameter variant signalling that a write is allowed to go through, regardless of underlying data version.
This way, we keep the feature, but remove the artificial limit that the per-life-time distinction imposes.
The decision is a result of studying existing long-time practices with concurrency controlled streams of data, as well as analyzing the actual requirements on such a feature.
Important to know, is that this feature is based on future work not yet completed, with regards to the network consistency management.
Improve naming
This is really a bonus, and not a driving motivator.
However, if we consider Arrays
, HashMaps
, Lists
etc. they are not named ArrayData, HashMapData, ListData. We have room to simplify the name of our most fundamental components.
This is not a trivial question, since a logical and elegant (simple) nomenclature is very important for code readability, system understanding and developer experience. Furthermore, it is not an isolated phenomenon but speaks of the logic and elegance (simplicity) of the entire system.
Assumptions
This document doesn’t cover the basic functionality of the data types that remains unchanged.
Detailed Description
Synopsis
MutableData
andAppendOnlyData
becomesMap
andSequence
Unpublished / Published
becomesPrivate / Public
Unsequenced / Sequenced
types are removed, and instead this is decided on a per-request-basis.ImmutableData
becomesBlob
.
The current types…
UnpublishedUnsequencedMutableData
UnpublishedSequencedMutableData
UnpublishedUnsequencedAppendOnlyData
PublishedUnsequencedAppendOnlyData
UnpublishedSequencedAppendOnlyData
PublishedSequencedAppendOnlyData
UnpublishedImmutableData
PublishedImmutableData
…thus become the following:
PrivateMap
PublicMap
PrivateSequence
PublicSequence
PrivateBlob
PublicBlob
New and old names side by side:
PrivateMap ~ (UnpublishedUnsequencedMutableData / UnpublishedSequencedMutableData)
PublicMap - New!
PrivateSequence ~ (UnpublishedUnsequencedAppendOnlyData / UnpublishedSequencedAppendOnlyData)
PublicSequence ~ (PublishedUnsequencedAppendOnlyData / PublishedSequencedAppendOnlyData)
PrivateBlob ~ (UnpublishedImmutableData)
PublicBlob ~ (PublishedImmutableData)
The following table gives a rough overview of the capabilities.
Type | Can edit | Can add to | Can delete | Write once | Can share | Concurrency control | Can change owner |
---|---|---|---|---|---|---|---|
PrivateMap | x | x | x | - | x | x (*) | x |
PublicMap | - | x | - | - | x | x (*) | x |
PrivateSequence | x | x | x | - | x | x (*) | x |
PublicSequence | - | x | - | - | x | x (*) | x |
PrivateBlob | - | - | x | x | x (**) | - | - |
PublicBlob | - | - | - | x | x | - | - |
(*) Optional on a per-request-basis, via parameter. | |||||||
(**) Only as part of instantiation, as the entire Blob is immutable. |
Blob
The current data type ImmutableData
is in this proposal renamed to Blob
.
The type is further sub-divided into:
PrivateBlob
(formerUnpublishedImmutableData
)PublicBlob
(formerPublishedImmutableData
)
Sequence
This API is similar to current AppendOnlyData
.
A Sequence
MAY expose the following APIs (or equivalent):
(Slightly simplified.)
Sequence Writes (aka mutations)
- PutSequence()
- DeletePrivateSequence()
- Append(value)
Sequence Reads
- GetSequence -> Result[Sequence]
- GetSequenceShell -> Result[Sequence] // only metadata
- GetSequenceRange {
versionStart,
versionEnd
} -> Result[Values]
- GetSequenceValueAt(version) -> Result[Value]
- GetSequenceExpectedVersions -> Result[ExpectedVersions]
- GetSequenceCurrentEntry -> Result[SequenceEntry]
Additionally there is the regular Permissions
and Owners
API from AD
(see Owner and Permissions) which MAY remain largely unchanged.
Private Sequence
A PrivateSequence
MUST allow:
- Deletion of the entire structure.
… and MAY allow:
- Deletion of specific indices (no restrictions).
Implementation of specific-index-deletion is deferred for now as we reach a consensus on the scope of such mutability.
Public Sequence
A PublicSequence
MUST NOT allow deletions. Data is there in perpetuity.
Map
This API is supposed to reflect current MutableData
as well as extend it for a more complete access to the versions.
The combined flavours of a Map
MAY expose the following APIs (or equivalent):
(Slightly simplified.)
- PutMap(data) // Creates map.
- DeletePrivateMap() // Deletes the map from the network.
- CommitMapTx(tx) // Commit multiple operations, fail or succeed all together.
- GetMap() -> Result[Map] // Returns everything.
- GetMapShell {
expected_data_version,
} -> Result[Map] // All metadata, but not the actual data.
- GetMapVersion() -> Result[Version] // Data version.
- GetMapValue(key ) -> Result[Value] // Current value of specific key.
- GetMapValueAt {
key,
version,
} -> Result[Value] // Value of a specific key at a specific version.
- GetMapValues() -> Result[Values] // All current values.
- GetMapEntries() -> Result[Entries] // All key-value pairs (current value).
- GetMapExpectedVersions() -> Result[ExpectedVersions] // Expected versions for data, owner and permissions.
- GetMapKeyHistory(key) -> Result[Values] // All history of specific key.
- GetMapKeyHistoryRange {
key,
versionStart,
versionEnd
} -> Result[Values] // A range of the history of a specific key.
- GetMapKeys() -> Result[Vec<Key>] // All keys.
- GetMapKeyHistories() -> Result[KeyHistories] // All history of all keys.
Here, tx
is a transaction object (equivalent to current MdEntryActions
) that we load with any add/update/delete
-operations we want to perform atomically.
Additionally there is the regular Permissions
and Owners
API (see Owner and Permissions) from AD
which MAY remain largely unchanged in functionality, while it also MAY be renamed as per the above description; AccessList
etc…
Updating Map owners and permissions
As these reference the Map
data as it were at a specific point in time, there MUST be a top level version for its data. This version MUST be incremented every time there is an operation that increments a key-version.
This means that owners and permissions works largely the same in Map
(and Sequence
), as they currently do for AD
.
Key versioning
Both Public and Private Maps use key versioning. This means that there is a history of values for each key in the Map. An Update
SHALL append a new value at the specific key, and thus incrementing the key version. A Delete
SHALL append a Tombstone
to that key’s value array (and similarily increment the key version). At that point, it would make sense that Add(key, value)
again is successful for that key (while it previously would give KeyExists
error), in which case the new value is simply appended after the tombstone. An Update
or Get
returns EntryDoesNotExist
when current value is a Tombstone
.
The difference between Private and Public is that the Private Map allows hard-delete and hard-update, which erases the actual value.
Private Map
A PrivateMap
SHALL allow:
→ Deletion of the entire structure.
→ Deletion of current key value.
→ Deletion of keys.
The characteristics of a PrivateMap
is that it additionally allows for deleting a value; that data is permanently removed from the network. This however, will always bump the version.
The PrivateMap
API is extended with ‘hard_delete’ and ‘hard_update’ to allow for this. So this is an addition to the capabilities from PublicMap
that it also has.
The key can be treated as a single value, where delete
or hard_delete
just renders the key virtually deleted (i.e. GET
will respond KeyDoesNotExist
). Using delete
, the actual value is still available in the history.
With hard_delete
, the value is deleted, but the previous history is intact. Deletion of a key along with the entire history of it MAY also be possible (this is similar to how MD
works today).
The options are then to delete either the entire history but leaving the last version, deleting the entire history and not leave version, or deleting the key and the history (which leaves no opportunity to keep the key version).
And ultimately, the entire instance can be deleted from the network.
hard_delete and hard_update
These operations replace old value with Tombstone
, and then appends another Tombstone
when delete
, and the new value when update
.
The vector is also reflecting version, which is the reason for inserting a Tombstone
, as it hard-deletes the data, but also increments version.
Using hard_delete
and hard_update
operations, the data history would look something like this:
[value] -> update(new_value) -> [Tombstone, new_value] -> expected_version = 2
[Tombstone, new_value] -> delete -> [Tombstone, Tombstone, Tombstone] -> expected_version = 3
…while the ‘soft’ delete and update gives:
[value] -> update(new_value) -> [value, new_value] -> expected_version = 2
[value, new_value] -> delete -> [value, new_value, Tombstone] -> expected_version = 3
If possible, this MAY be extended with hard_delete_at and hard_update_at, as to allow deletion at a specific version.
Although the above scheme would likely need to be implemented differently to allow for version to be properly bumped, while not changing any existing data versions.
Public Map
A PublicMap
would essentially expose the exact same API as the above with the exception of hard_delete
, hard_update
and deletion of the instance, as the PublicMap
data is perpetual.
The API for Update/Delete
exists, however no data is ever removed from the network, the key history is maintained. Any data going in, will exist there in perpetuity, albeit new data might be appended for a given key, which effectively emulates the mutability of the PrivateMap
.
Owner and Permissions
For Map
and Sequence
, the ownership and permissions API is as follows:
(Slightly simplified.)
- Set[Map|Sequence]Owner {
owner,
expected_version,
}
- Set[Public|Private][Map|Sequence]AccessList {
access_list,
expected_version,
}
- Get[Map|Sequence]Owner() -> Result(Owner)
- Get[Map|Sequence]OwnerAt(version) -> Result(Owner)
- Get[Map|Sequence]OwnerHistory() -> Result(Vec(Owner))
- Get[Map|Sequence]OwnerHistoryRange {
versionStart,
versionEnd
} -> Result(Vec(Owner))
- Get[Map|Sequence]AccessList -> Result(AccessList)
- Get[Map|Sequence]AccessListAt(version) -> Result(AccessList)
- Get[Public|Private][Map|Sequence]AccessListHistory -> Result(Vec([Public|Private]AccessList))
- Get[Public|Private][Map|Sequence]AccessListHistoryRange(versionStart, versionEnd) -> Result(Vec([Public|Private]AccessList))
- Get[Public|Private][Map|Sequence]UserPermissions(user) -> Result([Public|Private]UserAccess)
- Get[Public|Private][Map|Sequence]UserPermissionsAt {
version,
(user | public_key),
} -> Result([Public|Private]UserAccess)
Concurrency control
The previous Sequenced / Unsequenced
flavours of Map
and Sequence
are with this proposal removed.
The types were subject to concurrency control, specifically optimistic concurrency. In contrast to previous logic, where an instance was forever locked to one behaviour, without actually enforcing the behaviour (it was easily bypassed), we now allow for the user to decide on a case-by-case-basis if a request should honour existing values or not.
This works exactly the same as with the previous terminology, but we are now combining the two capabilities into one, and thereby cutting down on the data types, as well as giving developers more freedom, as we assume less about their requirements and behaviour.
In other words: the solution chosen here is to not fix the concurrency control of an instance over its entire lifespan, and instead pass in optimistic concurrency check as parameter to operations.
It’s as simple as adding an ExpectedVersion
enum parameter, with one of the variants being Any
, which would indicate that we will write regardless of the version.
pub enum ExpectedVersion {
Any, // this means concurrency check is OFF
Specific(u64), // this means concurrency check is ON
}
This is safe to do because the concurrency control is meant for us - the current writer. We are not preventing someone else from circumventing it. Thus we can just as well simplify, widen the capability, and the end result is the same; we will execute code correctly because it is in our interest, and the other writers (defined as not being “us”) still have to execute code correctly, for the concurrency control to work. If it’s not in their interest, then the original concurrency control was no help; they are rouge players with write access (so the owner of the data messed up in one way or another).
Encryption
In all of the above cases, the contents of the structures can be either Encrypted
or RawContent
.
This can simply be called for what it is, since the concept of something being encrypted is commonly known to mean secret, very secure and protected from unauthorized access (even though how it works and handling of keys is something quite alien to many - so as long as encrypting something is as simple as pressing a button or tick a checkbox, it should be fine UX wise).
It enables us to reclaim the describing property Private
for usage in a more natural taxonomy of our data types (as per the current proposal).
Sharing
No changes are proposed here.
Considerations for Private/Public naming
The question was raised whether there is a risk that people may incorrectly believe something they upload as Private
and un-encryped, is private as in 100% not accessible whatsoever by anyone else (while in reality, as long as it is not encrypted, even Private
data can be read by nodes when it is in transit).
Response to considerations
Data put as Private
should be encrypted by default. For many reasons, one being that it removes the “whisper-game” rumours of “your data can be read by the nodes when in transit”, which will be very sticky on the various corners of internet, and degrade the message of “your data is safe on the network”. Another reason is that a user regardless has to be educated on the fact that the data is not encrypted by default; Unpublished
doesn’t really avoid the risk suggested above. It’s not fully clear from that name that the data would not be encrypted, so it’s still something a user has to learn (or we solve it for them).
So, it is actually better that we educate a user that if you want to upload your data unencrypted you have to do that explicitly (with all warnings and education there).
The motivation for that is that those having the use case that they want it unencrypted, would be in a much better position to actually grasp what that entails.
Additionally, having “private conversations” etc. is a common and well understood concept, which means that the data of the conversation is private, but it is still shared with someone of your choice (the other party/-ies in the conversation). It is therefore argued here that “Private” is commonly known to be optionally shared, and that we explicitly use the word Encrypted
instead when we want to point out that it is in fact not readable by anyone else, unless explicitly granted by the owner. (See motivation behind use of the term in the Encryption section.)
Alternatives
Some alternatives have come up regarding the new names of the data structures.
The current set of names have been chosen as a whole, with the intent to have them roughly levelled aesthetically - meaning that they carry about the same amount of information and do not differ much in how descriptive they are. It has been considered more important with consistency in the general design with preference for palatable and easy-enough to understand names, than having the ultimate descriptive name at the cost of the former properties.
The idea being that any developer regardless will have to visit the RFC / documentation at some point to fully understand the data types (how to use them, what their capabilities and limitations are), so we might just as well lean towards more palatable naming, and leave the very detailed descriptions for the documentation.
Alternatives to Map
An alternative name KeyValue
or KeyValueStore
has been proposed as well, which would be more descriptive than Map
.
While it probably is more descriptive, the arguments for Map
, are that “map” is a common representation of key-value structures within many programming languages, and obviously a bit more slick of a name as well.
Alternatives to Sequence
A question was raised regarding the descriptive value of this name, and whether it is easy to relate to it based on previous usage within the programming domain. AppendOnly
was suggested.
Also here, it might very well be the case that it is not the ultimate name in the aspect of description, but just as with Map
, a slick and catchy name that was easy-enough to understand, was preferred.
Arguments against the alternatives
A “sequence” and a “map” are real things in themselves and also describe the properties of these data structure fairly well. An “appendonly” and a “keyvalue”, are a thing only by convention if we decide so, they’re not actual words or things otherwise.
It would seem that both KeyValue
and AppendOnly
are not entirely complete names as they are, and that tucking the word Store
(or similar) to them is almost necessary as to get all the way in terms of clarity (KeyValueStore
, AppendOnlyStore
). And by that we are again quite far away from the slick and simple names. (Not to mention we’ve now introduced additional connotations with Store
that we might not want there.)
Additional naming considerations
The use of Private / Public
antonyms is a deliberate step away from sharing the same word base in the two opposing descriptions (e.g. Unpublished / Published
).
It is RECOMMENDED that we use natural, existing and commonly used antonyms which do not share word base (e.g. Open / Closed
), by which we achieve a much higher level of readability and clarity, additionally lowering the risk of confusion when speaking of the terms - since the words do not sound similar.
Implementation roadmap
This RFC may or may not be implemented for Fleming
.
Impoortant to notice, is that until we order mutations with PARSEC at the data handlers (on the After Phase2a
list of Fleming
), the data versions might not be the same in all the vaults.
That means to say, that version handling at the network requires more work, and until this feature is implemented (given that it is before Fleming
), it might not work as expected.
While this is the case, ExpectedVersion.Any
variant might be disabled or implemented later in case we want to avoid that uncertainty in wait for the work to finish.
Unresolved questions
- There needs to be more details (to be added later) covering aspects of unified permissions handling, sharing, encryption keys and how encrypted content is to be handled by default for
Private
types.
Changelog Data Types RFC
2020-01-30
- Clarified
Private | Public
scope implications forMap
. (thanks @tfa) - Added description of possible extensions for deletion scheme in
PrivateMap
.
2020-01-21
- Replaced
Guarded
types with per-request-configuration of concurrency control.
2020-01-16
- Ordered the listing of old and new type names to match index of them. (thanks @david-beinn)
- Added a listing with new and old name side by side.
- Fixed flipped scope / concurrency (
Public
/Private
,Guarded
) in names. - Fixed flipped order of
Public
/Private
scopes between listing and table. (thanks @david-beinn) - Removed non-existing old types from listing. (thanks @tfa)
- Updated
Sentried
toGuarded
. (thanks @JPL, @dask, @jlpell) - Removed idea of excluding
Private
from names, fromUnresolved questions
section. (thanks @jlpell) - Found antonym with different wordbase for
Encrypted
;RawContent
, replacingRawContent
. (thanks @jlpell)
Note: If you strongly disagree with any of the above updates, please discuss it in the forum topic, for possible revert, or other change.