[RFC] Data Types Refinement

tfa · January 9, 2020, 10:49pm

Not a combination of letters but only a single letter which could be A, M or I for respectively AppendOnly, Mutable or Immutable.

You cannot dismiss a carefully designed naming convention like this. Consistency is important and there are many data structures based on it.

We are talking about a convention for developers and for them the meaning of a name must be instantaneous. When I see “Data” I know the structure is about a safe-nd structures and the prefixed letter is a precision on which main data structure, and further prefixes like Pub/Unpub or Seq/Unseq are sub-categorizations of data structure. So, in summary: Pub/Unpub + Seq/Unseq + A/M/I + Data, which is clear and regular.

To illustrate the huge risk of what you want to undertake, a few figures with a search for “[AMI]Data” as a regular expression in source code:

safe_client_libs (master): 1993 results
safe_vault (vNext): 982 results
safe-nd (master): 444 results
safe-api (master): 66 results

And considering another capitalization needed for function names with “_[ami]data”:

safe_client_libs (master): 620 results
safe_vault (vNext): 253 results
safe-nd (master): 3 results
safe-api (master): 41 results

This convention is an asset that must be preserved and changing it now would be sabotage.

david-beinn · January 9, 2020, 10:51pm

I only meant this as a trivial point. Just that in the list of new names Public always comes first, and then when the capabilities are listed, Private always comes first!
It was just that my first instinct was that all the lists would match up with each other precisely, and for a moment I had to adjust. On the plus side though, it forced me to read and understand it!

As a couple more suggestions for sentried how about regulated or resolved?

danda · January 9, 2020, 10:53pm

A related idea is to use words from another language if a particular word is too overloaded in english.

bloque, bloc, piedra, lapis, fijo, etc, etc.

Traktion · January 10, 2020, 8:11am

These are really good points, @tfa. While direct coupling API naming to underlying code should be avoided, common naming really helps to keep the code base aligned. Having not spent much time in these repos, I wasn’t aware of the extent of this.

I think this comes down to competing requirements between internal and external development. Ideally, both should align where possible. Is it a step too far, too late, to change the external naming? It is a good questions, imo.

dask · January 10, 2020, 10:43pm

Great RFC and discussion. All points make sense to me on a single read, and this is particularly vital because SAFE is going to need to be explained to the masses at some points and every word used to describe data risks undesirable connotations.

I agree Sentried is confusing… I have been following safe for years and programmed my whole life and I had to triple check my understanding of it.

How about ‘guarded’, ‘barred’, ‘constrained’, ‘restricted’, ‘kept’, ‘reserved’?

Don’t really get across the idea that writes are version checked without introducing other weird concepts imho. But this is a hard one. Guarded is the best of that list I think.

Along those lines How about Checked, Consistent, etc…

Or the good old route of simply saying exactly what it is: OCC E.g. PrivateOCCMap

jlpell · January 11, 2020, 8:19am

Per your example, hard_delete and hard_update are what is expected from a true conventional delete and update operation, so you should just call them delete and update to keep it short and sweet.

In your “soft delete” example, you simply append a Tombstone. You have temporarily “killed” the object, the action verb being kill, so just call the operation what it is. All is not lost however because I interpreted that your “soft update” operation will add another value entry after the tombstone if so desired. This offers programmers the ability to revive an object after it has been given a tombstone. The object becomes useful again and stores a meaningful value. It has been revived.

I forgot to mention this earlier. Looking at the examples you gave in the RFC it looks like you are starting the version index at 1. Can you please begin with version 0? This way the version count matches the array or vector index used to store the data. Indices that start with 1 instead of 0 often lead to misfortune and grief.

Ok, now for a look at the comments related to chunk, blob, and block…

This was not clear in the RFC. My interpretation was that Blob was being used as a specific term to describe an immutable chunk of data.

Yes, and it should be.

My interpretation of the RFC was that we were only talking about a new name for immutable chunks. Blob would be a good name for mutable chunks.

Agreed, in my post the intent was not to call the collection or map of objects a block. Far better to call a map a map, a set a set, and a collection a collection.

When you said ImmutableData would now be called a Blob, I thought you were giving a specialized term for an immutable chunk. I guess I am confused a bit too.

Agreed. I think you struck gold with guarded. Guard is a simple synonym for Sentry.

oetyng · January 11, 2020, 10:02am

It is 0-based. The expected version is the length of the array.

Expected version is the value passed in for optimistic concurrency control, it is the version you expect to come, not current version. Which is a convention.
You could pass in current version, but then you’d need to be able to pass in a value meaning “empty”, which is a complication. Additionally with unsigned it would not be representable.

Here is the convention explained (only, it’s called index instead of version, something that got updated later): https://github.com/maidsafe/safe-nd/pull/126

stout77 · January 11, 2020, 10:37am

@oetyng Apologies if I missed this in my, superficial I have to admit, review of the above, but when would you be thinking to implement this proposal? It looks like a valuable improvement to me, but considering the complex changes to the code required, as highlighted by tfa, can this implemented after Flemming? And I assume this would break the APIs?

happybeing · January 11, 2020, 12:27pm

@jlpell my reading is that the proposed Blob is an immutable file/object not a chunk. Chunks remain as they were if I understand correctly. Hence my suggestion we use ‘Immutable’ as a short form of Immutable Blob, rather than Blob which doesn’t suggest immutability.

oetyng · January 11, 2020, 2:30pm

This is to me an incomprehensible combination of letters without any prior knowledge. Additionally, if we were to just make up words like that, then much more ergonomic ones are easy to find.

Renaming is not really something I’d categorize as a “huge risk”, basing that both on all the capabilities of IDE:s and GIT etc., as well as plenty of experience both on my part and the others in the team, doing it with far larger impact in terms of lines of code. I don’t know if you read the entire RFC, but the renaming is a trivial part of the changes.

Additionally, this RFC has been up for discussion for quite some time now internally as we were building up consensus for the idea. While there are of course some who (correctly) point out the extra work and would be hesitant to introduce changes, I haven’t heard anyone categorize it as such a difficult transition, or object strongly to the idea.

Hey @stout77, no worries, and no need to apologize. It is not mentioned in the RFC. In a couple of comments below we discuss it and basically most of these things were coded as part of evolving the idea, with tests etc. But you probably wonder when we’d try get this merged into the code base. And currently we are reviewing and discussing the code. Then there are other repos that need refactor, and additionally an overview of all tests and complementing them where necessary. If we are done before Fleming, then it would be there, if not, then it comes after.

Yes, in most cases. Not big changes though, and all of those APIs are given extra care to make them better as well. There are currently inconsistencies and inaccuracies as well as just less well crafted parts of the API, that we will take the chance to improve while we’re at it.

Thanks @dask , great suggestions. Guarded would cover many of the requirements IMO.
PrivateOCCMap falls on the acronym I would say.

While I see why such a naming scheme could make sense in a way, I wouldn’t consider this a good solution or practice for a couple of reasons:
A. It goes contrary to the ideas of domain driven design. I.e. we’re not killing or reviving things, we are deleting and updating, albeit leaving a trail or not.
B. The Tombstone is an implementation detail. The naming is an old standard within database programming, why it was chosen. It is not something to base an API on, since it is more or less irrelevant and basically an arbitrary name with no relevance to the domain, and would probably not have been used had it not become a standard. (See my reasoning in previous post, about lean and comprehensible, and using existing vocabulary in the domain.)

What we do when designing, is to look at what the use cases, and contexts are, and how the operations are perceived within those.
We have identified that these operations are conceived and used as deletes and updates, but some are irreversible, and some are not.

I’m totally open for not using hard_delete or hard_update for the irreversible operations, but I think we could argue that those are existing and common words used for the equivalent operations/contexts.

Yes, very understandably so. I was mentioning above that I will need to come back about this, because there is currently confusion about those, which became apparent very recently. ImmutableData, IData and Chunk are used interchangeably, both in RFCs, among MaidSafe staff, and community members, for describing same and different things.

ImmutableData has been used in wordings like “the chunks of ImmutableData”.
ImmutableData as been used (as you interpreted it) as meaning “chunk”.
IData, which is short for ImmutableData, is used interchangeably with “chunk”.

And this goes back at least to 2016. I did some searches in the archives and looking at recent discussions and the mixup is total and present all over. Some even change the meaning in the same sentence.

In the RFC, a Blob is meant to replace the first meaning “the chunks of ImmutableData”, with “the chunks of a Blob”.

However, due to this pervasive naming confusion, I am currently considering to leave the renaming of ImmutableData to a later time, so that we can sort that out first.
Again, I will need to come back to this, and the RFC will be updated soon with various bits collected so far.

tfa · January 11, 2020, 9:50pm

Modifications of about 4400 lines of code is not a trivial change.

There are many branches / forks and inevitably there will be merge conflicts which will have to be solved manually which is error prone (though a renaming in itself does not generate a conflict, another modification in the same or in an adjacent line does)

In addition to the consistency of the naming convention that I already mentioned, there is also the long history of established names that should be considered: terms like “chunk”, “mutable data”, “immutable data” are well known because they have been used for a long time (they even existed when safe network was coded in C++ language!).

Appendable data is not so old but already existed in the past.

I don’t want years of usage, not only in code but also in the forum, wiped out like that.

oetyng · January 12, 2020, 6:20pm

Thanks for your opinions @tfa.

You state many things as facts, while I would say it is relative.
To me the name changes are trivial, both in themselves and especially relative to all other changes in the proposal, but even more so, relative to previous and upcoming tasks.

What you say about git, merges and conflicts, I don’t know what make you think this would be new information for experienced developers.
All such things require their measures to be carried out, but in no way do I conceive them to be such a hurdle as you depict it, and have perceived no such from anyone else either (as said already).

Maybe you and I have different experience of these kinds of things?

About changing what had been for a long time. That happens and has happened all the time during development, and surely will keep happen, in various degrees, more or less frequently until a live version has matured (actually means long after release). That is development. So, it’s not a very good argument, also because of more things:

A. So far, my impression is that a majority has been very positive about the changes.

B. Current number of users are practically non-existent compared to the user base that is coming. Those are the ones to consider, and what will be best for them. Us early adopters cannot be allowed to hinder improvements, just because they would be inconvenient to us. It would be extremely short-sighted and selfish.

We, a small number of relative die-hards, are here for good and bad during the changes, turmoil and uncertainty of startups, but things cannot be designed for us, but for those that are yet to come. That’s my view and opinion.

I don’t want years of usage, not only in code but also in the forum, wiped out like that.

Actually, the way data types have changed, it could be quite confusing to read the archives and understand what is still relevant.
With new names, it would be much easier to filter out the history of irrelevant info, and find accurate and up to date information.

Again though, the names, as said in the RFC, I consider a bonus, and the topic is much bigger. Maybe we can proceed, now that you and I have (hopefully) agreed to disagree?

I can add that I plan to do a lot more work to improve the quality, robustness and consistency of the code base, I’ve only just started really. And that is over all the MaidSafe repositories. If I do not find support within the MaidSafe team for my ideas, then those are probably not going to show up here on the forum either. All of us engineers consider cost / benefit of all such changes, on many levels, and we actively encourage each other to voice all concerns and objections, which we debate in order to come to a common understanding on where to move.

jlpell · January 14, 2020, 4:16am

Simple word changes are very trivial if you use Geany as a text editor (which I highly recommend to anyone and everyone as IMO the best IDE). All you need to hit is CTL + H and you are 90% of the way there.

To your point, functional changes and code refactoring are non-trivial but not necessarily ridiculously difficult either. The best refactoring is done with the delete key.

IMO the concepts/descriptors chunk, mutable data, and immutable data will always be with us. That doesn’t mean that the use of additional pet names that help build a cohesive and readily understandable object hierarchy is bad.

oetyng · January 16, 2020, 1:31pm

Yeah, but in that case I’d suggest rephrasing that, so that after every use of word Blob, we also tuck (which is always immutable), to not raise the question “what about the mutable Blob then?”
Having an adjective before a subject (immutable Blob), would very often spur the idea that the adjective antonym is also a valid configuration (mutable Blob).

Yes, very much so. Less code (where not obviously making too much of a compromise on readability / maintainability) is so underrated.

In this proposal we are adding a couple of features, so all in all it is not less code, but fairly the same.

Yeah, I think this is true. It will be used in the docs for covering different angles of explaining (always good to use a couple extra ways of depict/formulate/etc., to let people triangulate the meanings).

Chunk is not replaced or changed at all actually.

oetyng · January 16, 2020, 1:32pm

Changelog Data Types RFC

2020-01-16

Ordered the listing of old and new type names to match index of them. (thanks @david-beinn)
Added a listing with new and old name side by side.
Fixed flipped scope / concurrency (Public / Private, Guarded) in names.
Fixed flipped order of Public / Private scopes between listing and table. (thanks @david-beinn)
Removed non-existing old types from listing. (thanks @tfa)
Updated Sentried to Guarded. (thanks @JPL, @dask, @jlpell)
Removed idea of excluding Private from names, from Unresolved questions section. (thanks @jlpell)
Found antonym with different wordbase for Encrypted; RawContent, replacing NotEncrypted. (thanks @jlpell)

Note: If you strongly disagree with any of the above updates, please discuss it in the forum topic, for possible revert, or other change.

Blob

Due to the existing pervasive current and historical mixup in distinction of the file/object consisting of chunks, and the chunks themselves, with the words ImmutableData, IData and Chunks, this part of the proposal now waits a bit in line in favor of the other parts.
There is a related proposal currently under internal review, that makes the Blob / DataMap / Chunk concept absolutely clear, and also the name change in this proposal logical, but we still have to await the internal discussions before bringing it up in full.

Mendrit · January 16, 2020, 10:38pm

Sentried → Guarded
Blob → Anything else from above like IData

happybeing · January 17, 2020, 12:59pm

Blob →

Fossil?
Fix?

JPL · January 17, 2020, 2:35pm

I like Slab but I see it’s already taken as a datatype in Rust.
How about Brick?
Edit: Or coin a new one: Immut

nevel · January 17, 2020, 5:00pm

Is it possible to have just 1 wrapper object in the API where you can set properties like public/private, mutable/immutable etc. Then it translates this to the according dataType in SAFE.
Naming doesn’t really matter then for me as ‘user’.

oetyng · January 17, 2020, 5:14pm

Fully possible.

But, right now… I think there’s a bit much focus on alternatives to Blob.

Although if I’m to present my current thinking about it, it is this:

ImmutableData / IData / Chunk (in the meaning chunk), all should be named Chunk.
ImmutableData (in the meaning our content stored to the network), should be named Blob.

A Blob consists of Chunks.
You have content, and as you store it to the network, you can choose to store it as a Blob. That means it is chopped up in Chunks, that are spread out in the network.
When you want to access your content stored as a Blob, you look in the network for a Blob with the specific Id, and the Chunks of it are retrieved and assembled into the content.

Easy. There is no reason to invent new words here, because Blob is commonly used in similar applications (denoting a binary large object), and Chunks as well more or less.

A Blob is simply a way to describe the data structure; that there is no specific structure.
A Map describes the structure as a relation between a set of unique keys and a set of values.
A Sequence describes the structure as a sequence of values.

These describe how we organize the content, as a big piece, or in some sort of mapping or sequence.

Topic		Replies	Views
[RFC] Data Hierarchy Refinement RFCs	22	2470	January 31, 2020
An Overview of the New Data Types Development	40	2092	October 21, 2020
RFC 54 - Published and Unpublished DataType RFCs	40	3293	July 25, 2019
RFC: Dynamic Data Support RFCs	13	2013	April 5, 2016
RFC - Remove Transaction Managers RFCs	5	2504	July 1, 2015

[RFC] Data Types Refinement

Changelog Data Types RFC

Blob

Related topics