[RFC] Data Hierarchy Refinement

jlpell · January 24, 2020, 12:53pm

Nice work. There is a lot to dissect here after reading through it a few times. The only way for me to give feedback in a coherent manner is to pick through it line by line as a running commentary. Although crude it is the only way I can manage a response in the wee hours of the morning. Here goes:

Good. I would go one step further to require that the “chunk” datastructure is a base unit used in the construction of ALL other datatypes in the hierarchy following an OOP construction by assembly approach, including metadata.

Based on this RFC and past discussions in the other thread on Data Types Refinement you have convinced me that the term Blob is an absolutely horrible descriptor for what you are trying to accomplish. More on this below.

It is unclear here if you really mean data “type” vs. an instantiated data “object”.

Just chunk it. Chunk early and chunk often.

Interesting insight. I thought that if the local group of 8 did not have enough storage than the nearest neighbor search radius is expanded to include more than 8 nodes?

I agree that there is an opportunity for improved deduplication here.

I like where this is going. Viewing SAFE as a big hard drive in the sky with analogous operations/functions to a common tried and true filesystem like ext4 or xfs will help speed development IMO since you already have a stable, working and well documented model of what you are trying to accomplish at a grander scale.

If not careful with the definitions this could lead to some circular dependencies since the metadata needs to be stored somewhere too. Consider as an example the EXT4 filesystem where we have data blocks and metadata blocks. Regardless of data block type (meta vs. actual) they are all stored in fixed block sizes on disk (typically 4kiB to match hardware sector size). I view the EXT4 block on disk to be analogous to a SAFE chunk. This indicates that your metadata should ultimately be stored as a “chunk” too if you want to keep the logical consistency and benefits of assembling a well defined object hierarchy.

Specific comments about terminology:

Nice. I like the differentiation here. This could also be generalized to N layers extending from core to boundary. A few synonyms that evoke different imagery for the case of N =3:

“Gateway Nodes”, “System Nodes”, “Kernel/Core Nodes” for a computer reference.
“Exterior Nodes”, “Boundary Nodes”, “Interior Nodes” for a spatial reference.
“Frontier Nodes”, “Border Nodes”, “Control Nodes” for a geographical/political reference.
“Peripheral Nodes”, “Passing Nodes”, “Principle Nodes” for a roles reference.

The term Shell is not used appropriately here and also later in the document. In computing the term ‘shell’ is synonymous with a user interface that allows access to an operating system, its programs and services. It is confusing to equate shell terminology with pure data and datatype constructs unless you are specifically building to a shell program like the SAFE CLI. I do like the simple and self-explanatory definition offered by Client Nodes

Programming wise, if chunks form the base object in an OOP hierarchy from which other types are assembled, then your metadata should also be stored as chunks. This means that all nodes would store and retrieve chunks, but nodes dedicated to dealing with metadata would store and retrieve metadata chunks, and nodes dedicated to data would store and retrieve data chunks. For this reason I would recommend using the terms Data and Meta Data. To maintain continuity with previously employed terminology I suggest using Data Vaults and Meta Vaults here.

You seem to really like this term, but it is not a good designation for what you are trying to achieve here. This is made more evident by the picture you drew below. Really what your are designating by your Blob and Sequence is an Unordered Set vs. an Ordered Set. The mathematical definition of an Ordered Set is essentially a Sequence (when duplicate entries allowed) so you’ve got that one. Blob on the other hand evokes no intuition of a Set. So why not just keep it simple and call it a Set. You could also extend the Set terminology to a Collection where duplicate entries are allowed.

Everywhere else in the computing world this is called a File. The use of Files in the network is OK. KISS. “Everything is a File.” And like I said earlier, shell is usually reserved for user interaction with programs/services. Later on when SAFE has a computation layer, I could see “ShellNodes” as being the perfect description of an interface to this layer. These future ShellNodes would handle the running of SAFE programs and processes as part of a general SafeOS.

ClientNodes, FileNodes, DataVaults, MetaVaults.

I would be happier if you replaced the term “Shell” with “File” in this large section.

Seems inconsistent to have specialized ChunkSet and chunk_map types. Wouldn’t it be preferred to build this higher level types from lower level ones? So this way a chunk_map is instead Map<Chunk>.

This is a problem. Nothing in this world should ever have the indignity of being blobified at this high level of abstraction. I suspect that it would only get too large due to the owner history and permission history. Better to change these constructs from a Vec to a Sequence so that all of the #blobification can happen under the hood.

Perfectly logical and follows standard filesystem practice. However, is there a chance for a security exploit here where the refcount could be decremented maliciously?

You forgot one of the best features that is possible with this approach. The chunks can be encrypted again by a subtype of Gateway nodes prior to being sent to a Data Vault (your ChunkNode) and decrypted when retrieved from the vault. These keys would only be known by the Gateway layer, not the Client nor the System Layers.

All nice to see. I also like your data flow diagrams. They really help make things easy to understand. A lot of possibilities here.

Topic		Replies	Views
[RFC] Data Types Refinement RFCs	104	5627	February 4, 2020
RFC - Naming of ImmutableData types RFCs	1	1275	March 28, 2016
Pre-RFC: Linkable Data Structure Development	25	2243	January 6, 2016
RFC - Unified Structured Data RFCs	12	3306	September 22, 2015
RFC: Data Chains (Datachain) RFCs	4	1810	September 1, 2017

[RFC] Data Hierarchy Refinement

Related topics