DataStore over AppendableData design

OK, I need to clarify things that might not have been obvious in the OP.
So, the goal is to construct a datastore with indexed and searchable data.

Part 1

Goal: Indexed and searchable data.

Have a look at this chart to start with:

Comments

This falls under the very first [ NO ] in the diagram. This is not part of the problem I want to solve.

I am saying that this is not related to LSM trees, rather related to emulating mutability over immutable storage.

When we have copied the same data 1000 000 times (and far earlier than that IMO), then it’s quite fair to consider a few of those instances to be “garbage”. That is what I mean at least.
You’ll see further down, exactly what this duplication of data comes from.

The problem

Let’s continue with describing the problem:

The problem I want to solve, is to provide the kind of data storage (and access) that has been talked about during all development of SAFENetwork:

  • Autonomous cars data
  • SAFESearch
  • Token Exchange
  • Critical infrastructure (powerplants smart electricity grid etc.).
  • etc.

All of these go all the way down to the last [ YES ] in the chart:

  • Large amount of data - Check
  • Access data often or fast - Check
  • Continuous insert of new data - Check

So, these things seem to need indexes.

The problem we are solving is very similar to that of storing to disk. It is costly (in time or PUTs) to write, and we want to be able to access the data somehow afterwards. This is not a new problem and it has been studied in depth.

First of all, let’s go through a bit about the trees:

  1. Binary trrees are not suitable for for this since they have a very low fanout; only have 2 entries in a node. This gives a rapid increase in tree depth.
  2. BTrees can have n entries,which are stored at every level together with keys. This gives a bit better read performance than B+Trees, but is more complex and requires more merging and splitting of nodes.
  3. B+Trees can have n entries, which are all stored at leaf level. This is less complex to implement, utilizes the capacity of a node better and gives a larger fanout than BTrees and requires less rewrite of the tree as new data is inserted. This is the preferred solution for disks.

OK, so we have found a standard approach to build an index, which is B+Trees.

So, we need to look at why the B+Tree requires mutations.

To search for data, maybe all temperatures within a given range from a component in some powerplant for example, we need range queries.
If we are only indexing on a data id which is a monotonically increasing number, then there will be no problem, since every insert by default will result in ordered data.
But temperature measurements can come in any value. 123, 456, 234, 876, 345. They don’t happen in an ordered fashion. So every time we insert the data into a tree, if we are going to have an ordered tree, we will need to route it to it’s correct part in the tree. (This is what makes it fast to do a point query also, i.e. get where data equals an exact value.) But as new values come in, what once was the correct place for a previous value, will no longer be the correct place as to maintain a sequence. So as the new value should go there, we start re-arranging as to maintain the order.

^That there is mutation.


Part 2

Emulating mutability

The simplest way to do this, is to use MDs, and just rewrite the value in an entry. Done. Problem solved.
But, we are trying to solve the problem within the confines of immutable storage, so AD then.

A way to emulate mutation is to combine the two ways we can use ADs, i.e. as a value with opaque history, or as a collection of values. When combining them, we can store a collection at fix address, and by appending a new pointer, we can emulate that the collection is changing. What is really happening is that we are producing a new collection (new AD), and pointing to it.
Now, this doesn’t need to mean enormous data duplication, since the actual data need not be in that AD collection, only the pointers to it. The data is in ImDs, and that we don’t duplicate. The AD collection with pointers isn’t very big in comparison to the actual data. But still, over time, this will risk building up a lot of duplicated unused and meaningless data.

I’m a fan of event sourcing, so I’m not the least unfamiliar with the notion of the potential value of keeping ALL history (that’s how event sourcing people feel generally). But history is one thing, duplication is another. One copy of the history is good enough. Trillions and quadrillions and so on, of them is not needed.

Speed vs storage overhead

With or without out indexing, we stand before this problem:

For those applications that don’t need indexing, this is the problem with regards to the actual data.
For those applications that will use indexing, this is the problem with regards to the indexes as well.

  • When we go to the right end (append every change), we are not duplicating any data at all, and have minimal serialization costs when writing (so, high write throughput), at the cost of ever degrading access times (slower and slower queries).

  • When we go to the left (say, when you update email address on a Person model, you store the entire person model again, with Name, Address, Phone numbers, Birthdate, etc. etc. etc. i.e. constant duplication of data), we get fast access, since you just fetch the Person with one request (especially if you store the address as a hash of a unique value of the Person model, so you just calculate the hash, and access the data). But this comes at the cost of tremendous duplication of data, waste data. There is no need for more than one copy of the Phone or Address data, but we can’t avoid it. Also, writing is slower, since we need to load, deserialize, mutate, serialize and then insert large amounts of unrelated data with every little change.

There is no other dimension to the above, that is the physical reality we are stuck in when we go the immutable path. That is what I say seems to be the case at least, and ask of others to show that it is wrong.
We could cheat of course, and not store the data in the network, but we are assuming here that 100% data storage in the data network is the requirement.

LSM Trees

This is a way to increase write throughput, and is not exactly the same problem as this post covers. It’s a subset of it. If we have large amounts of data to store, then we will need higher throughput.
But LSM Trees are also best served with B+Trees for indexing. The OP described an idea to implement this combination, over AD - and during that work, the more fundamental problem emerged; the three data storage requirements in the flow chart; mutability over immutability, and it’s implications.


Summary

So, if we want to be able to do these things:

  • Autonomous cars data
  • SAFESearch
  • Token Exchange
  • Critical infrastructure (powerplants, smart electricity grid etc.).

.. it seems we have to solve this problem. I.e. efficiently emulating mutability. Or not go immutable. (Or what ever?)

We could also say that No, those things are not what SAFENetwork is supposed to do, i.e. we only want to do things like:

  • Backup Internet Archive as to ensure it lives on.
  • Individual websites that we modify now and then.
  • etc.

But that doesn’t seem to be what most people around here believe or wishes that SAFENetwork will do. It sure isn’t the only thing I wish for it to do.

Decisions, decisions..

I think if the decision is being made to go over to AD, then at least we must assess what that means for what SAFENetwork can do, and try to investigate if and how we can solve the problem that arises with regards to what we have imagined to implement on the network.

Solutions

Now, if anyone out there, has a solution to the problem, I would be very interested in hearing it, so that I can go ahead and implement it.
Because that’s what I want to do! :slight_smile:

21 Likes