Code with basic unit tests can be found here: https://github.com/oetyng/SAFE.DataAccess
It is a proof of concept of the algorithms and structures, that uses in-memory collections.
Next up I’ll add in SafeApp connections, as to run it with MockNetwork
and Alpha-2
.
So, last week I was revisiting the topic of need for a design for storage and access of data in an efficient manner in SAFENetwork.
I have been talking many times about how this is a topic that I’d very much like to see more discussion around, and there are some older topics. @intrz and @neo has provided some nice input, and this time I was reading this nice post by @neo again before writing the code. @intrz posted a set of links to some very interesting papers here. (Honestly, I didn’t base any work on that now, this is at a much simpler level.)
I think it’s almost half a year ago I was coding at this the last time. Back then I crafted a partitioning scheme, which was connecting to a local network, as well as using MockNetwork. Work slowed down as issues indicated that there were still some garbage collection issues with the SafeApp library. But that project was quite investigative to it’s nature, and it has quite some flaws, both the basic design idea and the code in general (takes many iterations to get it clean and simple).
This time, I was looking to overcome the deficiencies of that solution, and now I have the basis for an indexed and searchable (document-ish) database, however very very rudimentary still.
The main differences are:
- More or less limitless expansion.
- Only increases as data is added.
- Indexing on data properties, which enables search queries like
SELECT * FROM table WHERE p.Name = 'name'
.
Except from the obvious connection to the actual network, next I’ll be doing following:
- Use Sprache for creating a simple query parser.
- Continue to clean up and refine the structures (improve testability with inversion of control, interfaces etc)
- Test performance in a local network.
- Add documentation.
I’d be happy to hear opinions on how the indexing structures and so on can be designed better or differently (different approaches can give some nice thought-sparks).
The best way to understand the algos IMO is to just read the code, so please go ahead and read, and please feel free to ask questions when it is unclear how it works. (Please remember we are not - and do not strive to be - in the same universe as ACID
yet, that is a tough nut to crack.)
A short description:
- A
DataTree
is used to store data in a chain ofMds
. - A head md is used as the top structure for the
DataTree
. - When the head is full, a new
Md
is inserted at top, pointing to the previous head. - The data entry stored in in the new head is in the form of a
Pointer
object. - When data is added to the
DataTree
, it is stored in the form of aValue
object. -
Value
objects are stored in the leaves of the tree. -
Pointer
objects are stored in allMds
between the head and the leaves. - Info on what types exist in the
Database
, are also stored in its ownDataTree
.
- A
Database
holds any number ofDataTrees
, for the storage of data objects. - Each type of data added to a
Database
, gets its ownDataTree
instance. - Adding data is done by passing in an object, and the key with which to find it.
- Indexing on the key happens automatically.
- Indexes can be created for any public instance property on the object.
- The
Indexer
derives fromDatabase
, and stores the indices in it’s own database.
This is all very simple, the utmost MVP so to speak. An indexed and searchable database of decent quality requires huge amount of work and is really complex. And that is without being distributed, or decentralized… So, there’s plenty of work for all to go around