Using SAFE as a database by querying XOR names

To use SAFE as a database, you need a way to query the data. A way this can be done today is to set the XOR name of a mutable data to something that can be used for querying and storing an index as the value of that MD.

As an example, you may have a list of persons that you want to lookup, so you make an app that indexes your person collection an add some namespacing the keys. The keys of two of the indexes might be hash(“/mydb/persons/firstname/john”) and hash(“/mydb/persons/lastname/doe”). The values of these two indexes would be a list of the keys of all persons in your database with the first name John and all persons with the last name Doe. Maybe you have the persons “John Doe”, “John Gregor” and “Carl Doe” in your database. To run the query SELECT * FROM persons WHERE firstname = 'john' AND lastname = 'doe' the app would download two lists of persons, the list of persons with the first name John and the list of persons with the last name Doe, then compare these two lists and see which person(s) can be found in both lists.

This basic pattern can be used to make indexes for pretty much any query. Spatial queries, range queries, etc can all be made by making indexes using various data structures. In many cases this is likely to be a very good solution, but sometimes the lists or other data structure your app needs for the query could get huge and the data you would suddenly need to download and process lots of data and maybe partition the indexes so you’d have to download many MDs, that might not give a good experience, especially on mobiles.

Range queries, e.g. find all persons between 25 and 50 years old and spatial queries, e.g. find all restaurants in this area are typical examples of queries that might be run more efficiently by utilizing the SAFE network itself.

The way to do this would be to implement two new functions. There should probably be a range of type tags reserved for these two functions and they should only return type tags from within that range. The type tag should refer to one specific way or algorithm used for creating the XorName and also what data to expect from the value.

The two functions would be something like this:

GetClosestByName(XorName, n, typeTag)
Get the n closest MDs to an XOR name

GetByRange(fromXorName, toXorName, typeTag)
Get all MDs within a range of keys

Nodes should know the XorName of what they store already and they should know other nodes with ids close to themselves, so it would seem to me that doing these queries shouldn’t be all that demanding on the network.

These two functions are useful because there are many algorithms and ways of transforming some value from a document, that could be useful for querying, into the key (XorName). An album with a release date could for example have the date stored in the key and a query could then be used to find all albums within a certain date range. Semantic hashing as discussed in the threads semantic search and indexed structured data is another example.

The indexing, calculating hashes etc would be done on the client side by the apps. The are several reasons not to run it on the vaults. One is that there are many different algorithms that could be useful, some of which would be application specific, another is that the computations might be a bit heavy and it’s better not to strain the vaults with this, last is that it’s not really needed as the one running the app would generally be the one who wants the data to be queryable anyway.

I found some papers on related topics that might be useful.

http://www.cslab.ece.ntua.gr/~dtsouma/index_files/comcom_chaz.pdf

7 Likes

It would seem that GetByRange actually isn’t needed as the functionality could be implemented by just using GetClosestByName and instead putting stuff under the keys inside the MD, so an app could basically implement it on top of the GetClosestByName function.

If you for example want to find all employees between 25 and 60 years old and you have the two keys in employees-20 and employees-50 Employees-20 would be an MD with all employes between 20 and 50 years old, plus the key of the next MD in the range employees-50, which would have all employees older than 50.

The query would just then be find the first range closest to 25, which would be 20, then in that MD find the first employee that’s 25 years old and then follow the link to the next MD in the range and find all employees until finishing with the ones that are 60 years old.

In reality it would be a bit more tricky than this though, since it’s an XOR distance. The XOR distance between 25 and 20 is 13 and the XOR distance between 20 and 50 is 38, so it should work in the example above. However, the XOR distance between 27 and 28 is 7, while the XOR distance between 27 and 25 is 2. In this case if your query included 27, you would want to get 28, but would instead get 25, so it might not work for all applications and the keys and ways of representing queries would need to be thought through thoroughly.

2 Likes