I don’t usually suggest things that require extensions to the core, but this is one of them.
The idea is about the semantic tagging of documents, and it requires:
- a new data type (yes, I knowww
)
- that can be searched in XOR space in fuzzy way
The idea is based on this paper: Semantic Hashing (Salakhutdinov, Hinton).
What the Heck This Is About
“Dimensionality reduction” is the method where we assign an unlimited number of documents to a limited number of categories, with a certain weight.
For example, let’s take the categories age (“oldness”) and sex (“maleness” or “femaleness”), and let’s take the people of the world; if we decide a 120 years is 1.00, then a 60yo geezer would get 0.5 on that category. If he’s a dude, he’d get a 1.00 on “maleness” or a 0.00 on femaleness, depending on which way we decided to express sex/gender. Heck, you could do both and one can get a 0.9 male, 0.6 female rating at the same time
The categories can be handcrafted, but they can also be automatically generated (e.g. by neural networks or support vector machines) in which case the model is not readily understood by humans, but it is more optimal for similarity search.
“Semantic hashing” is the binary variation of the same idea: age becomes “young” and “old”, and then “young” can be further subdivided into “child” and “teen”, “child” into “baby” and “no longer a baby”, etc.
The resulting semantic hash is exactly like what you think: a looooong binary number. However, unlike in the case of cryptographic hashes, if the XOR distance between two numbers is small, we can be fairly sure that the two things are also quite similar (unfortunately, it’s not always true in the opposite direction, but it’s good enough.) If you bring an example for the kind of thing you’re searching for, we’ll just have to look for everything with a semantic hash within a small enough XOR radius to find similar things.
And this is where we need help from the SAFE core.
The SAFE Way
In a nutshell:
-
we need data to represent:
- creator – the one who does the hashing, e.g. “Google” (well, anybody who wants to be the safe google)
- version – so that the same creator can improve / change their hashing method
- hash – a 512-bit (OVERKILL FTW) number, e.g. “100110111110011…”
- document – the “real hash,” the address on the network of the document being described
- signature – authentication, obv
-
this is a tiny piece of data (well below 1K) and there would be a LOT of it, so it needs to be very cheap (cheaper than SD), and a PtP scheme to offset even that cost: indexing services would provide a valuable (if not crucial) service
-
there needs to be a way to search for
(creator, version, hash, distance)
, where creator and version are exactly matched, and hash is matched within the given XOR distance (“Hamming ball”) -
SAFE already searches for hashes in XOR space all the time, so that’s not something new to develop