Safe Search is a perennial topic so rather than bump one of several existing but 4+ year old topics which relate to earlier expectations (eg of DNS) I decided to start this one, prompted by a Mastodon post which seems to me, an approach that works with well on Safe Network.
It demotes the web crawling techniques of most search engines in favour of retro techniques, updated for the current form of federated and decentralised services.
So a focus on human curated resources, which is I think a good approach.
I think the most effective search will be one that leverages effortless, passive curation, rather than relying on a few enthusiasts to painstakingly index and share their areas of interest, though ideally both can be part of a user-focussed shared search index, shared among different apps.
Example:
Considering the nature of the bulk of my searching, Iâm beginning to think Iâd like a search engine that incorporated:
3rd party curated resources (i.e. like a library)
1st party curated resources (things I think are reputable)
Federated search from trusted comrades.
Last would be some sort of webcrawler. Maybe.
I know this goes against the grain of modern life, but the idea of instant answers seems to have poisoned our minds. We tolerate wrong answers as long as theyâre fast.
1st party curated resources (things I think are reputable)
Federated search from trusted comrades
Last would be some sort of webcrawler. Maybe
Adding to this:
a trusted âelfâ
Proposing a concept, work name MindSafe, where an elf, i.e. one of @dirvineâs LLM-âzettabitersâ - see link below - prudent and thus selectively publishes what you stored where on the Safe network. Call it âpushingâ. And your elf or your elves also âpullâ specific parts from what other people are âpushingâ, having learned and still learning from what you have shown likely to be interested in?
All within the constraints of itâs sistersâ mechanics, i.e. the Safe networking principles. Knotting both the push- and pull-channels together can safely be left to our team and our community?
To index digital content maybe use this digital archivistsâ concept:
Siegfried is a signature-based file format identification tool, implementing:
the National Archives UKâs PRONOM file format signatures
the Library of Congressâs FDD file format signatures (beta)
Wikidata (beta)
Admitted (by these digital archivists), this tools lacks a âclassificationâ-option apart from the one based on a strict âsignatureâ. Why not attempt using the following rather peculiar form of âclassificationâ to remedy this?
This by means of âentâ, dates from 1985, was built by the âAutocadâ-man and measures the randomness of noise signals. Usable on less random material, like audio, images and what have you âŚ
Letâs say I would like to preserve the âMaidSafe-Team-Press-Photos-Packâ for posterity by storing it on the Safe network:
At the time of uploading the file this information is held apart by doing:
for %f in (*.jpg) do ent -t %f
Which results in:
Later on anyone with a need for a portrait of David Irvine specifically under an extended roof would try to select one by checking his elfâs calculations of the distances between the available options:
Used a 15-years old 40-dollar toy to see what would happen then. In a way it seems to work ⌠o/o
Does Maidsafe have any plans to implement any indexing in layer one post beta? Or is this going to be left to layer two devs to add atop the network?
Iâve asked about this before and I didnât get a definitive answer.
It seems to me we need nodes that can perform this as an optional service on the network. For instance upon uploading I ask for the data to be indexed for public search and pay an additional fee.
This seems like it would be difficult for a layer two solution though as fees would have to be negotiated between differing service nodes (indexers versus storage) and we all know now that the timing of the price is tricky as is.
So IMO it would be great if Maidsafe would take this on as a layer one issue. Itâs always seemed like a pretty critical need for the network to grow beyond anything other than personal data storage.
Itâs an interesting one as indexing centralises what is decentralised data. For public data then it might be different but trying to figure out a path there is going to be a difficult one. Nodes could subscribe to an indexing node to have data indexed, perhaps it could pay them and then folk pay for queries etc.
These days LLM embeddings are often used for semantic search using vector databases.
Alternatively on SAFE,such embeddings could be turned into XOR adresses. For example for an 1536 dimensional OpenAI embedding (in reality one would use ones of the open source alternatives) youâd use some dimensionality reduction technique like PCA to get it to 256 dimensions then quantize it and you end up with a 32 byte (256 bit) XOR address containing the embedding, with some loss of detail. Then if you want to search for something, the client does the same process and then asks the network for the ten closest XOR adresses to the one the client calculated from the query and then these could contain the XOR adresses of the actual documents plus some metadata.
I think itâs more like context based search. Itâs fascinating though.
I am searching this space extensively right now and love any input like this. It seems like you are onto something here, but I wonder if there was a way to find âcontextâ addresses. So the next word is gonna be X and X is a set of probabilities of the word (token) in context?
Thatâs one part, the next is private models for private individuals where there have their own âlocalâ models (local to them, still on SAFE for hardware security). Then those local models can learn and adapt/learn from the large model that is on SAFE as you state above?
If this is done as a layer two, then perhaps the client just needs to âfindâ index nodes somehow - maybe through a plugin of some sort - using this term loosely; then it uploads & indexes in separate sequential processes; i.e. first upload the data, then index the data.
Yeah, I feel weâve had these conversations before
If we have a third-party layer two plugin approach then each âindexerâ will be responsible for their search results and maybe that is the way to go in order to maintain the âinnocence of the networkâ.
So for example I could create index nodes for a specific set/array of topics I want to index and publish on the forum my intention to run a search service for these topics with links to a plugin that facilitates such.
For the LLM approach maybe this is similar? Maybe the LLM is pretrained to filter and curate and so only accept content that itâs owner accepts. That could be really useful.
I have some other lengthy ideas written down somewhere on user curation of posts on a social media app for the network, but that needs a thread of itâs own.
Personally I feel that developing structures for search is the best way to go for the Safe Network.
Then safe sites can run an app over their site which uses the standard structure which search engines can use to create their search databases.
In other words the Safe network doesnât use crawlers, which wonât really work, and leaves it up to the users to do the work to create the data in a structured format and any search engines out there can then use that. This also allows site operators (or uploaders) to only have the pages/files indexed that they sepcifically want indexed.
Using embeddings is actually kinda like tagging a document. If you embed a document using OpenAIâs embedding API you get 1536 dimensional vector back, where each vector is kinda like a tag. If one of these dimensions was âbrightnessâ 0 could mean completely dark and 1 completely bright. So a text about something happening during the middle of the day might have a âbrightnessâ of 0.8 for example and a text about something happening at night might be 0.1. So while itâs similar to a tag itâs not exactly the same, since the absence of a tag usually doesnât mean that itâs the opposite of that tag. If this âbrightnessâ value was quantized if would instead only have the values 0 for dark and 1 for bright.
As it happens these dimensions typically do not correspond to easily understandable concepts and may even correspond to superpositions of multiple concepts, so that a 1536 dimensional vector might not actually correspond to 1536 different concepts but it could be 5000 concepts and maybe the values 0-0.3 in one dimension corresponds to one concept while the values 0.31-1 correspond to another, such that this dimension could actually be split into two dimensions. What is learned through backpropagation is complicated, kinda like evolution, so itâs hard to understand and untangle.
All in all though it is kinda like having the LLM adding a bunch of tags to the documents and then having the same LLM add a bunch of tags to query and then find which documents are closest to the query, i.e. which one has the most tags in common.
OpenAIâs embedding API I think uses a model of around 6-7 billion parameters, something which fits easily in a GPU, and can process a query in not too many seconds on a CPU. There are many similar open sources varieties. The network could do it, but maybe better if just the client do it as part of the upload process if the uploader set an option âdo LLM indexingâ. It would then first upload all the data and when thatâs done process the data through the LLM indexer. If thereâs data indexed with 10 different LLMs one wants to search through, then whoever wants to query all of these would need to embed the query with all these 10 different LLM. If there was a standard LLM for creating indexing embeddings, then maybe users could be encouraged to reindex data if the new better embedding LLM was created. Though it might not be necessary unless the LLM was very significantly changed or completely replaced with another one.
It might be best to embed on a per paragraph basis. Each index item for a paragraph would then contain the address of the document where the paragraph came from.
One issue with having a model adapt or learn is that the embedding vector they produce for a particular document may then change. You want to use the same model to embed the query as was used for embedding the index. Similar to how you want to use the same analyzer for indexing and quering an elasticsearch index. Youâd probably want something like storing the index as immutable data and in the metadata for the index keep a link to the LLM used for indexing.
This could be one of multiple types of index. As neo each site could generate indexes for themselves and various curated collections of indexes could be set up, something like mentioned in OP, like one search engine could contain a collection of indexed books and indexed curated reputable sources and users you set up their own index collections for what they want to search.
You mean if you can get an embedding for a single particular word in a particular context?
So for the word apple, you could extract different embeddings for the company Apple and the fruit apple, then somehow use that by itself?
The embedding of âIâm eating applesâ and âI bought an Apple computerâ will already have these different meanings.
Thatâs an interesting idea. Not sure if the LLM necessarily need to be pretrained specifically for each user, you could perhaps describe with words content that you donât want to see, then embed that and use it as a negative query, if any found content is too close to the description of the content you donât want, you wonât see it.
If these words plus context were distributed then we have something to start with
Then looking further at the embeddings/vec-db etc. Imagine we could take the parameters from different models and merge those. Merge architectures and weights, or use transfer learning type approaches (fine tine one against another etc.).
This then could be interesting if we all had local AI that lived with us then we could meet, understand each otherâs viewpoints and build on each otherâs knowledge. Then taking this to a more global approach and who knows.
There is much much more, but the field is moving at light speed, so itâs hard to keep up. Synthetic data then enters the scene along with llms training each other then liquid networks and neuromorphic chips (love them but how to scale). Itâs all fluid, but fascinatingly fluid right now.
The vector encoding part is for sure a massive part and that is fascinating and very dependent on data quality.
I have a feeling there is a reward function type that is aligned with feasibility, regardless of probability and perhaps there is even something there in terms of donât merge whatâs not feasible.
Start with the list of 50 000 words, then collect many sentences for each of these to ensure a wide variety of contexts.
Take these sentences or contexts where for example âappleâ clearly refers to the fruit and âAppleâ to the company. Pass these through the model and extract the embeddings from one the higher layers in the model.
Group the vectors into clusters of similar meaning so youâd have a cluster of vectors with something related to the fruit apple and a cluster of vectors related to the company Apple and for each cluster maybe take the average of the vectors to get an embedding vector for that particular contextualized concept.