Safe Search 2023

Safe Search is a perennial topic so rather than bump one of several existing but 4+ year old topics which relate to earlier expectations (eg of DNS) I decided to start this one, prompted by a Mastodon post which seems to me, an approach that works with well on Safe Network.

It demotes the web crawling techniques of most search engines in favour of retro techniques, updated for the current form of federated and decentralised services.

So a focus on human curated resources, which is I think a good approach.

I think the most effective search will be one that leverages effortless, passive curation, rather than relying on a few enthusiasts to painstakingly index and share their areas of interest, though ideally both can be part of a user-focussed shared search index, shared among different apps.

Example:

Considering the nature of the bulk of my searching, I’m beginning to think I’d like a search engine that incorporated:

  • 3rd party curated resources (i.e. like a library)
  • 1st party curated resources (things I think are reputable)
  • Federated search from trusted comrades.
  • Last would be some sort of webcrawler. Maybe.

I know this goes against the grain of modern life, but the idea of instant answers seems to have poisoned our minds. We tolerate wrong answers as long as they’re fast.

13 Likes
  • 3rd party curated resources (i.e. like a library)
  • 1st party curated resources (things I think are reputable)
  • Federated search from trusted comrades
  • Last would be some sort of webcrawler. Maybe

Adding to this:

  • a trusted ‘elf’

Proposing a concept, work name MindSafe, where an elf, i.e. one of @dirvine’s LLM-‘zettabiters’ - see link below - prudent and thus selectively publishes what you stored where on the Safe network. Call it ‘pushing’. And your elf or your elves also ‘pull’ specific parts from what other people are ‘pushing’, having learned and still learning from what you have shown likely to be interested in?

All within the constraints of it’s sisters’ mechanics, i.e. the Safe networking principles. Knotting both the push- and pull-channels together can safely be left to our team and our community?

To index digital content maybe use this digital archivists’ concept:

Siegfried is a signature-based file format identification tool, implementing:

  • the National Archives UK’s PRONOM file format signatures
  • freedesktop.org’s MIME-info file format signatures
  • the Library of Congress’s FDD file format signatures (beta)
  • Wikidata (beta)

Admitted (by these digital archivists), this tools lacks a ‘classification’-option apart from the one based on a strict ‘signature’. Why not attempt using the following rather peculiar form of ‘classification’ to remedy this?

This by means of “ent”, dates from 1985, was built by the “Autocad”-man and measures the randomness of noise signals. Usable on less random material, like audio, images and what have you …

Let’s say I would like to preserve the “MaidSafe-Team-Press-Photos-Pack” for posterity by storing it on the Safe network:

David-Irvine

At the time of uploading the file this information is held apart by doing:

  for %f in (*.jpg) do ent -t %f

Which results in:

ent

Later on anyone with a need for a portrait of David Irvine specifically under an extended roof would try to select one by checking his elf’s calculations of the distances between the available options:

distance

Used a 15-years old 40-dollar toy to see what would happen then. In a way it seems to work …
o/o

Links:

Data, information and knowle[d]ge: does AI change the game?

Siegfried

Fourmilab / ent_random_sequence_tester

Maidsafe press kit

7 Likes

Does Maidsafe have any plans to implement any indexing in layer one post beta? Or is this going to be left to layer two devs to add atop the network?

I’ve asked about this before and I didn’t get a definitive answer.

It seems to me we need nodes that can perform this as an optional service on the network. For instance upon uploading I ask for the data to be indexed for public search and pay an additional fee.

This seems like it would be difficult for a layer two solution though as fees would have to be negotiated between differing service nodes (indexers versus storage) and we all know now that the timing of the price is tricky as is.

So IMO it would be great if Maidsafe would take this on as a layer one issue. It’s always seemed like a pretty critical need for the network to grow beyond anything other than personal data storage.

7 Likes

It’s an interesting one as indexing centralises what is decentralised data. For public data then it might be different but trying to figure out a path there is going to be a difficult one. Nodes could subscribe to an indexing node to have data indexed, perhaps it could pay them and then folk pay for queries etc.

9 Likes

These days LLM embeddings are often used for semantic search using vector databases.

Alternatively on SAFE,such embeddings could be turned into XOR adresses. For example for an 1536 dimensional OpenAI embedding (in reality one would use ones of the open source alternatives) you’d use some dimensionality reduction technique like PCA to get it to 256 dimensions then quantize it and you end up with a 32 byte (256 bit) XOR address containing the embedding, with some loss of detail. Then if you want to search for something, the client does the same process and then asks the network for the ten closest XOR adresses to the one the client calculated from the query and then these could contain the XOR adresses of the actual documents plus some metadata.

6 Likes

I think it’s more like context based search. It’s fascinating though.

I am searching this space extensively right now and love any input like this. It seems like you are onto something here, but I wonder if there was a way to find “context” addresses. So the next word is gonna be X and X is a set of probabilities of the word (token) in context?

That’s one part, the next is private models for private individuals where there have their own “local” models (local to them, still on SAFE for hardware security). Then those local models can learn and adapt/learn from the large model that is on SAFE as you state above?

6 Likes

Maybe I was being too hasty here.

If this is done as a layer two, then perhaps the client just needs to ‘find’ index nodes somehow - maybe through a plugin of some sort - using this term loosely; then it uploads & indexes in separate sequential processes; i.e. first upload the data, then index the data.

Yeah, I feel we’ve had these conversations before :laughing:

If we have a third-party layer two plugin approach then each ‘indexer’ will be responsible for their search results and maybe that is the way to go in order to maintain the ‘innocence of the network’.

So for example I could create index nodes for a specific set/array of topics I want to index and publish on the forum my intention to run a search service for these topics with links to a plugin that facilitates such.

For the LLM approach maybe this is similar? Maybe the LLM is pretrained to filter and curate and so only accept content that it’s owner accepts. That could be really useful.

I have some other lengthy ideas written down somewhere on user curation of posts on a social media app for the network, but that needs a thread of it’s own.

4 Likes

Personally I feel that developing structures for search is the best way to go for the Safe Network.

Then safe sites can run an app over their site which uses the standard structure which search engines can use to create their search databases.

In other words the Safe network doesn’t use crawlers, which won’t really work, and leaves it up to the users to do the work to create the data in a structured format and any search engines out there can then use that. This also allows site operators (or uploaders) to only have the pages/files indexed that they sepcifically want indexed.

5 Likes

Using embeddings is actually kinda like tagging a document. If you embed a document using OpenAI’s embedding API you get 1536 dimensional vector back, where each vector is kinda like a tag. If one of these dimensions was “brightness” 0 could mean completely dark and 1 completely bright. So a text about something happening during the middle of the day might have a “brightness” of 0.8 for example and a text about something happening at night might be 0.1. So while it’s similar to a tag it’s not exactly the same, since the absence of a tag usually doesn’t mean that it’s the opposite of that tag. If this “brightness” value was quantized if would instead only have the values 0 for dark and 1 for bright.

As it happens these dimensions typically do not correspond to easily understandable concepts and may even correspond to superpositions of multiple concepts, so that a 1536 dimensional vector might not actually correspond to 1536 different concepts but it could be 5000 concepts and maybe the values 0-0.3 in one dimension corresponds to one concept while the values 0.31-1 correspond to another, such that this dimension could actually be split into two dimensions. What is learned through backpropagation is complicated, kinda like evolution, so it’s hard to understand and untangle.

All in all though it is kinda like having the LLM adding a bunch of tags to the documents and then having the same LLM add a bunch of tags to query and then find which documents are closest to the query, i.e. which one has the most tags in common.

OpenAI’s embedding API I think uses a model of around 6-7 billion parameters, something which fits easily in a GPU, and can process a query in not too many seconds on a CPU. There are many similar open sources varieties. The network could do it, but maybe better if just the client do it as part of the upload process if the uploader set an option “do LLM indexing”. It would then first upload all the data and when that’s done process the data through the LLM indexer. If there’s data indexed with 10 different LLMs one wants to search through, then whoever wants to query all of these would need to embed the query with all these 10 different LLM. If there was a standard LLM for creating indexing embeddings, then maybe users could be encouraged to reindex data if the new better embedding LLM was created. Though it might not be necessary unless the LLM was very significantly changed or completely replaced with another one.

It might be best to embed on a per paragraph basis. Each index item for a paragraph would then contain the address of the document where the paragraph came from.

One issue with having a model adapt or learn is that the embedding vector they produce for a particular document may then change. You want to use the same model to embed the query as was used for embedding the index. Similar to how you want to use the same analyzer for indexing and quering an elasticsearch index. You’d probably want something like storing the index as immutable data and in the metadata for the index keep a link to the LLM used for indexing.

This could be one of multiple types of index. As neo each site could generate indexes for themselves and various curated collections of indexes could be set up, something like mentioned in OP, like one search engine could contain a collection of indexed books and indexed curated reputable sources and users you set up their own index collections for what they want to search.

You mean if you can get an embedding for a single particular word in a particular context?

So for the word apple, you could extract different embeddings for the company Apple and the fruit apple, then somehow use that by itself?

The embedding of “I’m eating apples” and “I bought an Apple computer” will already have these different meanings.

That’s an interesting idea. Not sure if the LLM necessarily need to be pretrained specifically for each user, you could perhaps describe with words content that you don’t want to see, then embed that and use it as a negative query, if any found content is too close to the description of the content you don’t want, you won’t see it.

6 Likes

I mean more like

  • There are approx 50,000 words
  • Each word has many contexts
  • If these words plus context were distributed then we have something to start with

Then looking further at the embeddings/vec-db etc. Imagine we could take the parameters from different models and merge those. Merge architectures and weights, or use transfer learning type approaches (fine tine one against another etc.).

This then could be interesting if we all had local AI that lived with us then we could meet, understand each other’s viewpoints and build on each other’s knowledge. Then taking this to a more global approach and who knows.

There is much much more, but the field is moving at light speed, so it’s hard to keep up. Synthetic data then enters the scene along with llms training each other then liquid networks and neuromorphic chips (love them but how to scale). It’s all fluid, but fascinatingly fluid right now.

The vector encoding part is for sure a massive part and that is fascinating and very dependent on data quality.

I have a feeling there is a reward function type that is aligned with feasibility, regardless of probability and perhaps there is even something there in terms of don’t merge what’s not feasible.

:upside_down_face:

6 Likes

Something like this?

Start with the list of 50 000 words, then collect many sentences for each of these to ensure a wide variety of contexts.

Take these sentences or contexts where for example “apple” clearly refers to the fruit and “Apple” to the company. Pass these through the model and extract the embeddings from one the higher layers in the model.

Group the vectors into clusters of similar meaning so you’d have a cluster of vectors with something related to the fruit apple and a cluster of vectors related to the company Apple and for each cluster maybe take the average of the vectors to get an embedding vector for that particular contextualized concept.

4 Likes

Open source search engine in Rust.

7 Likes

3 posts were merged into an existing topic: Update 8 February, 2024