Content discovery inside Autonomi

Is there really no other way to find content inside Autonomi, than to find about it from somewhere else first?

Or, to put it another way round, is there really not any “public space” inside Autonomi, where content could be pushed for anyone to find?

I’m thinking about some sort of “motherpage”, that you would need to know, and somehow anyone would be free to link their content to that. Is that a stupid or impossible idea?

3 Likes

There will be multiple ways. Folk who weren’t in at the beginning of the web in 1994 may worry more about this than myself, because we had nothing then, but we do not need a centralised DNS.

We can and will build decentralised ways to connect and share. I’ve outlined this and provided a bare bones demo with dweb’s Names app. It’s mainly concept right now but could well be one of many ways we discover and share.

19 Likes

I looked at dweb names, but I just don’t understand it well enough.

Would you mind detailing a bit how it will be possible to, for example:

  • Have “a space”, like forum, blog or any kind of “page”, where many different people can publish something without anyone else taking any action in order for the message to appear on that “space”? Like we do here on this forum.

But only if you have time at some point. I’m sure you have a lot of other things to do :slightly_smiling_face:

3 Likes

Just a reminder of a concept/idea I had a while back:

I expect there will be many ways. A simple one is for people to publish an address where they publish data which others choose to follow. If I’m following you, my app would poll that address to see if it is changed. And if I follow several people it can merge that into a single UI.

You can do this with different Autonomi data types already.

I’ve just added support for public Scratchpads to the dweb REST API for Autonomi and was wondering about making a simple demo using that. Maybe this is the motivation!

Lists of DwebNames could be ‘followed’ in the same way and merged into your own list along with others, and an app could search all these when you start typing an address, offering you a choice of names created by yourself and others.

The topic is about discovery, but once we have people to follow, we can share things as with all social media. Discovery doesn’t need to be via search, and none of these things need to work the same way as the current web. That’s why I expect different approaches.

4 Likes

Where do they publish that address?

Back in the olden days, folks were encouraged to submit their sites for processing by a search engine’s crawlers.

Once the crawlers knew about a site, they would index it and make it available for searches.

Ofc, if a site was already linked from elsewhere, and the crawlers had already found it, no additional action was needed.

Crawlers are just scripts that index content and then follow any links to find any new stuff to index. Nothing to stop this being done on autonomi. Just need people to do it, with the knowledge to make it scale.

4 Likes

Anywhere they like. To start share it here. Once we have apps that can do this on Autonomi they can be shared there. I’m building one now.

This is how the web started. There were no search engines. We emailed each other links to sites and visited them. Some of those had links. We made notes of sites we thought were interesting, or might interest someone else. Some made a page of links to share with others. Some took lists from lots of people and built bigger lists.

There were other ways too. Long before search engines and it was fun!

3 Likes

Yes, sharing outside Autonomi was already familiar to me.

This was the thing I was asking some more details for.

But as you are building the app now, I think I can as well just wait and try. Thank you, I appreciate your work a lot :heart:

2 Likes

Am I right in thinking it works like this?

  • Someone is running an indexing script at their computer.
  • Another person sends a link to them, paying for the network for the data (the link).
  • The script indexes the link, and updates it’s list of sites, paying for that update.

I’d be happy with that process if, the indexing and the updating of the list can be made with all the payments and decisions done by the one sending the link to be indexed.

1 Like

The scheme I described does something similar. The person building the index will need to pay for that, but people can send the address of wherever they decide to publish their links. The index owner’s app scans all those lists every so often and builds a new list, which she publishes (and pays for) at their leisure.

If they publish using a scratchpad, updates are free though, so they won’t have to pay for updates providing the combined list fits within one scratchpad (~4MB).

1 Like

Yes, pretty much. At least if we want to produce another centralised search engine, likely running in a hybrid way.

I remember submitting my site maps to google not all that long ago and they then did something very similar.

The challenge for a good Autonomi search engine will be how users can interact with the data set it produces.

It would be great if everyone could build their own indexes (using an open tool/app) and link them to a common location (EDIT: i.e. crawl their own sites, then submit their results, which could be quickly validated, etc). Essentially, build a sort of open source google, where the indexes are stored in immutable data and they are quick to be cached, searched and appended.

As @happybeing says though, this isn’t where we need to be on day 1 and the clear net certainly wasn’t.

A fully distributed way to do search would be very cool indeed though and will be a big challenge for developers to solve.

5 Likes

Do I understand correctly?

We should quickly establish a common format for our sitemaps, then publish them as say robots.aut - text file that will contain the sitemap OR a string like “Do Not Index” - though why anybody would want to, Im no sure.
Then someone needs to create an app that searches all known published datamaps for the robots.aut file and begins generating a search index?

just musing out loud here…

Yes, something like that, I would have thought. Now, I’m no expert in search engines (it would be great to have one here!), but I would have thought that would be the first thing a crawler does - summarise the data in a way it can be ingested into the broader database.

Key words, phrases, how much they are used, etc.

I understand google et al will also check how often a site is linked, to judge how popular/correct the data is, etc. This could be part of the pass that reads in these crawler digests/outputs.

I know search is an area of academic research and engineering specialism though. Distributed search must surely be too.

I mean, some of us here could probably have a crack at something, but I bet starting on sound research would be a big help.

2 Likes

It’s a shame we don’t have another couple of weeks until IF kicks off. I think this would have been a good candidate.

As good a place as any to start learning… https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap

2 Likes

This has been suggested before by someone :slight_smile: and to have an app that indexes your own site storing the data and then submit a link to that site index. That way most of the work is done and paid for by the site owner, and only has what the site owner wants in the list. Also can have permissions for the search engine to expand the list when it crawls the site or to forbid it crawling the site.

And as someone suggested above have the one submitting the site pay a small fee to cover current and future updates to their search engine.

But if the owner does the update of their site with the app then future updates can be paid for by the owner of the site.

With scratchpad record type the site index can be held in one of those records and be updated by the site owner without need for the search engine to update their pointers/links

2 Likes

Allow me to step back from DNS a little bit to general content discovery, since my comments on that is what triggered this thread. Content discovery requires some kind of collaborative data structuring if we don’t want to rely on centralised parties/authorities, with all the potential power abuse that can come with it. I’ll try to explain why I think it won’t work well with the current Autonomi datatype restrictions:

Hahstable restrictions
Fundamentally, the network is a hashtable (dictionary), allowing for key/value storage and retrieval where the key is the network address. But currently it has restrictions on what keys are accepted; the key either has to be a valid cryptographic public key (because the value under that key has to be signed), or the key has to be the hash of the value (in the case of chunks). There is no freedom to store an arbitrary value under an arbitrary key.

Public keypairs
To me this seems like a major obstacle in public collaborative structuring of data on the network, like a conversational thread, which I will use as an example. The best method to achieving this with the current restrictions (that I’m aware of) is to use GraphEntries with shared or public signing keys, but it has a serious potential sabotage problem. The basic idea is that a person wishing to start a public discussion thread generates a BLS keypair for sharing, creates a GraphEntry, adds the public key of the keypair as a Descendant entry as the address of the next reply, and stores the signing key of that pair as Descendant metadata. This way anyone reading the GraphEntry knows the keypair and can upload a new GraphEntry under this shared public key.

Sabotage
But this falls apart if there’s one actor that wants to sabotage the discussion. That actor can simply upload a new GraphEntry using the shared keypair that doesn’t provide any new shared keypair descendant in it, effectively locking the thread because no further replies are possible (or only by the attacker(s), which can be even more nefarious). Even if you inititially add multiple descendants, even with keys that are not shared so you can initially circumvent the lock and make a new reply yourself, it doesn’t stop the attacker from doing it again the moment you switch back to using a publicly shared signing key.

So to me it seems that only whitelisted group discussions where every participant is trusted to not sabotage are viable under the current restrictions. The root problem is that the potential location(s) for replies have to be specified explicitly in advance, and once these are used without providing new potential locations, the overall data structure becomes immutable and the discussion cannot continue.

Mutable types
Using a mutable datatype like ScratchPad instead of immutable GraphEntry doesn’t solve the problem. At best it becomes a back and forth game between good actors and bad actors editing their previous uploads to circumvent the sabotage, where good actors have to constantly adjust their previous replies to route the reply addresses around the attacker’s sabotage.

Implicit addressing
However, if we had the ability to store data under arbitrary addresses, app builders can define protocols where the addresses of replies are derived implicitly (in contrast to specified explicitly in advance). A basic example would be that the first reply to address X will use “address = hash(X+1)”, the second "address = hash(X+2), and so on (any prefix or suffix can be added to differentiate protocols/namespaces, create branches, etc). This means that there’s a practically infite amount of potential reply locations, making the aforementioned attack impossible. App protocols can just skip and ignore any reply that doesn’t conform to its schema definitions, or that are flagged by moderators.

Node targeting
One of the reasons that I’ve seen mentioned on these forums not to allow this is because arbitrary addresses for data would allow different kinds of attacks on the network, as it would allow specific nodes to be targeted to store particular data.

I would argue that the current datatypes where a public key is used as an address is just as vulnerable as a datatype where the address would the hash of a field with an arbitrary value. Just like hashes can be “mined” (bruteforced) to find one that ends up at a particular node, the same can be done with public keys. The algorithmic complexity of generating hashes and generating public keys is both linear. Even if the public key derivation algorithm is a bit heavier to run, the network could use a heavier hash function or require rehashing X amount of times to derive the address.

DNS and lookup speeds
The other arguments I’ve seen are related to the typical “DNS” usecase, where people tend to think of the current restrictions as stopping global “DNS” usernames and website addresses (that can be squated and lost) from becoming dominant. I generally agree with those concerns, but stopping the network from functioning as a general purpose hashtable (dictionary) has a lot of collatoral damage.

Generally speaking, we can never have arbitrary collections of data that provide Θ(1) search time complexity on the network level; they would have to be converted to a hashtable in local memory first, which means downloading the entire collection. This is resource intensive because it has to be stored in local memory and it would have to be redownloaded every time the app that uses it is restarted.

At best arbitrary collections are organised and uploaded to the network in some kind of search tree structure that provides Θ(log(n)) average search time complexity. Which is not too bad, but organising such data structures collabarively (for example for distributed DNS) runs into the same problem as the locking sabotage attack I explained initially. So either every user organises and uploads such structures for themselves (extra upload and compute costs per user), or they rely on another party to faithfully do this for them (a centralising force).

Maybe I have overlooked some great solution; I really hope so. But if not, to me these restrictions on the network’s hashtable really don’t seem to be worth it. The main concern appears to be a fear for a global DNS system with negative aspects like squating, but removing the aforementioned restrictions does not make it inevitable that such a system will dominate. This community seems pretty aware of the risks there, and won’t naively implement such a system I believe. As long as Autonomi doesn’t champion it either, I think it’s best to just let the different designs compete.

I wrote the above without reacting to all the other comments in this thread to keep my thoughts clear and structured. If you want me to clarify how I see the other comments/proposals here in light of the above, please do ask!

About the only real solution I can see for the beginning is to revisit what was done early on with people hosting lists of sites.

On Autonomi it is more likely there will be such lists for search engines where “sites” are submitted to various search “engines”

Similarly the site name can be a part of these lists.

I agree that without the central authorities and the limitations of the data structures that using traditional central controls of DNS is not really feasible. Current DNS and one attempted to be implemented on Autonomi is not only open to abuse but like the current is abused the autonomi will be as well. Be it squatting on names, kicking people off certain names and so on. The financial abuse is horrific even if some form of financial costing is needed to help limit availability.

Added to that people will come up with their own solutions to “DNS” which only means we will have more than one solution being used. Maybe some countries will end up mandating using one system for their citizens (unenforceable) and companies mandating one system for their employees (to keep it SFW) and so on.

3 Likes

I think there are some solutions in this thread, if I understand the problem correctly: