Safe Web Crawler

anon81773980 · February 24, 2017, 10:43pm

Continuing the discussion from Good list of test sites to browse:

I thought it is impossible to do web crawl in safenet due to encryption, and chucks? And that, if one wants to share something, it needs to save through index type website. Am I wrong?

ootoovak · February 24, 2017, 11:25pm

Yes, I would be keen to read about this as well. Might be a good tutorial topic! Being able to publish your own sites is great, but sharing is even better.

I think I have yet to figure out even how to see someone else’s public website.

happybeing · February 24, 2017, 11:53pm

Crawling SAFE is no different to the clear web.

neo · February 25, 2017, 12:39am

If the public can read it then a crawler can read it.

The crawler starts off with one safesite and then finds any links in that and then crawls those safesites. Rinse and repeat.

And like now the “dark” sites rarely ever have a crawler touch them because they are not linked anywhere OR they use ports/security that crawlers obviously cannot get past. Which is the same for SAFE. Unknown SAFE sites cannot be found by a crawler OR are encrypted to the general public

anon71598723 · February 25, 2017, 4:35am

In ruby (with the ruby-safenet gem) you can read a website like this:

html = safe.dns.get_file_unauth('test1', 'www', 'index.html')['body']
# => "<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset=\"utf-8\">\n    <title>Test Safenet</title>\n  </head>\n  <body>\n    <a href=\"safe://www.test2\">Link A</a>\n    <a href=\"safe://www.test3\">Link B</a>\n  </body>\n</html>"

Then, you can use Nokogiri or URI.extract to extract the urls:

URI.extract(html)
# => ["safe://www.test2", "safe://www.test3"]

And then you can break these URLs into protocol/host/path with URI.parse and repeat the process recursively.

anon71598723 · February 25, 2017, 4:39am

Here is a more detailed scraper:

require 'safenet'
require 'uri'

def get_links(url)
  uri = URI.parse(url) # parse url
  service, domain = uri.host.split('.') # www.something -> domain = 'something', service = 'www'

  html = safe.dns.get_file_unauth(domain, service, 'index.html')['body'] # read safe://www.test1/
  html ? URI.extract(html) : [] # extract links if page exists
end

# client
safe = safenet_quick

# load list of urls (creates if doesn't exist)
urls_parsed   = JSON.parse(safe.sd.read_or_create('list_urls_parsed', [].to_json))
urls_unparsed = JSON.parse(safe.sd.read_or_create('list_urls_unparsed', ['safe://www.test1'].to_json))

# parses "safe://www.test1" recursively
while url = urls_unparsed.pop
  urls_unparsed += get_links(url)
  urls_parsed   += url

  # save on the network
  safe.sd.update('list_urls_parsed', urls_parsed.to_json)
  safe.sd.update('list_urls_unparsed', urls_unparsed.to_json)
end

Then you can put this script on cron and develop a website that reads “list_urls_parsed” and display the scraped pages. Also, you can open the unparsed list to everyone collaborate with an Appendable Data.

nice · February 25, 2017, 6:13am

I suppose we could also parse the words in each page, store an index of some form in an appendable/mutable, then an app can ask the index…
EDIT We need to write this one really well before some pain in the neck comes with tailored ads…

19eddyjohn75 · February 25, 2017, 7:48am

This ad is delivered 2 you, through eddyjohn’s mischief network

On a serious note, let’s not have one website doing the crawling, it would be nice if end users SAFE clients did the crawling decentralized. If at all possible…

nice · February 25, 2017, 8:04am

haha good one thanks

the Safe Browser could have a tick box to enable/disable contributing to the crawling effort. It would indeed help to prevent non objective selection of what is indexed or not, what results are displayed or not , in what order…and much more efficiency for pages with little or no links from outside.
You would need to be very careful not to forget to disable it while you browse your super top secret agent forum, though.
I didn’t take time to verify but I’m sure there is a topic about this somewhere.

davidpbrown · February 25, 2017, 9:26am

It is slightly different from clear web, for not having ISP and other servers in the middle on the network; so, there’s no option on a top 1 million Alexa list or similar traffic analysis available… it’s all then from client perspective and not from the network… at least as far as I understand it.

The only change to that would perhaps be some future google-analytics like data from sites using whatever that was but they would be known already and that would just be an attempt at traffic ranking.

So, the only crawl on SAFE I’ve seen is the one I’ve done, which simply has a sensible guess at urls and notes responses.

neo · February 25, 2017, 9:29am

I can see a system where SAFE sites would submit their root page to a “crawler” DB and then the crawler would crawl the site.

Also on the clear web the crawlers can receive notifications of new domain name registrations and include those on their crawl schedule.

aenemic · February 25, 2017, 1:37pm

So like, the networks crawls itself to create a search engine?

JBishop · February 25, 2017, 4:01pm

@davidpbrown: It is slightly different from clear web

@happybeing: Crawling SAFE is no different to the clear web

Think so too: Either other pages already known to crawlers link to the new page or a scan of GMail messages stumbles over links to this new page. You can also submit it yourself in their webform for this purpose. All other means of initial indexing derive from this?

About submitting pages yourself, according to @Tim87’s proposal it could be done like this for pages on the Safe-network:

I don’t usually suggest things that require extensions to the core, but this is one of them.

The idea is about the semantic tagging of documents, and it requires:

- a new data type (yes, I knowww )
- that can be searched in XOR space in fuzzy way

What the Heck This Is About

“Dimensionality reduction” is the method where we assign an unlimited number of documents to a limited number of categories, with a certain weight.

The categories can be handcrafted, but they can also be automatically generated (e.g. by neural networks or support vector machines) in which case the model is not readily understood by humans, but it is more optimal for similarity search.

“Semantic hashing” is the binary variation of the same idea: age becomes “young” and “old”, and then “young” can be further subdivided into “child” and “teen”, “child” into “baby” and “no longer a baby”, etc.

The resulting semantic hash is exactly like what you think: a looooong binary number. However, unlike in the case of cryptographic hashes, if the XOR distance between two numbers is small, we can be fairly sure that the two things are also quite similar …

In the thread David seems to like this approach, leading to “answer engines instead of search engines”.

Traktion · February 25, 2017, 8:43pm

Yup! When Google decides to point its crawlers at safe net, the search problem pretty much goes away. Of course, we may want alternatives, but they have won popularity on the clear web for good search results.

anon71598723 · February 26, 2017, 5:17am

@davidpbrown what was your guessing strategy? Did you guess based on a dictionary, on the forum usernames, on popular websites domains?

nice · February 26, 2017, 6:28am

Hum… At what price…

Traktion · February 26, 2017, 10:18am

Sure, but they will crawl safe net regardless if it suits them. They may even provide proxy access, similar to cached access on clear net too.

JBishop · February 26, 2017, 11:12am

Posted but thought to look into the following afterwards

Does Google Sniff Your Gmail to Discover URLs

So here is the question: Does Google scan Gmails to see URLs shared within them, and then does it use these to discover new content? There are many who adamantly maintain that they do.

These people, “Digital Marketing Success By Design, contact us”, “decided to put that to the test”, whether or not Google is reading (g)mail in order to see what it possibly is linking to.

Surmises: scanning is for other purposes, just not in order to point the crawler to links in the content also.

Murmur of leaks noticed continues, even at end of the above experiment’s account. Why people would think it’s possible anyway is maybe because of this sort of attention to detail:

Google just dodged a privacy lawsuit by scanning your emails a tiny bit slower

The company won’t do ad scans until after a message hits your inbox on behalf of non-Gmail users, who haven’t agreed to have their emails scanned under Google’s Terms of Service. Because Gmail’s ad-targeting system draws on every email a Gmail user receives, it inevitably catches some messages from non-Gmail addresses. Scans that take place before emails are available to the user are particularly sensitive, since they’re not yet part of Gmail’s inbox. In real terms, that gap lasts only a few milliseconds.

So data can be used any other way, as stated (in 2014):

Google admits it’s reading your emails

GOOGLE HAS UPDATED its privacy terms and conditions, eroding a little more of its users’ privacy.

Our automated systems analyse your content (including emails) to provide you personally relevant product features, such as customised search results, tailored advertising, and spam and malware detection. This analysis occurs as the content is sent, received, and when it is stored.

davidpbrown · February 26, 2017, 12:08pm

Google does no evil… by redefining good. The small evil for the greater good fallacy is just another symptom of conservative thought that leeches into every area, tempting those who can with more power and wealth.
Reasons we need SAFE to help avoiding those who ‘know’ best what is good for others.

All the above+… it’s not hard to do. Those who put up sites tend not to be trying to hide them. Naturally, I doubt that I guessed them all and I know of no sure fire way to catch everyone that exists.

nice · February 26, 2017, 12:31pm

What I meant is mostly that if Google indexes Safe ( and we can expect they will ) , then their issue with searching Safe is resolved.
The results they will serve are by design oriented for their profit, and do not necessarily serve the common benefit ( some results can purposedly be ommited, or buried deep in the ranking ).
So even if they solve Safe searching, we will still need to create a non profit oriented, decentralized search. ( just like we still need it for the clear web, btw )

Topic		Replies	Views
Google-like searches on Safe Network Apps	31	3343	January 23, 2018
Safe-Search, bringing content discovery to the SAFE network Apps	41	4941	February 16, 2018
Brainstorming decentralized search on Safe Features	28	1127	September 21, 2020
Simple Human-centric Search on SAFE Apps	13	1702	July 3, 2016
Assimilating the World Wide Web with the SAFE network Features	36	2201	March 16, 2017

Safe Web Crawler

Related topics