Performance testing upload/download using sn_httpd. Maybe related to QUIC window size

It struck me earlier that it may actually be easier to perf test via sn_httpd.

There is already a warm connection and a wealth of perf testing tools for HTTP.

Ofc, sn_httpd itself is bound to have a bunch of bottlenecks too, but wget shows I can grab files easily. Just using time and wget gets us quite far, pretty quickly.

Gattling or some other proper testing tools would do a much more thorough job too.

I’m quite far from my router here, but Termux let me write a wee script while I am struggling to sleep! :sweat_smile:

8 Likes

Actually, much faster from my wired box:

Edit: note the download speeds may look a bit strange, as it is downloading the full file in sn_httpd, then squirting the whole thing out at wired LAN speed. In reality, the handshake takes much longer than if it was streamed. The real time is accurate though, ofc.

7 Likes

I’ve heard it is decent. Not used it yet. Will play later, as I need to do it for sn_httpd anyway. Unless anyone else wants a go? :sweat_smile:

EDIT:

K6 seems to be a decent wee app for speed testing. It does http, but you can script other things too.

10 virtual users downloading a selection of image files, as 1000 requests. Over 8 MByte/s (about 65 Mbit/s) and an average latency of 530 ms.

This is over my wifi link, with a local docker instance of sn_httpd, sharing a single Autonomi client.

I’ll see if I can recompile with the different envs vars to see if there is any difference. It would be interesting to see what other folks get though.

I used the latest version of K6: Release v0.56.0 · grafana/k6 · GitHub

With the following JS file:

import http from 'k6/http';

export default function () {
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_QdxdljdwBwR2QbAVr8scuw.png');
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_dH5Ce6neTHIfEkAbmsr1BQ.jpeg');
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_pt48p45dQmR5PBW8np1l8Q.png');
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_sWZ4OWGeQjWs6urcPwR6Yw.png');
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_ZT6qplX5Yt8PMCUqxq1lFQ.png');
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_SxkGLnSNsMtu0SDrsWW8Wg.jpeg');
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_bogEVpJvgx_gMHQoHMoSLg.jpeg');
  http.get('http://localhost:8080/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/1_LFEyRQMHmxRnZtJwMozW5w.jpeg');
  //http.get('http://vader.lan/7ca488701eb318c05ecfea806245199f85d7987d5f73f7afea4a68b2437e5285/st-patrick-monument.mp4');
}

Command: k6 run -u 10 -i 1000 localhost-autonomi-http.js

Edit 2:

I tried running the same script on 2 different hosts (local and vader). Interestingly, the total data rate was about the same, being split between the two hosts. The latency is about double for each too.

Given I’m on a 1 Gbit link, I don’t think that is the bottleneck, but rather the hosts of the chunks may be splitting their resources?

I understand we aren’t swarm caching across the network right now, which I would expect would resolve this. Until then, it looks like a DDoS on specific files could be a concern, especially if others see the same thing:

6 Likes

Using a wrapper around the “ant” binary, seeing around 2minutes, per file download request, varying sizes from 129k to 32Mb - obviously most of that is connecting to the network :sleeping:

That is a “huge” improvement from the start in December 2024 where it was around 9 minutes !

Also, the error rate has improved allot, 66% improvement in client download errors, and a 40% improvement in client download network errors - although the quoting process has got worse…

From my testing, lots of parallel download requests for the same piece increase round-trip latency, and in some cases “might” cause the node to become unresponsive - That’s compounded by the node 50% CPU terminator, which I’m sure will be removed before tge-launch-node-proper-launch-not-resetting-launch.

On the concurrency, I’ve implemented a rate-limiter, 2-8 parallel downloads for the same piece seems to be “safe” - jumping to the 50-100 parallel downloads, over a 5 minute period “might” mean, somewhere in the world a node gets killed :sob: 1000-8000 parallel requests for the same piece is brutal, and will not end well.

As developers, we will need to ensure in these early days we put “workarounds” in place between the end user, and the network to remove the latency, and performance. The caching you have started to implement is a good option, request once, serve multiple times, if that also had async request queuing in it, then it might reduce the latency on time to first page impression.

Anyhow, all of this is hypothetical from me, I now firmly believe the network needs to be put out into the wild, given where it has been pushed to, to find it’s feet and it’s target users, and evolve - don’t trust, verify :zap:

Jad

14 Likes

It would be great to see how this compares with sn_httpd. I suspect the error rate and the latency will be way down.

At about 8 MB/s, you should be getting 129 KB to 32 MB in 1-5 seconds.

Do you have a list of XORs that you use for testing to share?

I think the biggest thing being stressed by ant CLI is the process of joining the network. The actual download is a tiny bit at the end.

The error rate for the above was 0% too.

Interesting! That tallies with what I’m seing with K6 too.

Doing the same test run across 2 different boxes, pretty much halves the performance on both (latency doubles and throughput halves, more or less). These weren’t hugely scientific tests (both boxes were busy doing other stuff too), but it was certainly a good indicator.

When I tried to download a 100 MB file on both boxes simultaneously, with 10 concurrent requests on each, I started to get chunk errors. I suspect this is what you are seeing too - it is literally knocking nodes offline, as their CPU increases and they die.

I understand the motivation, but I think we must assume there will be bad actors who want to show how Autonomi can’t scale, how brittle it is, etc. If you or I can easily DoS a chunk, experts in the field certainly can and will.

It’s acceptable while it’s a dev network, ofc. As soon as folks start spending money to push data up to the network, then reliably retrieve it again, this sort of thing is critical though.

Is there a timeline on swarm caching (or whatever we’re calling it), so that peers retain data that has recently been requested? That would immediately require much more sophisticated DoS tactics, as the source node for the chunk would be shielded.

9 Likes

Of course - I’m running on local branch at the moment, but these files are valid from what I’ve been able to upload - I’ll quickly merge in a few more updates of files.

4 Likes

With that file list, this is the same K6 run:

So, around 10 seconds per file or 42 seconds per run of all 4. Nearly 9 MB/s too. That’s while doing 10 concurrent requests for the same batch.

Note that vader (the 14yo laptop/server) was pretty busy during that run too! :sweat_smile: The data rate seems similar to the other tests too though.

(P.S. Maybe these perf test posts need moving elsewhere, but I definitely think it will be easier to test using a persistent autonomi client, rather than one which is bootstrapped each time)

Edit: just realised it is a much bigger file than forum showed in preview. Will try the full file next time

1 Like

yep 100% agree :+1: I’ve been using the ant client, as it’s what users “might” test with - coded client I’ve already written (but not tested) as issue I experienced on python api, which I’m sure the team will circle back to look at when no soo busy.

I’m going to put that tool on one of my Dev box and give it a spin, looks interesting, just need to get your http proxy installed as well.

yeah think we might be going off topic :laughing:

3 Likes

So sn_httpd is good :slight_smile: I’ve been able to max out at around a 148 MB/s running two processes on ports 8080 and 8081, and 16 cores and around 16GB mem - it’s now hitting a limit on downloads which I’m assuming is a cap on the nodes ability to deliver data. My understanding with a max 4MB chunk size, is only served from 1 node, instead of serving multiple segments of the chunk from all the nodes holding it in parallel (I’m sure that will come in the future), so with the max replication of 5 nodes per chunk we do have a max throughput cap on chunk retrieval. Will be great to see in the future hot nodes, where they store more replicated copies of the chunk to facilitate greater parallel downloads, and ability to stream a chunk.

Anyway, had a few hours spare to play with K6 more, i’m not gona be sucked into learning another language :dotted_line_face: it’s really good though :slight_smile: the JavaScript interface has allowed me to code very basic logic. It’s now able to pass my “csv” file of XOR addresses, and filenames on the network, and then use a random selection of those, as a virtual user (VU) via sn_httpd, with a monitor on files returning incorrect sizes.

Requirements are :
*sn_httpd running localy, I’ve got it on 127.0.0.1:8080
*K6 installed
*A file called “data.csv” in the same directory as the .JS K6 script, containing all the files to test from my github test file.

k6-ant-runner.js if you dislike github
import { sleep } from 'k6';
import http from 'k6/http';
import { SharedArray } from 'k6/data';
import { check, fail } from 'k6';

const csvData = new SharedArray('data', function () {
    return open('./data.csv').split('\n').filter(line => !line.startsWith('#') && line.trim() !== '').slice(1);
});

// set this to sn_httpd instance
const SERVER = "http://localhost:8080/";

function getRandomRow() {
    const randomIndex = Math.floor(Math.random() * csvData.length);
    return csvData[randomIndex].split(',');
}

export default function () {
    try {
        const row = getRandomRow();
        const name = row[1];
        const address = row[2];

        const url = `${SERVER}${address}/${name}`;

        const start = new Date().getTime();
        const response = http.get(url);
        const end = new Date().getTime();
        const duration = end - start;
        const downloadSize = parseInt(response.headers['Content-Length']) || 0;

        if (downloadSize < 1024) {
            console.error(`Name: ${name}, Duration: ${duration} ms, Download Size: ${downloadSize} bytes - Error: Download size is less than 1KB`);
            response.status = 501;
        }

        if (response.status !== 200 && response.status !== 501) {
            throw new Error(`Request failed with status: ${response.status}`);
        }

        check(response, {
            'status is 200': (r) => r.status === 200,
        });

        sleep(1);
    } catch (error) {
        console.error(`Error: ${error.message}`);
    }
}

load is 10 users, 1000 iterations total
running with ./k6 run -u 10 -i 1000 ./k6-ant-runner.js

jadkins@dev03:~/sn_httpd/k6-v0.56.0-linux# ./k6 run -u 10 -i 1000 ./k6-ant-runner.js

         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/

     execution: local
        script: ./k6-ant-runner.js
        output: -

     scenarios: (100.00%) 1 scenario, 10 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 1000 iterations shared among 10 VUs (maxDuration: 10m0s, gracefulStop: 30s)

ERRO[0002] Name: chume17a.jpg, Duration: 2571 ms, Download Size: 88 bytes - Error: Download size is less than 1KB  source=console
ERRO[0003] Name: CEP474.mpg, Duration: 3752 ms, Download Size: 81 bytes - Error: Download size is less than 1KB  source=console
*** More download errors...

 ✗ status is 200
      ↳  70% — ✓ 442 / ✗ 184

     checks.........................: 70.60% 442 out of 626
     data_received..................: 4.5 GB 7.4 MB/s
     data_sent......................: 101 kB 165 B/s
     dropped_iterations.............: 374    0.613676/s
     http_req_blocked...............: avg=14.94µs min=3.84µs  med=6.52µs  max=542.44µs p(90)=7.92µs  p(95)=26.56µs
     http_req_connecting............: avg=3.9µs   min=0s      med=0s      max=467.52µs p(90)=0s      p(95)=0s
     http_req_duration..............: avg=8.68s   min=1.2s    med=7.18s   max=30.11s   p(90)=17.18s  p(95)=20s
       { expected_response:true }...: avg=8.68s   min=1.2s    med=7.18s   max=30.11s   p(90)=17.18s  p(95)=20s
     http_req_failed................: 0.00%  0 out of 626
     http_req_receiving.............: avg=6.54ms  min=18.36µs med=2.15ms  max=262.91ms p(90)=15.28ms p(95)=22.72ms
     http_req_sending...............: avg=36.27µs min=12.36µs med=29.62µs max=190.2µs  p(90)=59.52µs p(95)=81.33µs
     http_req_tls_handshaking.......: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=8.67s   min=1.2s    med=7.18s   max=30.08s   p(90)=17.17s  p(95)=19.99s
     http_reqs......................: 626    1.027169/s
     iteration_duration.............: avg=9.68s   min=2.2s    med=8.18s   max=31.12s   p(90)=18.18s  p(95)=21s
     iterations.....................: 626    1.027169/s
     vus............................: 2      min=2          max=10
     vus_max........................: 10     min=10         max=10

Response time, as expected, is significantly better then the default “ant” client, but as I’ve seen before there are intermittent issues retrieving the same or different chunks from the network on download.

with load is 200 users, 1000 iterations total
running with ./k6 run -u 200 -i 1000 ./jadkins-vader.js

jadkins@dev03:~/sn_httpd/k6-v0.56.0-linux# ./k6 run -u 200 -i 1000 ./jadkin-vader.js

         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/

     execution: local
        script: ./jadkin-vader.js
        output: -

     scenarios: (100.00%) 1 scenario, 200 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 1000 iterations shared among 200 VUs (maxDuration: 10m0s, gracefulStop: 30s)

     data_received..................: 49 GB  148 MB/s
     data_sent......................: 837 kB 3.4 kB/s
     http_req_blocked...............: avg=4.72ms min=1.28µs   med=6.6µs   max=92.01ms p(90)=372.77µs p(95)=68.93ms
     http_req_connecting............: avg=4.16ms min=0s       med=0s      max=89.83ms p(90)=86.96µs  p(95)=65.97ms
     http_req_duration..............: avg=16.34s min=361.21ms med=11.83s  max=1m0s    p(90)=38.68s   p(95)=44.53s
       { expected_response:true }...: avg=15.99s min=361.21ms med=11.83s  max=59.88s  p(90)=37.85s   p(95)=43.45s
     http_req_failed................: 8.60%  172 out of 2000
     http_req_receiving.............: avg=1.85s  min=0s       med=1.12ms  max=39.6s   p(90)=7.04s    p(95)=14.93s
     http_req_sending...............: avg=1.08s  min=5.12µs   med=20.92µs max=24.26s  p(90)=4.24s    p(95)=6.15s
     http_req_tls_handshaking.......: avg=0s     min=0s       med=0s      max=0s      p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=13.39s min=361.15ms med=9.33s   max=1m0s    p(90)=28.71s   p(95)=36.87s
     http_reqs......................: 2000   16.336018/s
     iteration_duration.............: avg=1m5s   min=19.31s   med=1m7s    max=2m7s    p(90)=1m27s    p(95)=1m39s
     iterations.....................: 1000   4.084005/s
     vus............................: 4      min=4           max=200
     vus_max........................: 200    min=200         max=200

Seems the speed is there 148 MB/s, if the node you are connected to has good upstream - The concurrency, as expected directly impacts the download speed on the same file - theoretically this will be directly impacted by the concurrency of requests, and given the way de-dupe has been implemented, we are going to see some concurrency issues without developers heavy caching between the user and the network - as one chunk, could be dedupe over 1, 10, 100’s of files when requested, causing that chunk to be inaccessible.

If it’s a poor connection it’s very easy to identify due to latency and response. In the future it would be great to see the closed group, perform a consensus download check on the group, while also being able to request outside the group, so that slow nodes can be shunned on distributed consensus.

Be interesting if others want to take sn_httpd and K6 for a spin and see what others can achieve :+1:

9 Likes

148 MB/s is mighty impressive and I’m sure it will only improve as the network matures! Great to see sn_httpd holding up to the battering too, tbh - this is probably its biggest test so far too! :sweat_smile:

Interesting points re de-duped chunk holders too. The same for popular chunks, due to popular files.

I’d love to see LRU caches between the user and the chunk too. Repeated requests for the same chunk could then be very rapid and at least avoid intentional/accidental DoS.

We could do with an update of the Safe Network Primer, with a deep dive into the current architecture. I know a lot changed over the last year or two and it would be good to get an up to date picture of reality again.

I’ve not had chance to play with advanced scripting with K6 yet, but I like how simple it is to point and squirt at a box, then get some great metrics back. So much simpler than some of the older/established perf test tools. I like it!

5 Likes

The primer is much needed, that said documentation always lags when the dev ream is in th pressure cooker trying to hit deadlines…

3 Likes

That was officially taken over with the Docs website as a replacement and more

3 Likes

I noticed that performance has dipped with the current live network, so I thought I’d re-run the above.

Using AntTP 0.3.21 and the latest autonomi libs.

It’s possible there is a regression in AntTP, but it has been evolution, rather than revolution mostly. It feels like the network is just responding more slowly, despite it being pretty empty.

From 8 MB/s and around 500ms download time, to around 0.5 MB/s and 8500ms for this same batch of files. That’s quite a large shift.

@Jadkin, it would be interesting to see if you get similar or different results to before too. It could also be my setup/connection, etc.

3 Likes

Sure, I can set that up again to run, will be interesting to see what we get.

You able to share the K6 test file ? I can try running the same on my machine, to rule out any local issues.

Jad

3 Likes

Good idea - see below:

import http from 'k6/http';

export default function () {
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_QdxdljdwBwR2QbAVr8scuw.png');
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_dH5Ce6neTHIfEkAbmsr1BQ.jpeg');
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_pt48p45dQmR5PBW8np1l8Q.png');
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_sWZ4OWGeQjWs6urcPwR6Yw.png');
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_ZT6qplX5Yt8PMCUqxq1lFQ.png');
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_SxkGLnSNsMtu0SDrsWW8Wg.jpeg');
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_bogEVpJvgx_gMHQoHMoSLg.jpeg');
  http.get('http://localhost:8080/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/1_LFEyRQMHmxRnZtJwMozW5w.jpeg');
  //http.get('http://vader.lan/cec7a9eb2c644b9a5de58bbcdf2e893db9f0b2acd7fc563fc849e19d1f6bd872/st-patrick-monument.mp4');
}
2 Likes

Basically seeing the same issues as you, numbers are within the same ranges, timed out at the 10 minute mark, with 90 iterations out of the 1000 complete.

I will test my code, once I can get an upload to work :dotted_line_face:

Is it a possible regression ? I can skip back a few releases if that helps ?

#Edit : tried to roll back a few versions, and the network versions from Beta build aren’t compatible with Live network - I’ll try fiddling with the dependency versions, hopefully not too many API changes :thinking:

WARN  ant_bootstrap::contacts] Network version mismatch. Expected: 1_0.3, got: 1_1.0. Skipping.

2 Likes

It’s not impossible, but it’s mostly just been some refactors since beta.

I did do an off the cuff test last week and it was about twice as fast then. So, there may be a trend forming.

I can do some more tracing too, but I think most of the time is waiting on the autonomi library retrieve the data.

2 Likes

We will have a release coming out very soon that should be addressing various performance issues we have seen, and hopefully make for a better situation with uploads too.

Now that we have so many nodes, I do wonder about how long the update will take to propagate through the network, but hopefully, maybe if you ran these tests again in a week or so, you would get better results.

6 Likes

Sounds great! We can do a few more runs over the coming weeks and see how it progresses.

3 Likes

perhaps the distribution of XOR addresses is such per antnode in the close group servicing the upload is now more geographically dispersed generally? This would account for slower repsonse times with additive latency through more IP network hops.

that said the XOR address assignment at boot to the antnode is random,

so it could be partly luck of the draw (lots of geographically far way dispersed XOR addresses) however if the node-launchpad is set HOME,

my understanding is that all booted antnodes work through a paired (Close XOR Address non-HOME) relay node, which could also be far away geographically with lots of latency because of multi-network hops, albeit close in XOR address terms,

or maybe I am smoking something and don’t get it?

1 Like