Fleming Testnet v6 Release - OFFLINE

DeusNexus · June 20, 2021, 8:33am

Are there specific ports to port-forward (udp/tcp)?

For now I will just try to use the hard-coded default port 12000/tcp & 12000/udp?

Southside · June 20, 2021, 9:01am

I think you might be best to wait for testnet V6.1 which sounds like it could arrive tomorrow.

DeusNexus · June 20, 2021, 9:30am

Is there a mechanism in place to handle network uploads when uploading a file is stopped halfway through?

E.g. I can imagine some people upload file of several GB’s and then cancel because it takes too long.

Will this data be floating around or will the elders know to issue a timeout and then remove it from their storage/memory usage?

Would be best if any nodes whom are handling it will get paid for their work nonetheless to prevent spam attacks and put the responsibility at the user for maintaining a stable connection until file is uploaded.

Network determines cost
User pays in advance
Data is either uploaded or fails

Option to pay more for higher priority? People who are more patient could pay less, similar to BTC fees (fast/normal/slow).

dirvine · June 20, 2021, 9:37am

atm each chunk is paid and stored forever.

This is what we do, but we are looking to have storage contracts. So when created you can upload the data at any time, even repeatedly and with the contract, it’s proof of payment. Then we can batch that really nicely.

mav · June 20, 2021, 9:39am

Some context for future reference, it looks like there are two main commits for this

2930da5 - update flexi_logger and introduce log rotation
fd74565 - fix logger configuration

Is there anything else that’s worth pointing out for changes to logging?

I tried running a local network with RUST_LOG=off and no logs were generated (not even an empty file), but the performance was not noticeably different to running with logs. Is there anything particular that should be done to the environment or is all the logging benefits there in the code?

Aragorn · June 20, 2021, 9:42am

Actually, I think we just need udp protocol,
During the previous testnet, on IGD, I saw port forwarding on udp protocol only

dirvine · June 20, 2021, 9:42am

The log issue was more like a memory leak, so over time the logs grew unbounded. The biggest improvements in memory have been getting rid of unbounded containers and reducing message numbers required. Right now the whole messaging is being again simplified (a lot) to the point of telling the story or describing the algorithms. The hope is looking at messages will allow anyone to see all data flows easily. Especially devs

DeusNexus · June 20, 2021, 9:43am

The mempool concept with upload stack (highest fee is next in line after most current action) would stimulate the use of SafeNetwork Tokens.
Basically FIFO unless there is a higher attached fee in the current stack.

dirvine · June 20, 2021, 9:46am

Yes, just udp and you can use any ports you want to port forward, just need to make sure router is set with same ports and ip as your computer. The port numbers can be the same or different on the router, but the combination of internal and external ports must be the same on router and computer.

i.e. These are all Ok

local_ip::12000 : remote_ip::12000
local_ip::20000 : remote_ip::23333

As long as both router and computer are set with the same mapping of local/external it is ok.

mav · June 20, 2021, 10:30am

Interesting to know more details here, seems very strange for logging to make that much difference.

I tried uploading a 10MiB file, 5 tests with and without logging, new testnet for each test (so it never got very large logs). Maybe this particular test doesn’t reproduce the test where you saw the improvements?

With logging (seconds for upload to complete)

38.333
37.182
37.360
38.614
38.768

Without logging - RUST_LOG=off

37.659
29.017 (yes 29, not a typo)
38.064
38.971
38.677

Does appending to large files on disk cause slowdown over time? Seems like it would be a very quick operation, not one that would slow down as the file grows. Would be interesting to get more details and reproduce the no-logging speedup again. 2x-3x is a big improvement.

I had a quick look around to see if there was anyone else with similar experience, didn’t find much. The Rust Performance Book - Logging and Debugging doesn’t have much to say. Anyone know of any other third-party resources for logging performance (rust or otherwise)?

joshuef · June 20, 2021, 11:07am

This reddit thread on env:logger was my source for log related slowdowns there. (they talk about env logger being sync and needing to retake write lock for each log write). Seems like we’re already using `flexi-logger in node; which I’m not sure how it stacks up on this front.

What we were seeing in general with internal testnets prior (and thus what may have happened here) was the network get progressively slower. And then we would also be generating massive logfiles (100-200mb of logs sometimes; depending on log levels). It’s still not clear if this is definitely something that’s been an issue, but it seems to have had an impact.

My 2-3x number comes from upping a droplet network with no logging turned on (removing -vvvvv from our droplet update) and I saw the increase in speed. Before I was seeing all client tests (cargo test --release in sn_client on a 51 node, post-split network) take ~200-300s on a fresh network; with this change I was getting a reasonably consistent ~100s-130s.

The log rotation used on our last few internal networks has felt more stable over long runs (though without such a wild improvement).

Hopefully we’ll see soon enough when we get 6.1 up!

edit: may also be worth noting when testing on a DO, we’d normally be hitting it from a few computers simultaneously. It’s difficult to guage the numbers / impact here as it’s not (yet) super scientific (would be some nice tests to get in… running a set of tests with “background use” to get more standardised results).

We’re slowly building up some more standard checks which will run against stand alone networks. I’ll hopefully get to looking at hooking our churn test script into the CI flow this week, and then we can start expanding that to get more ‘real world’ (WAN) confidence for each PR (something we couldn’t realistically have set up when we had 5/6/7 different repos for what is going into the mono repo now.

And beyond that we can hopefully start to look at other testing frameworks (not sure what will be most relevant here).

It’s fun stuff trying to figure out how to test this network tbh!

another edit: our current churn tests for the curious. Super simple, we start a small network of 11 nodes, upload some data (1->7mb files here), then increase the node count to 50 or so nodes, and we verify we can retrieve our data.

We then do a few loops dropping two nodes at a time (2 as they could theoretically be elders, so we cant loose more than 2 at one time), checking at each stage for our data again.

It’s not the most elegant setup, but it increases our confidence in data retention over churn.

And as we get this hooked up we’ll start expanding and testing more scenarios.

jepson and other simulation frameworks have been mentioned in the last week, but i’m not super familiar with such things yet. But all interesting stuff!

mav · June 20, 2021, 12:05pm

Ah, very very interesting, thanks for clarifying!

This has really piqued my curiosity, I want to have a crack at reproducing when I get some time.

I did some early poking around on my laptop, 4 core hyperthreaded 16 GB memory 256 GB ssd.

Start 11 node baby-fleming.
Then run:
sn_client (v0.61.1) $ time cargo test --release

Took 5m32s
After the test the node log sizes were between 300 to 800 KiB
Using RUST_LOG 'sn_node=debug'

Repeat with RUST_LOG=off

Took 5m36s
No node log files were created

Not reading any meaning from the test, was just a first toe in the water type of thing. For example during the test my cpu was 100% the whole time, so could possibly be misleading due to the node:core ratio.

Creating a 300 MiB file took just over 8s on my laptop:
dd if=/dev/urandom of=/tmp/data300.bin bs=1M count=300
8s is not long in the scale of a 5 minute test, and in that test period the logs were not significant, nowhere near 300 MiB total.

I’m not really seeing how logs of hundreds of megabytes could cause significant slowdown. Although as you say repeatedly taking a write lock could be an issue.

I came across this java logging performance post where they’re regularly talking about millions of messages per second: Log4j – Performance

Anyways… really fascinating stuff and I’ll fire up my DO account when I get a chance and see if I can reproduce it because it’s very curious to me that logging would have that much effect.

Vort · June 20, 2021, 12:50pm

Massive logging may create large amount of small objects, which will slow down future memory allocations even after being deleted (just a guess).
I do not believe in large log files as a cause too. If whole file is not read, appended in memory and then written at each log operation of course

The problem is that lags are increasing in duration since network start. So somewhere square/exponential complexity is hiding.

joshuef · June 20, 2021, 1:46pm

Aye. It may not be the logging itself at all. Could be a red herring indeed.

We’ll see with the log rotation working properly soon enough! (Which is assuming flexi works like env logger), The ‘tracing’ lib seems to be the gold standard these days in Rust and is built for async code. So switching there for the node code will likely help (I believe routing was already using this).

KafkaLee · June 20, 2021, 1:53pm

Sounds like music!

Harvindar · June 20, 2021, 2:23pm

Ditto to that. As long as it sounds positive it pacifies a little. But its just two of us so we’re the minority here.

mav · June 21, 2021, 1:37am

Adding a bit more context to the -vvvvv flag

-vvvvv log level for nodes is defined in sn_testnet_tool provider L37 and used in sn_testnet_tool node L53 as a flag to the sn_node binary.

sn_node uses the verbosity flag in sn_node utils L112 to create a RUST_LOG filter sn_node=trace

The deployment to DigitalOcean uses terraform to start the nodes, not safe node run-baby-fleming.

The equivalent local command to what is running on DO (from a logging point of view) is

export RUST_LOG=sn_node=trace; safe node run-baby-fleming

When I tried running baby-fleming using RUST_LOG=sn_node=trace I saw only sn_node logs up to the trace level (as expected). All other libraries (eg routing, quinn) do not log (as expected). This is important to clarify since quinn is especially verbose at trace level and it’s important that quinn is never set to trace level logging because it really destroys the performance. I tried running baby-fleming with RUST_LOG=trace (ie all libraries logging at trace level) and quinn accounts for more than 99% of log messages.

I did a quick run of sn_client tests with nodes logging set to sn_node=trace and it took 5m22s (compared to ~5m30s with no logging) so it looks like I’m not getting a big slowdown from trace level logging (or speedup from disabling logging).

Still only looking at the surface here, but it’s good to keep track of the details along the way.

joshuef · June 21, 2021, 5:27am

Yep, good info there.

Also worth mentioning that prior to friday (just got a fix in the eve there), the droplet deploy script was writing to stdout and been piped to a logfile, not using nodes baked in logging. I’m not sure if that may account for something there or not.

I was trying a few other testnet things yesterday. I didn’t get any wold perf results, was able to (I think) see a few issues that may have contributed to our Put/Get woes though. Starting some deeper log analysis this morn.

joshuef · June 21, 2021, 5:35am

Ah yeh, forgot to highlight this yesterday.

This is definitely something that’s been heavy in the network. Before we squashed a lot of unecessary messages, WireMsg::serialize was our source of memory growth. Massively. Alongside the reduction in messages and some perf tweaks here (we were also re-serialising the whole message to send essentially the same thing to a different node; now we use a separated out payload + header) which we already got in and had a huge impact on per-node mem-usaged; there’s more changes coming now we’re in a mono-repo , unifying messaging across crates and adding a few more useful bits to the the header, so we can use this more often.

This should hopefully avoid even more unessesary deserialize calls.

Sascha · June 21, 2021, 7:04am

I wish there was a way I could tell whether e.g. a put has hung or is still working. Is there?

ps aux | grep safe

isn’t enough.

Topic		Replies	Views
[Offline] Fleming Testnet v6.1 Release - General & CLI Support Releases	103	3089	July 2, 2021
[Offline] Fleming Testnet v6.1 Release - Node Support Releases	64	3243	June 29, 2021
[Offline] Fleming Testnet v6.2 Release - Node Support Releases	100	3525	July 14, 2021
Fleming Testnet v2 Release - * OFFLINE - V3 RELEASED * Releases	158	4261	April 15, 2021
[Offline] Fleming Testnet v6.2 Release - General & CLI Support Releases	173	3774	July 10, 2021

Fleming Testnet v6 Release - *OFFLINE*

Related topics

Fleming Testnet v6 Release - OFFLINE