Late to the party tonight - excellent update, makes a lot of stuff clearer and loving the simplifications/optimisations.
I can report that on the last release I can consistently crash all sn_node processes witha single upload of a 1.9Gb file.
When I try to upload directories of smaller files around 300Mb each the sn_node processes appear to stall rather than crashing.
What I mean by this is I regard any sn_node adult process that does not exceed 4kb/sec disk write as stalled. The adults are the processes that have the largest Disk Write totals - ie the top four in te screenshot below
and here is the CPU and memory graphs for uploading single ~300M files, then a directory of similarly sized files. The final spike is me running the log collection process
I want to replicate the real world case of folk uploading their favourite videos/photos/music whatever. These will likely be large directories of perhaps 1Gb+ files.
Am I making life hard for myself by doing this on a local baby-fleming? Cos in real life this will be done on a machine running ONE node in almost every case. So Im not sure if I/we are poking the correct corners?
But if it results in more efficient code then I suppose its worthwhile. So major thanks to @joshuef and the rest of the devs.
I just tried local testnet with 30 nodes and with large files (folder with 1,5-2,5GB files) it stalled after some time, but none of the nodes crashed.
Before it got stalled I was seeing steady 10 Gbit of traffic at the loopback network interface, I wonder if that is just a coincidence or I was hitting a limit somewhere.
This is a huge speedup. Does anyone know the throughput for other large scale storage networks? Just trying to get a feel for how this ranks in a broader context. Probably hard to compare since the context is more important than the number. Any ideas?
This is why I was desperate for someone else to try this. Is there something weird with my setup on this one box or is there a major problem with the code itself?
Now we have two reports its still a crazy small sample, hopefully a few others will try and report back.
Im glad to see you can replicate the stalling. If you have the logs please put them on filebin.
Remember, this is limited to sending the same msg to many peers. So in reality you won’t be doing more than 7 normally. Sending to clients could be a lot more for popular data (and in fact I think we can ditch the Dst altogether there tbh; it’s not needed for clients).
There will be some tweaking to be done to try and optimise client sends around this. Maybe some batching or some such eg, or the cache stores the encoded bytes as opposed to the data
I tried to upload a 2GB file on an 11 node network locally (commit 6884257).
The upload failed (process killed) but all the nodes are still running. The cli process was killed after the big peak to the top of the chart on the right. Ran out of memory after 42s of uploading.
Ah interesting. Maybe the client is killing us then. Could well be. @bzee has been poking about wanting to refactor there the last week; so this’ll be another good angle to consider there.
Im getting this consistently with the latest release BUT the sn_node processes are not stalled and I can continue to upload other files - for now anyway.
willie@gagarin:~$ ls -lh Videos/League_Of_Gentlemen_Xmas_Special_2000_DVDRip_XviD.avi
-rw-r--r-- 1 willie willie 701M Jul 4 2015 Videos/League_Of_Gentlemen_Xmas_Special_2000_DVDRip_XviD.avi
willie@gagarin:~$ safe files put Videos/League_Of_Gentlemen_Xmas_Special_2000_DVDRip_XviD.avi
FilesContainer created at: "safe://hyryyryuk61uid4wkahfmeh85nj3dyhf3ds344jjt711gyect9aobc7zq5cnra?v=hnrijhg7qx435iejzzjm1i5oc3qgw3s3b36ej9r7wwe8nmgxjnfey"
+---+--------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| E | Videos/League_Of_Gentlemen_Xmas_Special_2000_DVDRip_XviD.avi | <Content may have been correctly stored on the network, but verification failed: safe://hy8ayyyx8nubohs8c3wygxuc7n7pdwgh1yobo3naq4bz3f848on9jitwrby> |
+---+--------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
@mods -should these memory/performance related discussions have their own tread? or even move to the dev forum?
I’d say: keep nodes minimal, 11 nodes is 7+4adults, so that’s our count there I think.
Then do increasing file sizes up until fail. And if you can kill the client, do that and see if the nodes are living or no? The idea would be to see if it’s the client that causes the mem issues rather than the nodes.
A pal has just arrived to help me with thee garage roof. I will be AFKfor at least a couple of hours.
Can I suggest folk work down from 750Mb to find the largest file that can be reliably stored?
I got the “verification failed” message only once with the bigger file and after that the network was stalled.
Btw in the logs I keep seeing these errors even when it works:
[2022-09-02T13:55:02.519243Z ERROR sn_node::node::messaging::agreement] Failed to generate proof chain for a newly received SAP: FailedSignature
[2022-09-02T13:55:02.522048Z ERROR sn_node::node::messaging/validate_msg] Service msg MsgId(3ba5..a40b) from 127.0.0.1:59737 has been dropped since there is no dst address.
And this was on one node when network was stalling:
2022-09-02T14:07:44.624609Z ERROR sn_node::comm/send] Sending message (msg_id: MsgId(a7a5..ec0e)) to 127.0.0.1:34622 (name be854f(10111110)..) possibly failed, as monitoring of the send job was aborted
I restarted the network and did some more testing. Machine is AMD 5900X, 64 GB RAM, server-grade SSD and I am running NODE_COUNT=11 RUST_LOG=sn_node=debug
For ~280 MB files all works consistently. I put 6 files with various pauses between, all had average upload speed 2.3 MB/s (±10%). I got the “verification failed” message on 5/6 files.
Memory usage went up during first two files and than stayed same:
For ~410 MB files: I did only two, first file jumped memory usage higher, stayed same after second. Avg. speed 2.7 MB/s, verification failed on one of them.
For 432 MB file - upload didn’t finish (I killed it after 30 minutes), memory usage was only little higher.
After that no upload finishes, not even 10 kB.
Downloads still work (5 MB/s and 30MB/s).
Files that uploaded with the “verification failed” have missing chunks.
No apparent errors in logs like before. These are examples of the only type of lines with the word “error” i see in logs, but they were appearing since I started this testnet.
2022-09-02T20:07:36.333074Z DEBUG sn_node::comm/send] Transient error when sending to peer b6b048.. at 127.0.0.1:45892: Send(ConnectionLost(Closed(Application { error_code: 0, reason: b"Connection expired." })))
2022-09-02T20:08:21.630443Z DEBUG sn_node::comm/send] Transient error when sending to peer 7af05c.. at 127.0.0.1:44490: Connection(TimedOut)
2022-09-02T20:07:40.625707Z WARN sn_node::comm::peer_session] Transient error while attempting to send, re-enqueing job Send(ConnectionLost(TimedOut))
2022-09-02T20:07:48.559455Z WARN sn_node::comm::peer_session] Transient error while attempting to send, re-enqueing job Connection(TimedOut)
EDIT: I am running from source compiled few hours back, should be the latest if I understand it right.