Things are moving a bit too fast for me with this automated start up stuff. I would like to help you guys out, but not sure how at this point.
I do see the --do-not-start flag on the upgrade, which is nice to have. That will be useful for some safenode operators!
I’m just wondering here if the staggered start up thing is possibly a bit of a red herring. Is the problem not really that the node shouldn’t be doing so much when it starts? I could be wrong about that, will need input from the other guys. But I’m not sure we would be expecting the node process to need an enormous amount of CPU time when it starts up.
I’m seeing that on my AWS based node that has been running since the beginning:-
Or could it be that the node is broken in some way? It’s still on
safenode cli 0.105.2
but should still be working shouldn’t it? I don’t think it’s doing a lot and there are a lot of messages like this in safenode.log:-
[2024-04-10T20:14:32.185516Z WARN sn_networking::event] RequestResponse: OutboundFailure for request_id: OutboundRequestId(422592) and peer: PeerId("12D3KooWJ9G1pteFCXrtE6Cb5Jco2mGRR2e8YPChL6KMrXxUvPTd"), with error: UnsupportedProtocols
That’s all it’s really doing, just logging that.
I started up 20 nodes, and will be starting another 15 pretty soon, hopefully. I’ve gotten some records, but no nanos, yet.
I’ve given up on using safenode-manager for now and am starting up 20 nodes the old fashioned way. They are still on 0.105.2 and some are instantly filling. I had to specify my own peer:-
export SAFE_PEERS="/ip4/44.214.100.193/udp/12001/quic-v1/p2p/12D3KooWFo7KPU3cf2XuBqbgwDmZJWg33oTnM2wUx4fzijJkgAQi"
Using the default of them using the Maidsafe ones didn’t work. I’m guessing because of the compatibility problem.
So if you want to get some running and getting records until Maidsafe resolve the issue feel free to use that as the --peer
My nodes are gathering records nicely and the prices are all either 0 or 10. So far, so good!
However, uploads are struggling. I can get the odd chunk here or there to upload, but mostly they’re failing with warnings like this one:
ritten":16384,"total_mb_read":0,"total_mb_written":1}}
[2024-04-10T21:03:35.841459Z INFO sn_networking::event] Query task QueryId(196) NotFound record 0231dc(1715b9b2780cd8121264e6ed83b032c7f4b7a3e773823ef92de34514cee8deee) among peers [PeerId("12D3KooWGkYeVory41jdnfNdSwCWfFgEiAisXa4iRqdkSdUS73vB"), PeerId("12D3KooWSm67PsTffQu8gb4mL1XCAWCxmzbHeSfSxQ2BAxnJRVcP"), PeerId("12D3KooWHPW4rztzRekjHRW5xgVnzH1DveFLFdfPo98HGK5ZF2XE"), PeerId("12D3KooWRAL82BLfPp1hSt42w7BGNc2sGBKroovrLvKnA1ME5Zpy"), PeerId("12D3KooWJtcAZYBbRtCAhN25MHMkdwiFBaQkoiuMziRKhZXyeAk8")], QueryStats { requests: 38, success: 38, failure: 0, start: Some(Instant { tv_sec: 228023, tv_nsec: 309763040 }), end: Some(Instant { tv_sec: 228027, tv_nsec: 786282270 }) } - ProgressStep { count: 1, last: true }
[2024-04-10T21:03:35.841545Z DEBUG sn_networking::get_record_handler] Get record task QueryId(196) failed with {PeerId("12D3KooWJtcAZYBbRtCAhN25MHMkdwiFBaQkoiuMziRKhZXyeAk8"), PeerId("12D3KooWGkYeVory41jdnfNdSwCWfFgEiAisXa4iRqdkSdUS73vB"), PeerId("12D3KooWHPW4rztzRekjHRW5xgVnzH1DveFLFdfPo98HGK5ZF2XE"), PeerId("12D3KooWRAL82BLfPp1hSt42w7BGNc2sGBKroovrLvKnA1ME5Zpy"), PeerId("12D3KooWSm67PsTffQu8gb4mL1XCAWCxmzbHeSfSxQ2BAxnJRVcP")} expected holders not responded, error NotFound { key: Key(b"\x021\xdc\xcbL\xd7-\x9d;U\xeb\xf9\xb5lLVCKk\xcc\x95\t \xb6\x19\xe0\x80\xd2}\xa1\xe6@"), closest_peers: [PeerId("12D3KooWGkYeVory41jdnfNdSwCWfFgEiAisXa4iRqdkSdUS73vB"), PeerId("12D3KooWSm67PsTffQu8gb4mL1XCAWCxmzbHeSfSxQ2BAxnJRVcP"), PeerId("12D3KooWHPW4rztzRekjHRW5xgVnzH1DveFLFdfPo98HGK5ZF2XE"), PeerId("12D3KooWRAL82BLfPp1hSt42w7BGNc2sGBKroovrLvKnA1ME5Zpy"), PeerId("12D3KooWJtcAZYBbRtCAhN25MHMkdwiFBaQkoiuMziRKhZXyeAk8")] }
[2024-04-10T21:03:35.841682Z WARN sn_networking] No holder of record '0231dc(1715b9b2780cd8121264e6ed83b032c7f4b7a3e773823ef92de34514cee8deee)' found.
[2024-04-10T21:03:45.325126Z INFO sn_networking] Getting record from network of 0231dc(1715b9b2780cd8121264e6ed83b032c7f4b7a3e773823ef92de34514cee8deee). with cfg GetRecordCfg { get_quorum: Majority, retry_strategy: Some(Balanced), target_record: 0231dc(1715b9b2780cd8121264e6ed83b032c7f4b7a3e773823ef92de34514cee8deee), expected_holders: {PeerId("12D3KooWJtcAZYBbRtCAhN25MHMkdwiFBaQkoiuMziRKhZXyeAk8"), PeerId("12D3KooWGkYeVory41jdnfNdSwCWfFgEiAisXa4iRqdkSdUS73vB"), PeerId("12D3KooWHPW4rztzRekjHRW5xgVnzH1DveFLFdfPo98HGK5ZF2XE"), PeerId("12D3KooWRAL82BLfPp1hSt42w7BGNc2sGBKroovrLvKnA1ME5Zpy"), PeerId("12D3KooWSm67PsTffQu8gb4mL1XCAWCxmzbHeSfSxQ2BAxnJRVcP")} }
[2024-04-10T21:03:45.325236Z DEBUG sn_networking::cmd] Record 0231dc(1715b9b2780cd8121264e6ed83b032c7f4b7a3e773823ef92de34514cee8deee) with task QueryId(197) expected to be held by {PeerId("12D3KooWJtcAZYBbRtCAhN25MHMkdwiFBaQkoiuMziRKhZXyeAk8"), PeerId("12D3KooWGkYeVory41jdnfNdSwCWfFgEiAisXa4iRqdkSdUS73vB"), PeerId("12D3KooWHPW4rztzRekjHRW5xgVnzH1DveFLFdfPo98HGK5ZF2XE"), PeerId("12D3KooWRAL82BLfPp1hSt42w7BGNc2sGBKroovrLvKnA1ME5Zpy"), PeerId("12D3KooWSm67PsTffQu8gb4mL1XCAWCxmzbHeSfSxQ2BAxnJRVcP")}
[2024-04-10T21:03:45.325252Z INFO sn_networking::cmd] We now have 1 pending get record attempts and cached 0 fetched copies
[2024-04-10T21:03:50.244050Z WARN sn_networking::record_store_api] Calling set_distance_range at Client. This should not happen
The CLI keeps retrying over and over, but seems pretty stuck at the moment.
What is the problem you experienced with the node manager? If you start up 20 nodes outwith, what would you expect to be different? It isn’t the node manager that is causing safenode to use a lot of CPU time when it starts up. Or at least, I don’t see how it could be.
Did you have a specific problem with 0.105.6?
It had upgraded the nodes to the latest version and that seemed to not work when specifying my node in AWS which is on 0.105.2. And we have to specify a peer when using safenode-manager so I couldn’t just let it target the Maidsafe ones.
It isn’t the node manager that is causing
safenodeto use a lot of CPU time when it starts up
I think I’m getting away with a 2 min delay starting up nodes now. But in the past what I’ve seen is that starting up more than a few at a time causes more traffic than my terrible 100Mb/s down/20 Mb/s up link can handle. Then nodes are stopped because they aren’t responding quickly enough. I think the CPU being really busy wasn’t helping either. If I stagger them I can get away with 20 or more. Just not starting all at once.
Did you have a specific problem with 0.105.6
Just that it seemed they wouldn’t connect to my peer that is on 0.105.2
The question I have about the node manager, is, why is it installing a copy of the binary for every node I startup? Seems inefficient.
OK, thanks for all the feedback.
I think we’re definitely uncovering issues here when attempting to start many nodes at the same time. On the bandwidth side, I don’t know if we could do much about that, but on the CPU side, I’m pretty convinced that we will be able to solve that one. I don’t think we would be expecting safenode to excessively use the CPU at any point in time. So, maybe there needs to be a bit of a mix between fixing stuff in the node and using some staggering mechanism in the node manager. I’m gonna bring it up at the next team meeting.
I’m not sure exactly what you’re referring to here, but you don’t need to specify a peer when you upgrade. The command operates over all the nodes.
I think we still have to specify a peer when first installing nodes using safenode-manager. I wasn’t using it before tonight so had no nodes to upgrade. So had to start them by specifying a peer. Now I think about it I could have got one from the network contacts file though. Doh!
It’s possible that nodes could be on different versions. It’s just easier to keep them isolated that way, when the binary is not shared. Plus, for an upgrade, if every service shared the same binary, all the services would need to be stopped before you could safely change the binary. It’s just far easier to manage with each service having an isolated binary. The disk space is pretty insignificant.
Ah ok, sorry, I see what you mean! That won’t be required in the next release.
This is just one of my nodes current logs.
There is a lot of repetitive blocking of already blocked peers.
Quite a lot of shunning.
Tough life being a node these days.
wyse@wyse0:~$ grep "blocking it" /var/log/safenode/safenode1/safenode.log | awk '{print $5}' | sort | uniq | wc -l
26
wyse@wyse0:~$ grep -c "blocking it" /var/log/safenode/safenode1/safenode.log
1071
wyse@wyse0:~$ grep CloseNodesShunning /var/log/safenode/safenode1/safenode.log | awk '{print $5}' | sort | uniq | wc -l
91
wyse@wyse0:~$ grep -c CloseNodesShunning /var/log/safenode/safenode1/safenode.log
197
That’ll all be initial connection work, I think. We populate the routing table, which means hundreds of initial connections. Which is the most CPU intensive thing we do.
Down the line we could look to limit connections or CPU usage… Or just slow down initial acitivity.
@joshuef @chriso Is it possible for nodes when starting from a system startup to not all instantly start doing their work, but for each to have a delay period according to their PID. Using the PID allows the node to have an indication of order.
Each node looks at the other node processes running and using their PIDs works out its position and delays say “X” seconds times its position. Seems seconds should be like 120 or more, but not sure on this.
So on startup with the init process each node would wait a second then look at the other nodes already started and determine its position in the list of already started nodes and work out its delay from that position and then wait that length of time. Emit a log message (if enabled) stating that so the user can go back and check all OK
Could be. Not sure if that’s best bet here vs some CPU/conn limiting as something more general we may need…
(could certianly be an easy enough way round this issue though).
Happy cake day, @neo !
On lower end machines this might even (prob will) cause a first time bad node indicator. Nodes not responding fast enough.
But in my opinion is to get nodes to wait for the previous to finish its initialisation of RT build, etc and the CPU load to reduce before starting the next. I suggested such a thing for node manager with starting/upgrading nodes.
EDIT: maybe look at the previous node and wait for its load to reduce rather than total system load



