Node Manager UX and Issues

I had that thought too, but we’ve had this issue for many months so it might be that, or it could be something more fundamental.

We need to know what it is so we can understand how it may affect the network and whether it needs addressing. Unfortunately there has been very little response or explanation of this. Qi engaged not that long ago but I don’t think we shed much light.

1 Like

@chriso not 100% this belongs here but twice now I have had a host running fine only to come to it later and all nodes are stopped.

I believe both occurrences were on ARM. (sorry did not pay too much attention the first time, thought it was a fluke) 99% sure it was ARM the previous time too. Both times on this new network.

Ran fine for several hours even had a log rotation.

I saw mention of this on discord too but they were blaming screen which I don’t get. It was running as a service without --auto-restart obvs.

I run on UPS so unlikely that it was power related and other ARM devices on the same power source continue without issue.

Edit: unlikely it is a ARM issue, just remembered that all my x64 hosts run with --auto-restart so I would not see this there.

rock64@four:/var/log/safenode/safenode3$ sudo safenode-manager status
╔═══════════════════════╗
║   Safenode Services   ║
╚═══════════════════════╝
Refreshing the node registry...
Service Name       Peer ID                                              Status  Connected Peers
safenode1          12D3KooWMw64rHTVegcwQchbMtzDjvxStcm4wD3pZT6h4qy7Kxgh STOPPED               -
safenode2          12D3KooWFprBycPZ25F7VfHn4CWUm9n4kw8CDd66gtzTuzXFgKqA STOPPED               -
safenode3          12D3KooWRawNaavF6mFENxWzZGrnnKrhsvGMDZjmAoKpmowxXTAf STOPPED               -
safenode4          12D3KooWSFB1VBfj5QbPt3d34HMaCCAHbdqjq25fSmhUyKaL8vZS STOPPED               -
safenode5          12D3KooWSTtebV7CXDAZifmfaH6VBVFTmw3o73r1UGRYA4b4naeZ STOPPED               -
safenode6          12D3KooW9wdpfoiDCDkFYrH9u6HWtmVFqjbJgqjC6p8evfPm38pX STOPPED               -
safenode7          12D3KooWAJDn9L5fw14J1Xb7MW1rSXzgjoSHtfDoCQ31BUs51qYm STOPPED               -
safenode8          12D3KooWHVqMSVpMD8io4qVTYBUDS3V4e21vtcvPrWbXjXZJLbWB STOPPED               -
safenode9          12D3KooWKBxjTFhVzzFHf8R8Xvjw8AHZkTibVmvRL3Y4TRXxPgT7 STOPPED               -
safenode10         12D3KooWKBxjTFhVzzFHf8R8Xvjw8AHZkTibVmvRL3Y4TRXxPgT7 STOPPED               -

It’s official! I earned my first 10 nanos!!! Now I know what my somewhat obvious issue was, I’m going to attempt getting my MacBook hooked on Amphetamine.

4 Likes

Sorry, because you mentioned --auto-restart, I’m not completely clear what the problem is here.

Are you saying that the nodes were not coming up after the machine was reset?

Or there is just a random failure that is unrelated to anything to do with restarts?

Neither am.

Short of it is.

Twice now (since the restart of the network) and never before has this happened to me.

All nodes are mysteriously stopped for no reason I know of.

Also saw another user on Discord mention this too.

So it occurred at least 3 times in two days. Seems fishy to me.

I know there is not much to go on here, I am just noting it so that if more reports come in, we know something is wrong.

1 Like

This could be part of MaudSafe’s testing. It could even be a deliberate attempt by them to fill up a certain area. Just to see what the effect is and if the network can cope.

Right ok, thanks. I do recall actually seeing one other person say this, I think. It was on Windows.

Is there anything in the node logs?

Not that jumps out at me, only glanced at the last few lines they were seemingly just going about regular node business.

System uptime is ok? Anything interesting in syslog?

Last restart was me when I set the nodes up. So that eliminates a power issue. Not seeing anything notable in syslog.

I notice my memory consumption per node has grown steadily to be doubled since Monday restart on the new code, this then pushed my whole MS WIN 11 system in to paging to disk via VRAM , the problem might be realated, it could be as a remedy you need to expand the VRAM allocation in combination with /swap , given the number of nodes you are running… where the peers per node in your close group are large meaning in aggregate the peer count is really high and there is a lot more spikey consumption of ram to handle the relays of chunks?

What does your network bandwidth consumption look like, spikey close to ‘practical’ BW I/O GE maximums given the cores/threads are all busy with the above?

As you can see I used to have 2+GB headspace to run Browser +gmail and an additional tab without paging to VRAM, now the system pages to VRAM on disk all the time with only 1GB of RAM headspace , and consumes more exlectricity in the process writing to the SSD and running the fan because the processor cores and threads are really busy , where nodes are handling on average 230 peers each.

Hello guys :slight_smile:

I can’t connect my nodes since the monday’s update…

My nodes stays in “started” state and I have thoses errors.

2024-07-11T16:17:21.566148Z DEBUG safenode::rpc_service] RPC request received at 127.0.0.1:45411: NetworkInfoRequest

[2024-07-11T16:17:23.732379Z WARN sn_networking::event::swarm] OutgoingConnectionError to PeerId(“12D3KooWFTMtaqu24ddDSXk9v5YxnuhJmTLFRunER1CG4wZ2XLUU”) on ConnectionId(1) - Transport([(“/ip4/139.59.168.228/udp/56309/quic-v1/p2p/12D3KooWFTMtaqu24ddDSXk9v5YxnuhJmTLFRunER1CG4wZ2XLUU”, Other(Custom { kind: Other, error: Right(Custom { kind: Other, error: Custom { kind: Other, error: HandshakeTimedOut } }) }))])

[2024-07-11T16:17:23.732482Z ERROR sn_networking::event::swarm] Dial errors len : 1 [2024-07-11T16:17:23.732492Z ERROR sn_networking::event::swarm] OutgoingTransport error : Other(Custom { kind: Other, error: Right(Custom { kind: Other, error: Custom { kind: Other, error: HandshakeTimedOut } }) })

[2024-07-11T16:17:23.732535Z WARN sn_networking::event::swarm] OutgoingConnectionError: On bootstrap peer PeerId(“12D3KooWFTMtaqu24ddDSXk9v5YxnuhJmTLFRunER1CG4wZ2XLUU”), while still in bootstrap mode, ignoring

I have started my node with

safenode-manager add --owner “XXX” --count “XXX” --node-port “XXX” --peer /ip4/139.59.168.228/udp/56309/quic-v1/p2p/12D3KooWFTMtaqu24ddDSXk9v5YxnuhJmTLFRunER1CG4wZ2XLUU

any idea?

1 Like

It used to work very well that way with the last weeks network.

15 safenodes with *230 peers each 2X for in/out = 6900 peer connections and that is just 15 safenodes…, on this little 8GB RAM WIN 11 system its consuming 2GB+ of RAM (25%) so 15 nodes in practices is the upper limit with the current beta 2 wave version , me thinks.

How many nodes, on what hardware and OS and how much RAM is available to the fleet of safenodes ?

I only have 3 of these little 4 core SBC’s running 12 nodes each,

They only have 2gb of RAM but are not loaded, they are not even using any swap.

top - 16:42:27 up 13:39,  1 user,  load average: 2.85, 3.44, 3.63
Tasks: 148 total,   1 running, 147 sleeping,   0 stopped,   0 zombie
%Cpu(s): 37.1 us, 12.9 sy,  0.0 ni, 46.4 id,  0.1 wa,  0.0 hi,  3.5 si,  0.0 st
KiB Mem :  2037200 total,   180708 free,   754208 used,  1102284 buff/cache
KiB Swap:  1018592 total,  1018592 free,        0 used.  1256664 avail Mem

I have now started the nodes with auto-restart I am not going to worry myself with it too much, the SBC’s are just a afterthought to my actual hosts.

My RAM use seems a stark contrast to yours? take away buffers/cache and I am using less than 1gb.

1 Like

@Josh Yes I think this is the MS Windows Tax at work, the MS Defender gets really busy checking everything which also uses about 10% of the CPU clock on this 10thgen Intel Aspire notebook and some add’l ram …

Imo it might be soon time to crunch the Rust code into WASM runtime binaries, looking forward to wave 3 beta with the GUI, non? We also need more config advice re: number of nodes versus type of system and available type of core, thread, ram, bus (network) resources to help users optimally set up their safenodes in shared or dedicated mode… a simple gsheet calculator is likely in order…

1 Like

@Josh also number of peers accepted by a node to be in the close group likely needs to be limited per a discovered (YAML/JSON config file) configuration referenced by safenodes before they boot? Configuring your fleet of safenodes is pretty wild west at the moment… everybody has their own angle…formula, which makes it really hard to debug the network as it grows…

1 Like

I’m running 3 nodes on a 8 vCPU, 16GB RAM, 2000GB computer. Ubuntu.

Still no connection to the network. Can’t understand why it used to work before the update, and not since…

Is there any problem with the initial peer I configure?

I’m using

/ip4/139.59.168.228/udp/56309/quic-v1/p2p/12D3KooWFTMtaqu24ddDSXk9v5YxnuhJmTLFRunER1CG4wZ2XLUU