New error introduced with v.0.3.9 (not just Windows!)

Bug Report: “Node events channel closed!” in v0.3.9 – Affects early-started nodes (Windows)

Summary:
After upgrading to v0.3.9, a significant portion of my nodes (typically ~70–80 out of 600) shut down shortly after launch with the error:

Node events channel closed!

This issue did not occur in v0.3.8 using the exact same configuration. The problem appears specific to Windows, as Linux node-runners using --interval 60000 report no similar failures.

System Info:

  • OS: Windows 11 Pro, i9 CPU, 64GB RAM
  • Autonomi version: v0.3.9 (v0.3.8 works fine)
  • Launch: 600 nodes with --interval 150000
  • UDP ports: unique per node, no conflicts

Observed Behavior:

  • Nodes ≤ antNode367 commonly shut down
  • Later-started nodes (> antNode367) stay running
  • Logs show:
    • QUIC ConnectionClosed with empty reason
    • Slowing of RT discovery
    • Then: NetworkEvent channel is closed and forced shutdown

Notes:

  • Not caused by high interval; happens early in startup
  • No resource exhaustion (CPU/mem/ports) observed
  • Likely a regression in v0.3.9, possibly thread/channel related on Windows ---------------------------------------------------------------------------------------------------------------------------------A note-usually, when I restart the nodes, the stopped nodes will restart without issues. Here is a snippet of the log from one of the nodes that stopped with this error: [2025-04-08T11:50:47.731067Z INFO ant_networking::driver 916] Set responsible range to Distance(18092513943330655534932966407607485602073435104006338131165247501236426506235)(Some(253))
    [2025-04-08T11:51:02.722565Z INFO ant_networking::driver 916] Set responsible range to Distance(18092513943330655534932966407607485602073435104006338131165247501236426506235)(Some(253))
    [2025-04-08T11:51:14.016478Z ERROR ant_networking::event::swarm 478] IncomingConnectionError Valid from local_addr:?/ip4/0.0.0.0/udp/58959/quic-v1, send_back_addr /ip4/65.109.92.166/udp/40023/quic-v1 on ConnectionId(2968) with error Transport(Other(Custom { kind: Other, error: Right(Custom { kind: Other, error: Custom { kind: Other, error: Connection(ConnectionError(ConnectionClosed(ConnectionClose { error_code: APPLICATION_ERROR, frame_type: None, reason: b"" }))) } }) }))
    [2025-04-08T11:51:17.728505Z INFO ant_networking::driver 916] Set responsible range to Distance(18092513943330655534932966407607485602073435104006338131165247501236426506235)(Some(253))
    [2025-04-08T11:51:21.066693Z ERROR ant_networking::event::swarm 478] IncomingConnectionError Valid from local_addr:?/ip4/0.0.0.0/udp/58959/quic-v1, send_back_addr /ip4/74.81.33.41/udp/7946/quic-v1 on ConnectionId(2969) with error Transport(Other(Custom { kind: Other, error: Right(Custom { kind: Other, error: Custom { kind: Other, error: Connection(ConnectionError(ConnectionClosed(ConnectionClose { error_code: APPLICATION_ERROR, frame_type: None, reason: b"" }))) } }) }))
    [2025-04-08T11:51:28.882587Z ERROR ant_networking::event::swarm 478] IncomingConnectionError Valid from local_addr:?/ip4/0.0.0.0/udp/58959/quic-v1, send_back_addr /ip4/135.181.132.16/udp/35529/quic-v1 on ConnectionId(2970) with error Transport(Other(Custom { kind: Other, error: Right(Custom { kind: Other, error: Custom { kind: Other, error: Connection(ConnectionError(ConnectionClosed(ConnectionClose { error_code: APPLICATION_ERROR, frame_type: None, reason: b"" }))) } }) }))
    [2025-04-08T11:51:32.724047Z INFO ant_networking::driver 916] Set responsible range to Distance(18092513943330655534932966407607485602073435104006338131165247501236426506235)(Some(253))
    [2025-04-08T11:51:47.723508Z INFO ant_networking::driver 916] Set responsible range to Distance(18092513943330655534932966407607485602073435104006338131165247501236426506235)(Some(253))
    [2025-04-08T11:51:54.878050Z ERROR ant_networking::event::swarm 478] IncomingConnectionError Valid from local_addr:?/ip4/0.0.0.0/udp/58959/quic-v1, send_back_addr /ip4/136.243.94.178/udp/25301/quic-v1 on ConnectionId(2971) with error Transport(Other(Custom { kind: Other, error: Right(Custom { kind: Other, error: Custom { kind: Other, error: Connection(ConnectionError(ConnectionClosed(ConnectionClose { error_code: APPLICATION_ERROR, frame_type: None, reason: b"" }))) } }) }))
    [2025-04-08T11:51:57.720547Z INFO ant_node::log_markers 69] IntervalReplicationTriggered
    [2025-04-08T11:52:02.446407Z INFO ant_networking::network_discovery 263] It has been 180s since we last added a peer to RT. Slowing down the continuous network discovery process. Old interval: 390s, New interval: 527s
    [2025-04-08T11:52:02.464460Z INFO ant_networking::network_discovery 388] With min_full_bucket_index of Some(247), targeting buckets of [247, 246, 245, 244, 243, 242, 241, 240, 239, 238]
    [2025-04-08T11:52:02.464505Z INFO ant_networking::network_discovery 128] Going to undertake 9 get_closest queries for non_full_buckets
    [2025-04-08T11:52:02.473037Z ERROR ant_node::node 357] The NetworkEvent channel is closed
    [2025-04-08T11:52:02.473400Z INFO antnode 450] Node is stopping in 1s…
    [2025-04-08T11:52:03.480045Z ERROR antnode 459] Node stopped with error: Node events channel closed!
4 Likes

Me Too!
On 2 out of 50 nodes started on a RPi4 this morning. I can’t see a good reason for it. CPU hasn’t gone high. Network has been up the whole time.

So not just Windows.

One difference this time is that I’m using NTracking so had used the --metrics-port setting and this message about ‘Node events channel closed’ sounds like it could be related…

Upon further investigation, I believe these errors are showing up whenever the router’s cpu is overloaded and starts hiccupping.

1 Like

Really? I am sceptical that is the cause in my case. Router CPU is definitely not overloaded at 1%. The USP session count is about 1,500. That’s about twice the max I was seeing with 50 nodes on the previous version. The amount of traffic is more than double as well.

But all these things are well under what I’ve seen in the past. I think it’s something that isn’t load related.

These are the last few entries in the logs for both nodes I found were stopped:-

safe@sn-test-02:~ $ tail -10 .local/share/autonomi/node/antnode14/logs/antnode.log
[2025-04-09T08:07:42.401617Z INFO ant_networking::driver 916] Set responsible range to Distance(35336941295567686591665950014858370316549677937512379162432124025852395515)(Some(244))
[2025-04-09T08:07:56.401474Z INFO ant_node::log_markers 69] IntervalReplicationTriggered
[2025-04-09T08:07:57.400660Z INFO ant_networking::driver 916] Set responsible range to Distance(35336941295567686591665950014858370316549677937512379162432124025852395515)(Some(244))
[2025-04-09T08:08:12.402041Z INFO ant_networking::driver 916] Set responsible range to Distance(35336941295567686591665950014858370316549677937512379162432124025852395515)(Some(244))
[2025-04-09T08:08:17.178817Z INFO ant_networking::network_discovery 275] More peers have been added to our RT!. Slowing down the continuous network discovery process. Old interval: 193.614636456s, New interval: 196.229458832s
[2025-04-09T08:08:17.187971Z INFO ant_networking::network_discovery 388] With min_full_bucket_index of Some(242), targeting buckets of [242, 241, 240, 239, 238, 237, 236, 235, 234, 233]
[2025-04-09T08:08:17.188035Z INFO ant_networking::network_discovery 128] Going to undertake 1 get_closest queries for non_full_buckets []
[2025-04-09T08:08:17.226874Z ERROR ant_node::node 357] The `NetworkEvent` channel is closed
[2025-04-09T08:08:17.227495Z INFO antnode 450] Node is stopping in 1s...
[2025-04-09T08:08:18.228876Z ERROR antnode 459] Node stopped with error: Node events channel closed!
safe@sn-test-02:~ $ 
safe@sn-test-02:~ $ 
safe@sn-test-02:~ $ tail -10 .local/share/autonomi/node/antnode41/logs/antnode.log
[2025-04-09T10:05:22.360282Z INFO ant_networking::driver 916] Set responsible range to Distance(17668470647783843295832975007429185158274838968756189581216062012926197755)(Some(243))
[2025-04-09T10:05:37.361159Z INFO ant_networking::driver 916] Set responsible range to Distance(17668470647783843295832975007429185158274838968756189581216062012926197755)(Some(243))
[2025-04-09T10:05:47.360701Z INFO ant_node::log_markers 69] IntervalReplicationTriggered
[2025-04-09T10:05:52.361466Z INFO ant_networking::driver 916] Set responsible range to Distance(17668470647783843295832975007429185158274838968756189581216062012926197755)(Some(243))
[2025-04-09T10:05:57.204676Z INFO ant_networking::network_discovery 263] It has been 180s since we last added a peer to RT. Slowing down the continuous network discovery process. Old interval: 218.460224085s, New interval: 561s
[2025-04-09T10:05:57.209020Z INFO ant_networking::network_discovery 388] With min_full_bucket_index of Some(241), targeting buckets of [241, 240, 239, 238, 237, 236, 235, 234, 233, 232]
[2025-04-09T10:05:57.209067Z INFO ant_networking::network_discovery 128] Going to undertake 1 get_closest queries for non_full_buckets []
[2025-04-09T10:05:57.223818Z ERROR ant_node::node 357] The `NetworkEvent` channel is closed
[2025-04-09T10:05:57.224088Z INFO antnode 450] Node is stopping in 1s...
[2025-04-09T10:05:58.226616Z ERROR antnode 459] Node stopped with error: Node events channel closed!

The common factor seems to be:-

[2025-04-09T10:05:57.209067Z INFO ant_networking::network_discovery 128] Going to undertake 1 get_closest queries for non_full_buckets []
[2025-04-09T10:05:57.223818Z ERROR ant_node::node 357] The `NetworkEvent` channel is closed

So I’ve changed my mind about it maybe being something to do with node-metrics and instead think it is something to do with node communication or buckets.

1 Like

Yeah, now I’ve changed my mind about the router being the issue also. Funny thing is “antctl status” might show all nodes running without anomalies, but then when you check a few hours later, you might get 40-50% that show “Node event channel closed!”. Definitely seems to be caused by something outside of my system, maybe related to peer discovery.

2 Likes

I’ve experienced similar

Confirmed it’s not caused by metrics as I’ve had it again for a node after starting 50 without using the --metrics-port option.

Yeah, I don’t use metrics at all.

This is actually a failure resulted by a bug that related to an edge case only.
Got resolvement work pushed and hopefully will be included in the next release.

11 Likes