It’s still undecided if this is wanted really, but we wanted to see/generate and log this into the quote system to get a read on things.
This was added in to allow new nodes to see progress faster. $$ being the main metric. It’s removal has been discussed and is currently planned, but at the moment it’s used as a proxy for liveness still for many users.
A lower quote will be chosen more and paid more by standard clients atm. (Should the clients always choose cheapest? Feels like what they might end up doing in the real world, or be patched to).
Atm we’re testing what a quote validation system looks like, and that was one part of the current proposal. It’s not planned to be permanent as yet… So what’s been said still stands for now.
If it were to prove useful, then we might be proposing to change it. I think the jury is still out on that (and what effect this really has on pricing, vs if it’s useful in the quote valiation).
Either way though, I’ll try and get any economic changesets in the testnet updates
I wonder if it’s autonat related (or lack of).
Probably! That’s a node scrabbling to connect if I ever saw one. I think we can gracefully kill this, but I’ll check with @chriso first before we dive in there.
Aha, that’s interesting…
One for @bochaco (@anon26713768 if you could make a github issue with repro steps, that’d be lovely )
Yes.
Eventually there could be. Right now we have an estimate command that could be used…
Might not be possible if our nodes close to your data ALL have a lot of data. This is where you might want to spin up more nodes to help lower the price (and earn some $!)
We want to work it so it does feel great! Hence some experimentation on the economic points.
Feels like we’re being hindered by connectivity issues/autonat’s absence atm though.
We’ve the folders cmd and are building out a local account packet just now. This will provide this ability soon.
I’m not sure. Seems like nodes are over 1k records locally, and atm we’d want to see nodes coming online to get more rewards… We’re not seeing that (because it’s all chocolate money so far?). I’ll be spinning up more maidsafe nodes soon, I think.
Providing an update here, I tried restarting safenode pids after they were running and had a connected peer = 0 for minutes on end (either due to DialPeerConditionFalse(NotDialing) on all 50 peers right away, or HandshakeTimedOut… its a hit or miss if they reconnect properly. .
I ended up spending too many hours in hopes of having all 400 safenode pids be connected with peer count > 0… it felt futile outside of a persistent cron job to do the needful over many hours.
Below is the current dashboard of 400 pids registered, off which ~350+ are currently running, of which < 50% have peers counts connected > 0.
Note, I also rebuilt the sn_node with libp2p 0.53.2 (increment from 0.53.0). It did not solve the issue .
I think the port changing (external address candidate) was due to my misconfiguration on the router. So feel free to ignore this specific case. I don’t think it was lying.
🔗 Connected to the Network Chunking 625 files...
"/fgfs/Aircraft/A340-313X/" will be made public and linkable
Splitting and uploading "/fgfs/Aircraft/A340-313X/" into 2547 chunks
Error:
0: Not enough balance in wallet to pay for chunk.
We have NanoTokens(3916403394) but need NanoTokens(6443184325603) to pay for the chunk
I think they’ll stay that way until we have a new node out. As nodes are not factoring in only close nodes yet (I forgot that wasn’t merged yet, sorry for false hopes yet!)
So when we have a new release, that should hopefully reduce pricing here.
So I’m launching local nodes here, and generally can connect to the network-contacts, i’m seeing HandshakeTimeouts consistently for some peers, we’re not marking them as bad at the moment, but I think it seems like we should be.
Eg, I can see a looot of peers on 65.108.236.166 that are failing, which looks to me like misconfigured port-forwarding perhaps… And we get that repeatedly. (And we do do checks on non-full buckets for bad peers here… which might well have been where a lot of bad node activity was coming from before).
So in my case, will I then basically get blocked for HandshakeTimed out with the network-contacts (since my NAT port forwarding is setup properly now), and not get a valid reason back saying I have been blocked… for a connection attempt? .
I almost feel we should use the data point of ‘new external address candidate’ if more than 1 publicip/port is found for same public ip, but on another port for the same remote given peer id over a certain time window etc, then you know their NAT port forwarding is off…(its rotating its ports), and block them as first pass and have that ER go into another testnet…and see how much the noise dies down with HandshakeTimedOut messages?
But at this stage to block out anything with a HandshakeTimedOut might be very aggressive? as its not solving the root cause of why this is happening but more off a brute force - outright block? (i.e. there could be more than 1 reason here for this message error, while NAT port forwarding is just 1 root cause, the other might be more prevalent here?)).
FWIW, thats not my WAN IP. . It is also interesting to me that its transient with the network-contacts, not 100% of the time, so even more scary to get blocked right away when NAT port forwarding is setup properly.
Had success Up- and downloading a text document. half a success uploading a 100Mb video file.
Error:
0: Failed to upload chunk batch: The maximum specified repayments were made for the address: 1530cf(00010101)…
Tried to re-upload but got.
Error:
0: Failed to upload chunk batch: Too many sequential payment errors reported during upload
More like, you’d be labelling these peers as bad locally. At least that’s what I’ve drafted up. (though maybe what you describe is a side effect of that?)
Has to be per peer id.
But I’m not sure it matters? If the peer is cycling ports because NAT is off, we don’t want to keep trying to talk to it, or? (I mean, it’s a question of tolerance, right?)
It’s not a block in anything but any peer repeatedly hitting this inside a timeframe. Which may well still be harsh, but I’m not sure what else we should be doing for repeat failed peers? I had 25msgs for the same peer failing to handshake eg. That’s a lot of wasted time/cycles etc. (And my node was up only for a couple of mins)
Aha, interesting. So if that’s setup and got issues, if those are not new nodes, we’ve a bug elsewhere as they should really have been purged from routing tables by now (Or are you also starting more nodes trying to correct the NAT setup still?)
Yes, was wondering at some point if the HandshakeTimedOut happens with my node against the remote, so its labelled as bad locally, if so, in the future the remote sends an outgoing request to me, and I need to respond back to it, but now I can’t as its been labelled as bad, so now the remote peer marks it as bad locally as well as a side affect because it didnt get the response back?
Yes, maybe I didnt describe it properly. I meant the block is on a peer ID whose public ip is same and its Peer ID is same, but its port is constantly changing, if and only if that is happening, add it as another condition to mark it as bad locally.
So far on this testnet, healthy nodes based on other’s logs reported, they only have 1 public ip/port detected by their remote peers for a given peer id. In my case, it was going into the 100s if not 1000s due to every new outgoing communication choosing to generate a new outbound port for it via router misconfiguration, but the inbound port remained the same on the router.
The tolerance rule you are describing and it repeating against the same failed peers with HandshakeTimedOut will reduce noise…(I agree), however, I was wondering if the initial condition to ‘limit or temporarily label them’ as bad locally, would be based on the fact that the local peer is noticing its perceived external address is different more than X entries for its peer id in a given time frame. If so, simply stop communicating with the network (from the local peer’s perspective - throw out a message saying mis-configured NAT and disconnect from network). Or, if the remote peers have decided that for a given peer, its external ip/port perceived was X in time Y, but then changed to Y in time Z enough times, it is now considered as ‘bad peer locally’.
Just to clarify, once a peer is marked as bad locally, is this at all broadcasted to its surrounding peers or close group, or it just remains within the local peer id’s list of bad peers?
The side effect of addressing ‘multiple external address candidate’ (whether that can be done locally or at the remote peer at the time this evaluation is being carried out or at both the sender and receiver level) is that no more HandshakeTimedOut for that node against remote nodes, and vice versa.
Basically, we have three separate scenarios here I think:
Failed to Dial initially on bootstrap, which then leads to more HandshakeTimedOuts against the same network-contact peer ids (at least on the local peer id logs).
After successfully bootstrapping to the network-contacts, multiple external addresses for a single peer id (as perceived by remote peers) and received as a FYI message to the local peer… when that keeps changing, its causing future handshake timed out (basically all the time), so in my opinion the node should simply just stop communicating with the network.
After successfully bootstrapping to the network-contacts, nodes that produce a HandshakeTimedOut, but have their NAT port forwarding setup right (i.e. not receiving changed FYI messages about their ‘external addresses’ perceived off themselves by others).
Here we should confirm that if the local peer X continues to have outgoing issues with peer Y, and it marks peer Y as bad locally, then will it produce the reverse situation of HandshakeTimedOut on the remote peer (because peer Y’s outgoing connection to peer X may go through, but it doesn’t get a response back anymore due to it being marked as bad locally on peer X, which then does it lead to yet another HandshakeTimedOut message on peer Y side)? If so, then basically both peers over a certain time and within a given tolerance would have labelled each other as bad temporarily?
My suggestion is to focus on scenario #1 and scenario #2 first and see how much noise is reduced initially before considering blocking HandshakeTimedOut locally, and by block I mean significantly reduce the communication activity for a brief amount of time by labeling certain peer ids as bad .
I think once scenario #1 and #2 are addressed, then whatever is left, well folks will need to dive deeper to figure out whats causing the root cause. Ideally we get to the bottom off the root cause on HandshakeTimedOut as oppose to temporary blocks in code for scenario #3?
I am now in scenario #1 and scenario #3 bucket as scenario #2 is no longer the case with my nodes. I have fixed 400 NAT port forwarding entries in the router properly so I am being careful not to spin up more than 400 nodes. If a node doesn’t connect, I stop the node, and start it up again. I assume it gets a different peer id here, however, the NAT port forwarding (scenario #2 doesn’t apply for me in these restarts), as the router’s configuration is already setup properly for safenodeX service on port A on LXC Y with WAN IP Z (remains the same for safenodeX service regardless of # of restarts).
Please let me know if I made any in-correct assumptions here.
Thanks again for taking a further look at this issue.
Mhmm, with you. This would be in startup phase perhaps, so maybe a naive solution here is to only label as bad when we have sufficient peers in our RT. ie, allow more error tolerance in the bootstrap phase.
So we could do this aye. We definitely need some good “shutdown” conditions where we can be sure the local node is having a bad and unrecoverable time.
It’s broadcast upon request to peers. (each node in the network passively samples it’s routing table to check what close nodes to a given peer think of it).
This is the case that might require more tolerance from the local node, as it’s bootstrapping (potentially into a busy network), and so it may be unwise to label any initial contact nodes as bad yet?
This would be one “i should shutdown” signal, which would be good to have, for sure
permanently (at the moment).
Sounds sensible aye
I don’t think so (beyond the clarifications above).
I think we might be able to get 1) on the go with some peers threshold eg.
For 2) was it DialPeerConditionFalse(NotDialing) you were seeing, or some other logs? If you can clarify that I’ll pin down where we need to track + initiate shutdown.
And then w/r/t 3) , atm we’ve just added the bad tracking from conn issues. I’ll relax that for Handshake timeouts so we can assess 1/2 as suggested
Thanks for taking the time here to dig in @Shu (and @happybeing I know I haven’t responded to you logs, but I think what @Shu 's seen here with the DialCondition is the same issue, so hopefully this approach sorts things for us there, too!)
I would think that you have a measure of observed badness
reduce a little periodically since that node was probably being good during that period. If still bad then the score would rise a lot more than the reduction
Small RT then you tolerate a higher badness
Lots in RT then tolerate much less.
make a score out of badness vs RT size and us that to determine whether to remove or not.
still needs to be a point where high badness is never accepted and small badness is tolerated (small badness is something like taking too long to respond once or twice in recent history)
At no time should you accept full badness since that will only makes ones life a pain.
DialPeerConditionFalse(NotDialing) is happening on / off (especially for scenario #1, I only see the Failed to dial + DialPeerConditionFalse(NotDialing) only initially in most of the logs, if it fails after say 2 batch attempts of 50 network-contacts each, I don’t see it trying again etc, the node basically is idle and dead… at other times it somehow manages to add a peer to its RT and then bootstraps with more than 0 peers etc).
For #2, getting past the DialPeerConditionFalse(NotDialing) (regardless of 1 to 49 peers off the original network-contacts it having an issue with), once it does connect to the network (1 or more peers), it receives over its lifetime multiple ‘external address candidate’ messages, looking at the message right before and after it, shows the HandshakeTimedOut on OutgoingConnectionError starting to happen at high frequency.
Looking at my earlier posts, if I had to reflect back on that scenario, it rapidly starts after the 1st proper port/IP combination that was originally mapped (in this case that was port 12008, then it went random on the outgoing ports):
2024-03-28T19:25:22.270606Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/12008/quic-v1
[2024-03-28T19:25:23.483660Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/12008/quic-v1/p2p/12D3KooWPMpYUsdVS5txq4jKyTVkMgEvditEMdidjJgrdcWG48Gs
[2024-03-28T19:25:23.717794Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/3981/quic-v1
[2024-03-28T19:25:24.490792Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/8021/quic-v1
[2024-03-28T19:25:24.518781Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/50765/quic-v1
[2024-03-28T19:26:47.904880Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/37799/quic-v1
[2024-03-28T19:27:45.331681Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/14265/quic-v1
[2024-03-28T19:27:50.334862Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/64204/quic-v1
[2024-03-28T19:33:16.528309Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/14265/quic-v1/p2p/12D3KooWPMpYUsdVS5txq4jKyTVkMgEvditEMdidjJgrdcWG48Gs
[2024-03-28T19:37:19.949118Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/3981/quic-v1/p2p/12D3KooWPMpYUsdVS5txq4jKyTVkMgEvditEMdidjJgrdcWG48Gs
[2024-03-28T20:08:04.169718Z INFO sn_networking::event] external address: new candidate address=/ip4/A.B.C.D/udp/64204/quic-v1/p2p/12D3KooWPMpYUsdVS5txq4jKyTVkMgEvditEMdidjJgrdcWG48Gs
You can see the time delta here between those ports changing, but yeah, once the external address messages start popping up after the original with different port but same IP and Peer ID, you end up getting HandshakeTimedOut with a vast majority of peers constantly.
I would need to change my router settings to re-trigger this environment again. If still needed, let me know, and I can try to reproduce this (scenario #2).
Also, just curious, is there some specific timeouts in milliseconds in the safe network’s code base that are currently used or its all default libp2p settings. For instance, my ping itself to safe network’s IPs are > 100ms (I wonder if distance / latency itself is somehow exceeding a allowed wait time here for the handshake to complete - i.e for scenario #3), though i am not sure if concept of keep alive is at play with QUIC etc. A lack of activity on the router would also kill the connection. I am going to investigate more NAT related settings and timeouts on my router side as well in regards to #3.