I wonder if there might be a bit of an effect from running ‘too many’ nodes on some computers.
Say you have 200 nodes running on a server and everything seems hunky dory because the there is enough CPU and RAM to run them and the network connection is running at the max you can get for it so everything looks really efficient and nice.
But what if the nodes are constantly not able to quite keep up with what is being demanded of them because the network connection is constantly maxed out? Or, slightly less bad, that it’s fine 99% of the time but every couple of minutes there is not enough bandwidth or randomly a lot of the nodes get an influx of data or a demand for reads and there is not quite enough bandwidth for them all?
Then 200 nodes might be throttled and cause issues for all the ones they are connected to and the whole network suffers.
I can see that when I start 5 nodes within a few seconds of each other my 100Mb down / 20Mb up connection is maxed out. What effect is that having on the network?
Also when you get close to maxing out your connection the lag time escalates. Its like a congested roadway, at 90% full the flow is good, but as you get closer to 100% it all slows down making the situation worse for all.
As suggested it would be good to test this. One very good idea would be for the node manager to monitor the bandwidth situation, especially the lag time as this is a very simple and good indication of the internet connection being maxed out.
@joshuef I’m not sure who is looking after the node manager at this time but it would be simple to include code to check lag every 10 seconds or so and watch for an increase in lag time and warn the operator that they have too many nodes for their connection to handle. This would work for any speed connection since its comparing against itself and looking for an overall rise in lag times. Measure lag times before starting then use that as a base line. For instance the person running nodes may have other people/computers using the internet connection, so one 100Mbps link may not support the same number of nodes as another 100Mbps link. This simple lag testing would highlight this important metric that will affect anyone connecting to their nodes with issues from slowness of response to lost requests.
Couple of points would like to clarify:
1, For the 5 nodes reported to not giving expected ChunkProof
It doesn’t mean all those 5 nodes are in trouble.
There is only one node being picked as pivotal and replicate the chunk to others
In most cases, this failing case only means that particular node is in trouble.
So the percentage of failing nodes across whole testnet is much smaller than the estimation.
2, Regarding why the failing chunks are deterministic along multiple re-tries
That’s because the algorithm of selecting the pivotal node is deterministic so far, i.e. closeness + store_cost_charge
We choose to remain this deterministic to allow node side issue can be exposed explicitly via client side activities, and eaiser to be tracked down.
Removal of this deterministic will hide up node side issue, and make it difficult to be tracked.
In long run, once we got 100% confidence in node behaviour with certain workload pressure, we may deploy other approach to avoid particular node blocks certain chunks deterministicaly.
3, With now fast upload available, we see avg records kept per node rise over thousands.
This does cause a burden when a new node start up, which will have to receive tons of new records in short time to fill up.
And this burden is not only to the new node itself, but also to the existing nodes (which has to provide record copies)
Even worse, if startup (or already running) multiple nodes at the same time, with limited bandwidth, this will choke the traffic for a while.
We will try to think about how to ease this, meanwhile try not to startup cluster nodes within short duration.
At last, I noticed there are two nodes having trouble, please let me know the status of them and share me the logs you have, if in chance you own them, thank you very much.
My node has done nothing since yesterday afternoon, despite being seemingly OK in Vdash.
How about others with new node version? I wonder if this version is more sensitive to problems in connections. At least a few testnets ago nodes used to bounce back to action after a reboot of router, chunks coming and going, but not now. It actually does something, but no traffic of chunks.
Hmhmm - once node kickout is implemented - given sufficient balance of the earnings - a node could automatically upload the logs to safe and somehow might communicate the location of the logs maybe
Just to not loose too many error cases was the idea
Network has to be able to deal with that. We cannot expect operators to follow recommendations. Also it is an attack vector for intentionally bad actors.
What if node goes down or churn happens between retries?
All seems fine. I looked in log of few nodes and there are some chunks coming and going.
Now most, but not all of my previous uploads are throwing errors when trying to download. At least all of the ones, that were in folder that I uploaded.
Chunks error Chunk could not be retrieved from the network:
iv noticed two vps’s are not receiving as many records as the rest interestingly the two nodes @qi_ma mentioned above that were misbehaving and are now terminated one was on each of these vps’s.
the vps’s are showing up in my record count as the black gaps in this bar graph.
just considering shutting down those vps’s any thoughts @joshuef@qi_ma
To be fair I do hold a large enough portion of the nodes for it to be coincidence. But wishing I kept notes from previous testnets as I’m sure vps-6 and vps-15 have had bad nodes before.
When @dirvine said “more nodes = more better” or something to that effect I think I was the only one listening
I think everyone else needs to put their backs into it and run more nodes or I will