[ValentinesNet] [14/2/24 Testnet] [Offline]

storage_guy · February 16, 2024, 12:07am

I wonder if there might be a bit of an effect from running ‘too many’ nodes on some computers.

Say you have 200 nodes running on a server and everything seems hunky dory because the there is enough CPU and RAM to run them and the network connection is running at the max you can get for it so everything looks really efficient and nice.

But what if the nodes are constantly not able to quite keep up with what is being demanded of them because the network connection is constantly maxed out? Or, slightly less bad, that it’s fine 99% of the time but every couple of minutes there is not enough bandwidth or randomly a lot of the nodes get an influx of data or a demand for reads and there is not quite enough bandwidth for them all?

Then 200 nodes might be throttled and cause issues for all the ones they are connected to and the whole network suffers.

I can see that when I start 5 nodes within a few seconds of each other my 100Mb down / 20Mb up connection is maxed out. What effect is that having on the network?

Bogard · February 16, 2024, 12:19am

Yep. It seems the solution would be to 1) revert today’s change adding all addresses as external and 2) revert the removal of autonat.

chrisfostertv · February 16, 2024, 12:25am

Lets test that.

I’ve had only failure for uploads over the past 3 testnets using the same payload, this time it’s currently

FTTC 75Mbps/20Mbps - 7.34 GB

Uploading 158668 chunks
⠤ [15:24:29] [#######################>----------------] 94700/158668
[16:48:43] [##########################>-------------] 104266/158668
⠂ [19:24:51] [##############################>---------] 121902/158668
[21:32:14] [##################################>-----] 136265/158668
[22:35:00] [####################################>—] 143122/158668
[1d 00:41:07] [#######################################>] 157838/158668

Success!
Because Safe prints out all the files after upload, I ran out of buffer to display the end game…

```
     Payment Details           *
```

Made payment of 0.253248899 for 154486 chunks
Made payment of 0.044616818 for royalties fees

So about 0.3 SNT for 7.34GB

Expensive? probably due to the number of puts/small files

Does the pay per put model need some tweaking given this result

neo · February 16, 2024, 1:49am

Also when you get close to maxing out your connection the lag time escalates. Its like a congested roadway, at 90% full the flow is good, but as you get closer to 100% it all slows down making the situation worse for all.

As suggested it would be good to test this. One very good idea would be for the node manager to monitor the bandwidth situation, especially the lag time as this is a very simple and good indication of the internet connection being maxed out.

@joshuef I’m not sure who is looking after the node manager at this time but it would be simple to include code to check lag every 10 seconds or so and watch for an increase in lag time and warn the operator that they have too many nodes for their connection to handle. This would work for any speed connection since its comparing against itself and looking for an overall rise in lag times. Measure lag times before starting then use that as a base line. For instance the person running nodes may have other people/computers using the internet connection, so one 100Mbps link may not support the same number of nodes as another 100Mbps link. This simple lag testing would highlight this important metric that will affect anyone connecting to their nodes with issues from slowness of response to lost requests.

qi_ma · February 16, 2024, 4:34am

Couple of points would like to clarify:
1, For the 5 nodes reported to not giving expected ChunkProof
It doesn’t mean all those 5 nodes are in trouble.
There is only one node being picked as pivotal and replicate the chunk to others
In most cases, this failing case only means that particular node is in trouble.
So the percentage of failing nodes across whole testnet is much smaller than the estimation.

2, Regarding why the failing chunks are deterministic along multiple re-tries
That’s because the algorithm of selecting the pivotal node is deterministic so far, i.e. closeness + store_cost_charge
We choose to remain this deterministic to allow node side issue can be exposed explicitly via client side activities, and eaiser to be tracked down.
Removal of this deterministic will hide up node side issue, and make it difficult to be tracked.
In long run, once we got 100% confidence in node behaviour with certain workload pressure, we may deploy other approach to avoid particular node blocks certain chunks deterministicaly.

3, With now fast upload available, we see avg records kept per node rise over thousands.
This does cause a burden when a new node start up, which will have to receive tons of new records in short time to fill up.
And this burden is not only to the new node itself, but also to the existing nodes (which has to provide record copies)
Even worse, if startup (or already running) multiple nodes at the same time, with limited bandwidth, this will choke the traffic for a while.
We will try to think about how to ease this, meanwhile try not to startup cluster nodes within short duration.

At last, I noticed there are two nodes having trouble, please let me know the status of them and share me the logs you have, if in chance you own them, thank you very much.

12D3KooWAXLEwtZdRhXn4aHkrbNoBkZpx7M9k921msKBA4Ecwmh9
12D3KooWDxrxCm3b8cGzr6P86tjPoYxUwez2bcgkq2h1mKiBQA5V

Toivo · February 16, 2024, 4:46am

My node has done nothing since yesterday afternoon, despite being seemingly OK in Vdash.

How about others with new node version? I wonder if this version is more sensitive to problems in connections. At least a few testnets ago nodes used to bounce back to action after a reboot of router, chunks coming and going, but not now. It actually does something, but no traffic of chunks.

My node is:

12D3KooWRjZbRhmr2wJ1CavRenVzHqcniCsCnKm3XZMpPUzxqgaD

… if someone else can see it in their logs.

aatonnomicc · February 16, 2024, 5:00am

They were both mine iv terminated them and here are the logs Node Logs

qi_ma · February 16, 2024, 5:01am

Ah, ok, thank you very much.

riddim · February 16, 2024, 7:19am

Hmhmm - once node kickout is implemented - given sufficient balance of the earnings - a node could automatically upload the logs to safe and somehow might communicate the location of the logs maybe

Just to not loose too many error cases was the idea

Toivo · February 16, 2024, 8:30am

My 4000 chunk file that had 38 chunk failing has now 13 chunks failing.

peca · February 16, 2024, 8:55am

Network has to be able to deal with that. We cannot expect operators to follow recommendations. Also it is an attack vector for intentionally bad actors.

What if node goes down or churn happens between retries?

All seems fine. I looked in log of few nodes and there are some chunks coming and going.

Toivo · February 16, 2024, 9:01am

I keep on trying, and now it’s down to 2.

I wonder if that is a result of network churning, or the nodes working every now and then?

qi_ma · February 16, 2024, 10:00am

most likely because of the churning: the node in trouble got terminated, or there is new join node take over the place.

Toivo · February 16, 2024, 10:23am

C’mon guys, bit of churn now!

I’m just about to get an ages old Linux distro up for you!

Toivo · February 16, 2024, 10:31am

Now most, but not all of my previous uploads are throwing errors when trying to download. At least all of the ones, that were in folder that I uploaded.

 Chunks error Chunk could not be retrieved from the network:

How downloads are working for others?

Josh · February 16, 2024, 10:49am

The average record count looks to be above 2k.
Could this be what caused the turning point in performance along side the churn from new nodes.

I know the limit has been removed so maybe not but idk?

Also noticing that between those despite the average being 2k-ish there are few that are almost empty.

I think it is not only new nodes to blame, going to keep yelling free tokens bad from my soapbox.

aatonnomicc · February 16, 2024, 11:01am

iv noticed two vps’s are not receiving as many records as the rest interestingly the two nodes @qi_ma mentioned above that were misbehaving and are now terminated one was on each of these vps’s.

the vps’s are showing up in my record count as the black gaps in this bar graph.

just considering shutting down those vps’s any thoughts @joshuef @qi_ma

Josh · February 16, 2024, 11:14am

Why does it feel like every time Qi asks who’s node is this it appears to be yours

aatonnomicc · February 16, 2024, 11:19am

To be fair I do hold a large enough portion of the nodes for it to be coincidence. But wishing I kept notes from previous testnets as I’m sure vps-6 and vps-15 have had bad nodes before.

When @dirvine said “more nodes = more better” or something to that effect I think I was the only one listening

I think everyone else needs to put their backs into it and run more nodes or I will

Josh · February 16, 2024, 11:21am

For sure, just pulling your leg!!

Topic		Replies	Views
QuicNet [30/01/24Testnet] [Offline] Updates	273	3515	December 27, 2024
PunchNet [24/04/24 Testnet] [Offline] Releases	656	4637	May 9, 2024
DataPaymentNet [26/7/23 Testnet] [Offline] Releases	306	5907	September 12, 2023
ReduceConnectionsNet [ 06/12/23 Testnet] [Offline] Releases	212	2701	December 13, 2023
Joshnet [May 4th Testnet 2023 ; Offline] Updates testnet	363	5222	September 6, 2023

[ValentinesNet] [14/2/24 Testnet] [Offline]

Related topics