Discussion on low level data flow and home networks. Found solution to allow 4MB max chunk size as smooth as 1/2 MB - its a setting in QUIC

You might be able to use a Microtik router as well:

“To measure dropped packets on a MikroTik router, you can use the stats or stats-detail commands to print the interface counters. The value for the number of dropped packets by the interface queue is indicated by tx-queue-drop.”

Interface stats and monitor-traffic - RouterOS - MikroTik Documentation.

3 Likes

Of course using quality routers doesn’t always mimic real woeful ISP supplied routers unfortunately, but may show the issues as well, just have to be mindful quality routers are only an indicator

2 Likes

@joshuef I noticed this network is surviving much better. A couple of observations though and a result of not checking performance a better way, but that takes about twice the development effort used to do the rough CPU %age

  • network is a lot smaller
  • CPU %age test is excluding lower performance node devices (see below)
  • The network is mainly larger VPS and “power” users who have higher performance machines and typically better routers and sometimes very good internet

People have mentioned that the registry refresh that happens with the node manager is causing many of the nodes already running to die because of the extended high CPU %age usage when the manager does registry refresh. This limits more than intended the number of nodes that can run per device. One person reported their 12 core CPU device suffers this and their nodes are using very little CPU collectively but as the manager adds more nodes it spikes the CPU for too long and many already running nodes die

This mean those using their own scripts to start nodes (the “power” users with upto 28K nodes) are the ones capable of running a lot of nodes.

Overall this shifting of nodes to the higher performance equipment is masking the instability that comes with home setups since not so many are in the network.

EDIT: And while there will be home nodes, I expect that it’ll be a lot less nodes on their machines and thus their routers much happier. But once the majority of nodes out there are home nodes then instability is only one step away again from noticeable instability.

We cannot take this as a sign the QUIC window size change is unnecessary, it is in my experience and will also provide a much smoother experience for node operators at home and the network as a whole.

15 Likes

At initial startup, Lauchpad really needs to query the specs of the user’s setup, either automatically or by manual “questionnaire” and then give the user a conservative estimate of the maximum number of nodes he can safely run on his machine and internet connection. Otherwise, everyone is shooting in the dark.

Machine specs would include amount of available RAM, number of cores and threads for the CPU and available hard drive space. Internet and router specs would be bandwidth and maximum number of concurrent connections (hard to find, granted, but LP could use a common ISP max of, say, 30000 as a default).

4 Likes

I’ve just had a thought: has anyone looked at the interaction between the large installations of thousands of nodes and whether they are relays or not?

I’m guessing the large sites with thousands of nodes were setup with port forwarding so end up being relays rather than piggybacking off existing relays.

What if part of the problems with the last network were due to the loss of a lot of relay nodes? I’m thinking that suddenly losing a couple of thousand nodes isn’t great but the ability of the network to recover from that could be hampered by there being a lot of ‘home-network’ nodes which were using them as relays. They can’t then take on board new records or send them to other nodes which request them until they connect to a new relay.

Furthermore, the remaining relays (which may also be in large sites) suddenly get a lot busier which add to their busyness just when they are becoming busier anyway just because of them being nodes that have to take on some new records.

7 Likes

You gotta run a ghetto machine to qualify :wink:

More tools for QUIC Analysis of traffic, for those so inclined

QUIC Traffic Analysis Tools

  • Locust: a Python-based open-source load testing tool that allows developers to write user behavior in code, offering high flexibility and customization, which can be used to simulate QUIC traffic and analyze its performance.
  • Icinga: an intelligent network monitoring tool that allows you to manage devices from multiple vendors using a single interface, providing insights on how network devices are being used, which can be used to monitor QUIC traffic.
  • Graphite: an open-source tool that tracks time-series data, such as network performance, and displays this data for users on a dashboard, simplifying network management considerably, which can be used to monitor QUIC traffic behavior.
  • Zabbix: a free network monitoring system that checks all infrastructure, including networks, providing performance insights into packet loss, network mode, connections, CPU, memory utilization, and bandwidth, which can be used to monitor QUIC traffic.
  • Nagios Core: an open-source infrastructure monitoring tool with enhanced monitoring capabilities for networks, systems, and servers, which can be used to monitor QUIC traffic behavior and performance.



walseisarel.medium.com
Top 12 Free Open-Source Performance Testing Tools | by Walse Isarel | Medium

🌐
dnsstuff.com
Top 8 Free Network Monitoring Software

🌐
enterprisenetworkingplanet.com
7 Best Open Source Network Monitoring Tools in 2023 | ENP

🌐
geekflare.com
Top 15 Open-Source Monitoring Tools

🌐
auvik.com
What is QUIC? Everything You Need to Know | Auvik

🌐
metricfire.com
9 Best Open Source Network M

4 Likes

plus a bonus high level post describing QUIC circa Oct 23 2024 from soft analysis tool vendor AUVIK, not an endorsement at all, its just a decent post…

2 Likes

I’m having a lot of problems with uploads from ant CLI freezing my network for other requests.

The memory stats on my router don’t look critical, but I’m convinced there is networking congestion causing issues.

Indeed, I suspect the record data being uploaded is competing with message data, with the latter losing out. I suspect this could be why quotes for storage are not getting through.

Do we know if it is possible to change the max_stream_data value for just the ant CLI? I’ve just been using the binary provided so far, but it would be easy for me to test (as I simply can’t upload a directory with a few files in it on this network).

3 Likes

Fwiw, I tried changing this setting in ant-networking and it made no difference to my quote issues (with ant file upload). I didnt experiment enough to know whether it left my network less congested yet.

2 Likes

Thanks for trying that. I am stuck in a huge PR so lack bandwidth. I hope to get to or networking stack soon though and I am certain we can make it more efficient.

8 Likes

Not sure if sender has a smaller window size will be accepted or not. Maybe everything has to use the same MAX window size or not. Buffer allocation space within the programs etc.

But you used to be able to set the block size for each run of the CLI. This would limit the number of chunks in each block being uploaded. If its gone then it needs to be brought back

3 Likes

Yes, the chunk size used to be an argument, but it’s gone now.

However, you can build the ant cli with an env variable to set it at built time. I tried it set to 1 (default is 8) to see if that helped, but it just gave me the quote error more slowly! :sweat_smile:

So, I think it is something else causing the quote issues. Maybe the quic tuning can help with network congestion either way though. It’s probably easier to test with ant cli uploads than nodes, tbh. Pretty easy to rack up a big upload and see what it does to the rest of the network traffic.

3 Likes

Someone shared this with me Stream receive window setting in `quinn` is changed by libp2p from 1.25MiB to 10MiB · libp2p/rust-libp2p · Discussion #5799 · GitHub @bzee

And it brought up a couple of thoughts on the matter

Even 1.25MB is a little big for Autonomi.

Most applications using libp2p are not trying to have thousands of connections across many nodes all of which could have a chunk requested to be sent at any time. The router could have anywhere from a few nodes to hundreds thus many thousands of connections running through it. This is extremely unlike any of the applications using QUIC which may only have one application having a couple/few connections at any one time.

With so many the higher the value of the window size the more chance that multiple blocks that cannot be interrupted are being sent at once. Then the higher the value the greater the buffer has to be, a double wammy. This results in a higher chance of multiple data blocks flowing before QUIC has a chance to adjust window size down

If devs had 20 or 30 years experience in serial data transmissions in the early days then it would just be an absolute no-brainer to reduce the max windows size to something rather small. I will always refer back to the fact TCP still after 40 years is max size of around 1536 bytes and max of 7 packets without an ACK

7 Likes

Keep in mind you still need to cater to 10Gbps uplinks for businesses, partners, and organizations that have that ability to upload PBs off data to the network to Autonomi. Its not just the small node operators that you want to optimize here for. If those target groups can’t make use of their full bandwidth, then thats not ideal either.

While I like to hear more details on their decision for the 10MB, the original post did make sense to me based on the raw formulas associated with Bzee original post on libp2p repository :

10MiB seems like a good middle ground. That's 100MiB/s on a 100ms latency. That's 1GiB/s on links with 10ms latencies (most places that care about really fast throughput).

A number too less will favor one group, a number too large will favor another etc (or so it seems) (different groups with different latency and throughput abilities).

Plus, I think this stream setting is now tunable, so folks can experiment with it (at least on the client side).

3 Likes

Did I mention I’m back to 0 home nodes because on every chunk arriving my streaming would make a break for 2 seconds and teams meetings where everything freezes for some seconds from time to time is simply unacceptable when working from home?

Average speed wasn’t an issue at all Lately… But when everything breaks every couple of minutes that’s not just ‘not ideal’ but simply a no-go

5 Likes

Yes true, but if you do not optimise for small users then you will eventually be excluding them.

Also even the suggested 128KB size does not affect the 10Gbps uplinks as much as you’d think. Interleaving will be occurring if indeed these are being heavily utilised for Autonomi uploads. While waiting for ACK for one channel then another (many others) is utilising the link. If its not used that much then there is no loss anyhow.

All I am saying in my post is to consider the reasoning why Autonomi is different to all the other so that it is not forgotten. The shear fact that Autonomi is a quite unique case means that it must have these factors consider too

And I am in favour of gaining as much insight as possible.

To be clear if we do not account for low end (not the worse) devices then eventually we will be excluding them because “we need to not slow down the huge guys” which we are not targetting anyhow. We are targetting the home user and if the datacentre miss out on that 0.5% extra performance due to ACK/NACK cycle then so be it.

4 Likes

Is that there now? How long before it filters into the node build?

Also if I set mine to 100KB will that cause my nodes to fall over when it receives 4MB in one go due to buffer overflows (either rejecting bytes beyond 100KB or actual overflow)???

3 Likes

At same time, if businesses and consumers lets say can’t upload their data to the network because it will takes weeks or months because they simply can’t use the 1Gbps or even say 10Gbps link or more, that would not be good for the network either.

Internally, I suggested if this setting has to be changed, we opt an incremental reduction to say 2.5MB or 5MB instead of 10MB (still has broad appeal in terms of bandwidth/latency for both ends of the spectrum), and not make a huge reduction in one go, so to still allow high bandwidth folks the ability to use their entire pipeline against the network as well, and still catering to the low bandwidth folks situation.

The 2.5MB or 5MB to me seems more off a middle ground if knobs have to be adjusted, instead of going to the extreme ends of the spectrum (orders of magnitude less than the current 10MB), because of one or more reasons.

Overall, I am really curious what libp2p team has to say about it (if they can recall the answers).

4 Likes

Just not happening. If they are truly using their 10Gbps uplink for uploading then the loss in performance is like 1/2 of a percent. Multiple upload streams will cover the time loss due to ACK/NACK times.

In fact it is the smaller user with fully capable links and routers that will experience the greatest effects with a few chunks going up at a time, they do not benefit as much from the multiple channels.

I will be too. But I realise that I HAVE TO consider the huge difference in connections happening

3 Likes