Routers for advanced setups

I’ll see where that takes me, thanks @neo.

I thought for a while it may be broadcast storms, but I am no longer thinking that.

I have set up multiple VLANS, each to a seperate bridge and its own IP pool to segregate traffic. My 12 hour day. Was convinced it was the cure, sadly not.

I just can’t see anything anywhere that is causing the sporadic slowdown. It does not happen at all below 600 nodes which is why I started suspecting my ISP.

Hopefully your lead helps me past this!

I might add that during the drop, CPU drops too, it doesn’t seem to be working through anything. I can watch the CPU and know the issue is there the moment it drops.

Upgrading from UDM PRO to PRO MAX does exactly nothing, despite faster CPU and double the memory. Increasing table size leads to periodical drops and outages.

Ubiquiti support is useless*.
*) I’ll update that if they actually manage to answer my technical question.
*) EDIT: useless → useful:

In the latest version of the UDM-Pro-Max, v4.0.X, the size of the conntrack table has been increased to 524288. For the UDM-Pro, the size of the conntrack table is 65536.

I will need to upgrade to the early adopter v4.0.X software to test this out, but it is promising.

2 Likes

What are you managing to run stable on that?

I wonder if we are experiencing the same issue.

@Erwin recommend someone who specializes in MikroTik routers, not much use to you but I have a meeting with him tomorrow. Perhaps if he can solve my issue it will shed some light on yours.

2 Likes

For UDM PRO I wrote: [quote=“drirmbda, post:175, topic:39715”]
200/200Mbps ISP: 500-600 nodes
[/quote] with increased max. table size.

For PRO MAX and faster Internet: Fewer! Increasing table size leads to outages. The default max. table size is stable but limits node count. It’s the router (kernel and modules), not the ISP.

I am surprised that there is barely anything on the Internet about pushing the limits of NAT in Linux and freeBSD. Not going to be surprised if there are hidden or lesser known limits in these approaches. These are DIY, not enterprise grade.

3 Likes

Two options here:

  1. Your ISP is doing something that cannot deal with so many connections from one IP. It could be the mandatory connection tracking for police and other state agencies or something else. Public IP doesn’t mean they don’t do something with packets that keeps connection table.
  2. Load distribution problem in that CCR. Look at load of individual cores, my bet is one is maxing out while others are doing mostly nothing. On Cisco boxes load balancing is done via hash table and you can finetune how the hash is created (source IP, dest. IP, source+dest. IP, IP+port, IP+MAC, …). I suppose Mikrotik does something similar, but I don’t know if it is user configurable.

I have seen one 1U server with Linux doing CGNAT for whole ISP network with about 10k customers. It was few years back, but you get the idea. You need ethernet interfaces with good HW offloading, loadbalancing between cpu cores and change entry count limits for some kernel parameters. You will find enough info on the internet, but don’t expect to find it in one simple cookbook.

5 Likes

Looks pretty good to me on the CPU front, but please keep ideas coming, I am all out of them!!

> /system resource cpu print 
Columns: CPU, LOAD, IRQ, DISK
 #  CPU    LOAD  IRQ  DISK
 0  cpu0   15%   15%  0%  
 1  cpu1   19%   19%  0%  
 2  cpu2   17%   17%  0%  
 3  cpu3   15%   15%  0%  
 4  cpu4   20%   20%  0%  
 5  cpu5   17%   17%  0%  
 6  cpu6   20%   20%  0%  
 7  cpu7   10%   10%  0%  
 8  cpu8   15%   15%  0%  
 9  cpu9   9%    8%   0%  
10  cpu10  14%   14%  0%  
11  cpu11  10%   8%   0%  
12  cpu12  10%   10%  0%  
13  cpu13  16%   16%  0%  
14  cpu14  17%   17%  0%  
15  cpu15  14%   14%  0%

I really hope this is not the case as it is then out of my hands, if it is though, then multiple statics is a easy way to skirt the problem, but I have a router that can run the neighborhood for nothing.

2 Likes

btw @anon26713768 have you tried the “play dumb” approach with the isp an escalated your issues ?

1 Like

No, speaking to their tech support is about as effective as asking my mother. Especially when you don’t use their equipment.

This is why I hope to speak directly with a tech at a central office.

As mentioned above I do contract work for my ISP, I know exactly where my connection is coming from and have access to the building I can walk right in and go find the tech who runs it.

The only problem is… its somewhat inappropriate :rofl:. If I can find a legitimate reason to go and bump into him, then start a casual conversation, that is a different story.

4 Likes

Ok, CPU is fine. What about interface errors or resets? Bad cable or connector can sometimes cause really weird behavior.

What are the port speeds on the way? Flow-control enabled or disabled? Sometimes you can get packetloss going from faster interface to slower even if you are not hitting maximum speed of the slower interface. There can be microbursts which temporary owerflow buffers on the device dealing with different speed interfaces.

EDIT:
Btw I upgraded my connection to 1/1 Gbit, currently I am running 550 nodes (need to upgrade RAM now :smiley: ) and my Mikrotik hAP ax2 is fine with that. CPU load around 80% on all 4 cores.

3 Likes

I havent been able to squeeze more than 550 nodes on any cpu I tried.

1 Like

They are not on one machine. Biggest one is 370 nodes, limited by RAM. CPU is Ryzen 5900X and it shows only around 60% load.

1 Like

Not sure what you mean?

I had it disabled now but it was enabled before. Restored from a backup and I must have had it disabled when I created the backup.

Thing is it behaved in the same way when I had the 5Gbps connection, when I had the CCR2016 the same issue, before VLANS and after, before queues and after, I have run up to 1300 via the router and resource wise it just chugs along walk in the park.
This periodic drop happens despite any of it past the 600 mark.
Might I state the obvious, it is driving me nuts.

1 Like

What are negotiated interface speeds between ISP, router and PCc running nodes. Is it 10G-10G-10G, or for example 10G-2.5G-1G?

Ahh, Auto Negotiation, Advertise 10G, Not sure if what it advertises matters when Auto is enabled, guessing not?

I am no network expert but in my experience it’s always better to set the interface to the speed it will actually connect to the other device. Strange and Bad things seem to happen sometimes when Auto Negotiation is used. It was a classic thing on a backup to tape product I had the dubious pleasure of administering.

I have seen problems both ways, some devices not liking auto negotiation, some not liking manual config. Same with flow control.

Auto negotiation not working correctly is also often indication of something bad on physical layer - bad or overheating SFP module, out of spec cable.

2 Likes

At 725,000 state table entries now on the router with 375-500 Mbps symmetric traffic in steady state… crossed a 100TB+ off traffic already for month of June (most of it in past 5 days).

Plenty of bandwidth left… only utilizing up to 10% off it at the moment.

I don’t think I have pushed my router this hard on the state table count ever before, :scream_cat: .

Just curious, anyone here pushing over 1M+ state tables from a single router (at home), specifically for safenodes?

5 Likes

That is a lot, I have 270k connections with 250-350 Mbps of node traffic. Have you played with the connection timeouts or are you using defaults?

2 Likes

I am just starting to look at setting up my router.

Noticed there is a scheduler for various tasks. Wondering if you have one set up and once you go over that figure then some scheduled task takes too much time and is set at the frequency you see the “pause”

Another one is the logs. Are they overflowing and some sort of deletion happening every so often. And the log is only filling fast with something due to the number of nodes? ISP canning connections become errors in the log and its filling fast and every so often the router is taking time to delete the oldest logs to make room

Have you got graphing on and its filling your “disk”? Thus every so often the router deletes the oldest to make room, only to have it fill up again. And only when you go over that number of nodes does it take long enough to actually slow down routing

Just a couple of possibilities

And of course your router logs could hold the answer to your pauses

1 Like

Hey Rob, no none of that. I spent 2 hours last night with a guy that knows MikroTik inside out. He went though everything from the logs to even my switch config.

Nothing exactly stood out that would cause my issue.
However there are some optimization and simplifications that can be done.

For instance apparently if you create more than one bridge on MikroTik only the first uses the switch chip the rest use the CPU.

Then there were a few things not done exactly as MikroTik documentation would suggest.

Tonight he will make some changes to my configuration to be sure everything is optimized and exactly as the docs want it.

No promises as yet because he is not sure why it is happening either.

Going to be a long day of bated breath.

5 Likes