I thought for a while it may be broadcast storms, but I am no longer thinking that.
I have set up multiple VLANS, each to a seperate bridge and its own IP pool to segregate traffic. My 12 hour day. Was convinced it was the cure, sadly not.
I just can’t see anything anywhere that is causing the sporadic slowdown. It does not happen at all below 600 nodes which is why I started suspecting my ISP.
Hopefully your lead helps me past this!
I might add that during the drop, CPU drops too, it doesn’t seem to be working through anything. I can watch the CPU and know the issue is there the moment it drops.
Upgrading from UDM PRO to PRO MAX does exactly nothing, despite faster CPU and double the memory. Increasing table size leads to periodical drops and outages.
Ubiquiti support is useless*.
*) I’ll update that if they actually manage to answer my technical question.
*) EDIT: useless → useful:
In the latest version of the UDM-Pro-Max, v4.0.X, the size of the conntrack table has been increased to 524288. For the UDM-Pro, the size of the conntrack table is 65536.
I will need to upgrade to the early adopter v4.0.X software to test this out, but it is promising.
@Erwin recommend someone who specializes in MikroTik routers, not much use to you but I have a meeting with him tomorrow. Perhaps if he can solve my issue it will shed some light on yours.
For UDM PRO I wrote: [quote=“drirmbda, post:175, topic:39715”]
200/200Mbps ISP: 500-600 nodes
[/quote] with increased max. table size.
For PRO MAX and faster Internet: Fewer! Increasing table size leads to outages. The default max. table size is stable but limits node count. It’s the router (kernel and modules), not the ISP.
I am surprised that there is barely anything on the Internet about pushing the limits of NAT in Linux and freeBSD. Not going to be surprised if there are hidden or lesser known limits in these approaches. These are DIY, not enterprise grade.
Your ISP is doing something that cannot deal with so many connections from one IP. It could be the mandatory connection tracking for police and other state agencies or something else. Public IP doesn’t mean they don’t do something with packets that keeps connection table.
Load distribution problem in that CCR. Look at load of individual cores, my bet is one is maxing out while others are doing mostly nothing. On Cisco boxes load balancing is done via hash table and you can finetune how the hash is created (source IP, dest. IP, source+dest. IP, IP+port, IP+MAC, …). I suppose Mikrotik does something similar, but I don’t know if it is user configurable.
I have seen one 1U server with Linux doing CGNAT for whole ISP network with about 10k customers. It was few years back, but you get the idea. You need ethernet interfaces with good HW offloading, loadbalancing between cpu cores and change entry count limits for some kernel parameters. You will find enough info on the internet, but don’t expect to find it in one simple cookbook.
I really hope this is not the case as it is then out of my hands, if it is though, then multiple statics is a easy way to skirt the problem, but I have a router that can run the neighborhood for nothing.
No, speaking to their tech support is about as effective as asking my mother. Especially when you don’t use their equipment.
This is why I hope to speak directly with a tech at a central office.
As mentioned above I do contract work for my ISP, I know exactly where my connection is coming from and have access to the building I can walk right in and go find the tech who runs it.
The only problem is… its somewhat inappropriate . If I can find a legitimate reason to go and bump into him, then start a casual conversation, that is a different story.
Ok, CPU is fine. What about interface errors or resets? Bad cable or connector can sometimes cause really weird behavior.
What are the port speeds on the way? Flow-control enabled or disabled? Sometimes you can get packetloss going from faster interface to slower even if you are not hitting maximum speed of the slower interface. There can be microbursts which temporary owerflow buffers on the device dealing with different speed interfaces.
EDIT:
Btw I upgraded my connection to 1/1 Gbit, currently I am running 550 nodes (need to upgrade RAM now ) and my Mikrotik hAP ax2 is fine with that. CPU load around 80% on all 4 cores.
I had it disabled now but it was enabled before. Restored from a backup and I must have had it disabled when I created the backup.
Thing is it behaved in the same way when I had the 5Gbps connection, when I had the CCR2016 the same issue, before VLANS and after, before queues and after, I have run up to 1300 via the router and resource wise it just chugs along walk in the park.
This periodic drop happens despite any of it past the 600 mark.
Might I state the obvious, it is driving me nuts.
I am no network expert but in my experience it’s always better to set the interface to the speed it will actually connect to the other device. Strange and Bad things seem to happen sometimes when Auto Negotiation is used. It was a classic thing on a backup to tape product I had the dubious pleasure of administering.
At 725,000 state table entries now on the router with 375-500 Mbps symmetric traffic in steady state… crossed a 100TB+ off traffic already for month of June (most of it in past 5 days).
Plenty of bandwidth left… only utilizing up to 10% off it at the moment.
I don’t think I have pushed my router this hard on the state table count ever before, .
Just curious, anyone here pushing over 1M+ state tables from a single router (at home), specifically for safenodes?
I am just starting to look at setting up my router.
Noticed there is a scheduler for various tasks. Wondering if you have one set up and once you go over that figure then some scheduled task takes too much time and is set at the frequency you see the “pause”
Another one is the logs. Are they overflowing and some sort of deletion happening every so often. And the log is only filling fast with something due to the number of nodes? ISP canning connections become errors in the log and its filling fast and every so often the router is taking time to delete the oldest logs to make room
Have you got graphing on and its filling your “disk”? Thus every so often the router deletes the oldest to make room, only to have it fill up again. And only when you go over that number of nodes does it take long enough to actually slow down routing
Just a couple of possibilities
And of course your router logs could hold the answer to your pauses
Hey Rob, no none of that. I spent 2 hours last night with a guy that knows MikroTik inside out. He went though everything from the logs to even my switch config.
Nothing exactly stood out that would cause my issue.
However there are some optimization and simplifications that can be done.
For instance apparently if you create more than one bridge on MikroTik only the first uses the switch chip the rest use the CPU.
Then there were a few things not done exactly as MikroTik documentation would suggest.
Tonight he will make some changes to my configuration to be sure everything is optimized and exactly as the docs want it.
No promises as yet because he is not sure why it is happening either.