Oh yeah!! Looking forward to that feature flag .
There’s a new stable
release coming, though it’s effectively just the alpha-punch
code (before anyone gets over excited there). We’ve run into a situation on the release flows where the alpha
versions were getting tricky to manage due to all being pushed to crates.io
. So this should get things updated, and added some tweaks to avoid pushing there by default.
That will hopefully give us the flexibility we need to be able to do prereleases a bit more freely without bogging us down in maintaining a version history for non-stable code.
(So only stable branch pre-releases would make it out to crates.io
, the rest would only get a github release all being well).
libp2p AutoNATv2 merged!
Not yet?
Ah, merged into another PR?
Quite the contrary… Would be nice to have this before June.
The latest turns in AutoNAT saga:
By chance I stumbled into this site talking about networking over QUIC. Way over my head, but I thought someone else here might be interested.
Who has been 007’ing github? What are we expecting?
There’s little sign of work on the issues in play as far as I can see, and no sign of a release.
Unless they’ve take it all private. Not much seems to have happened.
I just tried to spy @joshuef’s repo but he has so many branches its a bit of a needle in the haystack situation.
Not much there either:
https://github.com/joshuef/safe_network/activity
Surely we are not all following a open source project that is developed behind closed doors.
What would be the benefit or motivation behind that?
Rust’ling heard in the code forest,…
EDIT:
I just found this great article on hole punching,
Integrating it all with ICE
We’re in the home stretch. We’ve covered stateful firewalls, simple and advanced NAT tricks, IPv4 and IPv6. So, implement all the above, and we’re done!
Except, how do you figure out which tricks to use for a particular peer? How do you figure out if this is a simple stateful firewall problem, or if it’s time to bust out the birthday paradox, or if you need to fiddle with NAT64 by hand? Or maybe the two of you are on the same Wi-Fi network, with no firewalls and no effort required.
Early research into NAT traversal had you precisely characterize the path between you and your peer, and deploy a specific set of workarounds to defeat that exact path. But as it turned out, network engineers and NAT box programmers have many inventive ideas, and that stops scaling very quickly. We need something that involves a bit less thinking on our part.
Enter the Interactive Connectivity Establishment (ICE) protocol. Like STUN and TURN, ICE has its roots in the telephony world, and so the RFC is full of SIP and SDP and signalling sessions and dialing and so forth. However, if you push past that, it also specifies a stunningly elegant algorithm for figuring out the best way to get a connection.
Ready? The algorithm is: try everything at once, and pick the best thing that works. That’s it. Isn’t that amazing?
Let’s look at this algorithm in a bit more detail. We’re going to deviate from the ICE spec here and there, so if you’re trying to implement an interoperable ICE client, you should go read RFC 8445 and implement that. We’ll skip all the telephony-oriented stuff to focus on the core logic, and suggest a few places where you have more degrees of freedom than the ICE spec suggests.
To communicate with a peer, we start by gathering a list of candidate endpoints for our local socket. A candidate is any ip:port that our peer might, perhaps, be able to use in order to speak to us. We don’t need to be picky at this stage, the list should include at least:
IPv6 ip:ports
IPv4 LAN ip:ports
IPv4 WAN ip:ports discovered by STUN (possibly via a NAT64 translator)
IPv4 WAN ip:port allocated by a port mapping protocol
Operator-provided endpoints (e.g. for statically configured port forwards)
Then, we swap candidate lists with our peer through the side channel, and start sending probe packets at each others’ endpoints. Again, at this point you don’t discriminate: if the peer provided you with 15 endpoints, you send “are you there?” probes to all 15 of them.
These packets are pulling double duty. Their first function is to act as the packets that open up the firewalls and pierce the NATs, like we’ve been doing for this entire article. But the other is to act as a health check. We’re exchanging (hopefully authenticated) “ping” and “pong” packets, to check if a particular path works end to end.
Finally, after some time has passed, we pick the “best” (according to some heuristic) candidate path that was observed to work, and we’re done.
The beauty of this algorithm is that if your heuristic is right, you’ll always get an optimal answer. ICE has you score your candidates ahead of time (usually: LAN > WAN > WAN+NAT), but it doesn’t have to be that way. Starting with v0.100.0, Tailscale switched from a hardcoded preference order to round-trip latency, which tends to result in the same LAN > WAN > WAN+NAT ordering. But unlike static ordering, we discover which “category” a path falls into organically, rather than having to guess ahead of time.
The ICE spec structures the protocol as a “probe phase” followed by an “okay let’s communicate” phase, but there’s no reason you need to strictly order them. In Tailscale, we upgrade connections on the fly as we discover better paths, and all connections start out with DERP preselected. That means you can use the connection immediately through the fallback path, while path discovery runs in parallel. Usually, after a few seconds, we’ll have found a better path, and your connection transparently upgrades to it.
One thing to be wary of is asymmetric paths. ICE goes to some effort to ensure that both peers have picked the same network path, so that there’s definite bidirectional packet flow to keep all the NATs and firewalls open. You don’t have to go to the same effort, but you do have to ensure that there’s bidirectional traffic along all paths you’re using. That can be as simple as continuing to send ping/pong probes periodically.
To be really robust, you also need to detect that your currently selected path has failed (say, because maintenance caused your NAT’s state to get dumped on the floor), and downgrade to another path. You can do this by continuing to probe all possible paths and keep a set of “warm” fallbacks ready to go, but downgrades are rare enough that it’s probably more efficient to fall all the way back to your relay of last resort, then restart path discovery.
Finally, we should mention security. Throughout this article, I’ve assumed that the “upper layer” protocol you’ll be running over this connection brings its own security (QUIC has TLS certs, WireGuard has its own public keys…). If that’s not the case, you absolutely need to bring your own. Once you’re dynamically switching paths at runtime, IP-based security becomes meaningless (not that it was worth much in the first place), and you must have at least end-to-end authentication.
If you have security for your upper layer, strictly speaking it’s okay if your ping/pong probes are spoofable. The worst that can happen is that an attacker can persuade you to relay your traffic through them. In the presence of e2e security, that’s not a huge deal (although obviously it depends on your threat model). But for good measure, you might as well authenticate and encrypt the path discovery packets as well. Consult your local application security engineer for how to do that safely.
Do we expect a fix for CPU creep in the new release?
Has it been identified as a issue? I am at the point where I gladly lose the benefit of uptime for a restart and better CPU.
every current setting will go out the window with node size change anyway xD
…but even worse with that cpu creep you mention - yes …
Do you get cpu creep on all machines or just the same ones every time?
This was not mentioned in Tuesday stages or in the summary. Are we sure it is part of the Monday upgrade?
What I meant up above was are the team aware of it as a problem?
For a user to give up the incentive in place for uptime due to it is a rather big problem.
On all machines given enough time.
I find it creeps up to a point and doesn’t go higher and goes down and back up
vdash is a significant cpu usage when have 40 or 50 nodes and seems to increase up to a point.
I am guessing that both have to do with sizes of log files as both go up as log files increase and nodes use more cpu with more gets/puts/etc