Just for info on where we are this evening. Testing on live networks via droplet deployer. We ran the network several times today.
Good news :
- With Tcp we seem to be punching holes in routers - so yes much easier for folks to use. (big well done crust team) [btw Utp is there, but is very slow just now, so disabled for the test]
- Routing behaves extremely well and recovers via vault layer on routing table issues (quietly).
- Generally - all working as per expected … but (there is a but)
More work news:
- The network behaved to a point where we stressed it by randomly starting vaults from behind routers - all good
- Then churning vaults a lot while creating accounts and websites etc. we seen some disconnects.
- we could push to the stage logins were delayed or seemingly lost.
- Later this afternoon we looked further into traffic and noticed quite large amounts in vaults.
- This traffic can grow to such an extent that nodes lose heartbeat signals and disconnect (heartbeat may be too aggressive)
Actions
Tomorrow morning we will alter the vaults to not send multiple copies of data multiple routes, but instead trim this to sending one copy. This is a big reduction in Get
traffic. So good, not a long job to change that.
We will also reduce GROUP_SIZE And PARALELLISM counts to lighten the load there. This allows a quick change to traffic and get on with tests in the community. It will take 1-2 days (or more) to fix more correctly and we do not want to wait too long.
We also have a couple of less important changes we can make in routing which are too much to go into here. Basically there are a few loose ends and we will try and get through them and get test2 stable quickly.
So bear with us, we are going like the clappers right now to push through this and we do need to fiddle some knobs, but that’s all it is right now. There is just an awful lot of them
The network we test on is only 100 nodes, but it really should be several thousand nodes to be honest. That’s not great, or so it seems. In fact it is great, we get much less stability and we see faults that would be missed in much larger networks. So in this way we actually can drive edge cases quickly. Nice to be able to do that before a larger network exists in many ways. So basically it suits us to push these edge case errors right now.
Then we will run up the network again to check aggressive churn. If all is well we should have some binaries in your hands again very soon, but we do not know if these quick changes are enough till we again measure. We are not looking for perfection or engineering excellence here, but to get this stable enough where larger community tests can help. They do help tremendously and with Rust we can now make changes quickly when we see them, so that is great.
So you may think, this all sounds weird, surely these tests have been done, but no, we have not had a live network working like this to measure. So this is really at the stage the cars on the track and we need to alter some valves etc. No big issue, but we are doing it as much in the open as we can.
One issue from Thursday we only cleaned a bit is the error messages. They are better now, but still very developer like. These will not yet be good enough user facing messages, but will be a bit better. Next test will have much better user facing ogs and much less detail.
I have seen some frustration on the forum, so will try and communicate these points as we go along. Many devs do not read it, so good they will not be under as much pressure that way at the moment. All eyes on testing really is what we want.
I will try and give mini updates over the next few days as we move on. It will maybe keep everyone a bit more in the loop.
Exciting days, but a wee bit of pressure thrown in All good.