Testing Maximum Vaults Per Machine
I’ve been working on a tool to deploy vaults to various cloud platforms so I can do tests on very large and geographically diverse safenetworks.
The question of this post is how many vaults should I run on each cloud vm?
A test needs to be done to see how the number of vaults and clients impacts performance.
This post outlines the format of the test and results on a local machine to give a point of reference to cloud vms in future tests.
Scenario
Someone is going to start some vaults on a cloud vm.
They must decide
- how much cpu / ram that vm has. This is chosen based on predefined cloud provider options, eg aws instance types or digitalocean droplet sizes.
- how many vaults to run on that vm. If there are not many clients the load will be low so lots of vaults can be started, but if there are many clients the load will become high and this may impact performance.
Measurement
The point is to be able to supply as many resources as possible to the safenetwork while still satisfying some specific performance requirement.
eg I want my machine to be part of a safenetwork that can supply 100 clients with a an upload rate of 20ms per chunk; how many vaults can I run on my machine before performance drops too much?
The client_stress_test uploads then fetches 100 immutable and 100 mutable chunks to the safenetwork.
The total time for that test to run gives an indication of the performance of the safenetwork (in this case the entire safenetwork runs on one machine).
By starting X vaults on a machine and running Y simultaneous client stress tests there can be a definite measure of the impact of vaults and clients on performance.
The operator can then guess their likely client load, choose the safenetwork performance they desire, and from those two parameters know how many vaults to run on their machine.
If they run too many vaults they risk being punished by the safenetwork for not being performant and thus lose their ability to earn safecoin.
If they run too few vaults they’re paying for cloud resources that are not fully utilised.
This particular test is constrained by cpu, not memory or network factors.
Results
These results below are for running a safenetwork on an intel i7-7700 desktop cpu at 3.60GHz.
The vault version is alpha2 (0.17.2)
Performance
How is the safenetwork performance affected by number of vaults and number of clients?
The chart below shows the stress test takes about 3m (164s) to run for a safenetwork of 32 vaults and 10 simultaneous clients. This is improved to about 1m (77s) if only 1 client is on the safenetwork.
Slowing
As a slight tangent, in previous posts in this thread the safenetwork speed became slower as the file was uploaded. Does this test also have slowing as the safenetwork accumulates chunks? The chart below shows that it doesn’t. Note this test uses client_stress_test whereas previous tests used the nodejs apps to upload files.
But looking at the upload chart with more clients, there seems to be some chunks that take a very long time despite most chunks falling in a reasonable performance range. Not sure what to take away from this but it’s something that may indicate further room for investigation and optimisation.
Client Luck
Are some clients really unlucky and have unusually slow tests, or are most clients roughly equal? As the chart below shows, some clients are fairly lucky, but no client is especially unlucky.
Points Of Interest
The safenetwork performance decreases as more clients use it, which is in line with intuition. Doubling the clients leads to almost a double in slowness.
The safenetwork performance is basically unaffected by safenetwork size. Doubling the safenetwork size does not affect slowness. I was surprised by this result. Presumably at some point vaults will begin failing but the ease with which the safenetwork scales up in size is pretty impressive.
It would seem that safenetwork performance depends mainly on the number of simultaneous clients.
Caveats
This test runs the vaults and stress test on the same machine, so the load is from both. Ideally the load should only come from the vaults. But the cpu load from the client stress test is not high (less than 1%). I’m not too worried about the impact of this.
The test ignores network effects like bandwidth and latency. This is because it aims to be a comparable test between different vault versions, to see if the vault performance improves with each new release. It’s testing vault performance, not network performance. In reality, a vault operator will probably choose the number of vaults based on network performance rather than cpu performance, but for some low powered devices cpu may be the bottleneck in which case this test becomes very useful.
Future Work
I want to run this test on the different cloud vms to see if there’s a plateau in performance and whether greater resources stop providing greater benefit. This tests cpu constraints.
I want to run the test on a small but globally distributed safenetwork with the ‘cpu optimum’ configuration and compare how much network factors like latency and bandwidth affect performance. This tests network constraints.
I want to run a very large globally distributed safenetwork to compare how much effect hops have compared with a small global safenetwork. This tests safenetwork size constraints.
It’s not very useful to test ‘advanced’ constraints without having some existing knowledge of the ‘simpler’ constraints as a point of comparison.