Profiling node performance

qi_ma · March 23, 2017, 12:13pm

Further to your profiling result, the vaults’ CPU consuming pattern is bit clustered because:
1, when a client connected to network, all its messages goes through its proxy.
2, for a put request, it goes to proxy first, then ClientManagers, then DataManagers.
3, there is an accummultor in each group (ClientManagers or DataManagers) which having additional work of collecting signatures and swarming messages.
The accummultor of ClientManagers is always the node closest to client_name. Meanwhile the accummultor of DataManagers is closest to data_name.

Hence, the proxy node and accummultor of ClientManagers shares highest workload (50_4 544.41 and 51_3 481.35), and GROUP_SIZE nodes acts as ClientManagers shares higher workload than normal DataManagers.

The total copies you got in network is 848, divided by 8 it means 106 chunks, which means there is 9 chunks for account or directory info. These 9 chunks are small and will be quick to handle. There is chance those nodes only have 16 copies having most of those 9 small chunks, hence have a very light share of work load.

However, I don’t get idea why there are two nodes (50_2 416.57 and 50_3 411.70) shares almost the same workload to proxy or accumulator.

Also, would like to confirm with you that the node’s CPU time is measured by calculating the logged timestamp, right? And are those 28 vaults running in just one machine?

Cheers,

btw. is it possible you can send me the logs you have for that profiling test? thank you very much.

mav · March 23, 2017, 11:27pm

28 vaults are running across 7 quadcore pine64s, so each vault should have a full cpu core to work with at all times.

CPU time is measured using precise_time_ns() and uses defer! in the following way:

fn add_entry(&mut self) -> Result<(), MutationError> {
    println!("{} rprof_start maid_manager.rs:62:add_entry", precise_time_ns()); // auto added by rprof
    defer!(println!("{} rprof_end maid_manager.rs:62:add_entry", precise_time_ns())); // auto added by rprof
    if self.space_available < 1 {
    ...
}

CPU time for each method is time spent executing that method. Time spent executing other timed methods called inside it are not included with that method’s cputime. eg

fn do_a_thing() {
    call_to_start_timing;
    call_to_defered_end_timing;
    other_untimed_function();  // 3s
    other_timed_function();    // 4s
    other_untimed_function();  // 5s
}

The cputime profile would look like:

8 do_a_thing
4 other_timed_function

Files:

logs_vault_0-13-1_timing.zip test15 / mega
rprof (a tool for adding timing to every method, ‘rust profiler’) test15 / mega
log analysis script test15 / mega

qi_ma · March 24, 2017, 3:05pm

Hi, @mav,

Thank you very much for the logs. It does help me a lot.
And here is my understanding of why the CPU usage time is in such a clustered pattern:
1, 50_4 is confirmed to acts as the accumulator of the ClientManager group.
It handled over 35k routing messages with total size of 470MB, stored 35 copies
2, 51_3 is confirmed to acts as the proxy to the client, and also as a member of the CM group.
It handled over 20k routing messages with total size of 658MB, stored 38 copies
3, 50_2, 50_3, 53_2, 54_2, 54_3 and 55_1 are the other members of the CM group.
They handled over 15k routing messages each, with total size ranged from 4.5MB to 12MB, stored 32 - 39 copies.
4, other vaults handled less than 5k routing messages, and some stored only 16 copies.

The heavy duty of 50_4, together with the other two middle duty 50_2 and 50_3, affects each other (as the disk and network adaptor is still shared, and also their names are close hence handle almost the same chunks at the same time). Hence their performance is much slower than other nodes.
For example, a chunk’s put operation took around 0.0134s to complete in 54_2, but the same chunk will take 50_2 0.54s, 50_3 0.54s and 50_4 0.42s to complete

This explains why the two nodes of 50_2 and 50_3 required almost the same CPU time as to 50_3 and 51_3, and those vaults sits in 51 ranked higher than others.

Regarding the chunk_store::put method takes a long time to complete, a task ([MAID-2033] - JIRA) is now raised and there will be some work to be done soon to address this.

Regarding the proxy node being a bottleneck, some thoughts have been discussed. However, the solution may incur concerns in bootstrap flow and security. So it will take some time till any conclusion can be done.

And thanks again for your profiling work and it does help us a lot.

Cheers,

whiteoutmashups · October 30, 2017, 7:15am

1281387

What is this picture? is it a well-known programming image or something?

I remembered it from @mav but recently saw it in a bitcoin wallet dev who I follow on Github

Made me curious. Same profile pic, but I doubt you guys are the same person.

mav · October 30, 2017, 7:32am

It’s a picture of me. Same person here and on github I am probably best known for the bip39 tool.

polpolrene · October 30, 2017, 12:30pm

2 posts were merged into an existing topic: SatoshiLabs Launches TREZOR Password Manager: The Ultimately Secure “No Master Password” Cloud Solution

digipl · October 30, 2017, 2:26pm

Practically everyone who has been in the crypto world for years has used, at one time or another, Ian Coleman’s bip39 tool.
It is a pleasure and a privilege to have him here.

mav · November 2, 2017, 4:40am

Scaling and Performance

This test is for the impact of scale on vault performance. How does the network perform with various numbers of vaults?

Summary

Here’s a summary of the result for those with an aversion to long reads:

Larger network size does not seem to affect overall performance. This is definitely not what I expected.

Setup

The network uses AWS EC2 t2.micro instances with 8 vaults running per instance.

This differs from prior tests which used a cluster of 7 pine64s.

Vault version is 0.17.2 and min_section_size is 8 vaults (same as alpha2).

Method

The smallest network tested was 8 vaults (1 instance * 8 vaults per instance).

The largest was 160 vaults (20 instances).

Each network size was tested five times with the client_stress_test (with the default 100 immutable and 100 mutable). The test was timed using the time linux command.

It’s worth pointing out I mainly intended to explore how the number of hops due to routing affects performance. I expect the main performance implication when scaling up should be noticed by the increased number of hops required, as outlined below

Nodes                    8   16   24    40    80    160    ...   8K    8M   8B
Sections                 1    2    3     5    10     20          1K    1M   1B
Avg hops log2(sections)  0    1    1.5   2.3   3.3    4.3  ...  ~10   ~20  ~30

Even though the largest network has 20 times more nodes than the smallest network, the performance difference should only be expected to be 4.5 times worse (if hops is the only factor in performance).

The first test was conducted using the largest size network and reduced for following tests until it came down to the smallest network. This may impact the number of sections due to the merge and split rules having some leeway.

Results

The table shows the total seconds for client_stress_test to complete.

Netsize  Median | Test1  Test2  Test3  Test4  Test5
      8     349 |   350    340    344    356    349
     16     303 |   294    303    303    350    299
     24     289 |   283    289    289    289    288
     40     293 |   403    333    277    285    293
     80     316 |   338    297    297    316    351
    160     277 |   278    302    277    274    275

It’s pretty clear that network size does not significantly affect performance, at least not under these test conditions. Maybe it needs a much larger network before network size becomes a factor. Maybe it never becomes a factor and some other component besides the hops is the primary bottleneck.

This is a very surprising result to me, as I expected a 20-times larger network to have significantly more routing overhead than the smaller network. But in the end the networks were roughly comparable in their performance.

A pleasant surprise indeed.

Observations

The largest network size was estimated to be 148 vaults:

| Node(94916b..(100)) - Routing Table size:  73 |
| Estimated network size: 148                   |

I didn’t verify from vault logs how many of the 160 did not eventually connect, so there may have been slightly less connected vaults per test than the number of running vaults.

This was a relatively basic test, so take it with a grain of salt. I just wanted to do a very basic scaling test that took no more than a day to run. It would be fun to do a deeper test at larger scale but that takes a lot more preparation and tooling. Since this basic test has piqued my curiosity I’m hoping to do a more detailed one in the future.

The vaults were all running in North Virgina US with the client doing the stress testing in Melbourne AU. This means the client-vault latency and bandwidth may have some factor in the performance, but vault-vault latency and bandwidth should not. Probably in the future it’d be better to run the stress test in the cloud to reduce latency and bandwidth variation from the client.

The use of t2.micro instances was arbitrary, and may not have been the best choice for this test. A more powerful machine running the vaults may impact network performance more than the impact of scale.

However the cpu load never exceeded 25% so this test definitely did not seem bound by cpu load.

Likewise memory never exceeded 25%.

During churn there were some big spikes in cpu load, but this was between tests so doesn’t affect the results.

A big thankyou to the maidsafe team for making it easier to run local networks.

Conclusion

Network scale does not seem to affect the overall performance.

There are a lot of factors at play in this particular test. Hops due to routing maybe isn’t significant, especially at this relatively small scale. I’m pretty wary of drawing conclusions from this test.

Traktion · November 2, 2017, 9:19am

Awesome work @mav! Interesting results too and rather positive!

happybeing · November 2, 2017, 9:32am

Brilliant work @mav and very interesting results. I love your performance tests! These thoughts occur to me:

something other than hops is dominating performance in your tests, such as message processing, vault storage access, vault bcomputations, or even client operations, so to understand the results it would help to know what this is and why. It also seems possible that it was the client-cloud latency, although I’m not sure why performance would improve slightly with more vaults in that case.
the actual network experience will include or even dominated by data access, whereas you are testing upload, so a mixed test, or a separate download only test (or both!) would also be interesting.

Thanks for sharing this

Jabba · November 2, 2017, 8:30pm

Interesting. Love these tests that you do @mav. I’m sure you probably do them to satisfy your own curiosity, but it is very interesting for most of us too (even the less technical folks), so making the effort to write it up for us is really appreciated.

mav · November 3, 2017, 12:29am

The stress test does both PUT and GET, but I agree it would be interesting to do a more detailed test of GETs. This would also hopefully show how caching affects the performance.

I wanted to test whether the client had impacted the results, and whether running the client closer to the vaults would have better isolated the performance impact of routing hops. The results below seem to suggest the client-vault connection between AU and US was the main bottleneck in the prior tests (specifically the latency of that connection)

The stress test was run with various client cpu and network conditions:

#       Vaults         Client   Conn | Median | Test1  Test2  Test3  Test4  Test5  | cpu %
1   Local (AU)   i7-7700 (AU)  Local |     74 |    75     76     74     74     74  | 35
2   Local (US)  t2.micro (US)  Local |    151 |   151    150    154    152    151  | 93
3  Remote (US)  t2.micro (US)    LAN |    145 |   142    146    145    148    145  | 88
4  Remote (US)  t2.micro (AU)   VLAN |    299 |   297    299    298    420   1109  | 41
5  Remote (US)   i7-7700 (AU)   ADSL |    349 |   350    340    344    356    349  | ??

The important tests to compare are

3 vs 4 - the only difference should be latency from AU to US, which greatly reduces the performance of the stress test. Latency seems to be a big factor in performance, more than doubling the time to complete the test.
4 vs 5 - the only difference should be bandwidth since ADSL will be lower bandwidth than VLAN. Lower bandwidth does slow the test further, but latency had a much greater impact.

And of lesser importance but equal interest

1 vs 2 shows the cpu of t2.micro is not beefy enough for some loads, maxing out the cpu.
2 vs 3 shows that reducing the load on the t2.micro does speed the test up, and that the cpu is the bottlneck, not the network (since the test increased in speed despite adding network overhead).

The Cloud really throws a lot of extra factors into the testing mix. Performing a ‘basic’ test using the cloud is less compelling since the fine details matter much more. The pine64s were slow and crummy but they definitely isolated a lot of factors.

These results have motivated me to conduct a more precise test trying to accurately profile the bottlenecks. Currently I don’t feel comfortable pointing out any particular area as the bottleneck, nor where efforts toward improvement may be directed.

happybeing · November 3, 2017, 10:37am

Thanks @mav. If I understand correctly, it is very interesting to:

confirm that client-network latency is important and in certain tests can be a dominant factor. This shows how crucial the test setup can be, and will make the community vault tests all the more important when they resume.
that (again, in certain configurations) CPU can be dominant over hops, although I think we can expect there is room for significant code optimisation. So this effect may not remain in the final system.

oetyng · November 3, 2017, 11:38pm

Always such a pleasure to read you posts @mav. Great stuff.
I agree that it would be very interesting with separate profiling for GETs and PUTs.

mav · April 20, 2018, 6:55am

Testing Maximum Vaults Per Machine

I’ve been working on a tool to deploy vaults to various cloud platforms so I can do tests on very large and geographically diverse safenetworks.

The question of this post is how many vaults should I run on each cloud vm?

A test needs to be done to see how the number of vaults and clients impacts performance.

This post outlines the format of the test and results on a local machine to give a point of reference to cloud vms in future tests.

Scenario

Someone is going to start some vaults on a cloud vm.

They must decide

how much cpu / ram that vm has. This is chosen based on predefined cloud provider options, eg aws instance types or digitalocean droplet sizes.
how many vaults to run on that vm. If there are not many clients the load will be low so lots of vaults can be started, but if there are many clients the load will become high and this may impact performance.

Measurement

The point is to be able to supply as many resources as possible to the safenetwork while still satisfying some specific performance requirement.

eg I want my machine to be part of a safenetwork that can supply 100 clients with a an upload rate of 20ms per chunk; how many vaults can I run on my machine before performance drops too much?

The client_stress_test uploads then fetches 100 immutable and 100 mutable chunks to the safenetwork.

The total time for that test to run gives an indication of the performance of the safenetwork (in this case the entire safenetwork runs on one machine).

By starting X vaults on a machine and running Y simultaneous client stress tests there can be a definite measure of the impact of vaults and clients on performance.

The operator can then guess their likely client load, choose the safenetwork performance they desire, and from those two parameters know how many vaults to run on their machine.

If they run too many vaults they risk being punished by the safenetwork for not being performant and thus lose their ability to earn safecoin.

If they run too few vaults they’re paying for cloud resources that are not fully utilised.

This particular test is constrained by cpu, not memory or network factors.

Results

These results below are for running a safenetwork on an intel i7-7700 desktop cpu at 3.60GHz.

The vault version is alpha2 (0.17.2)

Performance

How is the safenetwork performance affected by number of vaults and number of clients?

The chart below shows the stress test takes about 3m (164s) to run for a safenetwork of 32 vaults and 10 simultaneous clients. This is improved to about 1m (77s) if only 1 client is on the safenetwork.

vault_client_performance

Slowing

As a slight tangent, in previous posts in this thread the safenetwork speed became slower as the file was uploaded. Does this test also have slowing as the safenetwork accumulates chunks? The chart below shows that it doesn’t. Note this test uses client_stress_test whereas previous tests used the nodejs apps to upload files.

But looking at the upload chart with more clients, there seems to be some chunks that take a very long time despite most chunks falling in a reasonable performance range. Not sure what to take away from this but it’s something that may indicate further room for investigation and optimisation.

Client Luck

Are some clients really unlucky and have unusually slow tests, or are most clients roughly equal? As the chart below shows, some clients are fairly lucky, but no client is especially unlucky.

Points Of Interest

The safenetwork performance decreases as more clients use it, which is in line with intuition. Doubling the clients leads to almost a double in slowness.

The safenetwork performance is basically unaffected by safenetwork size. Doubling the safenetwork size does not affect slowness. I was surprised by this result. Presumably at some point vaults will begin failing but the ease with which the safenetwork scales up in size is pretty impressive.

It would seem that safenetwork performance depends mainly on the number of simultaneous clients.

Caveats

This test runs the vaults and stress test on the same machine, so the load is from both. Ideally the load should only come from the vaults. But the cpu load from the client stress test is not high (less than 1%). I’m not too worried about the impact of this.

The test ignores network effects like bandwidth and latency. This is because it aims to be a comparable test between different vault versions, to see if the vault performance improves with each new release. It’s testing vault performance, not network performance. In reality, a vault operator will probably choose the number of vaults based on network performance rather than cpu performance, but for some low powered devices cpu may be the bottleneck in which case this test becomes very useful.

Future Work

I want to run this test on the different cloud vms to see if there’s a plateau in performance and whether greater resources stop providing greater benefit. This tests cpu constraints.

I want to run the test on a small but globally distributed safenetwork with the ‘cpu optimum’ configuration and compare how much network factors like latency and bandwidth affect performance. This tests network constraints.

I want to run a very large globally distributed safenetwork to compare how much effect hops have compared with a small global safenetwork. This tests safenetwork size constraints.

It’s not very useful to test ‘advanced’ constraints without having some existing knowledge of the ‘simpler’ constraints as a point of comparison.

Traktion · April 20, 2018, 4:40pm

Great work @mav and promising results too!

Bogard · April 20, 2018, 8:25pm

Hi @mav – very nice work and findings. Question for you re: “The safenetwork performance decreases as more clients use it, which is in line with intuition. Doubling the clients leads to almost a double in slowness.
The safenetwork performance is basically unaffected by safenetwork size. Doubling the safenetwork size does not affect slowness.”
Does the latter still hold true when doubling the safe network size after doubling slowness from double the number of clients? I wonder if there’s a # of clients/network size density threshold where the slowness kicks in (forgive me if these have previously been discussed/investigated).

mav · April 22, 2018, 9:05am

Yes.

Consider the example of 16 vaults and 5 clients which takes 95s to complete.

Doubling just the clients to 10 doubles the time to 167, almost double.

Doubling just the network size to 32 retains the same time, 92s.

Doubling both the network size to 32 and the clients to 10 also doubles the time to 164.

So network size has very little effect (at least for this test) on the time to run the stress test.

This is a good question and I’ll try running some much larger networks to see if I can get a doubling of time just from increasing network size. Some of the AWS instances I’ve been testing definitely showed some interesting threshold behaviour; I’ll hopefully have those tests completed soon.

Bogard · April 22, 2018, 6:01pm

Very interesting indeed. Thanks for the additional color.

andreruigrok · April 23, 2018, 2:01am

Hi Mav,

i am curious to know if this: [quote=“Bogard, post:83, topic:10331”]
The safenetwork performance decreases as more clients use it, which is in line with intuition. Doubling the clients leads to almost a double in slowness.
[/quote]

Is something that is was expected and/or can be easily fixed or not? It reads to me like a serious problem.

Thanks for claryfying!
André

Topic		Replies	Views
SAFE Network Dev Update - July 30, 2020 Updates	34	2581	August 6, 2020
SAFE Network Dev Update - February 20, 2020 Updates	65	3723	April 23, 2020
Testnet Vault Specs / Performance? Development	2	1018	March 3, 2016
SAFE Network - Test 16 Updates	186	14315	April 27, 2017
SAFE Network Dev Update - May 7, 2020 Updates	64	4507	May 14, 2020

Profiling node performance

Scaling and Performance

Summary

Setup

Method

Results

Observations

Conclusion

Testing Maximum Vaults Per Machine

Scenario

Measurement

Results

Performance

Slowing

Client Luck

Points Of Interest

Caveats

Future Work

Related Topics