I think it’s the RPC calls that are causing the safenode-manager status
to be resource intensive, but I still don’t know why. Hopefully sometime soon I might be able to get time to look at some node manager stuff again.
At the moment I am running 400 nodes on a 24 thread dedicated machine with load averages of around 2 per core. So safe node manager will complete but due to the amount of nodes it has to get through load averages sky rocket.
Could a small file be written along side the one containing the pid in the node folder when the node starts ?
Then no need to touch the metrics?
I’m pretty sure @chriso has mentioned before safe node manager was never conceived to run hundreds of nodes.
It may make more sense to expose a <port>:/metadata
or <port>:/metrics/metadata
(basically a different URL) on that same port that hosts the OpenMetrics output for these non-core metrics in terms of bucketing or classification, especially as type INFO, where the value doesn’t change during the lifetime of the pid.
The user can then hit one or two or both of these endpoints, if desired.
That would be fantastic and very much appreciated if that was possible
Location is xor, address not geolocation. chunking gives pseudo random xor addressing. Thus the geo location doesn’t matter
I am working on testing this. My previous test was using text version of a leading spaces text counter. It produced 3 chunks per number and one chunk was the same for all of them. But it showed a really skewed choosing of nodes in a 10 node local with 3 nodes getting the payments.
But an initial test run of the testing scripts showed for a 50 node local network that after a few hundred file uploads all the nodes had earned reasonable amounts with some much higher, so the effect was smaller but still there. This time the file contents were still based on a counter but the count seeded the contents to produced unique file contents and more importantly unique chunks (verified for first 5 million count).
I have issue with the safenode-manager borking with registry refresh when i have hundreds of nodes. The machine doesn’t break a sweat with 300 local nodes but node manager does and also interconnectivity does. Today I am changing over to scripting the node starting and genesis event in my script with the debugging, its not quite as clean as I’d want since reasons.
When i get the test running we can crawl through the output which will show all the metrics for quoting and wallet balances for the nodes in response for the chunks uploaded. I estimate that a 1000 node testnet on this machine (48 threads, 256GB mem) will be fine. testnet nodes use a whole lot less cpu than live nodes.
Although initial testing suggests the more diverse the uploaded data the more random the payments.
The problem is utilisation, to divide up the sources of information means the node has to respond two or more times to a script trying to obtain enough information to do what its trying to do.
Like I want to run 100’s of nodes in a local network and the script wants information, so I have to use 2 or 3 calls to the node using 2 or 3 different interfaces to get it. This requires more coding, and worse more cpu resources to be used to get the info. For hundreds of nodes this is a lot of wasted time.
What would be better than just version/peerID is a information block on the main parameters for the node. eg peerID, xor address, version, ip address, port number, etc etc.
@aatonnomicc what I do is capture the text from the node starting which contains that info and store it in a file containing one line per node. The line holds the peerID, ports in use (using what I started the node with), the node number or identifier, and so on
Understanding this would it be possible to have a separate version obtained the same as 127.0.0.1:port#/metrics and instead of using metrics as the page, use information or whatever. So basically its the metrics in a form more suitable for reading in a script and also contains the major parameters the node is currently using. Almost a debug output.
On the local testnet I ran up, the updating the registry took well over an hour and the CPU was never at 100%, just cycled high and low till it finished and outputted. Only had 300 nodes, I wanted 1000 but safenode-manager would spit the dummy before then, usually with a cannot get rpc response. Some race condition I bet that happens occasionally.
Personally I have done workarounds and don’t even use metrics now. Too slow anyhow. The info I want is in the logs and I keep separate data files, built from startup, on the basic node info I need. My data mining skills I honed in contracting to big businesses pays off here since the logs have a wealth of information.
There is a pid file in the node’s data directory already. created when the node is starting
Yes I understand that much, but what I clearly am not understanding is why it will be random and evenly spread out when only a handful of nodes are uploading to a network this size.
Why is it not going to be concentrated in certain sections.
Not sure what you are saying. The theory is
Peers have random xor addresses and should be evenly spread across the xor space.
The chunks produced are fairly random due to the self encryption
The location (xor or geo) of the uploader has no bearing on the chunk’s xor address nor the nodes in the network xor’s address. The uploader’s xor address is immaterial in realisation to uploading.
Now that was the theory. In actuality there would need to be near infinite number of nodes and near infinite number of chunks uploaded before we see true randomness barring faults in SE
In practise we will be a non even spread of nodes across xor space, but should be good enough that its more of a observation than causing issues. Same for SE, but it is SE that might have biases more than node address.
There is nothing in the upload process that having the uploaders in one location (xor or geo) is going to affect where the chunks are stored. Random* chunk xor address generated being stored in a network with random node xor addresses.
In case you are still thinking there is sections of any sort, there is no sections of any kind in the network. Yes a chunk has its 5 closest nodes but that is because of xor address of chunk and those 5 nodes
I think at least part of the problem is the way in which I am communicating my question compounded by not fully understanding how it works. Recipe for disaster.
You kinda answered my question above though.
Here is another grey area for me.
Not all nodes know all nodes at once, we have the concept of close nodes/groups right? Doesn’t that translate to a section of sorts.
Sections suggest they are grouped together in some way
No its all an individual view, be it how the node views things or how the chunk views things
Chunk - it sees the 5 nodes with xor addresses closest to its xor address. The nodes only see the chunk and that is is closer than the rest of the network barring 4 other nodes.
nodes - it sees a neighbourhood that it is the centre of. About 70 nodes from what they say. But each neighbour sees a slightly different neighbourhood
nodes - also see a wider view but is more random nodes in that wider view (peer count - ~70)
This is what I am getting at, if only maidsafe are uploading and with little churn why will the data travel very far from that neighborhood in which they are uploading.
If you answered it already sorry, it is still evading me.
Would the suggestion here in a different URL in OpenMetrics format work in this case? See Update 25th July, 2024 - #23 by Shu.
Telegraf works great to parse consistently X # of nodes with Y VMs with just two configuration files placed in its configuration folders (one that comes default by telegraf and one custom designed for safenodes endpoints), including calling safenode-manager (at a much slower frequency) and the /metrics
endpoint per safenode service hosted within a single LXC or VM or physical host.
It removes the headache of having to call various endpoints via tons of custom code (shell scripts & temporary files & cron jobs etc), handles parallelization, configurable timeouts on a per endpoint / format basis, and does retries on delivering the data points gathered to the final destination.
Its a one step operation to configure telegraf to parse the /metrics
endpoint (all of it) and pass it to a database or yet another forwarder of your choice that telegraf supports.
All my prior dashboards at home simply used a single telegraf configuration file + a single LXC hosting the grafana dashboard built off a single json config. Every LXC container had access to the telegraf configuration file, and it works beautifully on monitoring dozens of safenodes per LXC per physical or virtual host with zero shell scripts or log parsing
required.
Anyhow, everyone will have a different view of what kind of data they are after, and how they are going to piece it together. The above was my solution for my home nodes purely for metrics (not for log aggregation (that is a separate solution)).
I thought about trying that solution out but since I’m trying to make the community dash board as easy as possible for new comers to set up I’m unsure how I could do that automatically for them.
sorry I worded that badly I was meaning another file like that containing the version no Peer ID etc
so its easy to get at.
thank you @neo for putting things much more coherently than I can manage to. I fully agree machines are doing fine with 100’s of nodes and if start using safe node manager status the load averages increase drastically.
@shu when I fist started playing with grafana graphing was very difficult and jaggy and erratic for the same time point when adding things up across machines unless the influx time was exactly the same how do you get round that if its pulling data direct from the node metrics ?
They are uploading to the network. There is no local section/neighbourhood or anything to Maidsafe.
If the chunk’s address is 1234 then the 5 nodes in the whole network closest to xor address 1234 will be storing that chunk
Also it is not nodes that upload. The client is uploading to the whole network. Maidsafe is not uploading to their nodes.
As an aside - if you have a network with 50 nodes and all 1/2 full, then 50 other nodes join from around the world then the 100 nodes would end up being 1/4 full as churning will spread the chunks out.
The suggestion is that you have the “metrics” page for openmetrics
PLUS you have another page with a slightly different format of parameter: value
The other page leverages on the work done in the “metrics” page to populate 90% of the other page and add in other parameters from the workings of the node. It would provide a lot more of the internal state of the node. And in the process provide what @aatonnomicc is after and what might help me in testing on a local network.
So yes 2 URLs (and yes I should have looked at the link first LOL)
And yes along those lines, but provide a lot of dynamic values as I suggested in previous paragraphs
My work I am doing is with the local network and seems that the local network is meant to run on the same machine. In any case that is what I am doing, running the nodes on the same machine as that is easiest to script.
The code is basically so simple that scripting it is easier than coding it in a program. If speed is an issue then I will but so far linux has great (for this purpose) text processing that scripts are so easy when on the same machine
I’ll have a look at telegraf once my scripts exceed the 20-50 lines of actual work.
I just parse the text being output by safenode-manager when adding the nodes. So an automated process is to wrap safenode-manager with a script and pipe the output of the safenode-manager into a sed or similar to extract the peerID and the script sets the ports so that can be recorded too. And all automatically
By writing a file with that info then its available to any other program/script
ICYMI @joshuef
About this, a discussion and call for help:
@chriso, there is the source of my confusion.
I guess you are saying this is not a network reset, and that a node update is what’s happening.
But it does mention a reset via launchpad, which is what lead me to think this was a network reset.
Yeah, that’s my understanding of things. Although, I have posted a question in our Slack channel just to make sure we are all on the same page. My plan is to upgrade the nodes that we are hosting on DO, not reset them.
What should go out in the communication to users is that people who are using safenode-manager
can use safenode-manager upgrade
, without requiring a reset, but node-launchpad
users will need to run the Reset command within the launchpad.
It is unfortunate that the node launchpad does not support an upgrade yet, so the only mechanism it has to get on a new version is to do a reset.
It should be recoverable. We need to invest some time here to get all the edge cases. There’s a few things in flight that should help us here. As well as some unreleased fixes that are in main. I’d be keen to see if you’re still seeing this after the next release.
(Eitherway though, it should all be fixable, we just need to get time to get in there proper)
This is the plan yep!
Thanks @joshuef I will test this out.
I will need the git tag to build against - I usually guess this.
Will I also need new PK’s?