Best Ant (was Safe) Node hardware

riddim · December 10, 2024, 2:33am

peca · December 10, 2024, 8:40am

Looks like we still have some memory leaks. After 12 days of running my 32 GB RAM nodes are barely running 100-130 nodes.

neo · December 10, 2024, 9:02am

Its a problem I highlighted a while ago. I found that as I start up nodes they use more and more memory for each new one.

Also since I have over 100GB of memory the first nodes are using like 300-400MB, yet on my SBC with 8GB the nodes are only using between 100 to 200 MB (closer to 100MB) each

Back then we worked out the most likely reason the first nodes on the large machine were using 300 odd MB was that with more overall system RAM the initial allocation of RAM by the OS (or RUST startup) was higher, significantly higher.

The reason why the memory used increased when there were many nodes already running was never worked out. And something they were going to look into after everything else was sorted out. We are talking of first nodes around 300MB and by node 150 its 800MB

Then a few versions ago there was a bug where there was a real memory leak and occasionally a node would start using well over 1 GB of memory. I have not seen that bug for a while now

peca · December 10, 2024, 9:47am

I know about those problems, but this is different. It was fine for few days after startup, then I think nodes grow on network activity peaks and never shrink back, but I don’t have enough data to document it.

Southside · December 10, 2024, 12:48pm

All the more reason to have a proper test schedule before TGE.
If all looks fine and then 3-4 weeks in we start running out of memory, its not going to look good at all.

In fact its going to look downright amateurish.

Shu · December 10, 2024, 1:14pm

I have been running the safenodes since at least Nov 18th (400 nodes) on a 512GB RAM Alpine LXC.

Memory usage has stayed flat between 350GB to 365GB, and so has CPU & Network Traffic.

Note: However, I do switch out the default memory allocator within safenode binary, and switch to mimalloc (GitHub - microsoft/mimalloc: mimalloc is a compact general purpose allocator with excellent performance.) for home use, by recompiling the source code off the production branch with each release (personal preference so far).

I also haven’t updated to the latest production branch yet at home, though I should, but I haven’t had time yet to do so.

rreive · December 10, 2024, 11:54pm

memory leak in Rust? hmmm typically such malloc programming errors are a C (or C++) thing…, I get the explicit memory allocation thing, just leaving it up to Rust or the OS, well both are piggish in default mode

neo · December 11, 2024, 12:07am

It’d be more like a subprocess not terminating itself and keeping the memory it was allocated

Josh · December 11, 2024, 3:24am

This is a huge surprise to me, I can’t talk on RAM because I need to routinely restart nodes because of CPU creeping long before I would need to be concerned with RAM.

Erwin · December 11, 2024, 1:07pm

You should limit the amount of monitors to 3. The monitors are constantly fighting for consensus and having that many monitors causes significant cpu and networking overhead. It will get exponentially worse when the cluster is in trouble.

Erwin · December 11, 2024, 1:11pm

CephFS is not HA. When the MDS fails, all your cephfs filesystems will be down until the newly elected MDS has combed through the filesystem and loaded all metadata into ram.

You’d be better of mounting rbd images. They’re automatically sort-of-thin-provisioned.

Shu · December 11, 2024, 3:26pm

I disagree. If 3 of the 13 machines go down that contained the fixed monitors than there are no monitors running.

This is why there is more than 1 MDS, including one or more on standby.

HA as in in the sense as long the majority of machines remain up (with manager / metadata, and monitors), Ceph as as service will remain up including RBD and FS access (over the long run).

Obviously, if I am hit with a power outage and battery backups which don’t last for ever, then there is no HA. The idea was to be able to horizontally scale capacity if needed and handle machine level failures, which is why I went Ceph.

Jadkin · December 11, 2024, 3:33pm

Agreed 13 monitors is tooo many, 3 would give me sleepless nights 5 is the sweet spot on that few hosts - new releases of ceph will refuse to upgrade with 3 - you have to run it non-production mode.

Also best way to manage mon’s is to use “tags” on hosts - you tag a host as “mon” target, then you write a rule to say “max 5 mon’s on mon:target” then ceph will manage the placement dynamically for you.

same for managers, you need a master and a backup, and using tags is the way there again…

never again

Shu · December 11, 2024, 3:39pm

Sure, between 5 and 13, I am sure there is a sweet spot. Its easy to scale down, but I am on an older version at the moment and where they run need to be picked ahead off time. I also have to ensure enough run on different circuit breakers (physically), incase 1 circuit breaker trips and wipes out 1/3rd off the machines even with battery backup.

Thanks for the suggestion on the tags, however, not too worried about the current assignments of different daemons on different machines ahead of time.

The setup has been working well for me for many years.

Anyhow, lets stick to discussion on the different setups folks have for safenodes, as oppose to Ceph optimizations on this thread.

Helen · December 11, 2024, 7:05pm

What is the current procedure for specifying an external drive for running a node?

Shu · January 5, 2025, 4:39am

I am on an older Intel E5 v4 CPUs, and am getting about 325 watts for 585 antnodes on 1 host, and 350 watts for 400 antnodes on another, though second host is identical in hardware and OS configuration, but is running a few other workloads too.

So basically, 325 / 585 = .55 watts per antnode (thats the lowest I have been able to get to), and pretty satisfied with those statistics at the moment given the CPU dates back a few years.

Just curious, if any folks are running on the newer 5nm CPUs such as the Ryzens or EYPC CPUs, whats the watt per antnode like on those systems?

Shu · January 5, 2025, 2:33pm

Made a quick chart for power usage (watts) vs antnodes (# running) on a per host basis for home use.

Update:

Managed to make more tweaks and confirm steady state at 335 watts for ~600 nodes = .55 watts per node.

Though, without any antnodes and anything else running on the system, it consumes 235 watts as is already (I will need to look into disabling other features in the hardware to reduce this down a bit more (TBD)).

Basically, with antnodes running, the watts used jumps from 235 to 335 (+100 watts) to support the 600 antnodes:

rreive · January 5, 2025, 8:37pm

Early on paper calcs on our x2 Geekom Intel CORE i7 gen12 lab NUCs suggest its possible in a 64GB RAM setup x4 512GB NVME SSDs to get below <.4 Watt per Antnode… running 288 nodes with a claimed peak draw of 110 Watts running Ubuntu Linux 24.04, without the onboard Intel GPU doing anything (running headless, SSH in config)

We are going to fire this config. up next week and see what is actually happening with an in wall Watt Meter…, to see if that type of <.4 Watts/antnode level is achievable while keeping all nodes healthy/unshunned.

We are still working on some related LXC container provisioning and backup configuration/integration stuff for what is a 1GE connected to ISP router/modem/4port local 1GE switch accessed headless test build, plus some other ISP and IP address related stuff , which include our own ‘low write amp’ in memory FTL LKM…

its a journey, we will post results here as it happens…

Shu · January 5, 2025, 9:44pm

Very cool!

I am hitting about .49 watts now per antnode (345 watts for 700 nodes) and this is on an older CPU (14nm) than your target Intel i7 12th gen (10nm Enhanced SuperFIN aka Intel 7), so I suspect you should be able to hit that target!

Roughly 4-5% of my 700 nodes have a shunned count > 0 (no idea why yet), but keeping an eye on it.

neo · January 5, 2025, 10:10pm

So we are talking of 167mW to run a node on your system.

If we use that for a family’s PC running 5 nodes its still under 1 watt. Or 1200 hours (50 days) per KWH. Even at today’s prices that is like 1 token every 50 days to pay incremental costs on older h/w.

Now I am going to have to measure current going into my SBC’s when I get them up again (doing a lot of house rearrangements ATM) and see what the incremental increase is. The SBCs are also good for streaming services or general browsing/word processing etc so they do not need to be dedicated to node running, and likely run HA in addition to nodes. Seeing as they draw <7.5W running nodes I suspect they too will be low incremental wattage for 10 nodes running. If things improve then maybe 20 nodes for that power.

Topic		Replies	Views
Critical Criteria: Farming Ant-Node (was Safe-node)	16	2831	October 23, 2015
Raspberry pi 2 farm rig Ant-Node (was Safe-node)	22	4288	November 23, 2017
What kind of equipment will be required for dedicated farming? Ant-Node (was Safe-node)	11	4639	December 21, 2014
Fast Farming Rigs? Any Interest? Ant-Node (was Safe-node)	6	1340	March 13, 2018
Farming at home Ant-Node (was Safe-node)	27	3160	January 26, 2018

Best Ant (was Safe) Node hardware

Related topics