Update 15 August, 2024

Following on from a thoroughly enjoyable Discord Stage on Tuesday, we thought we’d tackle some community questions this week.

Let’s start with the big one :elephant::

Why are home nodes not earning?

OK this is a bug. It occurs primarily with Launchpad, because that automatically flags nodes as home nodes with --home-network, but it could happen with nodes launched in other ways too. The basic problem is that nodes are adding too many listening addresses and inactive addresses aren’t being cleared out properly, clogging up communications. There is a fix in place now which will make it into the next build, planned for next week. A second issue is we are seeing more incoming connection issues than expected. We’re digging into that one now. We are also looking to clarify the different connection options (e.g. UPnP, port forwarding) so we can give better instructions there.

Node numbering

This was an issue flagged by @southside who said the existing convention of numbering nodes from safenode1 up does not lend itself to optimal display and sorting. This might seem like a trivial issue, but we’ve built some infrastructure around it, e.g. monitoring, which would break if we changed the scheme. As a longer term issue, we could make this a user-configurable feature (because people might prefer different padding lengths), but this is not a priority at the moment.

Health metrics in log files

@happybeing raised an issue asking if the node health metric will be in the log files. We understand that for vdash it would be ideal to incorporate and ingest the /metrics (present now and more metrics added in 2024.8.1), and /metadata (expected in 2024.8.2 release likely) endpoints programmatically for a majority of the program’s display columns metrics. We don’t want to increase the frequency of data in the logs to make it more real-time on vdash as this will create brittleness as well as using more disk space. However, we are open to suggestions as to which metric folk would like to see and why in the codebase. vdash is a fantastic asset, and we are taking inspiration from it as we build out features in the Launchpad :point_down:.

Misreporting Stopped Nodes

On @loziniak’s issue about nodes are misreported as stopped on safenode-manager, this only applies to a local network. There appears to be some incongruity between local network status and service network status. This should sort itself out once we are back in sync, but it’s not currently a priority.

API Issues

@jadkins three issues about the API, these are all covered by current work in progress so you should see them dealt with shortly.

Triage

Finally, there was a discussion led by @southside on Discord about how the community can help triage issues, by being able to understand the logs better. This is a little more involved, but we are working on a list of strings to watch for when grepping the logs.

As mentioned last week, consider us as BAD is the sign that a node has been shunned. In addition, kBucketTable has should tell you about the status of the routing table, while outdated live connections, still have log the number of open connections (though it is not logged frequently).

And of course there’s the existing docs on HandShakeTimeoutsand CircuitReqDenied from our troubleshooting guide

Explaining how we can use these to diagnose issues is something for another, perhaps the next(?), update.

General progress

As well as fielding community questions, @chriso has been working on an extension for testnet-deploy to have different sizes of machines per environment type, modifying the tool and the related workflows.

@bzee has been busy (pun intended) on the API, getting a solid PUT method with all the inner stuff working correctly.

@mick.vandijke has been on wallets, specifically merging hot wallets and watch-only wallets and a new wallet file format similar to Bitcoin, that can be more easily backed-up and transferred than the wallet file + folders we have now.

And @qi_ma has joined @joshuef, @anselme and @dirvine on the Sybil- resistance work, putting in a PR in the libp2p repo to support range-based search. Range-based searching allows us to interrogate peers that are further away from the target address, to prevent an attacker from monopolising the search space.

@anselme has been writing documentation for the currency system, as required by our community, partners and for the internal devs. Documentation is one of those jobs that’s always needed, but hard to achieve when things are moving fast, so it’s a sign that we’ve arrived at a workable solution here.

Back from a family holiday, @jimcollinson has been working on node launchpad UX upgrades, node health indicators, and node port forwarding and configurations for home users.

@roland raised a PR to expose a /metadata endpoint via the node’s metrics port. This is used to return static info about the node. He also worked on a system to pull the Beta uploaders’ metrics into a single place, to make it easier to check for errors, and looked at errors caused by high bandwidth.

@mazzi has completed the Launchpad upgrade and is now testing it on different platforms.

And @rusty.spork hosted this week’s Discord Stages, chucking a few tasty curveballs at @bux, who of course handled them with aplomb (summary here). He also worked with some community members to obtain bandwidth metrics and requirements that partners have asked for. And he was responsible for leaking Jim’s node stats as part of showing off @mazzi’s new Launchpad while Jim was on holiday (that’ll teach him).

52 Likes

First -thanks for the update, thanks for all the work it has taken us to get here.

Best news is to see @dirvine back in action - as said elsewhere, please take it easy and dont rush back into 24/7 work, David.

30 Likes

Second me thinks

In metric it would be good to get the active record balance and not just the total records as it is now.

In the meta it would be good to have the address info.
The peerID and xor address
port info (all ports)

Using RPC is just too slow and not as easy as just grabbing the /metrics and /metadata

15 Likes

To help users monitor nodes you don’t need to make vdash anything like real time, and I wasn’t asking for all the metrics, just two small tweaks.

What would help a lot is to add relevant records and node health to one of the existing messages that happen more frequently than say a Store Cost quote which is typically every few days! At the moment that means people can’t see if they are getting records for a day or longer.

If you could report relevant records and node health in any of the other more frequent messages that would be fine. Or a new message that’s output once per hour if that’s not satisfactory.

This wouldn’t have a significant effect on brittleness or disk space taken by logs but would help lots of users during beta.

Especially new arrivals and anyone not seeing nanos.

11 Likes

I won’t for sure, it will be sporadic for a few months yet I think. I try though :smiley: :smiley:

27 Likes

Aye well, don’t try too hard :slight_smile:

7 Likes

Thx 4 the update Maidsafe devs

Good to see you again sir @dirvine

:clap: :clap: :clap: :clap: @southside @happybeing @loziniak @jadkins

Can you make this backup directly to the Network?

Not super relevant, but how do you prevent someone from changing your Discord Username on your Launchpad?

Launchpad looks super informative

Keep coding/hacking/testing super ants

9 Likes

Nice update team and good work this week. Many thanks to project supporters too – big thanks to @Southside for all of his contributions this week.

Out of curiosity:

Is this a TCP versus UDP thing? I don’t know if we are still using TCP or not.

Great to see you back on the forum @dirvine may your recovery be swift!.

Cheers all :beers:

9 Likes

No it’s just a thing. We use QUIC these days AFAIK. libp2p have some serious infrastructure PR’s in flight. I would let them settle and check this again.

15 Likes

Reluctantly conceding this point - its deal if the default is 2 leading zeroes – OK?

Yep, if there is work that needs undone first, then just quietly skip this - Its hardly an insurmountable problem, just a PITA. But it does make vdash a lot less useful than it otherwise could be… Cos vdash and safenode-manager have entirely different ideas about what node is what once we have >9 nodes.

4 Likes

I’m missing some kind of information about how the network performs:

  • Does it keep data well?
  • How fast are the uploads and downloads?
  • What kind of errors happen during the up/downloading?

Before Beta we used to get first hand touch on the things I mentioned, and I miss that. Would be nice to get some kind of report.

7 Likes

And just to clarify (particularly to @Southside, as I know you’ve been asking about it), it’s better to use /metrics rather than logs in the first instance for direct feedback on node health, rather than rely on logs. Logs being far more useful for debugging after the fact, rather than an indicator in the moment.

8 Likes

I’d respond to this but Im far too busy RTFMing /metrics :slight_smile:

5 Likes

How do we ensure this when 99% of us are running with the specified binaries for the beta and no longer building from source – and can specify which features to build?

from safe_network/sn_metrics at main · maidsafe/safe_network · GitHub

1. **Safe Node Configuration:** When running your Safe nodes, ensure they are started with the –feature=open-metrics flag.

To be clear, my interest in the logs is to make it easier to extract info for debugging newbie probs (like no nanos for home users) rather than up to the minute performance monitoring. Which itself is important but not my focus right now.

The list of the most critical log messages will be most welcome, we got some already, thanks :slight_smile:

IF anyone has the time, then a line or two about what causes these errors (if not bleedin’ obvious) and their likely effect would be super-informative as well.

3 Likes

Thanks so much to the entire Autonomi team for all of your hard work! :muscle: :muscle: :muscle:

And good luck to everyone with your beta rewards! :four_leaf_clover:

6 Likes

The team is planning to get this out as well after the next release. It’s in the pipeline.

The team is planning to add more quoting related metrics after the next release in the /metrics endpoint as well (ongoing discussion).

10 Likes

Thanks @Shu but my comment concerns what’s in the logs - which can then be used by vdash.

1 Like

I think the priority will be on getting metrics endpoint as up to date and relevant as possible.

The team did try to provide a general guidance on using logs to generate metrics vs consuming the metrics endpoints directly in the OP.

David’s response is similar on a different post regarding use of logs vs. metrics:

I understand it doesn’t answer your questions or resolves your request at this time for additional PRs required to make additional log file changes.

5 Likes

I also understand the issues wrt logs v metrics. My request is pragmatic - that the pair of tweaks I’ve requested were rejected for the what seem to me invalid reasons (“brittleness and logfile size”), and that these small changes will help those coming to the beta and using vdash rather than launchpad. That’s all.

3 Likes

I wonder, to me it seems dash is a Buch of really neat work, Done when we had no metrics and the community have really benefitted form it an awful lot. Now there is work internally to make launchpad cover some of what dash did.

So I have 2 points to throw in and see what folk think

  1. The trivial cases @happybeing has asked to happen should. It’s a simple PR AFAIK
  2. We need focus on update vdash to metrics or incorporate vdash into launchpad

The last point is where discussion would be good, but I feel as a community we must value past contributions and especially those that have been so helpful. Never mind awe, which looks great as well as work others are doing. We need community focus on client API goodness to happen now.

I will post this link internally as well, to get as many involved as possible.

13 Likes