PunchBowl [Testnet 09/05/2024] [Offline]

Small file worked OK here

Gave upon my handsome features, you lucky people!!!


"willie-holiday.png" 681b0f64e82a186dbc82a7e387a76983d12cedccc60b5a8096654b4a2dc9246f

Don’t shout though, I have lost my ears. Dunno why the hat is not over my eyes…

one of my nodes charging slightly above average :thinking: so if you have 25 coin feel free to go for an upload :tada:
oooooh - just 0.025 coin needed! more or less a bargain to upload chunks to me!

image

edit/ps:
hmhmmm… it happened right after the record count went from 1478 to 1479

Pps:
Hmmm - but maybe it’s just a coincidence (when I look at the next record increase and store cost increase there) (precise timing is not the strength of my monitoring I guess)

1 Like

I got this new (to me) set of errors when trying to upload from one of my VPS boxes


safe@snawthisyineither:~$ safe files upload -p cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb 
Logging to directory: "/home/safe/.local/share/safe/client/logs/log_2024-05-11_17-02-01"
safe client built with git version: 16f3484 / stable / 16f3484 / 2024-05-09
Instantiating a SAFE client...
Connecting to the network with 49 peers
🔗 Connected to the Network                                                                                                                                                  Chunking 1 files...
"cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb" will be made public and linkable
Splitting and uploading "cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb" into 6586 chunks
Error: 
   0: Failed to upload chunk batch: Wallet Error MsgPack deserialisation error:: invalid length 2, expected struct Output with 3 elements.

Location:
   /home/runner/work/safe_network/safe_network/sn_cli/src/files/files_uploader.rs:171

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
safe@snawthisyineither:~$ safe wallet balance
Logging to directory: "/home/safe/.local/share/safe/client/logs/log_2024-05-11_17-03-24"
safe client built with git version: 16f3484 / stable / 16f3484 / 2024-05-09
Error: 
   0: MsgPack deserialisation error:: invalid length 2, expected struct Output with 3 elements
   1: invalid length 2, expected struct Output with 3 elements

Location:
   sn_cli/src/bin/subcommands/wallet.rs:32

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
safe@snawthisyineither:~$ safe wallet address
Logging to directory: "/home/safe/.local/share/safe/client/logs/log_2024-05-11_17-03-34"
safe client built with git version: 16f3484 / stable / 16f3484 / 2024-05-09
Error: 
   0: Wallet Error MsgPack deserialisation error:: invalid length 2, expected struct Output with 3 elements.
   1: MsgPack deserialisation error:: invalid length 2, expected struct Output with 3 elements
   2: invalid length 2, expected struct Output with 3 elements

Location:
   sn_cli/src/bin/subcommands/wallet/hot_wallet.rs:131

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
safe@snawthisyineither:~$ safe wallet get-faucet 188.166.171.13:8000
Logging to directory: "/home/safe/.local/share/safe/client/logs/log_2024-05-11_17-03-46"
safe client built with git version: 16f3484 / stable / 16f3484 / 2024-05-09
Instantiating a SAFE client...
Connecting to the network with 49 peers
🔗 Connected to the Network                                                                                                                                                  Error: 
   0: Wallet Error MsgPack deserialisation error:: invalid length 2, expected struct Output with 3 elements.
   1: MsgPack deserialisation error:: invalid length 2, expected struct Output with 3 elements
   2: invalid length 2, expected struct Output with 3 elements

Location:
   sn_cli/src/bin/subcommands/wallet/helpers.rs:52

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
safe@snawthisyineither:~$
1 Like

I started reading up on LibP2P docs to better understand the layers involved and get more acquainted with the terminology and the processes involved here.

Below is a summary (feel free to revise or correct me) if I incorrectly stated any items:

For Hole Punching:

Phase 1: Preparation
  • AutoNAT (Determine Public or Private Node based on a node’s peers Dial Attempt/Responses off its assumed public address)

  • AutoRelay (Discover & Bind to closest Relay Nodes on the Network)

  • Circuit Relay (Connect To & Request Reservations with Discovered Relay Nodes)

Phase 2: Hole Punching
  • Circuit Relay (Establish Connection via Relay to Remote Peer Node)

  • DCUTR (a successful simultaneous coordinated dial off node A & B results in successful hole punch (attempts are repeated via relay until DCUTR takes place properly))

Statement quoted from an IPFS website:

Note: There are situations in which hole punching will not work, most notably when one of the nodes is behind a symmetric NAT. In such cases, nodes can instead explicitly add port mappings, either manually or by using UPnP. As a last resort, nodes can leverage external relay nodes.


It seems from earlier charts, the last phase of this hole punching, DCUTR (I am seeing only a 1-3% success rate here for safenodes with --home-network flag (i.e. going through the whole Phase 1 & Phase 2 steps for a successful Hole Punch).

Granted, AutoNAT is not in play yet, and therefore, I am not sure if its already smart enough to attempt to connect to only public nodes that can act as relay nodes, or is attempting to reach out to both types off nodes (public and private). Is the component ‘AutoRelay’ or equivalent already in play, and does it work without AutoNAT integrated successfully?

For reasons unknown yet with Group A (–home-network flag), the alternative routes, aka manual NAT port forwarding (Group B) and UPnP (Group C) are resulting in far less logs (errors) and overall less connections errors thus far.

Update:

Questions:

If a DCUTR is not successful, is communicating via Relay Node via ‘Circuit Relay’ off Step 1 of Phase 2 good enough for safenode pids (continuing to use the relay as a proxy (bi-directional channel via Relay between A & B nodes)), but not have the final A <=> B direct link?

At that point, is it really a successful hole punch, or its a successful communication maintained by relays only?

It seems being dependent on relay nodes (if lack of static NAT Portforwarding, unable to do UPnP, and also not get to the final successful DCUTR stage of hole punch sequence) setup would be the last resort (not that its a bad option (and it is an important option to have!)), but in terms of order of preference (according to the IPFS statement), :thinking: .

Notes:

I wanted to ensure the objective for a given testnet, and the terminology being used is properly understood by all of us (if possible) in the simplest terms, hence this post, :smile:.

I know the phases are evolving as the team is continuing to do work in the background for AutoNAT, UPnP, and other integration/components to give us the smoothest experience possible :clap: .

8 Likes

just chucked up an mint image 3gb :slight_smile:

took two attempts due to the dreaded
Failed to upload chunk batch: The maximum specified repayments were made for the address: ec2881(11101100) error

🔗 Connected to the Network                                                                                                 "linuxmint-21.2-mate-64bit.iso" will be made public and linkable
Splitting and uploading "linuxmint-21.2-mate-64bit.iso" into 2260 chunks
**************************************
*          Uploaded Files            *
**************************************
"linuxmint-21.2-mate-64bit.iso" d239d34f77f73f4dc60ef3e8a12ffc87ded440dd886536d5aaaea0b2c15e2963
Among 2260 chunks, found 45 already existed in network, uploaded the leftover 2215 chunks in 11 minutes 30 seconds
**************************************
*          Payment Details           *
**************************************
Made payment of NanoTokens(270446) for 2215 chunks
Made payment of NanoTokens(46628) for royalties fees
New wallet balance: 0.999682926

5 Likes

What is the cause of this error?

3 Likes

Pretty sure it’s trying to pay a node that you can’t connect to.

6 Likes

In my case it is impossible to upload such large files without getting the error (maximum specified repayments) even if I reset the wallet.

For example, I deleted the safe directory and started completely from scratch by uploading larger and larger files.
This is the result:

1.- File 216KB (4 Chunks)->Ok
2.- File 3,2MB (8 Chunks)->Ok
3.- File 28,9MB (57 Chunks)->Ok (Stuck in a chunk for about three minutes)
4.- File 44,2MB (86 Chunks)->Ok (Stuck for a couple of minutes in two chunks)
5.- File 84,1MB (162 Chunks)->Error maximum specified repayments (1 chunk without uploading)
New attempt to upload the last file->Error (maximum specified repayment)

After a reset I try to upload the last file. This time I can finish even though the only pending chunk takes about four minutes.

6.-File 149,3MB(285 chunks)->Error (maximum specified repayment, 5 chunk without uploading)
New attempt->Error (maximum specified repayment)

New reset and new attempt->Error (maximum specified repayment). I try with another file.

7.-File 289,4MB(553 chunks)->Error (maximum specified repayment). I try several times and I always get the error even if I reset the wallet.

It is as if it is impossible to connect correctly to some groups. In the previous testnet the same thing happened to me.

Did you upload from home or from VPS?
If it is from home, do you have open ports?

2 Likes

I’m uploading from home. No open ports for the client and they shouldn’t be needed for the client. Open ports for running safenodes but I’m not using them just now because I have the 40 nodes set to use ‘–home-network’.

Are you running safenodes as well?

No. Only client.

There’s probably nothing wrong with your setup then if you can get chunks to upload at all. It could be the issue we’ve seen before that some nodes that would have a chunk destined for them can’t take the payment for some reason so those chunks keep failing.

3 Likes

Tried running a node with “home network” and good old port forwarding. One thing I noted, that in the last hour or so, the amount of get’s in Vdash were raising really quickly in the port forwarded nodes. Much slower with “home network”.

2 Likes

I started looking at the logs to see if there is any patterns on Peer IDs vs DCUTR success/failure, and frequency of those success or failures for a single safenode service, in this case safenode1 (part of Group A).

Typical sequence logged for a non-successful DCUTR ERROR:

Notes:

Attempt #1:

  • Peer is added to routing table
  • Peer’s distance to us is logged
  • Peer ID is updated within KAD routing table
  • An outgoing connection to Peer is Attempted (which fails due to HandshakeTimedOut)
  • Issue regarding this remote peer is now being tracked and is cleared out

Attempt #2

  • Peer’s distance is logged
  • Peer is successfully removed from routing table
  • An outgoing connection to Peer is Attempted (which fails due to HandshakeTimedOut)
  • Issue regarding this remote peer is now being tracked and is cleared out

Attempt #3

  • An outgoing connection to Peer is Attempted (which fails due to HandshakeTimedOut)
  • Issue regarding this remote peer is now being tracked and cleared out

Result:

  • DCUTR status with remote peer is deemed as ERROR (3 Attempts Exceeded)

Additional Note:

  • Trying to understand in the sequence of 3 attempts based on the logging above (compared to a successful sequenced (as noted below)), where to draw the boundaries between attempt 1 / 2 / 3.

Typical sequence logged for a successful DCUTR OK:

Note:

  • Peer is added to routing table
  • Peer’s distance to us is logged
  • Peer ID is updated within KAD routing table
  • DCUTR status with remote peer is deemed as OK (successful direct connection)

Additional Note:

What’s interesting especially in the example above is the Peer ID above is the same Peer ID that failed to do a DCUTR (literally 2 mins prior to a proper success), even though on the failure attempts, it had tried 3 attempts with the whole workflow to try to get to a successful DCUTR (OK) in the first batch attempt, :thinking: .


Reflecting over the timeline of 18 hours for a single safenode1 service with --home-network flag:

  • There were 20,481 DCUTR (ERR) log entries with 3 attempts exceed messages

  • There were only 201 DCUTR (OK) log entries

  • Pivoting by Peer IDs across the 20,682 messages off DCUTR attempts (OK vs ERR):

    Total Unique Peers: 3393
    Total Unique Peers with ERR: 3377
    Total Unique Peers with at least 1 OK: 104
    Total Unique Peers with at least 1 OK and with ERR: 88
    Total Unique Peers with at least 1 OK and without ERR: 16

  • Roughly, 45% of all Unique Peers that performed a DCUTR had an ERR logged only once
  • Roughly, 40% of all Unique Peers that performed a DCUTR had an ERR status logged between 2 to 10 times (this includes a 3 attempt exceeded per DCUTR log entry)

I find it interesting that there is a few buckets off different combinations of my safenode1 attempting DCUTR against different but same remote peer id in terms of 1 or more attempts, and different outcomes over the 18 hours:

  • Bucket 1: On very first attempt it succeeds, and no further attempts or errors are generated
  • Bucket 2: All attempts result in ERR status with DCUTR
  • Bucket 3: Mix of ERR and OK (i.e. starts with ERR attempts, then goes to OK, then eventually goes back to ERR etc (flip flopping)) over a long time frame horizon… Hmmm :thinking: .

FWIW, DCUTR may not be the primary focus by the team if UPnP is a huge success and external relay communication continues to work even without a high DCUTR success rate (final phase of a successful hole punch (I believe)) (all TBD).

9 Likes

cc @joshuef
I found the port forwarding has way less errors than home-network, on the order of 5-10 times less

These errors may also slow down the rate of successful gets and puts

6 Likes

So the short answer:

If it’s being flagged, it’s not considered normal.

ConnectionIssue could be any number of things, essentially your node did not respond fast enough to a message. (So they could be overwhelmed/cpu starved eg). Or it could be normal dropped packets.

“What is normal” is something we need to dial in on. So if we’re seeing a lot of people reporting healthy nodes being flagged, we’ll need to be more tolerant. This is an ongoing process. So we’ll have to see what’s happening here and make some proposals and try them out I think. (Suggestions welcome!).

This should be UTC timestamp and we’re not being super accurate here are we? We’re measuring to ~10mins if I recall?

I suspect we can set that as a default for folks to use, and then the need for home-network should be even less, really.

If you can open a bug report with the details, that’d be awesome :bowing_man:

(It may be related to something @qi_ma is looking at , but i’m not sure sure.)


I’ll cede to @bzee on this stuff!

Looks liek we’re being too intolerant of chunk failure (as part of the larger process), and bailing too early (we can try again later, sort of thing), and that’s more likely to bring down larger uploads, really.

That’s being looked at now.

cc @qi_ma this might well be something we’re seeing with the added autonat/holepunching etc complexities!

Aye, hopefully it’s essentially a last resort (for non technical peeps).

7 Likes

I think you nailed it here, working out what the correct number of nodes a machine can handle is not easy, x amount of nodes will run fine for hours the next thing you know the load average nearly dbl what the cpu can handle.

Happens in bursts presumably network gets busy as it will typically affect several nodes at a time.

I have been running enough nodes to be just on the verge of a queue, I suspect when the cpu gets behind these complaints come through.

You are likely right on the money, I dont get many bad reports but I am pushing the limits and occasionally going over.

4 Likes

same results here trying to get a number is very difficult.

and the way it is will lead to lots of bad nodes on the network with the node Olympics coming up everyone id going to be trying get as many nodes running as they can :frowning:

3 Likes

Yeah I thought about this today, those who run too many will suffer the consequences.

I am going to aim to run at 80% instead of 100%. Hopefully that gets me GOOD instead of BAD reports :grin:

Kid you not, have a log message telling people they have good, it nodes may just be the incentive we need.

2 Likes

that’s what I aiming for but is nearly impossible to get it to stay at 80%

would be cool if safe node manager could terminate or add nodes depending on the available resources.

2 Likes

Just to be clear, I meant to run at 80 of what I consider possible without the spikes.

Just in case someone read that as running at 100% cpu :rofl:

3 Likes