[Offline] 50GB Static Testnet

4 parallel downloads all hashes verified correctly!!!

I think we filled the space or killed the uploads

2022-12-22T02:20:36.233466Z DEBUG                 main publish_register_ops{wal=[Register(Create { cmd: SignedRegisterCreate { op: CreateRegister { name: 2945dd(00101001).., tag: 1100, policy: Policy { owner: Key(Bls(PublicKey(1627..faf0))), permissions: {Key(Bls(PublicKey(1627..faf0))): Permissions { write: Some(true) }} } }, auth: ClientAuth { public_key: Bls(PublicKey(1627..faf0)), signature: Bls(Signature(1824..1328)) } }, section_sig: SectionSig(PublicKey(1255..6d15)) })]}:client-api send cmd:session send cmd{dst_address=2945dd(00101001)..}: sn_client::sessions::messaging: Insufficient CmdAcks returned for MsgId(61bb..66a5): 5/7. Missing Responses from: []
2022-12-22T02:20:36.233480Z TRACE                 main publish_register_ops{wal=[Register(Create { cmd: SignedRegisterCreate { op: CreateRegister { name: 2945dd(00101001).., tag: 1100, policy: Policy { owner: Key(Bls(PublicKey(1627..faf0))), permissions: {Key(Bls(PublicKey(1627..faf0))): Permissions { write: Some(true) }} } }, auth: ClientAuth { public_key: Bls(PublicKey(1627..faf0)), signature: Bls(Signature(1824..1328)) } }, section_sig: SectionSig(PublicKey(1255..6d15)) })]}:client-api send cmd: sn_client::api::cmds: Failed response on Register(Create { cmd: SignedRegisterCreate { op: CreateRegister { name: 2945dd(00101001).., tag: 1100, policy: Policy { owner: Key(Bls(PublicKey(1627..faf0))), permissions: {Key(Bls(PublicKey(1627..faf0))): Permissions { write: Some(true) }} } }, auth: ClientAuth { public_key: Bls(PublicKey(1627..faf0)), signature: Bls(Signature(1824..1328)) } }, section_sig: SectionSig(PublicKey(1255..6d15)) }), response: Err(InsufficientAcksReceived { msg_id: MsgId(61bb..66a5), expected: 7, received: 5 })
Error: 
   0: ClientError: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(61bb..66a5)) passed, expected: 7, received 5.
   1: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(61bb..66a5)) passed, expected: 7, received 5.

Location:
   sn_cli/src/subcommands/files.rs:213

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

Same here now, uploads are failing repeatedly for new random 10MB files (around 01:50+ UTC):

I have stopped the upload script from my end.

safe cat safe://hyryyrysg7td9ozp5cmtn3u64qezgn371gcb1uho1b8f76rra8apuy8azgynra/The%20Rolling%20Stones/05.%20Out%20Of%20Our%20Heads%20-%201965b/10%20Play%20With%20Fire.mp3 > 10.mp3

is stalled for me.

I’ll rerun with trace logging but I think this one has ceased to be :cry:

Nah… RIP 50GB Static Testnet - your name will be spoken in hushed tones…

good suggestion, perhaps something to impl/test in a future testnet case after this one is considered a PASS.

4 Likes

Might need to wait for that :cry:

Lets see what the logs say in the morning.

2 Likes

My hope is that the devs will implement a special message if a node runs out of space. Something like “Out of storage space, try again later.” Generally speaking, the network should still serve GETS even if temporarily out of space for new PUTS. It’s a rare edge case, but a graceful pause may help with early growing pains.

3 Likes

Same: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
bash>>
bash>> filename:4c52033bb341974a – 6587050
Error:
0: ClientError: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(b30f…b2d9)) passed, expected: 7, received 5.
1: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(b30f…b2d9)) passed, expected: 7, received 5.
might be full now

1 Like

I think it’s the upload rate from folk. Nodes do not appear stressed. We reached 23gb pretty fast in the last 50gb network. I’ve also had a node (in DO) uploading on a loop previously and can increase uploaded data pretty swiftly.

But I’d say that’s all fine. We test some different conditions!


To break it down further we’d need to know "why’ of this desirability. What are the properties we want?

There are lots of ways to increase uniformity of data across a section:

  • changing quantity of data
  • changing size of data put (chunks are not necessarily all the same size… registers either)
  • changing the size of a chunk (more chunks probably meaning a more even distribution)
  • quantity of nodes
  • force certain xornams (as @jlpell suggests)
  • increase data replication count

But they themselves have knock on effects. We’d need to see how it all plays our with various

  • sections sizes
  • elder counts
  • node specs
  • client specs

And what sort of knock on that all has for performance of the network in general (eg, we might expect smaller chunks to be more intensive to route?)

There’s probably more variables I’m missing there too :coffee: .

It’s something that as the post I linked to above shows, we can model statistically. But it’s also something we’ll probably need to test in the real world, on various machines etc. But I don’t think there’s anything fundamental in the network design which prohibits increasing data uniformity across a section. But what would we be aiming for and “why” are just are pertinent as “can we do it?” .


Current data stored:

node-23: 24K total                                  
node-1: 32K     total
node-16: 24K    total
node-24: 24K    total
node-15: 24K    total
node-3: 24K     total
node-21: 24K    total
node-22: 23G    total
node-19: 22G    total
node-9: 22G     total
node-5: 34G     total
node-11: 25G    total
node-18: 29G    total
node-7: 25G     total
node-6: 34G     total
node-8: 33G     total
node-10: 34G    total
node-17: 25G    total
node-25: 34G    total
node-2: 34G     total
node-4: 34G     total
node-14: 34G    total
node-20: 34G    total

So it looks like ~6 hours ago we had a spike in throughput to some nodes. Which is probably indicative of some issue. Though metricbeat seems fine and I don’t see any broken nodes :thinking:

At the moment I’m able to download some test-data, but it’s going much more slowly than earlier.


This is what it should do yep. At the moment, on some smaller node tests (since we started this testnet), I’ve seen some failure at capacity as we try and route chunks to new nodes, but we’ve not removed this full node from our section (and so we’d loop over this somewhat). Which may be a deadly scenario. There’s some dysfunction tracking missing there that’d be needed to avoid that.

But let’s see if we get there. At the moment things look unputtable. But I won’t bring it down just yet.

7 Likes

I should note. We’re actively testing out smaller nodes eg, which would increase node count + data spread. We’re testing this to see how a reduced amount of data affects the churn process (which is looking pretty positive thus far).

3 Likes

There could also be reasons that it’s undesirable. Some nodes in the real world may not be as responsive or have longer lag times. I’m not sure if the network currently takes that into account.

Seems to me there needs to be (supposing there isn’t already) a weighting system here regards to remaining space available and quality of space available. Could be useful in determining how much to charge for space too. But again, IDK what’s currently in the code, perhaps this is already a thing.

This would be node age and it’s knock on effects I think. That exists, though the whole knock on process doesnt yet. But basically older nodes (which should be more reliable) will likely get paid more (and so stick around).

We don’t factor bandwidth/response time in yet, but that could happen if we wanted it to.

4 Likes

Morning! looks like the network stops to replying. for both PUT and GET? just hangs on, with no exception, no timeout…

I was able to grab the test-data an hour or so ago, albeit very slowly.

It’s still not clear to me what’s up at the moment. But it mall well be on its last legs.

1 Like

Not uploading for me either. It had a good run though. Hopefully something interesting pops up as a result. :clap:

edit: it may be uploading very slowly. Definitely trying to do something, but not sure how successful.

@joshuef - no chance that we’re being restricted by droplet cloud (amazon)? - like they are perceiving it as some attack on their cloud so they have automatically restricted bandwidth?

response I’m getting:

Error: 
   0: ClientError: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(a5f0..7944)) passed, expected: 7, received 5.
   1: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(a5f0..7944)) passed, expected: 7, received 5.

Location:
   sn_cli/src/subcommands/files.rs:213

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

ditto, downloading is very slow but upload

Mac-Pro:keys user1$ time safe files put ../LICENSE-MIT 
Error: 
   0: ClientError: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(5977..d131)) passed, expected: 7, received 5.
   1: Did not receive sufficient ACK messages from Elders to be sure this cmd (MsgId(5977..d131)) passed, expected: 7, received 5.

Location:
   sn_cli/src/subcommands/files.rs:213

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

real    2m24.130s
user    0m0.068s
sys     0m0.057s

1 Like

Same here, even small 88kb file failing to put with same message (expected: 7, received 5)

You might be the new hammer lord!

1 Like

Yeh, it looks like a couple of elders have become unreachable (or effectively) for some reason.

We’ve not gotten deep into the dysfunction checking which in theory should drop these nodes (as they should also not be failing in these controlled circumstances).

So now I’m trying to see if it’s a consistent pair, or just everything being a bit slow orrrrr???


edit: looks like some nodes may have deadlocked or something. Still running, but not producing any more logs :thinking:

12 Likes

Aha, okay, so it looks like we’re deadlocking around the node storage level reporting. Many updates come in, and we’re attempting to lock for each. I’m not sure exactly where the issue is. I think around updating many node’s storage level at once. But I saw something similar on another test I was doing and made chore: only track full nodes as/when joins are not already allowed · maidsafe/safe_network@ea50066 · GitHub to alleviate it.

But, we’re looking to remove the need for this reporting entirely and rely on elders storing data (as many folk noted, they’re conspicuously empty), which will a) give us more nodes and b) allow elders to use their own storage as an estimate of the section saturation, without the need for messages.

It removes some state management and any need to sync on “full”. As well as any need to track ‘full’ adults, simplifying it down to “can you give us what we ask for or not”. Which is all quite nice.

It does give elders a bit more to do, but they do not appear to be a bottleneck at the moment.

11 Likes

Just in case it could mean something last night I ran the install script to set up aother machine about 1am and started an upload could a newer version being used have caused any problems as I see the git hub has been busy?

2 Likes