[Offline] Another day another testnet

TylerAbeoJordan · December 18, 2022, 5:08am

okay - didn’t know data integrity was being questioned - the problem I had earlier with data was my own fault (my proxy was adding crap) … well can add to the script to check data Don’t have to send checksums, just create one, then get the file and check it.

Southside · December 18, 2022, 5:22am

check the posts from @stout77 and @nevel

its probably not a SAFE issue as such - IIUC more an implementation issue with Powershell but lots of folk will still use PS so it remains as some sand in the vaseline.

in any case no point in a perpetual web unless we are very sure about the data integrity.

cryptoidiot · December 18, 2022, 12:41pm

what happened that network is offline, thanks

happybeing · December 18, 2022, 1:18pm

Something is still there. I restarted my node a few minutes ago and it is trying to join, not giving an error. I assume if the network was offline it would give an error.

I’ve done this a few times as it seemed to crash after a few hours, never succeeding in joining. I think the crash is due to a memory bug that was identified above.

Toivo · December 18, 2022, 3:29pm

Ditto.

Network sections information for default network:
Read from: /home/topi/.safe/network_contacts/default

Genesis Key: PublicKey(0e66..769d)

Sections:

Prefix ''
----------------------------------
Section key: PublicKey(1304..3f35)
Section keys chain: [(PublicKey(0e66..769d), 18446744073709551615), (PublicKey(07c8..c79a), 5), (PublicKey(1876..d44f), 3), (PublicKey(01cb..cd86), 0), (PublicKey(1304..3f35), 1), (PublicKey(1422..77ee), 2)]

Elders:
| XorName  | Age | Address               |
| 1364f4.. |   5 | 161.35.42.123:44021   |
| 38608b.. |   5 | 142.93.38.111:35572   |
| 6790a7.. |   5 | 206.189.19.110:36495  |
| 758294.. |   5 | 142.93.44.126:39670   |
| a1dcf8.. |   5 | 104.248.167.4:43678   |
| b62694.. |   5 | 161.35.42.143:44107   |
| 56e8c3.. | 255 | 143.110.168.239:12000 |

Southside · December 18, 2022, 3:46pm

@Toivo Any put/get success?

BTW - When you are unsure about any command like pkill, just prefix it with man for its manual page.
man pkill
Sadly pkill is not the best example for a first time man user as it contains info on three closely related commands but in general if unsure look at the man page or just enter the command with the help flag -h or --help will almost always work to give you a hint and save asking others. You then get that nice warm “I worked that out” feeling.

Do not let any of that stop you asking questions though but if you look at man or --help first, you can ask more targetted Qs

Southside · December 18, 2022, 4:20pm

@happybeing Correct it is not dead, but is effectively in a coma AFAICS

from my Hetzner node, here is the log of

 RUST_LOG=sn_client=trace  safe files put ~/.bash_logout

gist.github.com

https://gist.github.com/willief/45427d176dac818ad1a05844b64de523

weefile.log

2022-12-18T16:07:59.422047Z DEBUG main sn_client::api::client_builder: Session config w/ keep alive: Config { external_port: None, external_ip: None, idle_timeout: Some(18s), keep_alive_interval: Some(3s), max_concurrent_bidi_streams: None }
2022-12-18T16:07:59.423592Z DEBUG main connect: sn_client::api: Making initial contact with network. Our public addr: 0.0.0.0:48861. Probe msg: Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 }))
2022-12-18T16:07:59.423864Z TRACE main connect:send_query_without_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 }))}:send_query_with_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })) retry=false}: sn_client::api::queries: Setting up query retry
2022-12-18T16:07:59.424799Z DEBUG main connect:send_query_without_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 }))}:send_query_with_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })) retry=false}: sn_client::api::queries: Attempting DataQuery { variant: Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })), adult_index: 0 } (adult_index #0)
2022-12-18T16:07:59.425036Z DEBUG main connect:send_query_without_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 }))}:send_query_with_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })) retry=false}:session send query{query=DataQuery { variant: Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })), adult_index: 0 } dst_section_info=Some((PublicKey(1304..3f35), [Peer { name: 6790a7(01100111).., addr: 206.189.19.110:36495 }, Peer { name: b62694(10110110).., addr: 161.35.42.143:44107 }, Peer { name: 38608b(00111000).., addr: 142.93.38.111:35572 }]))}: sn_client::sessions::messaging: Sending query message MsgId(814e..4e10), from 0.0.0.0:48861, DataQuery { variant: Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })), adult_index: 0 } to the 3 Elders closest to data name: [Peer { name: 6790a7(01100111).., addr: 206.189.19.110:36495 }, Peer { name: b62694(10110110).., addr: 161.35.42.143:44107 }, Peer { name: 38608b(00111000).., addr: 142.93.38.111:35572 }]
2022-12-18T16:07:59.425187Z DEBUG main connect:send_query_without_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 }))}:send_query_with_retry{query=Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })) retry=false}:session send query{query=DataQuery { variant: Register(Get(RegisterAddress { name: 27d4b9(00100111).., tag: 1 })), adult_index: 0 } dst_section_info=Some((PublicKey(1304..3f35), [Peer { name: 6790a7(01100111).., addr: 206.189.19.110:36495 }, Peer { name: b62694(10110110).., addr: 161.35.42.143:44107 }, Peer { name: 38608b(00111000).., addr: 142.93.38.111:35572 }]))}:send_msg: sn_client::sessions::messaging: ---> send msg MsgId(814e..4e10) going out.
2022-12-18T16:07:59.425451Z DEBUG tokio-runtime-worker sn_client::sessions::messaging: Trying to send msg MsgId(814e..4e10) to Peer { name: 6790a7(01100111).., addr: 206.189.19.110:36495 }
2022-12-18T16:07:59.425478Z DEBUG tokio-runtime-worker sn_client::connections::link: sending bidi msg out... MsgId(814e..4e10) to Peer { name: 6790a7(01100111).., addr: 206.189.19.110:36495 }
2022-12-18T16:07:59.425484Z DEBUG tokio-runtime-worker sn_client::connections::link: Attempting to get conn read lock... MsgId(814e..4e10)
2022-12-18T16:07:59.425488Z DEBUG tokio-runtime-worker sn_client::connections::link: lock got MsgId(814e..4e10)

This file has been truncated. show original

Toivo · December 18, 2022, 5:07pm

No, just Failed to connect… when I try either of put or get.

Toivo · December 18, 2022, 7:00pm

Hey @joshuef, @davidrusu, you wanna our logs?

joshuef · December 19, 2022, 6:13am

Hmm, may be that one didn’t respond here. Sometimes the ssh call fails.

So we’ve removed the unreliable upnp code at the moment, and it looks like folk may be able to join behind a NAT, but that’s not then actually connectable. So we’ll be disabling that for now.

@stout77 I’ve made an issue to track the windows corruption issue. If you or anyone else has more details or repro steps, please fire them up there

It should create a fresh/random endpoint each time yep. But the old ones should be closed. So that’s a bug.

Also just to reitereate, I don’t think anyone is actually joining the network. @happybeing were you seeing chunks stored? It should not have been allowed to join until we had muuuch more data stored.

Seems like we’re consuming ports on the retry loop as tokio’s not being killed before restart now. I’d wager that’s the main issue w/r/t mem consumption before join.

A lovely stuff.

Sorry, I had a draft Saturday morning then had to run out. At ~7am CET saturday, data storage was ~3-6gb on nodes. Things were healthy in general.

Right now:

node-17: 24K total                                                        
node-1: 32K     total
node-11: 24K    total
node-10: 24K    total
node-7: 24K     total
node-4: 24K     total
node-12: 24K    total
node-16: 18G    total
node-15: 23G    total
node-6: 18G     total
node-13: 12G    total
node-9: 21G     total
node-8: 21G     total
node-2: 23G     total
node-3: 23G     total
node-20: 23G    total
node-19: 12G    total

If you had a node join, that would be cool aye.

If you have some failed Get that was working, the tail end of msgs not received /w MsgIds is the most pertinent part.

I see this has already been marked as offline. Thanks everyone for poking at this. I can see a few supposedly dead nodes, so going to start some morning log diving before taking a look at the join leak.

bzee · December 19, 2022, 6:28am

Thanks a lot for pointing this out. This is called cancellation safety in async Rust. It’s basically an unresolved problem in the Rust community and causes very nasty, hard-to-catch bugs. I think there’s a real chance it affects Safe code too, but the question is how common it is.

stout77 · December 19, 2022, 7:55am

Comment added to the issue with steps to reproduce.

Btw, my node, on a Oracle cloud instance, eventually joined at 2022-12-17T07:26:08 GMT and stored:
95M chunks/
992K register/

I can try and find the logs if they can be of any use.

Toivo · December 19, 2022, 8:12am

OK, I’ll DM those to you. And I did get chunks, 196MB of them.

I don’t think I have those, or are they automatically logged somewhere?

joshuef · December 19, 2022, 9:07am

Initial log poking looks to me like we’ve had nodes die from attempting to send too many chunks out for replication.

The nodes that died appeared to have a big ole mem-spike before they died. Which also correlates to some v high network throughput.

Last logs for all nodes lie here:

pub-logs.zip (7.4 MB)

our dead nodes look to be:

node-13__104.248.171.248
node-14__161.35.42.132
node-3__209.97.135.17
node-5__165.232.36.110
node-6__157.245.38.232
node-20__138.68.140.11
node-21__138.68.135.250
node-19__161.35.42.122
node-8__206.189.17.185
node-9__165.22.119.34
node-2__161.35.42.139
node-18__161.35.40.161

So for me, I’m trying to see if there’s a simple way to throttle this throughput (what would be a sensible limit?)

See if there’s any obvious bugs there…

But also wondering: how did we get churn here. I’ve some logs from @Toivo I’ll be looking at to see how/why/when the node was accepted.

With those things in place we can look at another testnet. Since we appeared to fall down w/ churn it may make sense to look at more churn next… If any folk have thoughts on what they’d like to see tested next, suggestions are very welcome.

dirvine · December 19, 2022, 9:17am

I think normal data usage. Share data, but real data as though we paid for it. No ability to run a node right now, so remove the frantic upload of tons of stuff to bypass any join limit. Just use it as a normal user would expect to see Safe.

Also test DBCs like mad.

Then we need to work hard on replication and churn. There will be limits as we are trying to test real life but with no consequences. I expect churn every 5 minutes or so over many nodes (old nodes), but here we are compressing time as we bash this (which is good). I feel we need to be able to handle churn with this level of mad bashing as much as we can, I would be happy that goes past realistic levels.

Then node/section and data recovery, A node does not know it stopped, but a helper process would and could restart it.

Then we add pay for data and have farming rewards to slow things closer to real life, knowing if it went mental we can recover nodes and data etc.

joshuef · December 19, 2022, 10:06am

@dirvine this is a bit catch 22-y I think. If you didn’t earn $$ (via nodes), then we don’t have that ‘normal’ usage. So we need joins there.

If we implement some kind of genesis, folk can have test-snt to their addresses perhaps… though that’d also be a fair whack of $$ so not real… (or we have to manage storage algos and test those).

oorrrr, we have to implement a faucet? Or do manual distribution from the genesis.

I agree getting towards reality is a good aim, but the shortest path to it is not super clear.

One thing I was thinking of was lowering the full threshold, to get more churn, but theoretically keep the total amount we could store high? So we accept new nodes at 50% capacity eg.

peca · December 19, 2022, 10:26am

I not sure how to understand this. They determine what chunks need to by replicated and they try to send them all at once? Or what are the steps in the process?

joshuef · December 19, 2022, 10:40am

On churn, adults say “i have this data… what am I missing” to all other nodes in the section. They then check that they have, what our adult should be holding but are not, and then send it out in batches.

Right now they batch to a limit of 50mb / message. But they loop over this. So for 25gb of data, this could be… a lot. And it’s happening to each adult, at each adult… So a big mem spike there if it’s not cleared fast enough.

It looks like it’s not cleared fast enough as we were attempting to connect to adults for each batch, so any broken connection will result in batches.len reconnect attempts at the same time just now.

Sooooo, we’re looking to limit throughput of replication data to a given adult somewhat. Either via hard limit of mb/s/adult as well as deduping data to be replicated etc.

Toivo · December 19, 2022, 10:50am

I don’t think we can simulate normal cases here, because normally I wouldn’t share any data with anyone here. I share some data with some friends and colleagues very irregularly. It’s actually quite difficult to even get an idea how much an what kind of data you share, let alone to replicate that in a completely different social environment.

Maybe if we could get an idea of statistically average data use and someone could write a script for it that everyone could use?

peca · December 19, 2022, 10:54am

Only a quick thought, what about adding random delay on reconnect? Something like CSMA/CD does on ethernet.

Topic		Replies	Views
Community Test 13 November - offline Community	211	2988	November 25, 2021
Joshnet [May 4th Testnet 2023 ; Offline] Updates testnet	363	5349	September 6, 2023
OFFLINE Will it be a Quicky? (run 4) Community community-test	173	3274	December 23, 2021
[Offline] 50GB Static Network - Round 3 Updates	101	2012	January 9, 2023
NodeDiscoveryNet [07/07/23 Testnet] [Maidsafe nodes offline] Releases	178	3495	August 2, 2023

[Offline] Another day another testnet

Seems like we’re consuming ports on the retry loop as tokio’s not being killed before restart now. I’d wager that’s the main issue w/r/t mem consumption before join.

Related topics