okay - didn’t know data integrity was being questioned - the problem I had earlier with data was my own fault (my proxy was adding crap) … well can add to the script to check data
Don’t have to send checksums, just create one, then get the file and check it.
check the posts from @stout77 and @nevel
its probably not a SAFE issue as such - IIUC more an implementation issue with Powershell but lots of folk will still use PS so it remains as some sand in the vaseline.
in any case no point in a perpetual web unless we are very sure about the data integrity.
what happened that network is offline, thanks
Something is still there. I restarted my node a few minutes ago and it is trying to join, not giving an error. I assume if the network was offline it would give an error.
I’ve done this a few times as it seemed to crash after a few hours, never succeeding in joining. I think the crash is due to a memory bug that was identified above.
Ditto.
Network sections information for default network:
Read from: /home/topi/.safe/network_contacts/default
Genesis Key: PublicKey(0e66..769d)
Sections:
Prefix ''
----------------------------------
Section key: PublicKey(1304..3f35)
Section keys chain: [(PublicKey(0e66..769d), 18446744073709551615), (PublicKey(07c8..c79a), 5), (PublicKey(1876..d44f), 3), (PublicKey(01cb..cd86), 0), (PublicKey(1304..3f35), 1), (PublicKey(1422..77ee), 2)]
Elders:
| XorName | Age | Address |
| 1364f4.. | 5 | 161.35.42.123:44021 |
| 38608b.. | 5 | 142.93.38.111:35572 |
| 6790a7.. | 5 | 206.189.19.110:36495 |
| 758294.. | 5 | 142.93.44.126:39670 |
| a1dcf8.. | 5 | 104.248.167.4:43678 |
| b62694.. | 5 | 161.35.42.143:44107 |
| 56e8c3.. | 255 | 143.110.168.239:12000 |
@Toivo Any put/get success?
BTW - When you are unsure about any command like pkill, just prefix it with man for its manual page.
man pkill
Sadly pkill is not the best example for a first time man user as it contains info on three closely related commands but in general if unsure look at the man page or just enter the command with the help flag -h or --help will almost always work to give you a hint and save asking others. You then get that nice warm “I worked that out” feeling.
Do not let any of that stop you asking questions though
but if you look at man or --help first, you can ask more targetted Qs
@happybeing Correct it is not dead, but is effectively in a coma AFAICS
from my Hetzner node, here is the log of
RUST_LOG=sn_client=trace safe files put ~/.bash_logout
No, just Failed to connect… when I try either of put or get.
Hey @joshuef, @davidrusu, you wanna our logs?
Hmm, may be that one didn’t respond here. Sometimes the ssh call fails.
So we’ve removed the unreliable upnp code at the moment, and it looks like folk may be able to join behind a NAT, but that’s not then actually connectable. So we’ll be disabling that for now.
@stout77 I’ve made an issue to track the windows corruption issue. If you or anyone else has more details or repro steps, please fire them up there ![]()
It should create a fresh/random endpoint each time yep. But the old ones should be closed. So that’s a bug.
Also just to reitereate, I don’t think anyone is actually joining the network. @happybeing were you seeing chunks stored? It should not have been allowed to join until we had muuuch more data stored.
Seems like we’re consuming ports on the retry loop as tokio’s not being killed before restart now. I’d wager that’s the main issue w/r/t mem consumption before join.
A lovely stuff.
Sorry, I had a draft Saturday morning then had to run out. At ~7am CET saturday, data storage was ~3-6gb on nodes. Things were healthy in general.
Right now:
node-17: 24K total
node-1: 32K total
node-11: 24K total
node-10: 24K total
node-7: 24K total
node-4: 24K total
node-12: 24K total
node-16: 18G total
node-15: 23G total
node-6: 18G total
node-13: 12G total
node-9: 21G total
node-8: 21G total
node-2: 23G total
node-3: 23G total
node-20: 23G total
node-19: 12G total
If you had a node join, that would be cool aye.
If you have some failed Get that was working, the tail end of msgs not received /w MsgIds is the most pertinent part.
I see this has already been marked as offline. Thanks everyone for poking at this. I can see a few supposedly dead nodes, so going to start some morning log diving before taking a look at the join leak.
Thanks a lot for pointing this out. This is called cancellation safety in async Rust. It’s basically an unresolved problem in the Rust community and causes very nasty, hard-to-catch bugs. I think there’s a real chance it affects Safe code too, but the question is how common it is.
Comment added to the issue with steps to reproduce.
Btw, my node, on a Oracle cloud instance, eventually joined at 2022-12-17T07:26:08 GMT and stored:
95M chunks/
992K register/
I can try and find the logs if they can be of any use.
OK, I’ll DM those to you. And I did get chunks, 196MB of them.
I don’t think I have those, or are they automatically logged somewhere?
Initial log poking looks to me like we’ve had nodes die from attempting to send too many chunks out for replication.
The nodes that died appeared to have a big ole mem-spike before they died. Which also correlates to some v high network throughput.
Last logs for all nodes lie here:
pub-logs.zip (7.4 MB)
our dead nodes look to be:
node-13__104.248.171.248
node-14__161.35.42.132
node-3__209.97.135.17
node-5__165.232.36.110
node-6__157.245.38.232
node-20__138.68.140.11
node-21__138.68.135.250
node-19__161.35.42.122
node-8__206.189.17.185
node-9__165.22.119.34
node-2__161.35.42.139
node-18__161.35.40.161
So for me, I’m trying to see if there’s a simple way to throttle this throughput (what would be a sensible limit?)
See if there’s any obvious bugs there…
But also wondering: how did we get churn here. I’ve some logs from @Toivo I’ll be looking at to see how/why/when the node was accepted.
With those things in place we can look at another testnet. Since we appeared to fall down w/ churn it may make sense to look at more churn next… If any folk have thoughts on what they’d like to see tested next, suggestions are very welcome.
I think normal data usage. Share data, but real data as though we paid for it. No ability to run a node right now, so remove the frantic upload of tons of stuff to bypass any join limit. Just use it as a normal user would expect to see Safe.
Also test DBCs like mad.
Then we need to work hard on replication and churn. There will be limits as we are trying to test real life but with no consequences. I expect churn every 5 minutes or so over many nodes (old nodes), but here we are compressing time as we bash this (which is good). I feel we need to be able to handle churn with this level of mad bashing as much as we can, I would be happy that goes past realistic levels.
Then node/section and data recovery, A node does not know it stopped, but a helper process would and could restart it.
Then we add pay for data and have farming rewards to slow things closer to real life, knowing if it went mental we can recover nodes and data etc.
@dirvine this is a bit catch 22-y I think. If you didn’t earn $$ (via nodes), then we don’t have that ‘normal’ usage. So we need joins there.
If we implement some kind of genesis, folk can have test-snt to their addresses perhaps… though that’d also be a fair whack of $$ so not real… (or we have to manage storage algos and test those).
oorrrr, we have to implement a faucet? Or do manual distribution from the genesis.
I agree getting towards reality is a good aim, but the shortest path to it is not super clear.
One thing I was thinking of was lowering the full threshold, to get more churn, but theoretically keep the total amount we could store high? So we accept new nodes at 50% capacity eg.
I not sure how to understand this. They determine what chunks need to by replicated and they try to send them all at once? Or what are the steps in the process?
On churn, adults say “i have this data… what am I missing” to all other nodes in the section. They then check that they have, what our adult should be holding but are not, and then send it out in batches.
Right now they batch to a limit of 50mb / message. But they loop over this. So for 25gb of data, this could be… a lot. And it’s happening to each adult, at each adult… So a big mem spike there if it’s not cleared fast enough.
It looks like it’s not cleared fast enough as we were attempting to connect to adults for each batch, so any broken connection will result in batches.len reconnect attempts at the same time just now.
Sooooo, we’re looking to limit throughput of replication data to a given adult somewhat. Either via hard limit of mb/s/adult as well as deduping data to be replicated etc.
I don’t think we can simulate normal cases here, because normally I wouldn’t share any data with anyone here. I share some data with some friends and colleagues very irregularly. It’s actually quite difficult to even get an idea how much an what kind of data you share, let alone to replicate that in a completely different social environment.
Maybe if we could get an idea of statistically average data use and someone could write a script for it that everyone could use?
Only a quick thought, what about adding random delay on reconnect? Something like CSMA/CD does on ethernet.
