ReplicationNet [June 7 Testnet 2023] [Offline]

I’ve not tried this command but think it will do the trick:

grep -e "5a1c9e\|73b703\|6be651\|63594e\|ab8031" /tmp/safenode/safenode.log

Although really you will need to decompress all the logfiles and change the .log to *

2 Likes

Can anyone explain the shape of memory graph for my node?
This looks suspicious, I doubt that it can be explained by “chunk names”:
memory_used_mb:
sn_mem

total_mb_written / total_mb_read

total_mb_written:
sn_mbw
total_mb_read:
sn_mbr
raw data:
snmetr.zip (246.0 KB)

5 Likes

I have idea regarding memory leak:
Here is chart for OutgoingConnectionError:
oce
Looks like we have correlation with RAM amount.

Probably initial nodes started crashing. Which led to memory leak and further crashing. It is more guess, but worth checking.

raw data:
sn_oce.zip (9.2 KB)

8 Likes

zgrep "5a1c9e\|73b703\|6be651\|63594e\|ab8031" safenode.log*.gz

mine sorted itself out.

7 Likes

KISS
grep -re "5a1c9e\|73b703\|6be651\|63594e\|ab8031" . WFM

safe@ubuntu-2gb-nbg1-1:~/.safe/node$ grep -re "5a1c9e\|73b703\|6be651\|63594e\|ab8031" .
./safenode_20/safenode.log.20230609T165646:[2023-06-09T16:50:45.720153Z TRACE sn_networking::msg] Received request with id: RequestId(40229), req: Cmd(StoreChunk { chunk: Chunk { address: ChunkAddress(63594e(01100011)..) }, payment: None })
./safenode_20/safenode.log.20230609T165646:[2023-06-09T16:50:45.724630Z TRACE sn_node::api] Handling request: Cmd(StoreChunk { chunk: Chunk { address: ChunkAddress(63594e(01100011)..) }, payment: None })
./safenode_20/safenode.log.20230609T165646:[2023-06-09T16:50:45.728729Z DEBUG sn_node::api] That's a store chunk in for :63594e(01100011)..
./safenode_20/safenode.log.20230609T165646:[2023-06-09T16:50:45.728906Z INFO safenode] Currently ignored node event ChunkStored(ChunkAddress(63594e(01100011)..))
./safenode_20/safenode.log.20230609T165646:[2023-06-09T16:50:45.742260Z TRACE sn_record_store] Wrote record to disk! filename: 63594e7024e857853eca6b68cc13ce04f16c96557af88e59495fbd78fc62e6ce
./safenode_20/safenode.log.20230609T165646:[2023-06-09T16:51:32.298346Z TRACE sn_record_store] Retrieved record from disk! filename: 63594e7024e857853eca6b68cc13ce04f16c96557af88e59495fbd78fc62e6ce
./safenode_7/safenode.log.20230609T161523:[2023-06-09T16:13:35.301860Z TRACE sn_record_store] Wrote record to disk! filename: 99b1676629f6ac7384b20afcc2226063594e2f6514febfc

So am I the baddie storing (or not) one of the missing chunks (63594e(01100011)…) ?

./safenode_20/safenode.log.20230609T165646:[2023-06-09T16:50:45.728729Z DEBUG sn_node::api] That’s a store chunk in for :63594e(01100011)…

4 Likes

think ill be the one to jinx it !!

well done to @dirvine and all the team this test net has taken everything i had to throw at it !!

40 gb of mp3’s and movies at 2Gb have all succeeded so it has surpassed anything we had on the old code.

but there is a but here as i am testing on a few different connections :frowning:

oracle cloud arm instance
40gb mp3’s amd 2Gb crow movie 100% success

Retrieving speedtest.net configuration...
Testing from Oracle Cloud (xx.xx.xx.xx)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by M247 Ltd (Manchester) [240.72 km]: 7.228 ms
Testing download speed................................................................................
Download: 1524.03 Mbit/s
Testing upload speed......................................................................................................
Upload: 1239.16 Mbit/s

virgin media fiber broad band
mp3’s were a 100% success for what i tried not the full 40Gb but crow movie at 2Gb was a fail :frowning:

Retrieving speedtest.net configuration...
Testing from Virgin Media (xx.xx.xxx.xx)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Structured Communications (London) [6.96 km]: 31.257 ms
Testing download speed................................................................................
Download: 364.39 Mbit/s
Testing upload speed......................................................................................................
Upload: 52.27 Mbit/s

Vodafone broadband
mp3’s were succeeding at around 10% success rate and the crow was a fail

Retrieving speedtest.net configuration...
Testing from Vodafone UK (xx.xx.xx.xx)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Wildcard Networks (Newcastle upon Tyne) [399.19 km]: 32.864 ms
Testing download speed................................................................................
Download: 53.49 Mbit/s
Testing upload speed......................................................................................................
Upload: 13.07 Mbit/s

so looks like things are very internet connection dependent from my side id like to see the ability to increase the time out for the next iteration

well done to the team you have surpassed the old code that took years to get to a point that you have now surpassed!!

17 Likes

Updated the charts to include the following:

  • chunk size on disk under record_store folder for each file
  • total accumulated bytes for all files under record_store over time
  • larger time window since start of the safenode pid spawn up

Observations:

  • Seems more chunks/files written under record_store in past 24 hrs than the first 24 hrs
  • Not all sizes of each file under record_store are the same even after few hrs since last modified
  • Assuming the total bytes written by the process includes writing to the log file, hence the 904 MB is greater than the 294 MB (computed accumulative total bytes on disk for the record_store directory)
  • Not sure what it will take for Total Bytes Read to go over 4KB, :smiley:
  • Memory levels for safenode pid have risen but that is expected (explained earlier by Maidsafe Team)

I haven’t had time to look for odd errors or other interesting messages in the logs yet.

18 Likes

Thanks chap, that’s very kind and I do believe it to be the case.

Yes, this timeout is interesting in my book it should not exist in our code. It’s really a client app thing if it exists at all. I Will explain more

  1. The timeout is historic and used when the network was unstable to allow clients to break the waiting cycle.
  2. Clients should not be downloading something that does not exist, if they do they are search type clients and should go a different way.
  3. Clients really should only timeout stuff that might not exist, i.e. checking a DBC Parent transaction, but not for downloading data
  4. Our CLI is an end user app and has that timeout set, but we should tweak that and also pass on the data download timeout to the user. The DBC timeout should be very large

Super helpful

Again, this is why being open is so important. We could not apply the resource needed fro this kind of detail. It’s massively helpful.

We are all building Safe here, every one of us. That is how it always should have been :+1: :+1: :+1: :+1: :+1: :+1:

28 Likes

For my results I was meaning for uploading I’ll do more testing over the weekend to see how I get on with downloading to the relatively slower connections.

Now I have a good bag of uploads to play with thanks to my oracle instance that was able to upload it all.

Seriously impresed with how this is going since the latest reboot :slight_smile:

12 Likes

Yes, I believe also uploading we should not timeout. Basically, if you have paid (eventually) then the network must safe the data, it may take time, but it will save it, so we should not timeout but wait as long as it takes. Probably need to analyse the upload code part much more and take that into account in more detail now we seem to have found stability here IMO

15 Likes

It started to oscillate:

Correlation with OutgoingConnectionError is not so obvious now, but I still think it may be related to memory usage:

I wonder what is the fate of nodes like 12D3KooWHs2FuFcuSHtkt1KdCAKDnXp35EDbkrcx559rp7TMrj9n.
Will developers share such information or it’s a secret?

@Shu it is interesting that your node missed events mentioned above.

4 Likes

I added tracking the record_store folder’s total file count & total bytes in real time now.

I also decided to go back to some LXC level - TCP stats and its pretty fascinating just how many TCP connections are established (min / max / mean) even over a course of 90 minutes (the oscillations).

No idea if ~250 TCP connections on average in an established state is considered okay or not for the current size testnet with its X # of total nodes… either way what a dynamic environment! :smiley: .

12 Likes

Amount of connections is ok I think.
Tor node have ~6k, I2P node have ~4k.

What looks not ok is how often connections are dropped and reestablished.
Such noise also happens when downloading and uploading files with client.
But it’s probably just not optimized yet.

7 Likes

Assuming the time scale is UTC in the images above, yes it seems my node did not experience a rapid spiky oscillations in memory during the same time window.

I have never tried running one of those services at home, so good to know for future reference.

2 Likes

Last memory chart is from 2023-06-09T10:38:07.613266Z to 2023-06-10T05:57:14.654755Z. So x is not exactly time, but conclusion will be the same anyway I think.

2 Likes

I’ve had a node running since nearly the start and I confirm it was a slow start for getting chunks. Then I got a few and they just dribbled in. Things really picked about this time yesterday and last night it went bananas! There were hundreds at 0340 UTC today so maybe there was a big upload or big disconnection?

I now have 734 chunks taking up 310MB.

I’m amazed at the recent progress!

9 Likes

i am still uploading mp3’s going since yesterday early afternoon total uploaded successfully 329Gb

image

6 Likes

Today my 600MB file:

Not all chunks were retrieved, expected 1243, retrieved 1241

And 1.2GB file:

 expected 2233, retrieved 2229

I’m gonna give them anohter try.

And a couple smaller files, that I uploaded on Wednesday, downloaded just fine.

6 Likes

Do it means that for some chunks all 8 nodes holding copies were crashed?
Also such crash should happen at the same time providing no time for nodes to make fresh copies of data.

5 Likes

Two of my nodes now have 1024 chunks. Coincidence, or is that a hard limit?

4 Likes