ReplicationNet [June 7 Testnet 2023] [Offline]

Spun up a quick container on ubuntu-22.10 locally:

   11  apt install python3-pip
   12  pip3 install flask
   13  history
root@safe-url:~# python3 safe_url.py 
 * Serving Flask app 'safe_url'
 * Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:80
 * Running on http://192.168.X.X:80
Press CTRL+C to quit
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 108-922-596
192.168.X.X - - [14/Jun/2023 17:20:31] "GET / HTTP/1.1" 200 -
192.168.X.X - - [14/Jun/2023 17:20:31] "GET /favicon.ico HTTP/1.1" 404 -

Script above in theory is working. Hmm.

Try the following:

  • install pip and flask with sudo
  • run task with sudo
  • change port to be much higher than port 80 and open the respective port in security groups via AWS
3 Likes

I changed file permissions from 775 to 755 and then it works with sudo.

But I will try again with a high non-privileged port.
Thanks for the quick response - appreciated :+1:

EDIT : and now its all working nicely at http://54.75.31.54:8765 with none of this sudo malarky :slight_smile:

EDIT: Any of youse mob playing with a Hetzner node at 46.4.11.91 ?

x.x.x.x - - [14/Jun/2023 20:24:35] "POST / HTTP/1.1" 200 -
Traceback (most recent call last):
  File "/home/ubuntu/endpoint/venv/lib/python3.10/site-packages/flask/app.py", line 2213, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/ubuntu/endpoint/venv/lib/python3.10/site-packages/flask/app.py", line 2193, in wsgi_app
    response = self.handle_exception(e)
  File "/home/ubuntu/endpoint/venv/lib/python3.10/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/ubuntu/endpoint/venv/lib/python3.10/site-packages/flask/app.py", line 1487, in full_dispatch_request
    return self.finalize_request(rv)
  File "/home/ubuntu/endpoint/venv/lib/python3.10/site-packages/flask/app.py", line 1506, in finalize_request
    response = self.make_response(rv)
  File "/home/ubuntu/endpoint/venv/lib/python3.10/site-packages/flask/app.py", line 1801, in make_response
    raise TypeError(
TypeError: The view function for 'handle_requests' did not return a valid response. The function either returned None or ended without a return statement.
46.4.11.91 - - [14/Jun/2023 20:33:39] "HEAD / HTTP/1.1" 500 -
x.x.x.x - - [14/Jun/2023 20:34:10] "POST / HTTP/1.1" 200 -
4 Likes

I think I asked this before but I CBA searching

Are the new URLs always exactly 65 chars?

2 Likes

If the opening post for the topic is a wiki then people can add them to the end of the OP

That might do as a backup.
I am deep in scripting right now.
The PoC version will only handle single file uploads for now if we want to (easily) capture file upload size and md5Hash

1 Like

@qi_ma - I will conclude here that this is most likely attributed to my personal setup of passing through mountpoints from the host to a LXC, and certain stats are simply not available inside a container, :frowning: .

I attempted to allow all cgroup controllers, switch from unprivileged to privileged containers, but in the end it did not help. I tried to capture the same type of stats in a similar way like safenode is but for programs like dd against these mountpoints, and I was not successful. It seemed like external tools like iostat could not recognize non block based device mount point inside the LXC, though tools like stat, df, du etc work just fine.

So a note for others, if you end up running safenode inside a container with external mountpoints presented or pass through to the container that do not present themselves as a raw block device such as /dev/sdX etc, you may experience the same issue.

I didn’t have any further time today to figure out if the total_mb_written values were accurate against the mountpoint or not, and why that metric increased, but the read stayed flat, oh well.

In summary, the 4KB total_mb_read metric issue and the concern is closed from my end, :smiley: .

3 Likes

Really appreciate your digging through and can clear out this suspicious issue.
great work!

9 Likes

@qi_ma - Seems, I missed these request received and sent messages along with their RequestId, which might be useful to track and analyze in the dashboard for future triaging and confirming the response is indeed sent back for GETs, PUTs, etc:

[2023-06-07T20:57:13.029193Z TRACE sn_networking::msg] Received request with id: RequestId(69653), req: Cmd(StoreChunk { chunk: Chunk { address: ChunkAddress(84ea47(10000100)..) }, payment: None })
[2023-06-07T20:57:13.033455Z TRACE sn_networking::msg] ResponseSent for request_id: RequestId(69653) and peer: PeerId("12D3KooWC6wddzjx5YVWVJqUzkkFCcDSdTJywsXXittUCL7Z9EwR")

I didn’t have time yet to do so, but will try to back-fill the data with some more stats on the above on the dashboard from this past ReplicationNet testnet. Maybe even get an idea of average response time (Request Sent minus Request Received) etc.

Also, just curious, without parsing and charting this data below fully yet, and analyzing it further, why is success + failure not equal to requests for every Query Task - GetClosestPeerOk? Is that expected?:

QueryStats { requests: 45, success: 33, failure: 0, start: Some(Instant { tv_sec: 4524476, tv_nsec: 255471926 }), end: Some(Instant { tv_sec: 4524478, tv_nsec: 625056966 }) } - ProgressStep { count: 1, last: true }

QueryStats { requests: 39, success: 24, failure: 1, start: Some(Instant { tv_sec: 4517272, tv_nsec: 30325389 }), end: Some(Instant { tv_sec: 4517275, tv_nsec: 625297993 }) } - ProgressStep { count: 1, last: true }

Ideally, I like to maintain a snapshot of the data of 1 or more nodes on my end along with the charts per testnet, and compare them between iteration, and see what the charts reveals in terms of trends, and see if they reflect the expected outcome or desired effect based on my limited understanding of the changelist of bug fixes and ERs and the summary provided in the weekly updates between each iteration, :smiley: .

8 Likes

Updates to the dashboard:

  • Added REQUEST_RECEIVED - based on message from logs (random):
    • Received request with id: RequestId(1226680), req: Cmd(StoreChunk)
    • Received request with id: RequestId(1224274), req: Cmd(Replicate)
  • Added REQUEST_RESPONSE_SENT - based on message from logs (random):
    • ResponseSent for request_id: RequestId(1210154)
  • Captured the RequestId for the above sample messages
  • Calculated the duration of all Request vs Response log entries for Requests associated with StoreChunk cmd based off the log timestamp on a matching pair of RequestId, if found.
  • Added two panels to the dashboard, showing the Duration (y-axis) vs Request ID # (x-axis) for a successful received & sent response for Request IDs dealing with StoreChunk, where one panel shows the distribution based on the duration < 100ms, and the other > 100ms.

Few more observations:

  • Seems all REQUEST_RECEIVED:STORE_CHUNK messages had an equivalent REQUEST_RESPONSE_SENT message :white_check_mark:
  • On my node at least 90% of the StoreChunk workflow on a per Request ID basis took on average < 5ms to complete. Seems like a good initial starting baseline # or reference # to record here as MaidSafe continues to iterate on the feature set and performance gains with each public testnet, :white_check_mark:
  • On my node about 10% of the StoreChunk workflow took about 106 ms to up to 12,000 ms to complete on a per Request ID basis. This could be due to multiple reasons, but didn’t have time to dig into this further yet. The min vs max duration was pretty wide here, hmm? :thinking:
  • Received Request IDs dealing with replicate cmd seem to not have a successful sent counter part in the logs on my node at least. A quick scan here seem to indicate mostly edge case failures when looking for 2 or more log entries associated with the same Request ID, where the originating Request Id was associated with a ‘Replicate’ cmd. This might already been taken care off by recent PRs by MaidSafe team.

Additional Notes:

  • Currently, parsing through the logs, there seem to be various other conditions for failures for other types of Request IDs such as some being associated with Inbound/Outbound failure messages, though some of this area MaidSafe mentioned they have ongoing PRs for bug fixes here as part of their latest weekly update :white_check_mark:
  • I haven’t yet figured out how to easily capture the duration for edge case failures due to the current format of the logs, or at least not easily in one sweep or pass off the log file.

Post Updates after Initial Post:

Updates to Dashboard:

  • Re-updated the image above
  • Added overall count of Unique Request IDs statistic
  • Re-labelled title for durations for StoreChunk requests on two panels
  • Fixed up color scheme for a few panels
  • Added a MISC bucket (catch-all) of remaining Request IDs that are not associated with StoreChunk or Replicate cmds
    • Duration for this panel was computed based on the last timestamp of a given Request Id minus the first appearance of that Request Id in the logs
  • Confirmed Unique MISC Request IDs Counts (325) + Request IDs associated with Store Chunk Counts (1232) + Request IDs associated with Replicate (83) do add up to the total Unique Request IDs of 1640

EDIT: Additional Updates:

  • Added REQUEST_RESPONSE_INBOUND_FAILURE - based on message from logs (random):
    • RequestResponse: InboundFailure for request_id: RequestId(260831) and peer: PeerId("12D3KooWMnkMD1Cgesv6Bz2DtpqfVjGTARYtTXuNLvQTtcKBC7wv")
  • Added REQUEST_RESPONSE_OUTBOUND_FAILURE - based on message from logs (random):
    • RequestResponse: OutboundFailure for request_id: RequestId(7) and peer: PeerId("12D3KooWJwvEWyMW87A4MK9M8E8JHe9h1bb2KoSzkm5qGMpfyTwM")
  • Added panel for PID - Duration of Failed Replicate Request IDs
  • Re-updated main image above

EDIT: Additional observations:

  • Seems all REQUEST_RECEIVED for REPLICATE cmds resulted in a response of INBOUND_FAILURE
    • Duration of failure ranged from ~0ms to 8420 milliseconds
    • The average response where it resulted in the message of INBOUND_FAILURE was 370ms.
    • Wondering if there are any timeouts here, or there are not expected to be any? The min vs max duration was pretty wide here, hmm? :thinking:
  • Seems all OUTBOUND_FAILURES resulted in ConnectionClosed with ResponseDropped (makes sense)
  • Not sure on the type of message classification here for certain Request_IDs (see below):
    safe-build-122:/tmp/safe_logs# cat safe.log | grep "RequestId(247)"
    [2023-06-08T20:55:31.093137Z TRACE sn_networking::cmd] Request_id is RequestId(247)
    [2023-06-08T20:55:31.322836Z WARN sn_networking::msg] RequestResponse: OutboundFailure for request_id: RequestId(247) and peer: PeerId("12D3KooWHrSRXSj7s1Ak8WN5Qyj2iXkMuGg3KkWfBmkspBRF9KLe"), with error: ConnectionClosed
    [2023-06-08T20:55:31.322854Z WARN sn_networking::event] MsgReceivedError: ReceivedResponseDropped(RequestId(247))
    
    • As oppose to the InboundFailure where the cmd is logged earlier:
    safe-build-122:/tmp/safe_logs# cat safe.log | grep 1224274
    [2023-06-12T03:00:56.757268Z TRACE sn_networking::msg] Received request with id: RequestId(1224274), req: Cmd(Replicate { holder: NetworkAddress::PeerId([0, 36, 8, 1, 18, 32, 233, 237, 81, 223, 151, 169, 173, 143, 54, 204, 145, 169, 74, 155, 242, 60, 137, 223, 121, 127, 133, 207, 182, 239, 193, 42, 144, 93, 182, 216, 99, 125]) })
    [2023-06-12T03:00:56.757725Z WARN sn_networking::msg] RequestResponse: InboundFailure for request_id: RequestId(1224274) and peer: PeerId("12D3KooWRZX3otk1w3VfBBqgvTNJceni3muzidLr84q9FDKJ5aua"), with error: ResponseOmission
    

06/19/2023 Updates:

  • Added ability to query the RPC for safenode via nodeinfo command for # of connected peers on a 10 second interval
  • This data is now being fed to the dashboard as well and tracked over the lifetime of the safenode pid, as I am curious if this is changing rapidly or remaining constant, and if so whats the min, max, and mean # of connections here

  • Above was safenode compiled from some of the later releases in past 24hrs:
    safenode cli 0.83.29
    sn_cli 0.77.31

@qi_ma - For a testnet spun up with 20 nodes, it seems the # of connected peers on 1 of these nodes for example, doesn’t remain static after an initial discovery period?

I am not sure what the expected behavior is when all 20 nodes are spun up in a single container, and more than 5 minutes have elapsed as is without any extra load on the network (PUTs / GETs). The node seems to be fluctuating between a min of 7 and max of 19 peers, hmmm. :thinking: .

Further Thoughts: I am really excited for the next testnet to see a before and after comparison in many of these panels’ charts.

15 Likes

If the connection closed is due to time out , i.e. inactivity on the connection, this one will be closed on libp2p transport layer, but remained within the libp2p kad knowledge.
There is also network detection activity to be undertaken if long period of inactivity detected across the all connects of a node. This will get new nodes into the detection radar, meanwhile may also get some libp2p transport layer closed connection to be opened again.

2 Likes

I ran the bash shell script provided earlier in this topic by @loziniak to gather the quick stats and to my surprise, all but Replicating chunk were non-zero values. FWIW, below is the summary of the above keywords for my node, as a reference post (supplement the existing chart image for now):

PeerAdded 5893
Sending a replication list 76
Replicating chunk 0
Chunk received for replication 233
Fetching replication 249 :white_check_mark: (same # as FETCHING_REPLICATION_ADDRESS)
Detected dead peer 908 :white_check_mark: (same # as DEAD_PEER_DETECTED)
Replicate list received from 83

  • Seems to match the count off Received request with id: RequestId(265818), req: Cmd(Replicate :white_check_mark:
  • However, for each of these Replicate list received from Some(PeerId("12D3KooWJwvEWyMW87A4MK9M8E8JHe9h1bb2KoSzkm5qGMpfyTwM")) , there is a follow-up message with InboundFailure on the RequestResponse with the originating RequestId as noted earlier in this topic

I am trying to wrap up the summary / conclusion from this past testnet, by adding the above keywords as well to the dashboard, if possible, so to be as ready as possible for the next testnet, :smiley: .

Additional Notes:

  • Areas for the above keywords with a check mark have been incorporated into the dashboard already.
  • Dashboard now supports ability to switch and filter on metrics between different host / LXC container combinations, in-case the user ends up running multiple nodes in their environment

EDIT: Post Initial Post Updates:

Updates to dashboard:

  • Added all the extra keywords that Maidsafe Team wanted us to look for in this past testnet :white_check_mark:
  • For certain log lines, they don’t contain a submodule/component in the format, [sn_networking_event::event], and instead are just [sn_record_store], so I added ‘N/A’ for that component to standardize it
  • Added ability to filter on module & sub-module components for the Logs & Errors panel
  • Added ability to filter on any particular Peer Id and get all log / error interactions of that Peer Id from one’s respective node
  • Picked up more coverage on unique Chunk Addresses (overall aggregation)

Overall timeline of this node during the testnet period:

Sample image of a filter of the Peer Id of the ‘Dead Peer’ that was mentioned earlier by Vort:

All set for the next testnet … :joy:

7 Likes