ReplicationNet [June 7 Testnet 2023] [Offline]

Okay, bringing this down. It’s been very useful, some leads on increasing mem being investigated! Thanks everyone for participating, more great stuff! :muscle:

18 Likes

Updates to the dashboard:

  • Breakdown of some the storage piece workflow based on an initial investigation of the logs

  • For instance, I split this up in 3 generic categories:

    • PUT_REQUESTS
      • Example message from logs (random):
        • PUT request for Record key: Key(b"^\xa2X\xa9\x15n\x13\n\x12u\xffD\xfe\xb2\x94\xdc\x96\xea\x18\x96\xae\x97\xae\x8e!\xb64\x9b\x8f\x9c$\x01")
    • STORE_CHUNK
      • Example message from logs (random):
        • Handling request: Cmd(StoreChunk { chunk: Chunk { address: ChunkAddress(5391ec(01010011)..) }, payment: None })
    • CHUNK_WRITTEN
      • Example message from logs (random):
        • Wrote record to disk! filename: ffb4575e4b8d36b79182cd3f9745b2719c86af408409f233c8c763a37d1b2cf9

Based on the above changes, below were some stats from the charts:

  • Timeline was zoomed in when node had 809 files on disk, and went up to 1024 limit
  • Number of files created during this time based on external disk/file watcher were: 215
  • Total # of PUT_REQUESTS messages logged: 3585
  • Total # of STORE_CHUNK messages logged: 455
  • Total # of CHUNK_WRITTEN messages were: 215
  • Total # of Chunk Addresses detected: 455
  • Total # of Unique Chunk Addresses detected: 455

Observations:

  • Number of CHUNK_WRITTEN match the # of files on disk that increased: :white_check_mark:

  • PUT_REQUESTS vs STORE_CHUNK ratio is: 7.8x :white_check_mark:

  • It was mentioned earlier that 8 PUT requests would be received, but after filtering duplicates, it would likely result in 1 chunk file

  • How do we go from 3585 PUT_REQUESTS → 455 STORE_CHUNKS handling message requests logged compromising of exactly 455 unique chunk addresses to just creating +215 unique files on disk? :thinking: . Is this because (455 - 215) = 240 files previously stored at some earlier time?

  • Also, the vertical blue line on the 2nd panel was a period when no files were written on disk for 2 hours, but we had PUT_REQUESTS that still generated handling STORE_CHUNK request messages, which led to 0 written files on disk, hmm? :thinking:

  • If there is de-duplication happening, is it at two different layers / stages?

I am obviously missing some concepts in my understanding here, :smiley: . My terminology above might be incorrect too, so looking forward to getting corrected, and further understanding the magic happening underneath the hood.

Post initial update:

  • Reloaded all the data again, and refreshed the images below:

  • Overall view of the node stats for the timeline above:


    PUT_REQUESTS (3713) → STORE_CHUNK (474) → CHUNK_WRITTEN (215)

  • Overall view for the lifetime of the testnet:


    PUT_REQUESTS (9229) → STORE_CHUNKS (1232) → CHUNK_WRITTEN (1024)

    • Note:
      • Ignore the `SAFE PID - Record Store - File Count’ panel (feature was added later), so not the complete data for the given time window.
      • Seems on both timelines, PUT_REQUESTS were about 8x compared to STORED_CHUNK
      • Looking at the larger overall timeline, the #s on the entire timeline for PUT_REQUESTS, STORE_CHUNKS, CHUNK_WRITTEN seem to make more sense now.
      • Still slight discrepancies on unique chunk addresses detected not adding up to 1024, and STORE_CHUNK > 1024, but it could very well be on my end, and my current ETL process.
17 Likes

What packages are you using to create these amazing graphs?
I want to play with this on home testnets so I can get some familiarity for the next testnet.

9 Likes

Graphs are generated using open source Grafana.

6 Likes
8 Likes

Great work @Shu and I’m finding your disection of log messages is interesting as well as helpful for vdash. I haven’t time to do much of that myself.

Yesterday I decided to use these messages for my GET and PUT metrics:

"Retrieved record from disk"
"Wrote record to disk"

As always I’m never sure how accurate my choices are :man_shrugging:

I see PRs coming with messages relating to DBC transfers and getting from a faucet but haven’t any thoughts yet on what would be useful to display in vdash.

11 Likes

I think all the above questions related where the de-duplication get carried out:
it only get carried out when within the record_storage::put function, i.e. after the log of Put Request and Store Chunk, but before the final creation of the chunk file.
Hence, it’s expected you will see more Store_Chunk logs than the actual number of chunk files created.
And, if the same data got uploaded twice, the second time of uploading will trigger all those Put Request & Store_Chunk logs as well, but won’t lead to any real chunk files to be created.

11 Likes

Is the the memory issue fixed? Looks so.
*nextnet is going to be :muscle:

11 Likes

There is a comment added there.

2 Likes

Well, have to say that PR is mainly just to prove where the OOM issue comes from.
Just being a quick possible patch but not a fully proper fix.
We need some further discussions on it, as shown in the PR comment.

11 Likes

Thanks for confirming! Yesterday, looking at the larger timeline vs. the smaller timeframe, it led me to the same conclusion.

9 Likes

I think that makes sense for aggregated unique PUT, but from what I can tell for the PUT side of workflow, those would indicate only new unique PUT requests that wrote the record files.

It depends what the dashboard or the statistic is trying to answer, even STORE_CHUNKS where the file was previously written is a valid request originated by 1 or more PUT_REQUESTS that was resolved by the network, even though it didn’t yield a new create file on disk.

To some its still is a successful PUT request from the clients’ perspective.


I haven’t had time to add in metrics for GETs and understand more on that part of the workflow yet, but on an initial scan, the node did log Retrieved record from disk more than 250+ times, which still makes me wonder why total_mb_read in the JSON created by the metrics component was only at 4 KB, :smiley: .

More to investigate a bit later.


All of this is a work in progress, and we are still learning to put the right label in terms of the right choice of words on the classifications, :smiley: .


Post Updates after Initial Post:

Updates to Dashboard:

  • Exposed the different modules / crates within the safe_network repository in the panels above that were found within the logs files for different message types
  • Re-organized the panels positions for hopefully better readability
  • Added CHUNK_READ - based on message from logs (random):
    • Retrieved record from disk
  • Added GET_REQUEST - based on message from logs (random):
    • GET request for Record key: Key(b"X\x1e\x1c\xa5\x94\x08\xbd\xdd\xf7o\x03C\xd0~\x03(/\xdf]\t\x9bh~\xc7\xc0\xcbp!\x01*\xdf+")

Further Observations:

  • There were 690 GET_REQUEST, but only 254 actual records fetched from disk, hmm.

    • Is there de-duplication on GET requests too?
      • I haven’t had time to extract out the record key to see if all were all unique or not etc.
  • With 254 records fetched from disk, I would think this would yield > 4KB of total_mb_read in the safenode logging metrics component

    • The 4KB seems to have been incremented or set to this final value at very early phase in the timeline, and remained the same for the duration of the entire timeline, so it was never really incrementing it seems with the additional CHUNK_READs that were spread throughout the overall timeline
    • Maybe the 4KB is really monitoring raw underlying disk bytes read, and if these record files were still in the RAM off the node’s OS (not in the safenode pid), it may not have been a hard i/o request to the underlying disk subsystem, and more off a cache hit by the OS when reading it, and therefore not logged by safenode metrics component?
9 Likes

No, there is not.
The reason there is so many un-successful get requests is just because libp2p carries out extra work in case of looking for data.
i.e. it may ask about K-Value (20) nodes regarding a record, in addition to CLOSE_GROUP_SIZE (8).
In addition to this, there is also requests regarding non-existing data.
Hence, in total, the success rate of Get requests will be lower than 8/20, which somehow explains the number of 254/690.

Yeah, the logs from other nodes shows the total_mb_read increases and reflects the traffics involved.
Meanwhile this 4kb reading does being quite strange.

Could be, but I think it shall rarelly happen?
It’s almost impossible all those 254 chunks are still sits in the cache along so long time.

6 Likes

Interesting. So if the request for data, eventually got read from disk, its a strong likely hood the data finally arrived to its destination, whomever was asking for that chunk?

I guess I am a little concerned about my node setup on whether the GETs are truly relaying the data back to the end consumer or not, since total_mb_read is near 0.

I may need to post a safe files upload URL address on the next testnet topic, and confirm if others are able to download the file etc., :smiley: .

  • Seems the problem is with my node, and not in general.
  • Hmm, I will need to get to the bottom of this now, haha.
  • I do run an IDS (intrusion detection system) on my routers, but I don’t think it was blocking any of the safenode traffic, but going forward, I ended up whitelisting the internal IP of the safenode LXC within the IDS configuration files, incase it was blocking network traffic.

Few other notes on this setup:

  • The LXC container was given 24GB ram.
  • The entire record_store folder was 422 MB in size.
  • The path provided to safenode pid for the record_store was a remote network storage mounted locally, though that should hopefully not matter here.

I will be curious on the next community testnet, what ends up happening here, fingers crossed.

3 Likes

Just thinking out loud here, could we get a page/topic on here that would allow all testnet participants to post their upload URLs?
Could we (temporarily) add code to safe files upload where it automatically posts the upload URL to that topic?
Possibly restrict it to priv level 1 to avoid spam?

Suggested as a temporary measure only until we get a better understanding of what exactly is happening.

3 Likes

This sounds like a good idea, or give people access to a temporarily live Excel like sheet URL with modify access with a few columns like :

forumUser | safeURL | timestampUTC | md5Hash (optional) | descriptionContents (optional) |

But have two sheets, one for when the original content was uploaded, and one where users record their errors if the md5 hash doesn’t match or a full download didn’t happen. It would be a live running document summary of how the testnet went and which safe addresses created problems, and when.

It would help in terms of the organization when reading the running commentary and multiple folks’ attempts on downloads / uploads.

Might be abusive to this forum’s api’s, as it could be easily be abused, especially if wrapped in scripts that are benchmarking or uploading 1000s of files independently. In my opinion, adding a non-core dependency to safe files upload binary doesn’t seem like an efficient use of Maidsafe’s developers time.

However, if restricted to priv level 1, it would have to be a community driven effort wrapped around safe files upload output, possibly directly updating some sort of table, CSV, or live Excel like sheet, as oppose to new posts each time. Hmm.

4 Likes

Yes that would need to be looked at very carefully.
Im just flinging this out here for brainstorming.

This is likely much simpler - and could be hosted by any of us with a cloud instance. Access needs discussed but I dont see that being a showstopper - thoughts?

3 Likes

Its simple, if we can tap into existing services that provide such a service, as is.


Also, if we want more programmatic access to updating such a table, it might be simple to setup a very quick endpoint to receive HTTP GET/PUT/POST in a CSV format, with minimal safety checks that required CSV fields were provided, so after you do your safe files upload, you call HTTP GET/PUT/POST with the data supplied in the URL or the body (wrapper scripts from community can assist here too), and it simply appends the data to a CSV file on the back-end server.

The link to that file is always available as read-only HTTP URL to all audiences.

In addition, if this HTTP URL contains data of safe URLs, more automation in community scripts can be done so folks can programmatically attempt all downloads from this CSV file, even as the CSV file builds up more rows, hence the entire community can re-validate the uploads (all of them), in a controlled test case, and failures on downloading against any specific rows, are sent back as feedback to the HTTP server as another line item (server simply maintains two sets of CSV files - upload data, and feedback/errors on the downloads). We may need a column on total size of the payload, and possibly a filter on maximum upper size per row item provided as an input by the user, if we want folks’ to attempt to download as many of the row items locally from the CSV summary file provided by the server.

Granted, all of this could be abused in the early days, and wrong payload or inputs could still filter in to the server, but it could be a work-in-progress. One could also substitute CSV for a sqlite or some standalone easy to use DB etc.

It could be hosted on a small container anywhere. After the testnet is offline, both files can be attached back to the relevant topic on this forum for future reference.

There is no end to improvements, but one would have to start somewhere if we want the community to be even more involved and become even more efficient on more systematic testing of the overall network on public testnets as a collective group, outside of the current individual adhoc tests that individuals perform (which are great!, but we can do more here).

Note: Though some may argue that input/output against the network is an independent action from the safe client, i.e. not dependent on other individuals, so conducting an upload/download repeatedly as an individual may be enough.

I would argue in the early days, that due to the distribution of data, replication time, hidden timeouts, and other factors (churn too), if more folks are attempting uploads, as well as downloads of the data posted by others (systematically/programmatically), more and more nodes will be engaged in GETs and PUTs, as well as more concurrency and load, and should result in edge cases being discovered faster, whether its on the safenode or safe client side.

Maybe I am being too optimistic, but yeah, open to the community for discussion / possibly as a separate topic?

2 Likes

Well this is frustrating…

on a fresh updated AWS t2.micro


(venv) ubuntu@ip-172-31-45-225:~/endpoint$ $(which python) ./safe_url_endpoint.py 
 * Serving Flask app 'safe_url_endpoint'
 * Debug mode: on
Permission denied
(venv) ubuntu@ip-172-31-45-225:~/endpoint$ ls -l safe_url_endpoint.py 
-rwxrwxr-x 1 ubuntu ubuntu 696 Jun 14 17:01 safe_url_endpoint.py

safe_url_endpoint.py listing

from flask import Flask, request

app = Flask(__name__)


app.debug = True


@app.route('/', methods=['GET', 'PUT', 'POST'])
def handle_requests():
    if request.method == 'GET':
        return 'Received GET request'
    elif request.method == 'PUT':
        return 'Received PUT request'
    elif request.method == 'POST':
        csv_data = request.get_data(as_text=True)
        # Perform minimal safety checks on CSV data
        if 'field1' not in csv_data or 'field2' not in csv_data:
            return 'Missing required fields', 400
        # Process the CSV data here
        return 'Received POST request with CSV data'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

running as sudo makes no difference, I am specifically targetting the correct python3 executable and ChatGPT talks shite

Anybody got a clue?

1 Like

You trying to get chatGPT to do your home work for you? :slight_smile: