Bug found in the client - Effects serious for certain workflows

Not sure which dev is best to tag here @joshuef

I see that client now has chunk_artifacts give a name to the chunks directory according to the filename (with path/filename hashed). I worked this out when I noticed the bug

tl;dr if the file contents change then the uploader does not generate new chunks but reuses the ones from a previous upload of that file. Resulting in the new version not being uploaded

More research details
There is a bug in the client, (safe files upload testfile)

  • Tested on a local network with SAFE_PEERS set to local network peer
    and SN_LOG=v
  • The filename was the same but the contents were changed
  • There was 3 versions of the testfile. The contents consisted of one line of 32 chars
    and terminated with normal unix new line character
  • wallet had no funds
  • chunk_artifacts directory of the chunks is not removed upon completion of the upload by
    the client uploader

Below is the 3 attempts without touching the client or chunk_artifacts directories
Below those is the reattempt of contents 2 & 3 but removal of chunk_artifacts directory prior to attempt

Conclusion
If the filename is the same then the client reuses the chunk_artifacts directory
which results in the same chunks as previously being uploaded without consideration
that the contents could have changed

Consequences
For those who save their documents multiple times while
working on them will lose all subsquent changes to their files/documents

This means it is useless for workers who are updating files/documents/projects
by simply saving the files.

Workings & Findings

Attempt 1 (xor/hash correct) client directory empty

File contents: c4ca4238a0b923820dcc509a6f75849b

The 3 xor/hash of the 3 chunks
[2024-08-30T05:38:44.093807Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(593be5 - f1b33f5b88515c636dce94623d1605e0b7b672d37f8ca4e97a009547519a0ffe)
[2024-08-30T05:38:44.094082Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(c71aa7 - 4d6b9b8ced36d35f21b77a4ad764aa05a49279e461d1872277da786207d73171)
[2024-08-30T05:38:44.094163Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(7c8d07 - c0c6887d09b966eefa4f9d6d425402bc3539536aca3787897676ebe778400a51)

ls listing of the chunk_artifacts directory
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*
total 4
drwxr-xr-x 2 autonomi2 autonomi2 4096 Aug 30 15:38 5eb77fd8260350333ead4408ab383cb4d0cf1a1d03a0bdd2e5e24ae17f12b667
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*/5eb*
total 16
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 593be5a41c64879897b1bd3fd08def3f6183bfad29b2b4b85e1a8beca417ac05
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 7c8d0744708ac91f92b6be14cb847ab1e9b3fa2f31a52b7fb55acd182e7c32f0
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 c71aa7018fecc8764a02ad3db249e0728a2c7c00ecfeb52c3f1f8acd5e166960
-rw-r–r-- 1 autonomi2 autonomi2 383 Aug 30 15:38 metadata

Attempt 2

  • Filename same as attempt 1
  • Contents different to attempt 1

File contents: c81e728d9d4c2f636f067f89cc14862c

The 3 xor/hash of the 3 chunks
[2024-08-30T05:41:14.743079Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(593be5 - f1b33f5b88515c636dce94623d1605e0b7b672d37f8ca4e97a009547519a0ffe)
[2024-08-30T05:41:14.743913Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(c71aa7 - 4d6b9b8ced36d35f21b77a4ad764aa05a49279e461d1872277da786207d73171)
[2024-08-30T05:41:14.743953Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(7c8d07 - c0c6887d09b966eefa4f9d6d425402bc3539536aca3787897676ebe778400a51)

ls listing of the chunk_artifacts directory
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*
total 4
drwxr-xr-x 2 autonomi2 autonomi2 4096 Aug 30 15:38 5eb77fd8260350333ead4408ab383cb4d0cf1a1d03a0bdd2e5e24ae17f12b667
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*/5eb*
total 16
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 593be5a41c64879897b1bd3fd08def3f6183bfad29b2b4b85e1a8beca417ac05
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 7c8d0744708ac91f92b6be14cb847ab1e9b3fa2f31a52b7fb55acd182e7c32f0
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 c71aa7018fecc8764a02ad3db249e0728a2c7c00ecfeb52c3f1f8acd5e166960
-rw-r–r-- 1 autonomi2 autonomi2 383 Aug 30 15:38 metadata

Attempt 3

  • Filename same as attempt 1
  • Contents different to attempts 1 & 2

File contents: eccbc87e4b5ce2fe28308fd9f2a7baf3

The 3 xor/hash of the 3 chunks
[2024-08-30T05:43:13.619612Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(c71aa7 - 4d6b9b8ced36d35f21b77a4ad764aa05a49279e461d1872277da786207d73171)
[2024-08-30T05:43:13.619926Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(593be5 - f1b33f5b88515c636dce94623d1605e0b7b672d37f8ca4e97a009547519a0ffe)
[2024-08-30T05:43:13.619974Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(7c8d07 - c0c6887d09b966eefa4f9d6d425402bc3539536aca3787897676ebe778400a51)

ls listing of the chunk_artifacts directory
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*
total 4
drwxr-xr-x 2 autonomi2 autonomi2 4096 Aug 30 15:38 5eb77fd8260350333ead4408ab383cb4d0cf1a1d03a0bdd2e5e24ae17f12b667
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*/5eb*
total 16
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 593be5a41c64879897b1bd3fd08def3f6183bfad29b2b4b85e1a8beca417ac05
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 7c8d0744708ac91f92b6be14cb847ab1e9b3fa2f31a52b7fb55acd182e7c32f0
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:38 c71aa7018fecc8764a02ad3db249e0728a2c7c00ecfeb52c3f1f8acd5e166960
-rw-r–r-- 1 autonomi2 autonomi2 383 Aug 30 15:38 metadata

Attempt 4

  • Filename same as attempt 1
  • Contents same as attempt 2
  • Removed the clients directory

File contents: c81e728d9d4c2f636f067f89cc14862c

Directory removal command
autonomi2@localhost:~/tstclient> rm -r ~/.local/share/safe/client/*

The 3 xor/hash of the 3 chunks
[2024-08-30T05:46:38.673352Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(5a84fa - a0b0d2bf8fcd49b2ad6cd2bdb51b2a08bed5fc71b08c1db141657f46d5c0a06a)
[2024-08-30T05:46:38.679458Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(21c670 - ac199813d74c8754ba69857bddadd9c252f41ec9a718f17cca67c3d266aa6fe4)
[2024-08-30T05:46:38.679675Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(3e1683 - 65a4f55b64de7e13be225ee6a492c10753a0991ab4876ceef8cbed5cc06fd7cb)

ls listing of the chunk_artifacts directory
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*
total 4
drwxr-xr-x 2 autonomi2 autonomi2 4096 Aug 30 15:46 5eb77fd8260350333ead4408ab383cb4d0cf1a1d03a0bdd2e5e24ae17f12b667
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*/5eb*
total 16
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:46 21c67020ba2db03f9d20c5086392e1880dc53c64eb39278b1839c5090b3e560b
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:46 3e1683066d0c53e66520ab71d10a5595cf351049a67d50c7d502203eb9bf9a09
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:46 5a84fa53525ee73685d53b116e51d6177d193a4c10097a8f86cfc865bad1e4b0
-rw-r–r-- 1 autonomi2 autonomi2 369 Aug 30 15:46 metadata

Attempt 5

  • Filename same as attempt 1
  • Contents same as attempt 3
  • Removed chunk_artifacts directory

File contents: eccbc87e4b5ce2fe28308fd9f2a7baf3

Directory removal command
autonomi2@localhost:~/tstclient> rm -r ~/.local/share/safe/client/chunk*

The 3 xor/hash of the 3 chunks
[2024-08-30T05:49:56.024969Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(ea1c4b - 71af910815ebf73b0db8dd615381b32a96981b71406cdbacca86f6f87ca31a62)
[2024-08-30T05:49:56.025014Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(69aa61 - 6bea9295741678d91fba236553817feeab01600271b8d453c70469a356cceb72)
[2024-08-30T05:49:56.025032Z DEBUG sn_networking] Getting the closest peers to NetworkAddress::ChunkAddress(abb106 - a0581e16e58d86d781e72260b6992619aea8e0a357365ceec15eff55b96e7fb5)

ls listing of the chunk_artifacts directory
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*
total 4
drwxr-xr-x 2 autonomi2 autonomi2 4096 Aug 30 15:49 5eb77fd8260350333ead4408ab383cb4d0cf1a1d03a0bdd2e5e24ae17f12b667
autonomi2@localhost:~/tstclient> ls -l ~/.local/share/safe/client/chunk*/5eb*
total 16
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:49 69aa61afb634939acb35b7188a5e70c829cbffc97ee2fe56ce1c9740e0b4a0f1
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:49 abb1062dfee8afcba37a40903b44601c2a08cc61c76572ad104c58d1f5b1d74c
-rw-r–r-- 1 autonomi2 autonomi2 16 Aug 30 15:49 ea1c4b6d31c3eaa6207b256fd7855296d85f35c4364fb6884cddadea14bd3be7
-rw-r–r-- 1 autonomi2 autonomi2 382 Aug 30 15:49 metadata

20 Likes

Excellent :male_detective: work here @neo

15 Likes

Yea I wondered why my tests where giving the save xor address/hash for every version of the test file. Had my head scratching for a bit until I noticed the chunking directory didn’t change, so that was the clue.

5 Likes

Great work @neo.

This may explain why my app appears to be failing to display it test page site correctly. Actually, I suspect not as I am using a different directory to upload different versions of each test site. Can you confirm that if the same named file is in a different directory, the bug should not occur (due to the hash using the full path to the file)?

FYI @dirvine I found a regression in the registers API (issue #2077) yesterday. I’m not clear how quickly issues are looked at so flagging this here.

7 Likes

I originally had the file in ~/currentdir/tmp/testfile and then tried ~/currentdir/testfile and 2 different chunk directories were created. From there I assumed they were independent and I am confident that its a good assumption

Additionally I considered it a bug that the chunking directory was not deleted after uploading @joshuef
Although that would in most cases mean the bug would be harder to ever see it still needs to have the directory deleted

So two bugs to fix

  • one is using hash of only things to do with meta data for the file. Need a random factor here.
  • two the chunking directory is not deleted.

@happybeing it is very possible that two separate applications could have seemingly same path/filename resulting in same chunking directory name yet be different files altogether. Alias and links could do this.

6 Likes

Apparently this is a breaking change in nodes that is preventing the network retrieval. It will be addressed in the next update, but for now the current register crate is ahead of the nodes, unfortunately that causes this error.

It’s not easy to fix that, but feels tightly coupled. We can likely do better, but for now if we get it synced back up we can keep moving on and when time permits separate network calls from the crate.

9 Likes

Thanks @neo ! (Sorry I missed this thread in amongst a few other notifications yesterday).

  • one is using hash of only things to do with meta data for the file. Need a random factor here.

We could/should hash the file itself perhaps? (Not sure how that scales for larger files :thinking: )

Something random wouldn’t be helpful I think in that we’d always be rechunking files. (Which can be slow)?

1 Like

Using something like hashing the file’s modification date/time along with the full path/filename would be perfect. Because to have a different file contents then the file modification time will be updated as well.

If the person deliberately fudges the modification date/time to remain the same then they deserve their loss LOL.

That suggests you plan on keeping the chunk artifacts directory and not delete it when finished. That will in itself be a big issue for someone (me and others for instance) planning on uploading 10,s and 100’s of thousands of files in future. I’ve lived a good amount of time and while average size of files has increased over 60 years the number of files per year is more similar.

You need to delete the chunk artifact directory at some stage otherwise that directory holding all those chunking directories will become overloaded with the shear number of subdirectories. Not to mention the space all those chunks will consume on the drive. For instance if I upload my videos I am making right now (1/2 TB worth yesterday, prob 2 - 5 TB by the end of week) of my construction of some vintage computer replicas, then my system drive will not handle it.

3 Likes

No, just until finished I think.

This “feature” came from large uploads failing, and wanting to avoid rechunking there.

5 Likes

Perhaps adding file size to the hash would help?

I was thinking that, but the error appeared with same size file. Hashing the file is better than file size.

Of course modification date/time is the standard indicator of the last change time.

1 Like