Proposal for more efficiently handling small files

There is some value to having that unused space though, isn’t there?

It helps keep duplicated files on the network to a minimum. If we are cramming all these little files into the 4MB chunks, then we are going to have a ton of unique 4MB chunks, rather than just a few. It would bloat the size of the network either way, but now we have a million blankpixel.pngs instead of just a few.

Correct me if Im wrong though.

3 Likes

I think the answer is both yes and no.

For large, unassociated, files, it may make sense to upload them in a regular public archive. Then you get the benefit of deduplication.

However, for small, associated, files, it’s unlikely they will already exist and even if they do, they would be small enough to not worry anyway.

The network will have an ideal minimum chunk size. The effort to establish the route, make the connection, the start data transmission, etc, will be a somewhat fixed cost. You want to ensure that the payload is worth that effort.

I’m not sure what the optimal size is. I suspect it would be less than 4mb. However, there is also the cost to upload each chunk, including communication with the blockchain.

So, given you could probably have 100s of small files in a chunk and files are split into at least 3, it would seem good for network efficiency to group small files together as one. Let the larger ones benefit from deduplication though.

3 Likes

Also storing the index in the datamap as an addon to the end of it so that other clients can still use that datamap would be the way to go. Index in the 3 chunks introduces a known portion and even if by a tiny bit, weakens the encryption. People will want to use this for their private files too.

Also be great to introduce it to the standard libraries as well, then all apps have access to it and not just novel apps specifically doing it leaving the data “useless” for other apps unaware of the multiple files in the one file.

And it’ll work for 4 and 5 and 6 and … chunk blocks too, except it would potentially take longer and longer to get a single file. So closer to 3 chunks is by far the best.

Very good point and certainly one to be considered carefully.

So then this system should only be used when essential and maybe phased out or kept to an absolute minimum when paying GAS doesn’t exist anymore. EG native is introduced.

@Traktion makes a good counter point that tiny files like 10KB each will represent only small savings with dedup. And that its the larger sizes that dedup will most effective for.

And of course for private files there is much less chance for dedup to come into play since so much of private files will be unique with much less common stuff.

4 Likes

I would argue that having a chunk that is less unique is better for the overall performance of the network. A unique chunk is probably not going to have as much traffic through the network compared to something that has that deduplication factor where more users are requesting it. It might be beneficial from a denial of service standpoint to keep those files flowing, rather than use autonomi as cold storage.

2 Likes

I am not in disagreement with your general thinking on this. But that one idea I cannot see any basis for this.

In fact if more people want a chunk then that will add to a DOS on the nodes holding the chunk, especially currently without caching implemented. Accessing a chunk does not make it move or change nodes. New nodes joining or nodes leaving close by that chunk will rearrange where the node lives by a little bit.

Having the same file living in different chunks will actually reduce the DOS effect.

For an extreme example to illustrate. If a file has 1000 uploads and thus the chunks dedup 999 times and it is say the latest cat meme being used on 100s of thousands of discussion/social media places/posts because of it being a super viral meme

Then in a day those 3 chunks are each hit 100,000 times. Effectively having a DOS effect on 15 nodes. (5 copies x 3 chunks)

BUT if the file is effectively changed (re-encoded or stored compressed or in a “collection”) 200 times then there are 200 times more nodes being hit but 500 times less often.

Yes extreme example only only to illustrate that it would have reduced DOS effect with dedup not happening

2 Likes

Doesn’t accessing a chunk make nodes that are down more noticeable for the network, causing more churn? For example, if I want to access my private cat meme once a month, would it just sit on the same nodes that whole time with minimal churn? Whereas if a public cat meme with 100,000 requests maybe wont sit on the same nodes, as any kind of inefficiencies in any of those nodes could have the file move around more nodes in order to help with performance?

1 Like

Nope, the closest node algorithm always finds the closest nodes whether its the first time being searched for or the 1000th time

Churn only happens when a new node appears or one leaves.

And the churn only happens for chunks that the new node is one of the closest and sometimes in the close neighbourhood. And also when a node leaves (rarely) churn happens to the node that goes back to being one of the closest 5. But typically it already has the record

1 Like

Sybil attacks

Sybil attacks involve malicious actors generating multiple fake identities, known as Sybil nodes, in an attempt to control operations. With sufficient Sybil nodes controlling part of the address space, a malicious actor could block access to certain content, for example. Autonomi detects Sybil attacks by estimating the network size using a statistical method KL Divergence. This compares the real distribution of node IDs (i.e. their network addresses) with the theoretical one. A significant difference between the two could indicate a Sybil attack. At this point nodes looking for data at a certain address will broaden their search to include a wider range of peers than normal. Even if all nodes closest to a chunk of data are compromised, it may still be accessed via nodes that are further away.

So then this protection does not exist?

It does since the records are stored now in a lot more than the 5 closest nodes.

Also new nodes are joining and nodes are leaving so churning is occurring regularly. But no churn due to accessing records

BTW I have tested the sybil attack and damn if I could “kill” a known test chunk a friend uploaded for the test. Its also how I found out just how many nodes store a record, and how nodes that once stored a record still store it and restore it once the attack is over. Although I have a suggestion to increase the protection which I will put forth soon enough if it isn’t done anyhow

8 Likes

Surely the best solution to that would be the proposed caching nodes?

1 Like

Off topic, but why not just increase copy number +1 for each additional copy stored on the network? That would help with nodes being hit with requests on popular chunks no? Or is such a mechanism too hard to implement? Seems like it would be relatively simple, but no clue if copy number is stored somewhere and can be adjusted without it becoming another attack vector.

1 Like

Yes, but the topic is centred around trying to save $$$ by combining files into 3 chunks when very small files.

Really the dedup and caching is a different subject to saving the dollars, but does affect the usefulness of it. I also personally feel that individual files is always the best way. But the $$$ cost means alternative to spending thousands of dollars for uploading portions of the way back machine or similar is needed to be explored

1 Like