Could we have multiple Chunk sizes?

storage_guy · October 15, 2024, 9:42pm

I’m worried about the loss of functionality that will happen with ‘large’ Chunk sizes of 2MB+. The currently plan of 4MB seems excessively large and 2MB not much better.

I realise it is possible to still upload much less but it will become much less attractive and maybe completely uneconomic. The price will be driven by the maximum chunk size because the bulk of data on the network will be in files or other data structures that are larger than whatever the chunk size is so that is the unit that storage will be purchased in.

I also realise that because of the use of a ERC20 L2 token to do the payments for storing data (hopefully not forever) it is more attractive to have a larger Chunk size.

You can have Object Storage / Content Addressed systems that can handle variable sized objects. Although it is much more frequent to have systems where a single size is set. But that is because of the way Erasure Coding works when fragments are spread across disks or nodes. Autonomi is a completely different architecture where there is no defined size of the network (or even knowledge of the size of the network) or physical tolerance zones based on disks, trays of disks, racks or sites that have to be taken into account.

At the moment I believe the network is using straightforward replication of 5 copied rather than EC and presumably that is coming later?

How about:-

512KB objects
2MB objects
4MB objects
… Other sizes in the future? 64MB is common is attractive for large uploads and downloads.

Objects up to 512KB = replication 5 ways because it’s hardly worth doing EC for that size. 3 chunks to still make the encryption work.

Objects larger than that would use whatever number of chunks needed to make the EC economic, resilient and performant.

If it’s not possible or too late to add variable size objects just now could it be added later? Or leave the door open to it somehow? Or does the chunk size have to be baked in for all time at the beginning?

I realise this question crosses with some others going on about encryption and the use of the ERC20 L2 token but it seems worthy of a separate question and I don’t think it’s been asked before. Sorry if that’s not the case!

joshuef · October 16, 2024, 6:05am

Self encryption is entirely client side. The network just has a max size it will accept records for.

Chunks are already variable in size. (They only hit max if the filesize is over MAX_CHUNK_SIZE).

storage_guy · October 16, 2024, 8:01am

Thank you. That is good to know. Maybe what I’m asking then is could there be different MAX_CHUNK_SIZES? And maybe they are treated differently.

Basically, what I’m driving at is that the 4MB size is too big for a lot of use cases. The price will be driven by the price of the smallest amount that can be paid for.

To illustrate my point - I went into a pub at lunch for a cheeky half. Here is how the conversation went:-

“A half pint of your finest ale please, barkeep”

“Coming right up, sir. That’ll be £20 when you’re ready.”

“£20 for a half pint?! I know this is London but that seems excessive!”

“I agree, that would be a little steep even for round here. It’s £20 for 4 pints”

“But I only want a half! £20 is about 8 times what it should be.”

“And I’ll gladly pour you half a pint but I will have to charge you £20. I won’t insist you drink 4 pints if you don’t want to. If you don’t want the other three and a half pints I won’t bother pouring them.”

“So I can buy anything up to 4 pints for £20?”

“Yes, sir. We sell in units of 4 pints at a time. That chap at the end of the bar is very happy with his 4 pints. That large group over there are overjoyed with the 16 pints they bought for £80. Those four gentlemen in the corner bought 4 pints for £20 and are doing well. I expect they will be back soon for another round just like it and be at least as happy, if not more.”

“What if one of them leaves or wants a spirit next time?”

“It will still be £20 for 4 pints but I’ll only pour 3 pints for them if they only want 3 by then.”

“I don’t really need 4 pints just now and I think there are pubs that will sell me just a half pint for a more reasonable amount.”

So I went elsewhere.

neo · October 16, 2024, 9:37am

Not without removing the pure simplicity that the network uses where all nodes are considered equal and rely on that fact for simplicity.

For instance the addressing is level throughout xor space, no need to know anything but the 5 nodes closest to my chunk address are the 5 nodes holding the chunk. The network fills at approximately the same rate as double randomness helps (chunk random address and node random address). And many other cool things that help keep the logic and node software as slim as possible.

To go different chunk sizes means that chunk 1 at 128MB stored in nodes A, B, C, D, E then 1MB stored in C, D, E, F ,G since its xor address is closest to those nodes. Very uneven indeed.

And keep doing this sort of thing and there is not going to be basically equal filling of the network even with nodes close in xor space. Some might have a lot of 1,024MB chunks (large files generating these) and others with 3/4 of that size and some with only a few. And others with 1MB to 1024MB at different rates.

So how to define max space needed to accommodate this. Do you break the model of 5 closest nodes and have indexing nodes directing routing to the nodes holding chunks and distribute chunks to nodes who say they can handle that size. Or do you have different xor spaces with one for each of a set of chunkMaxSize. IE 1 xor space for 1MB chunks, a 2nd one for 2MB chunks and one for 4MB chunks etc? All so you can keep the 5 closest algorithm and the routing is controlled by the size of the chunk you are using (store or retrieve)

storage_guy · October 16, 2024, 9:54pm

Ok, maybe it’s not possible to have different chunk sizes. And it sounds like whatever the chunk size chosen is at launch will be the size forever.

I still think 4MB or even 2MB is a bit big. Lots of use cases will become uneconomic. 1MB is a bit more like it.

I still don’t know if the system uses Erasure Coding so that’s my other question. I remember a lot of talk about it but all I’ve heard recently has been about ‘copies’ or ‘replicas’. EC is in use by other vendors so Autonomi will be forever less economically attractive (barring the obvious advantage of store forever for a fixed fee) without it.

neo · October 16, 2024, 10:30pm

Not really.

I suggested a method and echoed by David later on.

A 2 stage update where code is changed to larger size (best for node size increase, but works for max chunk size increase) and then a 2nd update later when most nodes have upgraded that triggers the increase to be used. The ones that did not upgrade will either run out of disk space if network fullness grows or continue as normal just charging too much with higher fullness. The node operator at that time will either ignore no more earnings or find out why and upgrade.

If this is used for max chunk size increase only then the nodes that did not upgrade would reject the larger chunks and be marked bad. Once again the owners will ignore or find out why and upgrade.

The obvious thing is to ensure the 2nd step is done when the network has mostly done the 1st step upgrade. Maybe wait a year? six months?

If we follow the wisdom of tcp/ip we will remain at a small chunk size. I can tell more but me thinks the devs are sick of me trying to educate on comms and control systems without the actual data to look at.

joshuef · October 17, 2024, 5:07am

Chunk size is not the same thing as maximum record size.

Chunk size is wholly client side.

Maximum record size is the max size of a record a node will allow. (So limits chunk size).

Maximum record size can always be changed down the line.

I’m never sad to have more wisdom!

neo · October 17, 2024, 7:07am

I was meaning max chunk size since the previous poster was referring to that. Yea not said correctly and not technically right, I fully agree, but what I saw in the context of his post it is what was talked about there

As I said.

Only need to ask how in comms across the internet has issues multiply as tcp/ip packet size increase. TCP/IP with error correction masks this. UDP when a packet is lost it is lost and the receiver has to request it specifically again or for a multi packet transmission just not acknowledge it as OK (by not responding, ignore error packet like streaming services do, request block of packets again, respond with not OK, or …) So with 1/2MB data blocks a packet error only results in 1/2MB block being transmitted again with inter block delays response times etc. 4MB its 4MB. The difference is that there is a 8 times the chance of one packet error/loss resulting in the whole datablock being in error and needing resending. Delays are multiplied 8 times. Thus its an issue of multiplication of issues. And that is just one of the issues with comms.

Its why internet packets are still after so many years/decades it is the basic ~1500 byte packet size since it provides better all round performance. Now I am not suggesting the chunk size go to that LOL. But the issues that exist pressuring the maintaining the small packet size also influence the performance of using larger data blocks. Now if it was all TCP/IP then the packet error correction would reduce these issue significantly. But UDP without this error check/recovery is great for apps where packet loss doesn’t stop the process (eg video streaming) it does rear its ugly head for datablocks where error is not acceptable.

If maybe there was error correction across the packets sent (eg a m of n) where you only need m packets of the n packets sent to recreate the data block then there would be a reduction of some issues. Although that introduces increased complexity and processing required when an error occurs and means that for the majority of data blocks sent, which have no errors, there is significantly more data sent.

This is a reason why 4MB max chunk size increases 2 parameters in this huge control system called the beta test network.

Now I do not suggest this was the only issue but in my opinion was a significant contributing factor. There are many parameters in this “control system”/network and maybe changing a few others might mask the underlying communications issue.

Without data though I can only be generic in my opinions.

I was wring communication protocols in the 70s and eighties used by a telecommunications company (the only telco in AU at the time) and comms in other areas. Even was the unofficial comms expert that IBM used in our state. Got jobs in remote gold mines fixing comms issues in IBM systems. Wrote more comms/protocols in the 90’s for the telco company on the new thing called the Internet

NOTE: I use Data Block as a more generic term which would include a record.

riddim · October 17, 2024, 8:30am

since we’re using quic and not tcp (?) … do we send 1 chunk as 1 quic data package (because that’s theoretically possible by the quic standard if I’m not mistaken)? …

if that’s the case than enlarging the package size would drastically increase the error rate in chunk communication (especially over longer paths and not just within the DO Datacenters)

…e.g. for a probability of 0.5MB arriving intact being 99% … for package increase to 4MB that would mean 8x the data must arrive intact … 0.99^8 == 92.3% probability for a chunk to arrive without errors

…ofc all just if nothing super smart would happen in the background …

riddim · October 17, 2024, 8:44am

may the higher bit error rate of Mobile Internet be a reason for @happybeing s issues with connecting to the network via his wireless router?

could as well explain the strange things connected to bandwidth usage I’ve seen there:

ChatGpt sais:

Bit-Error-Rate of Mobile Internet typically 10^-3 to 10^-6

Land-Line 10^-9 to 10^-12

…didn’t do a proper research but I think that might be a hint … @joshuef … is our beloved network not precisely tolerant to bit-errors happening in the communication (especially when it comes in combination with lower available bandwidth)? …

neo · October 17, 2024, 12:07pm

Its still going over UDP. I didn’t say we were using tcp, just mentioned to show how it uses error recovery in the protocol.

I am not exactly sure what benefits quic is providing at the udp packet level

If mobile is 10^-3 (to 10^-6) bit error rate then that is 1 bit in 1000 to 1 million bits. or 1 bit in 128 bytes to 128K bytes. Very high. But if I am not mistaken mobile data has error correction built into the mobile data connection beneath the protocol layer.

10^-9 is 1 bit in 128MB or 1 in 32 max chunk sizes which is still high and 10^-12 is much better at 1 in 32K max chunks.

riddim · October 17, 2024, 9:38pm

Which would explain why that stuff works at all for anything… But every error detection/correction has limits…

Topic		Replies	Views
4MB Chunk Network LIVE — Reset Your Nodes! Updates	58	374	October 9, 2024
Chunk size questions Development	8	572	October 16, 2023
32GB beta test. Did it actually test 32GB average node size? Educate me please Support	57	332	December 6, 2024
Update 10th October, 2024 Updates	21	973	October 17, 2024
Discussion on low level data flow and home networks. Found solution to allow 4MB max chunk size as smooth as 1/2 MB - its a setting in QUIC Development	90	1611	February 12, 2025

Could we have multiple Chunk sizes?

Related topics