Important considerations for data integrity

The following is a discussion about silent data corruption in sas vs sata links and unrecoverable hard error read rates on commercial and enterprise sas/sata drives. Every would be farmer / vault operator should consider these things when building their systems. This discussion is also pertinant to safe routing and how the built-in Safe Network chunk redundancy is leveraged to flag chunks that have suffered rot/damage, and subsequently communicates that knowledge back to vaults through node age or rewards.

6 Likes

Does that tempt that vaults are on occasion doing checksums on chunks?.. seems like a good idea. I guess it’s similar to fsck -f force checking the drives… but the network cannot action that directly, it could once in a blue moon ensure that chunks still are stored accurately.

2 Likes

A point that I’ve tried to make in past conversations is that, yes, the network will need to constantly poll vaults. This is necessary to ensure that a) the vault still has the chunk it is supposed to have, and b) that chunk has not become corrupted, and c) take corrective action if one of the N copies of chunk is defective to d) challenge the problem vault to accept a corrected copy of the chunk and e) penalize the problematic vault if they don’t shape up.

The above article details how often this is likely to occur based on the type of hardware used.

Once in a blue moon usn’t nearly enough. Imo, vaults should be polled for chunk integrity 24/7/52. My hope is that maidsafe or dirvine have some clever tricks for doing these checks in an efficient manner.

3 Likes

Could this be a function of the Vault (at a minimum)?

At least then the vault would check its data and if it finds a bad chunk then it initiates a request for the chunk to reinstate it.

This way it will not lose farming rewards and be less load on the network than taking a node down for bad data and moving the chunks to a new node(s)

4 Likes

That might be fine if all vaults could be trusted and are diligent and contientious… but …

I think it’s best if the network drives the process, demanding that the vaults respond in a certain way if they want to stay in good standing.

There is a lot of power in asking the N redundant vaults to perform a task (ex. hash a randomly selected set of chunks) and then see which result is not like the others.

This way, the communication overhead could stay low, but the chunk verification rate stay high.

4 Likes

The issue that good vaults will want to do this because they want their data to be good otherwise they get kicked or at least lose rewards.

Bad vaults may not care. But they get kicked on bad retrievals.

If you have a majority of bad nodes doing this then the network will die anyhow

2 Likes

Yes, but not all chunks are retrieved at the same frequency, so a supplemental ‘audit’ retrieval can speed up the detection rate of bad nodes and ensures that every chunk in a vault is checked at some reasonable interval.

1 Like

This is definitely an area where the incentives of the network are not properly aligned. Clients are incentivized to continuously (or at least frequently) request the chunks that they have uploaded because gets are free and doing a get is how they can assure the network looks after their data.

2 Likes

I think you missed the point.

If you have even a majority of bad vaults/nodes then the network is sunk anyhow. Even if the elders are doing the checking. And more likely a bad node is doing other bad stuff ensuring more problems for the network than a bad chunk being retrieved or not storing them at all.

Now for bad nodes to cause a collapse for that chunk then all nodes would need to be bad*. Bad that do not faithfully do their own checking. (*all nodes excluding genuine faulty ones)

This is why I say that it would be a good feature that the node does a self check on a slow but regular basis. It means that the nodes self correct and hopefully before they try to serve up a faulty chunk.

1 Like

Depends how many chunks the network keeps over an above what is essential for assurance and relative to the risk of rot. The compound of those might suggest not 24/7, and unnecessary burn on cpu. The risk of hard drive corruption is low… that’s why the odd fsck or whatever M$ does now that was defragging, might be good to encourage.

The other consideration is how hard the drive is being used but if SSD then perhaps physical overuse is not an issue.

1 Like

The idea is to do incremental. Like 1% of chunks every hour. So 100 hours for full storage check.

Maybe cycle once a week would be more appropriate. Even once a month is likely to be OK

3 Likes

I think this can be done efficiently, at least the check, the network can also adjust to do more of this during periods of relative inactivity and less when stressed.

For example checking might be asking three vaults to return the nth byte from the chunk with a given xor address. Only if they differ does any further action need to be taken. You might even be able to piggy-back these operations on top of essential messages to avoid adding additional message traffic.

3 Likes

Yes, I saw what you meant. So just throw enough redundancy at it to passively account for bad and faulty… ignoring the bad actors imo there needs to be a more active process to deal with faulty hardware. Consider that most vaults will be on consumer grade tech with no ecc. Trust but verify.

Yes, my main thought was something like this. Say 100 hours to do a full storage check, but when you finish start over and check again to claim 24/7. The network might even offer farming rewards for these self-test or GET audits to incentivise the good behavior (not as much as a real GET reward of course).

I think a good perspective is to ask the following question.

What is the longest duration/ period of time that a chunk should sit in a vault between GET requests?

Popular data will naturally be checked much more regularly than cold data. I’ve seen the claims how the majority of data is written and then read again very rarely if ever (once or twice per decade). That is clearly too long a maintenance interval.

2 Likes

Yep you missed the point.

Good actors check & fix themselves because they want maximum farming rewards and no penalties.

Bad actors will act bad no matter what you try. You put in elder/section checks on node and the node will record correct hash to fool the elders. Yes you could try and salt it but that is adding complexity for a problem that doesn’t really exist.

The network has redundancy built in, that is inherent in the design and roots out bad actors not storing chunks. For bad actors to actually cause loss of chunks requires ONLY bad actors to be storing the chunks. This may occur when you have 7 bad actors and the one good node goes off line. But you see the issue here, if an attacker can get that many bad nodes that close (XOR space) together then the network will not survive and reliable storage is the least of the problems.

Summary
good actors want to check their storage to ensure max farming and no penalties.

bad actors will fool the elders one way or another, even if they actually store the data till attack time and no amount of elder checking can help.

Simplest and least load on the network is for nodes to self check and request any chunk that went faulty. Best and most efficient way to fix storage compared to elder checking. And will work a dream for consumer grade gear or server gear.

Elder checking of each and every chunk is clumsy and prone to gaming anyhow. Its in the best interest of nodes to self check. Only when all nodes holding a chunk are bad actors is there any data loss. Its the redundancy built in to the network design.

Sorry your poll did not make sense considering the topic. Checking is not related to put/get

2 Likes

No, I saw your point just fine. In general I just don’t think the loss of potential GET reward far in to the future is enough of a disincentive to ensure data integrity long term.

Yes, I know. That’s why in your quote of me above I said agreed with your point about redundancy solving many ills so “ignoring the bad actors” let’s focus on faulty hardware. You missed my point.

Chunks will accumulate over time, cold data will be obvious to identify. If that data is never checked/audited or retrieved for many years then the vault has no incentive to maintain it and spend power on hashing and error correction.

“Ahh, but good vaults will be good and diligent by definition!”, you say. Maybe. Active and passive checks could work together.

“Why should I keep/maintain data that won’t be requested even once in the next 10 to 25 years?” In this situation it’s not a concern for bad/malicious actors, but lazy or ignorant ones.

An active audit process solves this. The vault still does all the work, an elder would just need to compare a couple hashes offered by the vaults periodically. The elder wouldn’t check each chunk at a time, batches of chunks could get a single hash to minimize communication overhead. Lazy vaults then face penalty in the short term for not being diligent maintainers of the data on their own (perhaps in the manner you describe above by requesting a new copy if they detect a problem) to ensure they don’t fail an audit.

It was premature so I scrapped it.

It is because the data naturally gets checked when a GET is served. A maximum constant GET interval per chunk is a simple abstraction for thinking about the issue. It’s also the most naive/simple solution to the problem.

3 Likes

You are still missing the point. It would be built into the node software. So basically no choice. I said it that way to indicate the differences between good/bad nodes.

Your “Section does the checks” is also just part of the code base. So why not reduce the load on the section and leave it as part of the node functioning.

Umm so does just writing the node software to do active checking of its own storage. Same result but less load on the network.

  1. The problem is that you haven’t closed the control loop and have no guarantees that the result is the same until a GET is performed. The time between GETs is currently unbounded, that is the issue. The method you described is only a half solution.

  2. The section elders don’t do the checking, the vaults do. The only difference between the two scenarios is that in a) vaults do checking and make a request (perhaps in a manner like you suggested) but no more , wheareas in b) they perform the same thing as a) but periodically get audited so there is no chance for a chunk to sit in a vault for 3 days or weeks let alone 3 decades before it is retrieved. The section is only enforcing that the checks are being done and shortening the time horizon to when nodes are rewarded or penalized in the natural way farming works. The extra load on elders is negligible.

Another perspective to view this from is that of flattening the popularity of data. By auditing chunks, there is no longer hot and cold data. Every chunk will be requested at roughly the same rate in perpetuity. This makes all data equally valuable. To make it efficient in terms of bandwidth, audited chunks don’t get transferred, just a 256bit proof for a set of chunks.

I do not know why you say this. A node can do this checking for itself. This is effectively the node doing a get on itself.

If you are concerned that the node has the wrong hash for its chunk and somehow the disk error matches the incorrect hash, then you could have nodes request the hash from others for themselves. This is not a get as such since its just meta data.

But work out the odds of the hash being corrupted to exactly the same value as the corrupted chunk. Very unlikely.

Now if you are concerned the node will not return a chunk due to some error in its hardware then its unlikely to be performing any other function of the node properly, so it gets booted anyhow.

But none of this is concerned with the age of the data as such. The self check causes each chunk to be read from disk at least weekly. The normal functions of the node causes the node hardware/software to be checked.

1 Like

Don’t we already have that kind of check inplace through churn? A leaving node causes data stored on that node to be re-get-ted by some other node(s) in relativly even Intervalls?

3 Likes

Maybe. Unsure. I am not clear in the details of chunk passing during a churn event. How often do elders churn? Churn is an expensive operation in terms of time and bandwidth. Section splits are another opportunity for error correction routines, but the time between splits is likely unbounded.

There are many ways to “skin a cat” (please pardon the expression). The only real point I’ve been trying to make is that there needs to be some upper bound on the amount of time any chunk is “at rest” in order to allow for reasonable error/malice detection/correction intervals on all chunks. An overabundance of redundancy then fixes all ills.

The churn mechanism may be a simple way to affect this as long as there was a guarantee that all chunks in a section eventually get touched by successive churn events within a given time interval. Serendipitously, the more redundancy demanded by a section, the greater the probability that a single churn event will touch a majority of the chunks. This starts to tie in with @mav’s discussion regarding degree of redundancy. The trade-off is that passing chunks around instead of proofs is expensive, so it’s better for the network if churn is minimized.

The trade-off between redundancy and error correction is a delicate dance. The right balance of both yields a network that can survive in the real world.

Yes, and for good vault operators who want to be diligent and good maintainers of their chunks your approach is a fine recommendation to them. I suppose my thoughts beyond what is written here have drifted to a variety of other ideas/concepts that may be off topic for this thread.

3 Likes