Calculating the probability of data persistence

peca · February 7, 2024, 2:43pm

I think we need erasure coding in the network level. On app level it will help in certain cases, but it works only at the time of download or it requires periodical checks from the client. If nodes are aware of the “reserve” chunks, they can use them anytime there is problem with replication and some chunks missing.
This “self-healing” adds another layer of protection, because if two chunk loss events happen not at the same time, but after each other, file can be up to full health before the second event hit even if all copies of some chunks were lost in the first event.

dirvine · February 7, 2024, 4:09pm

I am not sure. I see no difference between EC and just more replicas in our case (usually the argument is space efficiency, but that is not really the case here)? Also with self encrypt we know the hash names of chunks, if we then use EC on them we will need much larger data maps.

The other thing is SE chunks are information theoretical secure, i.e.

There is not enough info to decode a chunk, even with quantum or better compute
No 2 chunks have any possible link between them (they are securely unlinked)

peca · February 7, 2024, 5:32pm

I am comparing two examples:

10000 chunk file, 6 replicas = 60k chunks stored
same file 4:2 EC coded, 15000 chunks, 4 replicas = 60k chunks stored

Number two somewhat feels, it would do better in the test @Toivo did in first post, but I don’t have enough math skill to do the calculations.

I get there are other factors in play, I just wonder if the difference would be significant.

dirvine · February 7, 2024, 6:26pm

If we simplify it.

1 chunk (1Mb)

10 replicas → 10 storage points (10 Mb in total)
5:3 EC (meaning we need 3 of 5 to survive) and 2 replicas → 10 storage points (approx 13Mb in total)

For the same redundancy (Storage points) we have more data.
In 1. we can lose MAX 9 storage points and still be good
in 2. We can lose MAX 7 storage points and be good

This is what I mean, it’s the viewpoint we take, never mind the complexity.

to7m · February 7, 2024, 8:43pm

Consider how many copies there would be of each bit in a given file. Non-EC solution would always give the highest number, and wouldn’t involve the unnecessary complexity of implementing EC.

neo · February 8, 2024, 1:04am

I agree that its not for the network level. To gain those benefits that it’d provide extra replicas will match it just as RAID 6 provides better protection than RAID 5 because of the extra “parity” disk.

I would agree/accept that it can be a addon to the client to provide automatic extra blocks, but that is not at the protocol layer (network layer). These extra blocks would have to go through their own form of self encryption since if included in the original file’s chunks then you end up with dedup being defeated if the original file already exists on Safe.

I would expect that this will be maintained. The link between a files chunks and the extra block chunks is only known by the datamap.

While it maybe theoretically less secure when presented with only the chunks for that file & extra blocks, the bigger problem is finding 2 chunks that will be related to that file/extra-blocks. The time to download the 1000s of billions of chunks in order to try and find related chunks will negate any theoretical advantage gained by having extra blocks to assist.

If it was to be implemented in the client then I’d suggest that it done in a batched way. EG for every 10 chunks of the file there is “X” number of extra blocks. This is to reduce the computational load on the client machine when generating them. I am sure some testing will come up with an optimal number. Working with this in the past the larger the file & “chunks” causes the time to process to multiply. Its not linear but closer to exponential.

The real protection it would provide is protection against lost chunks since errors in transmission or node storage/retrieval has the protection of the 5 replicas. So even a 11:10 protection will provide very good protection.
EG a 10% outage (huge)

0.1 ^ 5 chance of all 5 replicas in the outage for any particular chunk.
100,000 chunk files this is on average 1 chunk per file.
11:10 protection
- 110,000 chunks stored
- effectively divided into sections of 11 if we use one extra block per 10 chunks
- #chunks lost in 11 chunk section is 0.00011 or 1 in 9090 sections will have one chunk lost.
  - all sections still protected if one chunk lost per section
  - to lose 2 chunks in a section is very small (think 1 in millions)
12:10 protection
- 120,000 chunks stored
- effectively divided into sections of 12 if we use two extra blocks per 10 chunks
- lose one chunk in a section after millions and 2 chunks in one section after billions of sections
- need to lose 3 chunks in the section to lose the section, and this is in terms of 1 in 1000’s of billions of sections

So even with one protection chunk per section of 10 chunks the average loss of one section numbers in the millions of sections. Basically 11:10 would protect files with a million chunks 9 out of 10 times with a 10% permanent loss of the Safe network.

Replica of 6 without EC has a 0.1^6 (0.000001) chance of any particular chunk being in the 10% lost section. This means a file with one million chunks would be expected to be lost.

Replica of 5 with 11:10 and 1 million chunk file

5.1 million stored chunks on the network
less than 1 in 10 chance of being lost

Replica of 6 and 1 million chunk file

6.0 million chunks stored on the network
expected to lose most if not all files of 1 million chunks

Replica of 7 and 1 million chunk file

7.0 million chunks stored on the network
expect to lose 1 in 10 files of 1 million chunks

We see that it takes having 7 copies of each chunk (Replica of 7) to come close to matching a 11:10 protection with Replica of 5, but with 1.37 times the chunks stored on disk.

And for 12:10 protection it requires Replica of 10 or more to come close.

The reason is that the protection assumes each of its blocks is stored once. But with Safe’s replication of each block (chunk) magnifies its protection.

[EDIT] And of course this is not needed for launch at all. It can be added later on to the client. But it does solve the large file problem when there is a significant outage on the network.

TylerAbeoJordan · February 8, 2024, 6:36am

I like EC, but agree 100% it should be app layer not core network. Core network has a viable method to insure data persistence and adding another just increases complexity for no provably significant gain – yet there is provably a significant cost.

benpreiss · September 4, 2025, 12:24pm

Hey all! Happy to be here, yet late to the party!

I think I read about self-healing somewhere here in the thread… is this actually a thing in autonomic currently? I.e. when a node goes offline, does another node step in and store the missing chunk copy instead?

And just to recap, the “problem” of the current implementation (without erasure codes) is, if you loose the wrong 5 copies the whole file is lost?

E.g. if you have a big file with e.g. 10000 chunks, each 5x replicated, losing the wrong 5 chunks (out of 50000) results in total loss of file readability?

How does that work on implementation level? Are the 4MB storable within e.g. a scratchpad seen as a single chunk? Or what is the chunk size on autonomi?

benpreiss · September 4, 2025, 9:49pm

This leads to the following sample calculations for durability (link):

Screenshot 2025-09-04 at 22.47.582040×1554 310 KB

Note that fleet time between any losses describes the time until one chunk from a fleet of chunks (in this case 10k) might be lost.

Bottom line

It would be really great to get some metrics about the autonomi network that we can plug in here… Or someone who knows the network can give some educated estimates?

Especially interesting:

AFR
time until a rebuild is triggered (τ)
repair time per missing replica (t_x)
node reliability (temporary failures), described by online / offline ratio and mean time a node stays online

If someone could share these metrics, I’d love to plug them in and give a concrete number for availability and durability on autonomi!

I would be delighted if someone with more mathematical understanding than me could check the above calculations

neo · September 4, 2025, 10:04pm

Where is this from?

Definition of an outage?

riddim · September 4, 2025, 10:05pm

can we get rid of such LLM nonsense …

yes

so in a drastic case where we’d loose 20% for each of the crucial chunks probability to survive for each chunk would be:

1 - 0.2**5 == 99.968%

probability not to loose such a large file (that’s a 40GB file there):

(1 - 0.2**5) ** 10_000 == 4% → so that file would probably need a re-upload (actually … the number 5 is a lower bound - might be more than that … + the 20% nodes really would need to be shut down fast enough for no replication being triggered in time to create additional copies again… not sure that’s a super realistic scenario)

… with just loosing 10% of the network at once instead of 20% survival probability would jump to:

(1 - 0.1**5) ** 10_000 == 90.48%

neo · September 4, 2025, 10:13pm

Yes records are always stored on at least 5 nodes. Replication ensures that records are copied onto other nodes if a node goes down.

Also to lose all 5 requires all of those 5 nodes to go down at approx the same time and there isn’t another node still storing that record. My testing showed that all records I tested were stored on 12 to 50 nodes.

But yes losing all copies of a record will result in the loss of the data file stored and the larger the file the greater the chance.

The chance though is according to %age of network lost to the power of 5 (usually more). So a 1% loss of nodes (eg 10,000 nodes in a network of 1 million nodes) will result in a 0.0000000001 chance of any one node int he network being lost. Or chance of a 10,000,000,000 chunk file being lost. Obviously any record lost will result in a file somewhere being lost no matter the size.

But to lose 1% requires those nodes going offline before replication kicking in

But that figure assumed the worse case scenario ever where records are only stored on 5 nodes. Typically 7 nodes will actively grab each record and during nodes joining and leaving with replication that more than those 7 nodes will have a copy that they can reinstate if needed.

neo · September 4, 2025, 10:14pm

100%

AI only gives what already exists somewhere and is a mash up of that depending on the exact wording of the question

riddim · September 4, 2025, 10:15pm

with 12 copies survival rate of your 40GB file with 20% network drop:

( 1 - 0.2 ** 12) ** 10_000 == 99.9959%

ps: with 12 copies even a 40% network drop still results in 84% survival probability for that 40GB file

neo · September 4, 2025, 10:19pm

12 copies means a 1 in 244,140,625 chance of losing any one record if a 1/5th of the network went down within a couple of minutes. That is another way to look at it for betting people

benpreiss · September 5, 2025, 1:29am

Durability and availability on autonomi

First of all, what are we talking about?

Data durability = probability of data remaining intact (neither corruption nor full loss) over time period x
Read availability = The chance that a read request for an object succeeds at a random moment

I think both are very important metrics. Read availability is important UX wise, as users don’t like service outages. Data durability due to the fact that users expect platforms to not lose their data (ever!).

Autonomi stores 5 (or more) duplicates of each chunk. Also, chunk copies that go offline (temporarily or permanently) are taken up by other nodes on the network, thus striving to maintain 5 copies at all times (this is an assumption - maybe someone else here knows this?). Copies that were offline are not declared “dead” whilst the repair is running, but only after another node has finished duplicating the rescue copy.

With the above characteristics, we can model both read availability and data durability. It is important to note, that nodes going offline, temporarily or permanently, drive availability, while permanent failure alone (nodes not coming online again) drives durability.

Some terminology:

chunk: a piece of data that is 5 times replicated in autonomi
replica / duplicate / copy: a single copy of a chunk
node: a node in the autonomi network

Assumptions that the following derivations rely on:

node failures (both permanent or temporary) are unrelated to a chunk (it is not more likely for nodes storing a copy of the same chunk to fail). This relies on the fact that chunk copies are distributed randomly around the globe?

Read availability

Read availability of a single data chunk (replicated 5 times) is governed by how long and how frequently nodes go offline. Assuming that node failures are truly random / not related to specific chunks and that there is a repair / rebalancing mechanism for chunks from nodes going offline, we can model the read availability as a model of the following parameters:

r — replica count (e.g., 5).
a — per-node availability (fraction of time a node is online).
T_on — mean time a node stays online.
τ — declare-failed threshold (only outages longer than τ trigger rebuild).
t_x — repair time per missing replica (include detect + queue + copy).

One can find that a is the biggest factor in determining availability, with flaky nodes (50% downtime) leading to significantly reduced availability in comparison to 10% downtime.

Here is the formula derivation for read availability, developed with ChatGPT (don’t hate me):

Inputs

r: replica count (e.g., 5)
a: per-node availability (fraction of time a node is online)
Per-node on/off:
- Mean online time T_on ⇒ down-rate β = 1/T_on
- From a = T_on / (T_on + T_off) get T_off = T_on * (1-a)/a and up-rate μ = 1/T_off
Repair policy:
- Declare-failed threshold τ (rebuild only if outage lasts > τ)
- Repair time t_x (rate μ_r = 1/t_x)

Baseline (no rebuild deficit)

With k completed copies, independent availability a:

A_k = 1-(1-a)^k

Quorum ≥ m:

A_{≥m,k} = Σ_{j=m}^k (k\ \text{choose}\ j) a^j (1-a)^{k-j}

How repairs enter

Only outages longer than τ trigger rebuilds. For exponential OFF times:

λ_{eff} = β e^{-μτ}

Declared-failure rate per replica.

Quick, first-order availability (repairs are rare)

Fraction of time with one missing replica:

f_{rep} ≈ r λ_{eff} t_x

Overall read availability:

A_{read} ≈ (1-f_{rep}) A_r + f_{rep} A_{r-1}

(For quorums, replace A_r, A_{r-1} by A_{≥m,r}, A_{≥m,r-1}.)

Exact, any regime (overlapping repairs allowed)

Birth–death on completed copies K ∈ {0,…,r}:

Loss (declare-failed): δ_k = k λ_eff takes k → k-1
Repair: b_k takes k → k+1
- Serial: b_k = μ_r for k < r
- Fully parallel (upper bound): b_k = (r-k) μ_r

Stationary weights:

π_k = π_0 ∏_{i=1}^k (b_{i-1}/δ_i) π_0^{-1} = Σ_{k=0}^r ∏_{i=1}^k (b_{i-1}/δ_i)

Availability:

A_{read} = Σ_{k=0}^r π_k A_k

Quorum:

A_{≥m} = Σ_{k=0}^r π_k A_{≥m,k}

Read availability sample calculations

Example A — flaky nodes

r=5, T_on = 10 min, a = 0.5, T_off = 10 min, β = μ = 0.1/min, τ = 30 min, t_x = 20 min

Gives us

\begin{gathered} \lambda_{\mathrm{eff}} = 0.1 e^{-3} = 0.004978/\mathrm{min} \\ f_{\mathrm{rep}} \approx 5 \times 0.004978 \times 20 \approx 0.50 \\ A_r = 1 - 0.5^5 = 0.96875 \\ A_{r-1} = 1 - 0.5^4 = 0.9375 \\ A_{\mathrm{read}} \approx (1-0.50)\times 0.96875 + 0.50\times 0.9375 \approx 0.953\ (95.3\%) \end{gathered}

Example B — mostly up

T_on = 10 min, a = 0.9, T_off = 1.11 min, β = 0.1/min, μ = 0.9/min, τ = 30 min, t_x = 20 min

Gives us

\begin{gathered} λ_{eff} = 0.1 e^{-0.9×30} ≈ 1.9×10^{-13}/min \\ f_{rep} ≈ 5 × 1.9×10^{-13} × 20 ≈ 2×10^{-11}\ \text{(negligible)} \\ A_r = 1 - 0.1^5 = 0.99999 \\ A_{r-1} = 1 - 0.1^4 = 0.9999 \\ A_{read} ≈ 0.99999\ \text{(repairs don’t move the needle)} \\ \end{gathered}

Outage Wait Time Analysis

A more intuitive metric is the average-experienced outage time, which can be derived from the read availability, the p-quantile of the outage distribution and the amount of monthly outage incidents (also ChatGPT):

We can also summarize the expected wait times if a read lands during an outage, under a Poisson/exponential outage model.

Model

Outages arrive randomly (Poisson) with rate N per month.
Each outage duration is exponential with mean ( d = B/N ).
A service level objective (SLO) of availability X% per month implies a downtime budget ( B ) seconds per month.
If a read lands in an outage, the residual wait is exponential with the same mean ( d ).
Hence:
- p95 wait \approx -\ln(0.05) d \approx 3d
- p99 wait \approx -\ln(0.01) d \approx 4.605d

99.9% (three nines) — per month

Downtime budget: 2628s ≈ 43m 48s

N outages/mo	Avg outage (d=B/N)	p95 wait (≈3d)	p99 wait (≈4.605d)
1	43m 48s	2h 11m 13s	3h 21m 42s
2	21m 54s	1h 5m 36s	1h 40m 51s
4	10m 57s	32m 48s	50m 26s
10	4m 23s	13m 7s	20m 10s
30	1m 28s	4m 22s	6m 43s
100	26s	1m 19s	2m 1s

99.999% (five nines) — per month

Downtime budget: 26.28s

N outages/mo	Avg outage (d=B/N)	p95 wait (≈3d)	p99 wait (≈4.605d)
1	26s	1m 19s	2m 1s
2	13s	39s	1m 1s
4	6.6s	19.7s	30s
10	2.6s	7.9s	12s
30	0.9s	2.6s	4s
100	0.3s	0.8s	1.2s

Tip: Swap “per month” for any window by scaling the budget (B).
E.g., for 99.9% per year, (B = 8h 45m 36s).

Data durability

Data durability of a single chunk (!!) can be described (amongst other things) by the node (or replica) permanent failure rate, the rebuild delay (wait before rebuild after chunk failure) and the rebuild time (time to rebuild, starts after rebuild delay).
This was also derived by ChatGPT (link):

Setup & symbols

r: target replica count (e.g., 5)
\lambda: per-replica permanent-failure rate (constant hazard)
t_x: mean rebuild/copy time; \mu = 1/t_x is the repair rate
\tau: rebuild delay (wait this long after a drop before starting the copy)
i \in \{1,\dots,r-1\}: number of simultaneous deficits (truly lost replicas being rebuilt) in the cascade
Assumptions: independent replicas; parallel repairs (best case). (Serial noted later.)

Step A — “Rare cascade” picture of data loss

Data loss occurs only if you see r permanent failures before any repair completes. The loss rate factors into:

\text{LossRate}(\tau) \;\approx\; r\lambda \;\times\; \prod_{i=1}^{r-1} p_i(\tau),

where p_i(\tau)=\Pr[\text{the next failure lands before the first repair completes}\mid i\ \text{deficits}].

Step B — The stage i race: X (next failure) vs Y (first repair completion)

Derivation why lifetimes are exponential with rate \lambda

Assume constant hazard: if a replica is alive at time t,

\Pr(\text{fail in }[t,t+\Delta t)\mid \text{alive at }t) \approx \lambda\,\Delta t.

Let S(t)=\Pr(T>t) be survival. Conditioning on the next tiny step,

S(t+\Delta t)=S(t)\,[1-\lambda\,\Delta t] \;\Rightarrow\; S'(t)=-\lambda S(t),\ S(0)=1,

so S(t)=e^{-\lambda t}: T\sim\mathrm{Exp}(\lambda).

(i) Distribution of X (next failure among r-i good replicas)

Let the remaining healthy replicas have i.i.d. exponential lifetimes T_1,\dots,T_{r-i}\sim\mathrm{Exp}(\lambda). Then the next failure time is the minimum:

\Pr[X>t] = \Pr\!\big(\min\{T_1,\dots,T_{r-i}\}>t\big) = \underbrace{\prod_{j=1}^{r-i}\Pr(T_j>t)}_{\text{independence}} = \prod_{j=1}^{r-i} e^{-\lambda t} = e^{-(r-i)\lambda t}.

Hence

\boxed{X \sim \mathrm{Exp}\!\big((r-i)\lambda\big)}.

(ii) Distribution of Y (time to first repair completion with delay \tau)

You must wait a fixed delay \tau before any rebuild can finish. After that, there are i parallel rebuilds, each finishing at rate \mu. The earliest completion is the minimum of i i.i.d. exponentials of rate \mu, hence exponential with rate i\mu. Therefore

\boxed{Y = \tau + Z,\qquad Z \sim \mathrm{Exp}(i\mu)}.

(iii) Stage probability p_i(\tau)=\Pr[X<Y]

\begin{aligned} p_i(\tau) &= \Pr[X<\tau] \;+\; \Pr[X\ge \tau]\Pr[X-\tau < Z] \\ &= \big(1 - e^{-(r-i)\lambda \tau}\big) \;+\; e^{-(r-i)\lambda \tau}\cdot \frac{(r-i)\lambda}{(r-i)\lambda + i\mu} \\ &= \boxed{\,1 - e^{-(r-i)\lambda \tau}\,\frac{i\mu}{(r-i)\lambda + i\mu}\,}. \end{aligned}

(Sanity checks: \tau=0 \Rightarrow p_i(0)=\frac{(r-i)\lambda}{(r-i)\lambda+i\mu}; increasing \tau makes the stage riskier.)

Step C — Put stages together → Loss rate and MTTDL

\mathrm{LossRate}(\tau) \approx r\lambda \prod_{i=1}^{r-1} \Big[ 1 - e^{-(r-i)\lambda \tau}\,\frac{i\mu}{(r-i)\lambda + i\mu} \Big]

\mathrm{MTTDL}_r(\tau) \approx \Big( r\lambda \prod_{i=1}^{r-1} \big[ 1 - e^{-(r-i)\lambda \tau}\,\tfrac{i\mu}{(r-i)\lambda + i\mu} \big] \Big)^{-1}

(If repairs are strictly serial, replace i\mu by \mu inside the brackets.)

Step D — First-order, fast-repair approximation (clean closed form)

When \mu\gg\lambda and (r-i)\lambda\tau \ll 1,

p_i(\tau) \;\approx\; (r-i)\lambda\!\left(\tau + \frac{1}{i\mu}\right).

Therefore

\text{LossRate}(\tau) \;\approx\; r!\,\lambda^{\,r}\;\prod_{i=1}^{r-1}\!\left(\tau + \frac{1}{i\mu}\right), \qquad \boxed{\text{MTTDL}_r(\tau) \;\approx\; \frac{1}{\,r!\,\lambda^{\,r}\, \displaystyle\prod_{i=1}^{r-1}\!\left(\tau + \frac{1}{i\mu}\right)} }.

Special cases:

Immediate rebuild (\tau=0, parallel repairs):

\displaystyle \boxed{\text{MTTDL}_r(0) \;\approx\; \frac{\mu^{\,r-1}}{\,r\,\lambda^{\,r}}}.

Degradation vs immediate start (independent of \lambda):

\frac{\mathrm{MTTDL}_r(\tau)}{\mathrm{MTTDL}_r(0)} \approx \prod_{i=1}^{r-1}\frac{1}{1 + i\,\mu\,\tau}

Step E — Durability over a horizon H

Treat data-loss events as rare (Poisson) with mean rate 1/\text{MTTDL}:

\boxed{D(H) \;=\; \Pr[\text{no loss in }H] \;\approx\; \exp\!\left(-\frac{H}{\text{MTTDL}_r(\tau)}\right)}.

For \text{MTTDL} \gg H, D(H) \approx 1 - H/\text{MTTDL} and nines \approx \log_{10}\!\big(\text{MTTDL}/H\big).

Side derivation — Why \lambda = -\ln(1-\text{AFR})

AFR is the probability a single replica fails within one year:

\text{AFR} = \Pr(T \le 1) = 1 - \Pr(T>1) = 1 - S(1).

From Step B, S(1)=e^{-\lambda\cdot 1}. Hence

\boxed{\text{AFR} = 1 - e^{-\lambda}} \quad\Longleftrightarrow\quad \boxed{\lambda = -\ln(1-\text{AFR})}.

(For small AFR, \lambda \approx \text{AFR} to first order.)

That’s it: the constant-hazard assumption gives exponential per-replica lifetimes; independence makes the minimum exponential with summed rate, yielding the stage race p_i(\tau). Multiplying stages gives LossRate, its reciprocal is MTTDL, and then D(H)=e^{-H/\text{MTTDL}}.

Sample durability scenarios (all numbers are per object unless noted)

This leads to the following sample calculations for durability:

Scenario (inputs)	MTTDL (years)	Durability in 1 yr `D(1yr)`	“Nines” (≈ −log10(1−D))	Fleet time between losses (10k objs)
A0: r=5, AFR=5%, tx=1d, τ=0h	1.00×10^16	1−1.11×10^−16	15.95	1.00×10^12 yrs
A1: same, τ=6h	1.52×10^15	1−6.66×10^−16	15.18	1.52×10^11 yrs
A2: same, τ=12h	4.45×10^14	1−2.22×10^−15	14.65	4.45×10^10 yrs
A3: same, τ=24h	8.34×10^13	1−1.20×10^−14	13.92	8.34×10^9 yrs
B: r=5, AFR=1%, tx=1d, τ=0h	3.46×10^19	1−2.89×10^−20	19.54	3.46×10^15 yrs
C0: r=5, AFR=5%, tx=7d, τ=0h	4.19×10^12	1−2.39×10^−13	12.62	4.19×10^8 yrs
C1: same, τ=24h	1.27×10^12	1−7.87×10^−13	12.10	1.27×10^8 yrs
D: r=3, AFR=5%, tx=1d, τ=0h	3.29×10^8	0.999999996962	8.52	3.29×10^4 yrs
E: r=5, AFR=5%, tx=6h, τ=0h	2.56×10^18	1−3.91×10^−19	18.41	2.56×10^14 yrs

How to read this

MTTDL scales like μ^(r−1) / λ^r: faster repairs and more replicas give huge wins; worse AFR hurts steeply.
Rebuild delay τ only reduces durability (compare A0 → A3).
Fleet view: expected time between any loss for 10k objects is roughly MTTDL / 10,000.