Nice work with the spreadsheet @neo, though I noticed that some of your numbers are a bit off - we actually did calculate very similar things
Let’s denote the number of faulty nodes per section by f
, the number of all nodes in a section by N
, and the number of relaying nodes as m
(equal to N/3
rounded up). Also note that we expect f
to be strictly less than N/3
, so strictly less than m
, but we can also extend the calculations to it being equal to m
or greater.
First thing is that you assume the probability of failure in a single hop (i.e. the probability that all relaying nodes will be faulty) to be (f/N)^m
. That’s incorrect. Note that the probability should be 0 if m > f
(impossible that all relaying nodes will be faulty if there are more relaying nodes than faulty ones), but your equation gives a different result.
This is because you need to take into account that you aren’t randomly “drawing” nodes out of the section independently - every next draw is dependent on the previous ones. The correct expression turns out to be (f!(N-m)!)/(N!(f-m)!)
. If m > f
, the denominator breaks down (you get a factorial of a negative number), but the result should be 0, as I said above.
The second thing is the probability of message failure with 10 hops. You calculate it using the expected number of sections to fail, but there is a more accurate way of calculating that.
Note that message relay fails if at least one relay section fails. The probability of this is 1 - [prob. of all sections succeeding]
. Thus, it is: p = 1 - (1 - (f!(N-m)!)/(N!(f-m)!))^10
(prob. of success to the 10th power is the prob of all hops succeeding). Then you can calculate the expected number of retries until failure as 1/p
.
Try inputting that in your spreadsheet You will note that if m > f
(so if N/3 > f
), then the prob. of failure is 0, and all messages are certain to succeed. Only for f ≥ N/3
you get a nonzero probability of failure.
This is why we think this will allow us to drop acks and timeouts - as a failure due to a node being faulty is impossible if f < N/3
(other failures might be possible, like incorrect handling of relaying during a merge or split, but that is a separate topic).