Implementation details for the Dec 2020 testnet

Age is halved when a node leaves then later returns to the section. The node is also relocated when they rejoin (source).

let new_age = cmp::max(MIN_AGE, old_info.value.peer.age() / 2);

if new_age > MIN_AGE { 
    // TODO: consider handling the relocation inside the
    // bootstrap phase, to avoid
    // having to send this `NodeApproval`.
    commands.push(self.send_node_approval(old_info.clone(), their_knowledge)?);
    commands.extend(self.relocate_rejoining_peer(&old_info.value.peer, new_age)?); 

    return Ok(commands);
}

I think there’s a reasonable debate to be had whether age should be reduced by 2 (or even 1) rather than halved. Halving the age penalizes for much more than half the total work done. Halving the age seems like a fairly big punishment.

eg from the table below, a node age 15 has done a total of 4095 ‘units’ of work. If it’s penalized it goes down to age 7. It must do 4080 units of work to get back to age 15, which is 99% of the work they had previously done.

Initial Age Total Work Halved Age Work Lost Portion Lost (%)
4 1 4 0 0
5 3 4 2 66
6 7 4 6 85
7 15 4 14 93
8 31 4 30 96
9 63 4 62 98
10 127 5 124 97
11 255 5 252 98
12 511 6 504 98
13 1023 6 1016 99
14 2047 7 2032 99
15 4095 7 4080 99
16 8191 8 8160 99
17 16383 8 16352 99
18 32767 9 32704 99
19 65535 9 65472 99

My feeling is the network would be entirely nodes age 4-8, since it takes only a single penalty to massively set a node back. Get to age 9 after 63 ‘days’ of work with no penalty, then a single mistake takes the node back to age 4 removing all 63 ‘days’ of work. A node at age 19 with 65535 ‘days’ of work with no penalty has two penalties in a row, suddenly is back to age 4. This obviously makes age not a very accurate measure for our purposes and only allows extremely high uptime nodes to participate in a meaningful way.

There could be less work than doubling to increase age, or the penalty could be less extreme than halving age, or the conditions to trigger a penalty could be quite lenient. This is also discussed in this post: “virtually all age will be roughly within the range 7-20”.

When is a member subject to the half age penalty? After 30s of downtime. This is not as clear in the code as other features, but the path is like this:

Nodes declare another node has gone offline using the Event::MemberLeft event, which is broadcast in handle_offline_event (source)

self.send_event(Event::MemberLeft {
    name: *peer.name(),
    age,
});

Agreement among elders is established using Vote::Offline in handle_peer_lost (source)

let info = info.clone().leave()?;
self.vote(Vote::Offline(info))

So when do those bits of code actually get run?

The comment for send_message_to_targets explains it (source)

/// Sends a message to multiple recipients. Attempts to send
/// to `delivery_group_size`
/// recipients out of the `recipients` list. If a send fails, attempts to
/// send to the next peer
/// until `delivery_goup_size` successful sends complete or there
/// are no more recipients to
/// try.
///
/// Returns `Ok` if all of `delivery_group_size` sends succeeded
/// and `Err` if less that
/// `delivery_group_size` succeeded. Also returns all the failed
/// recipients which can be used
/// by the caller to identify lost peers.
pub async fn send_message_to_targets(
...

If there’s any error in sending messages to another node, that node is declared as ‘lost’ and we start a vote for them to be declared as Vote::Offline. There’s not really a clean code snippet for this process to paste here, but this is where a failed node is recorded in the failed_recipients variable (source):

Err(_) => {
    failed_recipients.push(*addr);

    if next < recipients.len() {
        tasks.push(send(&recipients[next], msg.clone()));
        next += 1;      
    }
}

What sort of errors can get us to execute this code block? I’m not sure of all of them, we’d need to dig into qp2p and quinn errors to find what can go wrong in the send(recipient, msg) function. But one thing for sure that’s in there is a connection timeout.

Timeout is set to 30s in qp2p/src/peer_config.rs

pub const DEFAULT_IDLE_TIMEOUT_MSEC: u64 = 30_000; // 30secs

But this is a configurable value so nodes can set it to whatever they want. There may be some comms noise if nodes are tweaking this value. Can’t really do anything about people changing it, but we can a) set a sensible default and don’t make it really easy for people to change it, b) maybe introduce some anti-spam measures for Vote::Offline measures. Could spamming be punished by consensus, or is it something we can only react to locally? How about when a node disconnects repeatedly from just one other node to induce spam to all the others? It’s tricky…

It seems as long as nodes don’t drop out for longer than 30s then they’re safe from age demotion. Longer than that and they’ll be voted as being offline and subject to the rejoining penalty with age halved. Maybe there’s some wiggle room if the node can return before the voting reaches consensus, but I can’t imagine that would give much extra time, I wouldn’t imagine any more than 10s between starting and completing the vote.

13 Likes