Implementation details for the Dec 2020 testnet

mav · December 29, 2020, 11:19pm

Age is halved when a node leaves then later returns to the section. The node is also relocated when they rejoin (source).

let new_age = cmp::max(MIN_AGE, old_info.value.peer.age() / 2);

if new_age > MIN_AGE { 
    // TODO: consider handling the relocation inside the
    // bootstrap phase, to avoid
    // having to send this `NodeApproval`.
    commands.push(self.send_node_approval(old_info.clone(), their_knowledge)?);
    commands.extend(self.relocate_rejoining_peer(&old_info.value.peer, new_age)?); 

    return Ok(commands);
}

I think there’s a reasonable debate to be had whether age should be reduced by 2 (or even 1) rather than halved. Halving the age penalizes for much more than half the total work done. Halving the age seems like a fairly big punishment.

eg from the table below, a node age 15 has done a total of 4095 ‘units’ of work. If it’s penalized it goes down to age 7. It must do 4080 units of work to get back to age 15, which is 99% of the work they had previously done.

Initial Age	Total Work	Halved Age	Work Lost	Portion Lost (%)
4	1	4	0	0
5	3	4	2	66
6	7	4	6	85
7	15	4	14	93
8	31	4	30	96
9	63	4	62	98
10	127	5	124	97
11	255	5	252	98
12	511	6	504	98
13	1023	6	1016	99
14	2047	7	2032	99
15	4095	7	4080	99
16	8191	8	8160	99
17	16383	8	16352	99
18	32767	9	32704	99
19	65535	9	65472	99

My feeling is the network would be entirely nodes age 4-8, since it takes only a single penalty to massively set a node back. Get to age 9 after 63 ‘days’ of work with no penalty, then a single mistake takes the node back to age 4 removing all 63 ‘days’ of work. A node at age 19 with 65535 ‘days’ of work with no penalty has two penalties in a row, suddenly is back to age 4. This obviously makes age not a very accurate measure for our purposes and only allows extremely high uptime nodes to participate in a meaningful way.

There could be less work than doubling to increase age, or the penalty could be less extreme than halving age, or the conditions to trigger a penalty could be quite lenient. This is also discussed in this post: “virtually all age will be roughly within the range 7-20”.

When is a member subject to the half age penalty? After 30s of downtime. This is not as clear in the code as other features, but the path is like this:

Nodes declare another node has gone offline using the Event::MemberLeft event, which is broadcast in handle_offline_event (source)

self.send_event(Event::MemberLeft {
    name: *peer.name(),
    age,
});

Agreement among elders is established using Vote::Offline in handle_peer_lost (source)

let info = info.clone().leave()?;
self.vote(Vote::Offline(info))

So when do those bits of code actually get run?

The comment for send_message_to_targets explains it (source)

/// Sends a message to multiple recipients. Attempts to send
/// to `delivery_group_size`
/// recipients out of the `recipients` list. If a send fails, attempts to
/// send to the next peer
/// until `delivery_goup_size` successful sends complete or there
/// are no more recipients to
/// try.
///
/// Returns `Ok` if all of `delivery_group_size` sends succeeded
/// and `Err` if less that
/// `delivery_group_size` succeeded. Also returns all the failed
/// recipients which can be used
/// by the caller to identify lost peers.
pub async fn send_message_to_targets(
...

If there’s any error in sending messages to another node, that node is declared as ‘lost’ and we start a vote for them to be declared as Vote::Offline. There’s not really a clean code snippet for this process to paste here, but this is where a failed node is recorded in the failed_recipients variable (source):

Err(_) => {
    failed_recipients.push(*addr);

    if next < recipients.len() {
        tasks.push(send(&recipients[next], msg.clone()));
        next += 1;      
    }
}

What sort of errors can get us to execute this code block? I’m not sure of all of them, we’d need to dig into qp2p and quinn errors to find what can go wrong in the send(recipient, msg) function. But one thing for sure that’s in there is a connection timeout.

Timeout is set to 30s in qp2p/src/peer_config.rs

pub const DEFAULT_IDLE_TIMEOUT_MSEC: u64 = 30_000; // 30secs

But this is a configurable value so nodes can set it to whatever they want. There may be some comms noise if nodes are tweaking this value. Can’t really do anything about people changing it, but we can a) set a sensible default and don’t make it really easy for people to change it, b) maybe introduce some anti-spam measures for Vote::Offline measures. Could spamming be punished by consensus, or is it something we can only react to locally? How about when a node disconnects repeatedly from just one other node to induce spam to all the others? It’s tricky…

It seems as long as nodes don’t drop out for longer than 30s then they’re safe from age demotion. Longer than that and they’ll be voted as being offline and subject to the rejoining penalty with age halved. Maybe there’s some wiggle room if the node can return before the voting reaches consensus, but I can’t imagine that would give much extra time, I wouldn’t imagine any more than 10s between starting and completing the vote.

Topic		Replies	Views
How fast how large (deterministic sized nodes) Development	77	2105	January 4, 2023
Safe Network Dev Update - February 18, 2021 Updates	56	4277	February 25, 2021
Node age and relocation Development	26	1088	January 1, 2023
Latest Release March 20, 2025 Updates	278	1356	April 1, 2025
Update 26 January, 2023 Updates	40	2262	February 1, 2023

Implementation details for the Dec 2020 testnet

Related topics