InstallNet [2023-06-19 Testnet] [Offline]

safenode v0.83.35 shows the following at the beginning of the log:

[2023-06-20T00:12:23.687797Z DEBUG safenode] Built with git version: 74528f5 / main / 74528f5.

Seems to align with this:
chore(release): sn_cli-v0.77.37/sn_node-v0.83.35/sn_testnet-v0.1.39

Confirmed the md5 hash of the safenode bin downloaded by safeup indeed matches the release bin for v0.83.35 from github page:

safe-build-122:/tmp# cat /tmp/safenode | md5sum <-- downloaded file from github just now
84f97e95924da013f8afb569ee9d79ec  -
safe-build-122:/tmp# cat /usr/local/bin/safenode | md5sum <- original downloaded file from safeup
84f97e95924da013f8afb569ee9d79ec  -

Summary on safenode v0.83.35 itself on Github Release page shows:

Added

    attach payment proof when uploading Chunks

Fixed

    reduce churn weight to ~1/2mb

Other

    Revert "chore(release): sn_cli-v0.77.1/sn_client-v0.85.2/sn_networking-v0.1.2/sn_node-v0.83.1"
    (release) sn_cli-v0.77.1/sn_client-v0.85.2/sn_networking-v0.1.2/sn_node-v0.83.1
    Revert "chore(release): sn_cli-v0.77.1/sn_client-v0.85.2/sn_networking-v0.1.2/sn_protocol-v0.1.2/sn_node-v0.83.1/sn_record_store-v0.1.2/sn_registers-v0.1.2"
    (release) sn_cli-v0.77.1/sn_client-v0.85.2/sn_networking-v0.1.2/sn_protocol-v0.1.2/sn_node-v0.83.1/sn_record_store-v0.1.2/sn_registers-v0.1.2
    (logs) enable metrics feature by default
    log msg text updated
    making Chunk payment proof optional for now
    adding unit tests to payment proof utilities
    moving all payment proofs utilities into sn_transfers crate

ReplicationTestNet (last 5 seconds):

[2023-06-07T20:57:13.029193Z TRACE sn_networking::msg] Received request with id: RequestId(69653), req: Cmd(StoreChunk { chunk: Chunk { address: ChunkAddress(84ea47(10000100)..) }, payment: None })

[2023-06-07T20:57:13.029298Z TRACE sn_node::api] Handling request: Cmd(StoreChunk { chunk: Chunk { address: ChunkAddress(84ea47(10000100)..) }, payment: None })
[2023-06-07T20:57:13.029315Z DEBUG sn_node::api] That's a store chunk in for :84ea47(10000100)..
[2023-06-07T20:57:13.030114Z DEBUG sn_networking] Putting data as record, for Key(b"\x84\xeaG\x01\xec\xe5\x8bt\xee\x83\x16r\xf8\xfc\xfb\x8eG\x80\xe1[u\xb3\xc6\xb4\x9e;)\xdd\xb3\xd0\x01G") - length 512016
[2023-06-07T20:57:13.030297Z INFO safenode] Currently ignored node event ChunkStored(ChunkAddress(84ea47(10000100)..))

[2023-06-07T20:57:13.031114Z TRACE sn_record_store] PUT request for Record key: Key(b"\x84\xeaG\x01\xec\xe5\x8bt\xee\x83\x16r\xf8\xfc\xfb\x8eG\x80\xe1[u\xb3\xc6\xb4\x9e;)\xdd\xb3\xd0\x01G")
[2023-06-07T20:57:13.032597Z TRACE sn_record_store] Wrote record to disk! filename: 84ea4701ece58b74ee831672f8fcfb8e4780e15b75b3c6b49e3b29ddb3d00147

[2023-06-07T20:57:13.033018Z INFO sn_networking::event] Dialing 12D3KooWSxLmxbx3PmUDBkFmrov98Yt3xmYNwuqodzUaDZkSTe49
[2023-06-07T20:57:13.033153Z INFO sn_networking::event] Dialing 12D3KooWNxaZaV6Gj7BgPw5K1We3fL3aVaw3DGx1QY9PcpRhPbJT
[2023-06-07T20:57:13.033297Z INFO sn_networking::event] Dialing 12D3KooWRZX3otk1w3VfBBqgvTNJceni3muzidLr84q9FDKJ5aua

[2023-06-07T20:57:13.033455Z TRACE sn_networking::msg] ResponseSent for request_id: RequestId(69653) and peer: PeerId("12D3KooWC6wddzjx5YVWVJqUzkkFCcDSdTJywsXXittUCL7Z9EwR")

safenode under InstallNet (v0.83.35) (last 5 seconds):

[2023-06-20T14:14:01.590129Z INFO sn_networking::event] identify: Sent { peer_id: PeerId("12D3KooWGVYzDEYFDUh54uHmUxqWk5BV8asBq972HPxLyMPzN3LF") }
[2023-06-20T14:14:01.938048Z INFO sn_networking::event] identify: received info peer_id=12D3KooWMwN4GAeSwer4bpz1jyRPC6KaZTHCsANDZLgL1F1zP4iW info=Info { public_key: Ed25519(PublicKey(compressed): b4172eb27da3126bceec25dbe3e56d252ded9cb581b53e87d735ec3642c2553), protocol_version: "safe/0.1.10", agent_version: "safe/node/0.1.10", listen_addrs: ["/ip4/127.0.0.1/tcp/45087", "/ip4/10.16.0.109/tcp/45087", "/ip4/10.131.0.105/tcp/45087", "/ip4/178.62.88.143/tcp/45087", "/ip4/178.62.88.143/tcp/45087", "/ip4/10.131.0.105/tcp/45087"], protocols: ["/safe/1", "/ipfs/kad/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/libp2p/autonat/1.0.0"], observed_addr: "/ip4/X.X.X.X/tcp/X" }
[2023-06-20T14:14:01.946055Z INFO sn_networking::event] identify: received info peer_id=12D3KooWJSS9axpyKu5EK1XnL1vp5zVsLB1jpM7Xj5QhdMVuwzGw info=Info { public_key: Ed25519(PublicKey(compressed): 801ae79f5d3931132297265d8d478c3462cc35ca590a9be18ea86043bded0), protocol_version: "safe/0.1.10", agent_version: "safe/node/0.1.10", listen_addrs: ["/ip4/127.0.0.1/tcp/39853", "/ip4/178.62.17.148/tcp/39853", "/ip4/10.131.0.118/tcp/39853", "/ip4/10.16.0.122/tcp/39853", "/ip4/10.131.0.118/tcp/39853", "/ip4/178.62.17.148/tcp/39853"], protocols: ["/safe/1", "/ipfs/kad/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/libp2p/autonat/1.0.0"], observed_addr: "/ip4/X.X.X.X/tcp/X" }
[2023-06-20T14:14:01.949386Z INFO sn_networking::event] identify: received info peer_id=12D3KooWGVYzDEYFDUh54uHmUxqWk5BV8asBq972HPxLyMPzN3LF info=Info { public_key: Ed25519(PublicKey(compressed): 63301699a99f3b7192d1b47e135983013bfd935087942668947dcb47a23c), protocol_version: "safe/0.1.10", agent_version: "safe/node/0.1.10", listen_addrs: ["/ip4/127.0.0.1/tcp/42433", "/ip4/10.16.0.79/tcp/42433", "/ip4/10.131.0.75/tcp/42433", "/ip4/206.189.127.64/tcp/42433", "/ip4/206.189.127.64/tcp/42433", "/ip4/10.131.0.75/tcp/42433"], protocols: ["/safe/1", "/ipfs/kad/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/libp2p/autonat/1.0.0"], observed_addr: "/ip4/X.X.X.X/tcp/X" }
[2023-06-20T14:14:03.535628Z DEBUG sn_networking::event] KademliaEvent ignored: InboundRequest { request: FindNode { num_closer_peers: 8 } }

[2023-06-20T14:14:04.959688Z TRACE sn_record_store] PUT request for Record key: Key(b"2\x11\xe9\xcf\x9b-\xd5\xdc\xb7\xf1v\x84\xfa\xb9\x0fd\xb8\x18Q\xc1\xaaC[O(H\xe7/\x1b\x8a\x14\x01")
[2023-06-20T14:14:04.961429Z TRACE sn_record_store] Wrote record to disk! filename: 3211e9cf9b2dd5dcb7f17684fab90f64b81851c1aa435b4f2848e72f1b8a1401

[2023-06-20T14:14:04.961484Z DEBUG sn_networking::event] KademliaEvent ignored: InboundRequest { request: PutRecord { source: PeerId("12D3KooWJAeXzoG76Y8sSQsqhCpZto6HtcTNr23ygHDUrwso3ys7"), connection: ConnectionId(40744), record: None } }
[2023-06-20T14:14:06.077245Z DEBUG sn_logging::metrics] PID: 36112 {"physical_cpu_threads":4,"system_cpu_usage_percent":3.8849647,"system_total_memory_mb":6895.4355,"system_memory_used_mb":217.89491,"system_memory_usage_percent":3.1599874,"network":{"interface_name":"eth0","bytes_received":554785,"bytes_transmitted":548799,"total_mb_received":952.6531,"total_mb_transmitted":4599.871},"process":{"cpu_usage_percent":1.210898,"memory_used_mb":38.600704,"bytes_read":0,"bytes_written":532480,"total_mb_read":0.049152,"total_mb_written":122.200066}},

Note: I do see a ConnectionId, but that might be new, but a scan for that reveals only 1 line in the entire log and that with a request of type PutRecord. Hmm.

2 Likes

Iā€™m poking at this now and it looks like weā€™re no longer triggering NetworkEvent::PutRequest, which would have triggered some logging etc. (This is on main at least w/ the aforementioned RecordHeader changesā€¦ But itā€™s odd and may chime with what youā€™ve seen here :male_detective: ).

Hopefully i can get this straightened out and we can get the logging story improved some.

(@chriso 's json PR is in, btw, so thatā€™ll be available to you in the next release: feat: provide option for log output in json by jacderida Ā· Pull Request #427 Ā· maidsafe/safe_network Ā· GitHub)

7 Likes

Thank you so much, fingers crossed that we find the root cause here! Even if the implementation has changed, I think its important to have some of concept of request received vs request sent # so the duration can be calculated (as well as for triaging purpose), and tracked for performance metrics over the lifetime of the pid, as well between testnets.

I am happy to change up the parsing logic on my end if the request IDs or equivalent are just in a different format or label etc (going forward), or compile the safenode on my end with future special --features flag, or non --release variant if required to enable such functionality.

Also, I promise I didnā€™t do anything too advanced other than just parse the normal logs between the two testnets :stuck_out_tongue: , and from what it seems, the code base for sn_networking still has those trace lines for Received Request & Response Sent on master, so outside of me bisecting versions from my end or learning RUST quickly, I will wait for your feedback, much appreciated! :smiley: .

4 Likes

Itā€™s great to see how productive community and maidsafe devs are figuring out whatā€™s going on under the hood. Great work guys, keep going and we all deserve the better future we create together here.

15 Likes

Looking more into these lost nodes, and reading about more kad behaviour and what weā€™re seeing may be expected for a given PeerId (especially if weā€™re saying they all connect to the same Peer in the first instance).

As kad buckets fill up, new ā€œcloseā€ nodes are only promoted when another peer becomes unresponsive. So I thiiink we might be concentrating too much in one corner of the network here with our startup scripts and what we suggest above.

It may also be that if we cnanot get activity with a given peerId, we just need to try with anotherā€¦ These are suppositions at the moment, but it seems to fit with what weā€™re seeing and experiencing with connection closure.

(That Closed error maps to the other node closing due to InactivityTimeout, ie, quiet conns cos thereā€™s nothing for a node to be doing as itā€™s not in a bucket).

As I say, thatā€™s a bit of guesswork at the moment, Iā€™m trying to verify this some more just now.

10 Likes

Interesting, so at what point in terms of time on the current 2000 or so nodes network, would we expect it to rotate around with different nodes till it finds this right bucket, or right set of nodes to call its home base, where there is some sort of heartbeat or some activity so its not in a continuous inactivity timeout and churn on the connection connect/closed?

It seems these connection closed and connected went on for hours on end at least on my node after the initial connection to the first/same Peer Id provided by the the initial seed on safenode spin up.

If connections are closing and connecting nearly 2500+ a hour, I am guessing it may have already walked all the unique 2000+ peers, maybe even in the first 1 hr or less.

FWIW, my node that was spun up yesterday, this is how fast it enumerated or found other peer ids to connect and close with below (accumulative total unique peer ids it interacted with over first 120 mins):

First 5 minutes: 141 unique peer ids
First 10 minutes: 392 unique peer ids
First 15 minutes: 530 unique peer ids
First 30 minutes: 1037 unique peer ids
First 60 minutes: 1572 unique peer ids
First 120 minutes: 1787 unique peer ids

Keep up the good work on the deeper investigation, :smiley: .

4 Likes

Do it means that uniform data/load distribution is not possible with libp2p?
Itā€™s a pity there arenā€™t many such libraries, otherwise it would be possible to ā€œjust try anotherā€.

1 Like

I think you are not getting the even distribution so well here. With 32byte random numbers (basically what we have with data or node names) they will evenly distribute and the larger the network, the more evenly this will happen. With a small network then the ā€œevennessā€ is rough. Itā€™s just statistically how that works.

Think like this
1st data starts with 0
2nd data has a 50/50 of starting with 0 (so uneven)

So scale this up and you get a much more even distribution.

I think this is right. As a bucket fills then there is no space for another node, so we split, but if already split and there is no room left the node wonā€™t be added to the bucket, but it may be added to other nodes bucket and so on.

However the closest bucket can fill to K nodes and wont overflow as those are the actual k closest nodes with certainty. The more remote the bucket to us then the more sparse itā€™s coverage (i.e. the furthest bucket can have K (20) nodes and those nodes cover the entire 50% of the whole network.

That should not matter though as kad ensures every hop gives us at least 2X the knowledge of that part of the network. At the end of the hops we are in the close K bucket of close nodes and they will all agree who is the closest nodes (barring churn etc.)

11 Likes

I decided to make simple simulation based on my understanding on how such randomness works.

Program.cs
using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication117
{
    class Program
    {
        static void Main(string[] args)
        {
            var r = new Random(12345);

            int nodeCount = 40;
            int recordCount = nodeCount * 1000;
            var nodeIds = new double[nodeCount];
            var nodeData = new List<double>[nodeCount];
            for (int i = 0; i < nodeCount; i++)
            {
                nodeIds[i] = r.NextDouble();
                nodeData[i] = new List<double>();
            }

            var records = new double[recordCount];
            for (int i = 0; i < recordCount; i++)
                records[i] = r.NextDouble();

            for (int i = 0; i < recordCount; i++)
            {
                var record = records[i];
                var nodeId = nodeIds.OrderBy(id => Math.Abs(id - record)).First();
                nodeData[Array.IndexOf(nodeIds, nodeId)].Add(record);
            }

            Console.WriteLine("{" + string.Join(", ", nodeData.Select(nd => nd.Count.ToString()).ToArray()) + "}");
        }
    }
}

This program generates 40 ā€œnodesā€ and spread 40000 ā€œrecordsā€ among them.
This corresponds to small ā€œnetworkā€. To get large ā€œnetworkā€ I multiplied these numbers by 10.
Resulting histogram for data spread is:
sn_random

What I see from it is that percentage of ā€œnodesā€ which gets low amount of ā€œrecordsā€ do not depend on ā€œnetworkā€ size.

If running a node farm with multiple nodes, then youā€™d get an average. Another reason small nodes are better. But probably stating the obvious here.

1 Like

Why not make single node to have several addresses?
It would allow to have ā€œaverageā€ even for single node.

upd. Here is how distribution of node record count look for simulation with 10 addresses per node compared to 1 address per node:
sn_random2

Code
using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication117
{
    class Node
    {
        const int addressesPerNode = 10;

        public Node(Random r)
        {
            Addresses = new double[addressesPerNode];
            for (int i = 0; i < Addresses.Length; i++)
                Addresses[i] = r.NextDouble();
            Data = new List<double>();
        }

        public double[] Addresses;
        public List<double> Data;
    }

    class Program
    {
        static void Main(string[] args)
        {
            var r = new Random(12345);

            int nodeCount = 400;
            int recordCountPerNode = 1000;
            int recordCount = nodeCount * recordCountPerNode;
            var nodes = new Node[nodeCount];
            var addresses = new List<double>();
            var addressToNode = new Dictionary<double, Node>();
            for (int i = 0; i < nodeCount; i++)
            {
                var n = new Node(r);
                nodes[i] = n;
                addresses.AddRange(n.Addresses);
                for (int j = 0; j < n.Addresses.Length; j++)
                    addressToNode.Add(n.Addresses[j], n);
            }

            var records = new double[recordCount];
            for (int i = 0; i < recordCount; i++)
                records[i] = r.NextDouble();

            for (int i = 0; i < recordCount; i++)
            {
                if (i % recordCountPerNode == recordCountPerNode - 1)
                    Console.Write(".");
                var record = records[i];
                var minDelta = double.MaxValue;
                var closestAddress = double.MaxValue;
                for (int j = 0; j < addresses.Count; j++)
                {
                    double delta = Math.Abs(addresses[j] - record);
                    if (delta < minDelta)
                    {
                        minDelta = delta;
                        closestAddress = addresses[j];
                    }
                }
                addressToNode[closestAddress].Data.Add(record);
            }

            Console.WriteLine();
            Console.WriteLine("{" + string.Join(", ", nodes.Select(nd => nd.Data.Count.ToString()).ToArray()) + "}");
        }
    }
}

Hey, what is in that config.json file?

I couldnā€™t find any reference to that in the current code. Is this something that was perhaps left behind from a previous testnet?

2 Likes

I kick started a debug build of safenode v0.83.35 yesterday, and FWIW I see the following there this morning (more to analyze from my end):

Logs finally arrived, though with more follow-up questions below:

[2023-06-21T06:52:30.473280Z TRACE sn_networking::msg] Received request with id: RequestId(115602), req: Cmd(Replicate { holder: NetworkAddress::PeerId([0, 36, 8, 1, 18, 32, 40, 2, 147, 68, 94, 29, 194, 150, 152, 23, 97, 24, 94, 198, 219, 169, 4, 154, 200, 83, 40, 230, 24, 240, 166, 142, 78, 18, 24, 125, 250, 160]) })
[2023-06-21T06:52:30.480971Z TRACE sn_networking::msg] ResponseSent for request_id: RequestId(115602) and peer: PeerId("12D3KooWCWYkhhmQgTnyHEPDRKFxDJ9rC7w5cqzHGp9PZsfXbAqD")

[2023-06-21T10:18:13.128360Z TRACE sn_networking::msg] Received request with id: RequestId(153974), req: Cmd(StoreChunk { chunk: Chunk { address: ChunkAddress(159222(00010101)..) }, payment: None })
[2023-06-21T10:18:13.141095Z TRACE sn_networking::msg] ResponseSent for request_id: RequestId(153974) and peer: PeerId("12D3KooWMKk6BYvzSgfPFysrRFzcRRh3traoBUG6e2zcnkcMs8mT")

Interesting, are we expecting a slew of different Cmd(Replicate|StoreChunk|DBCs|... commands to distinguish the source of the PUT going forward?
I am guessing this is all work-in-progress, so will observe the logs, once the dust settles, I suppose.

Additional Note:
Unlike previous testnet, where I saw a correlation of 1:1 ratio off InboundFailure messages associated with Cmd(Replicate Response Received, I donā€™t see any InboundFailure messages in the logs with the above Cmd(Replicate lines! The PRs - 377 & 369 mention OutboundFailures, but maybe the fix also addressed the InboundFailure messages, too! Nice!

EDIT:

  • Adjusted timeline from the exact time safenode spin up to now.

  • Seems Cmd(StoreChunk did finally get recorded, but they are not >= 678 Chunks Written unlike prior testnet. Hmm, :thinking:?

  • My MISC bucket (more off a catch-all) did catch 671 unique request_ids, which if added up with 8 Request Received Response IDs for Cmd(StoreChunk, ends up coincidentally at 679 Unique Request IDs, which is super close to the 678 Chunks Written (Wrote record to disk) messages.

  • Also, interesting to note the CHUNK_RECEIVED_FOR_REPLICATION messages are at 670.

  • @joshuef - In summary, if I am interpreting this properly, the network likely got the replicated data aka chunks 670 times (must have been 670+ requests here), and stored it as a PUT request as part of a record on disk. This was in addition to handling 8 Cmd(StoreChunk requests.

    What I am not sure about then is why Received request for Cmd(Replicate (19) and Received request for Cmd(StoreChunk (8), neither are close to 670-671? Is there a 3rd classification that is missing here, or both combined should have added up to >= (670+8) = 678 (total record written to disk), assuming all these requests go through the sn_networking msg request/received layer?

5 Likes

If you look at SHA (secure hash algorithm). The output is secure, i.e. unlikely to collide == particularly random, much like output of a good encryption cypher.

So simulating randomness is likely best done here by using sha3_256 on many different input values. The principle of that randomness is that is spreads evenly across the address range. This is regardless of nodes, just data.

Then take node addresses, pub key of ed25519, it will have much the same distribution.

If you then look deeper the even spread is more evident on a larger population of addressable items (nodes or data).

Here is an explanation of the even distribution and why it is so

SHA-3 256 is a specific variant of SHA-3 that produces a 256-bit (32 bytes) digest. A good cryptographic hash function has the following properties:

Determinism: Given the same input, it always produces the same output.
Preimage Resistance: Itā€™s computationally infeasible to regenerate the original input value from a hash output.
Small Changes in Input, Big Changes in Output: Even a small modification in the input will cause a drastic change in the output.
Collision Resistance: Itā€™s computationally hard to find two different inputs that hash to the same output.
In terms of your question, the reason why SHA-3 256 produces values that are evenly distributed across the address range has to do with the third and fourth properties above.

Hash functions like SHA-3 are designed to distribute the possible hash values as evenly as possible across the available space. This is related to the concept of the ā€œavalanche effect,ā€ where a small change in input should cause drastic changes in the output. If the hash values werenā€™t evenly distributed, then you could start to predict the hash value based on the input, which would reduce the security of the hash function.

This uniform distribution also helps to minimize collisions. Ideally, a hash function would have no collisions at all, but given the finite length of the hash output and the infinite number of possible inputs, collisions are inevitable. However, by distributing the hash values evenly, SHA-3 minimizes the likelihood of these collisions.

So, in summary, the even distribution of hash values across the address range for SHA-3 256 (or any good cryptographic hash function) helps to ensure the integrity and security of the function.

10 Likes

minimizes but does not eliminate these collisions - what are the practical effects of such a collision?
getting the wrong chumk returned ?
access denied to your data?

5 Likes

AFAIK there has never been a collision in the 256 address space. Itā€™s akin to you creating satoshis wallet

7 Likes

True, but at some point, at some time, every possible variation of the incredibly unlikely is inevitable.

Just hope I get my Lambo before a bowl of petunias materialises 10m above my headā€¦

4 Likes

C# Random have the same distribution too. Details are not important here.

What I was showing is how two such random sources (specifying node address and chunk address) interact with each other.
And resulting spread is not uniform despite sources individually being uniform.

Thatā€™s why I showed code - anyone can reproduce my results and think about why two uniformity combined are non-uniform.

Here is the video which describes why it is not important:

3 Likes

Are you saying you have 2 distributions that are evenly spread across an address and when you put them both together, you have an uneven distribution? That seems to be counterintuitive indeed

3 Likes