SAFE Network concerns from an old employee

vtnerd · September 1, 2016, 4:52am

Sorry in advance for the very long post …

Something I noticed too. It is a quirk of StackOverflow. I listed the date my contract ended, but it always lists the most recent position listed as current even if ended. And I cannot remove that line.

My employment status is irrelevant to my criticisms - attacking credentials is a flawed method to refute claims or ideas. FWIW after much patience I was given an offer for a dream remote-working position. Unfortunately, it is contingent on a contract approval, and several months of waiting for any resolution has taught me that I severely mis-judged the glacial pace of non-governmental bureaucracy. I expected governmental bureaucracy to be this slow, but I have learned that private businesses can also move very slowly.

Perhaps I should list boost::fusion as a current role; I have recently spent a significant amount of time “normalizing” the C++11 variadic containers to match C++03 behavior. Actually, being paid to work on Boost would be my true dream position, but that sounds like some unicorn non-sense.

I incorrectly assumed all of my points, with the exception of the nonce reuse, should have already been known to the team. I only noticed the nonce reuse last week when looking at the code for the first time since I left.

The first major problem I perceive involves p2p quorum systems in general. Numerous weekly updates have mentioned quorum + churn or quorum + segmentation, so I thought the difficulty was known. There is also freely available literature on the internet, which I think describes the problems in this environment reasonably well.

I think the second major problem is related to the close group attack mentioned on Google Groups, but I failed to bring another thought to attention (I will take the blame for that). The “closest” public-key in terms of XOR space is determined by the longest prefix. This indicates to me that the person with the most computing power can generate the highest number of closest keys, and this person will be capable of setting any desired value immediately. You indicate that there is more to the solution - can you point me to a file in the codebase or greppable search term? The most recent update suggests that the security implementation is still in flux: “Now that Andreas is back, the Routing team is discussing the immediate options and measures we can take to make groups of nodes more secure (there are many approaches here and the implementation schedule is important to get right). - i.e. prevent an attacker from getting enough nodes to reach quorum and control group consensus -, what level of group security can be achieved, and if necessary, how the network can deal with malicious groups.”

I notified the team directly of the subdirectory issue while I was under contract. This is likely a rare occurrence because Tahoe-LAFS appears to have a similar design in this situation. However, major cloud storage services (Google, Amazon, Microsoft) do not allow subdirectories and Tahoe-LAFS FAQ (Q31) recommends only one writer at a time in a directory. Their documentation goes further to state that no mutable file/directory should have multiple writers. Tahoe-LAFS users have yet to complain about about forked-off readers, so maybe I made too much of it. But if Maidsafe does not recommend one writer per directory subtree, then forked-off writers are possible too which adds to the complications.

The convergent encryption issue is mentioned in a whitepaper (no longer available on maidsafe.net) co-authored by a current employee. Also mentioned by me on a post in an intriguing thread on this forum. The proposed remedy is for the user to symmetrically encrypt prior to self-encryption, but it is unlikely many users will be aware of when they should do this or how to do it properly. Luckily the problem should be rare. However, Tahoe-LAFS no longer does system-wide convergent encryption after discovering the potential weakness. If the intent was to reduce storage requirements in exchange for lesser privacy in a small number of cases, then the tradeoff seems peculiar given the projects high priority on privacy.

The key/nonce reuse I just noticed this past week, but I notified the team of similar problems with unauthenticated AES-CFB with key/iv reuse in the C++ codebase. It is a relatively simple mis-application of cryptographic primitives, but the constant code re-writing by the project creates a high probability of future crypto problems. Also, whoever wrote the hybrid_encryption function does not understand how libsodium public-key encryption works - it generates a shared secret using ECC-DH and then uses that secret as the key to XSalsa with the provided nonce. So the call to XSalsa is superfluous in this situation. As stated on the mailing list, plaintext recovery should be difficult due to the content being encrypted. But a serialization change (padding or new fields), could reduce the security. Since a random nonce generation can be stored unencrypted there is no reason to take a chance.

I guess this was not clear; @Josh understood the intended meaning of this paragraph. After re-reading the statement, I wish to have worded it differently. I think a trustless p2p database is vastly more difficult than it first appears. I do not think the success of this project should be assumed; there is a risk of failure and people should be aware of that. I doubt the Maidsafe team will be able to achieve their lofty goals, but I do not wish failure on the team. And obviously my thoughts on the probability of success are subjective.

But not impossible. Which means that you either: (1) accept the risk that a resource (folder, coin, etc.) could theoretically be “locked” forever or require a trusted party to “fix” it; (2) allow for inconsistent writes on the network (forks); (3) accept both. This is outlined by the CAP theorem, and a 2012 follow-up by the author gives a good summary about how there is some known variability between the consistency and availability extremes. [Note: According to the theorem you can also get consistency and availability if you drop partitioning, but that would require a computer network that never drops data, and therefore in current real-world systems “P” is “chosen by default” in the 2 of 3 theorem].

It is not clear from your statement the path Maidsafe has selected, or whether Maidsafe believes they have invalidated the theorem. I think you have decided that you can engineer a very small probability of the resource being locked (“infeasible to lose the whole group”). Unfortunately, maintaining consistency while keeping availability high is difficult, and trustless p2p systems increase that difficulty. The obvious techniques: (1) super-nodes with higher reliability; (2) larger groups; (3) more frequent faulty-node detection messaging. If you know of more techniques, or have created a new one, please direct to some reading material.

Super nodes (1) have to be selected somehow, and Maidsafe previously had algorithms for selecting more reliable peers for group selection. The algorithms have disappeared and I cannot recall ever seeing an implementation. Larger groups (2) have the negative side-effect of requiring more messages (bandwidth) as the group size increases. Additionally, the number of messages required for quorum consensus algorithms to converge will generally increase as the number of faulty or malicious nodes in the group increases (non-responding nodes are faulty so a large group + churn is bad). More frequent faulty-node detection messages (3) require more bandwidth. If the check frequency is taken to the extreme for rapid failure detection and replication, it increases the probability of false-positives and thus instigating even more messages to update or increase group information spuriously.

I have found some interesting research papers on p2p quorum databases. One method varies read and write consistency to assess the availability and messaging tradeoffs. Another uses a hierarchy based quorum to reduce messaging necessary for quorum. The latter seems more useful for Maidsafe.

The 4 remaining nodes cannot determine the difference between nodes without connection, and a network partition. So if the network allows writes to occur in that situation, its possible for writes to occur in both partitions. This will result in a data conflict. Bitcoin has the same limitation, and the nodes agree that the longest chain is the version to take. Other designs prevent writing (write availability loss), while still allowing reading (read consistency loss due to possible stale data). Yet other designs prevent reading and writing (write/read availability loss), but have a single history timeline and never return stale data.

There are ways to play with this a bit, but the important takeaway is that no existing distributed computing literature provides a method that guarantees 100% availability and 100% consistency over a fault-capable network. This is not catastrophic, as Bitcoin does not guarantee 100% consistency either. However, I do not believe Maidsafe has found a way to make a trustless distributed database sufficient for its goals. But this is hard to accurately judge; Maidsafe keeps changing some of its core algorithms making it harder to analyze.

Probably all over the world != are guaranteed to be all over the world. It should be possible to use the IP address prefix to reduce the probability. This technique would be inaccurate if the node connected to the network through a VPN or similar.

The detection is not instant, that would imply a network and OS with zero latency which is non-existent. I stated this once above - as the heartbeat frequency is lowered it increases the necessary bandwidth and increases false-positives. And even if the detection were instantaneous, replication is not.

Its also worth mentioning that heartbeats and ping-ack cannot be used to synchronize multiple writers to the same resource. Messages can be received in a different order from different sources, so a method to reach ordering consensus is needed if this is a desired feature.

The first sentence is a response to the statement, “network partition prevention”, in an update. A 100% solution would require a network that never failed or failed globally, neither of which is how computer networks currently work. Even the statement “lowering the probability of network segmentation” would be incorrect, because that is out of the scope of this project. I assume they meant lowering the risk in the event of a partition, but I do not know what action was taken.

anon40790172:

This is a comment about Disjoint Groups. It makes the assumption that 2 groups should have the same “state” and therefore consensus on data/decisions and more. This is fundamentally wrong IMO. The idea behind Disjoint Groups is that each group is responsible for a certain range in the address space. So if group A1 becomes too small and the same happens with group A2 than each group still has it’s own “state” about everything they’re responsible for. So group A1 still signs stuff with quorum and group A2 does the same. They can’t conflict because the group sign (quorum) of each group is “law” so to speak. And when they decide to merge they accept each others previous signs and decisions. So IMO the writer of the article doesn’t have a good understanding of DG and therefore the current focus of the devs. But he’s free to explain here in the forum if I missed something in his statements.

I was referring to two groups that were previously on different sides of a network partition. In other words, a single group split into two after a partition, then merged. If both accepted writes, which is correct? If one or both of them stopped accepting writes, what other conditions can induce this “locked” resource?

I did not mention this on Metzdowd, but this is something that has bothered me for a while too.

anon40790172:

This is what’s in the RFC.

The quorum cannot be a constant anymore, due to varying group sizes. It needs to be a percentage strictly greater than 50% instead, and in a group of size n, a number x of nodes will constitute a quorum if x / n >= QUORUM.

The reason for making the point is this quote:

A group can losequorum if enough nodes drop out simultaneously, which would mean that no more updates can be made to the resource.

It reads to me like: we have 8 nodes, 4 of them churn now we can’t have consensus. This isn’t the case as quorum is a percentage of the number of nodes. Not a fixed number like 8 or 5 or 7. That’s what I mean with no fixed quorum size. Each group finds consensus no matter if it’s 4 nodes or 12. Only at 2 we might have trouble. But that’s why a merge should happen way before we’re at this level. And when a group does get at this level due to extreme churn it can’t route new updates anymore as the other close group doesn’t sign a thing without enough signs from the sending group.

This design has inherently decided to relax write consistency. If both sides of a partition follow this algorithm, then both sides will accept writes on any available resource. Admittedly I am a stickler, but I think its worth preparing for these edge cases.

I do not stir turds, I flush them. And I do not understand how unemployment is viewed negatively, you get to do whatever you want! Also, I am not using any government assistance, for people concerned about that sort of thing.

Topic		Replies	Views
SAFE Network Dev Update - October 4, 2018 Updates	16	2203	October 6, 2018
Developers have they been threatened? Community	3	930	July 28, 2015
SAFE Network Dev Update - March 14, 2019 Updates	56	4543	March 21, 2019
MaidSafe Dev Update - August 17, 2017 Updates	42	7974	August 24, 2017
Update 24 March, 2022 Updates	25	3244	March 31, 2022

SAFE Network concerns from an old employee

Related topics