SAFE Network concerns from an old employee

dirvine · August 31, 2016, 9:08am

This, in particular has been discussed many times on this forum. A group size is chosen in any dht to be of a size that’s infeasible to lose the whole group (think whole, quorum etc. as same thing) in a refresh time. A refresh is generally 30 or 60 mins, but in SAFE the nodes are directly connected, so refresh is as close to network speed as possible. This makes this significantly less feasible.

Regardless of that, with disjoint groups and data_chains, merge/split of groups is handled as a natural thing. These make that even better handled and data chains itself does directly address a split/merge/data republish etc. Network partitioning is always something to consider, like earthquakes etc. and this is where data republish makes a huge difference. Securely doing so is hard, but we like hard in MaidSafe

The reason we don’t directly respond to all such points is

They are generally mis-informed
A single person takes out one of our Engineers for a while to answer it
It’s already answered several times
It’s a point about probability and all possible outcomes are possible, but not feasible
It’s not a great way to start looking properly at any subject. Engage, clarify, understand and then critique is very helpful though

I take time to answer things where I can, but the cost is very high, so I focus on the forum mostly. Several other email lists and forums have some of us (esp me) going into several days/weeks long explanations that would be very much quicker answered by reading the papers/wiki or search here etc. Initially I did spend that time, but right now if somebody wished to damage progress, this is exactly what they would do. Just state some “fact” that’s not all that relevant or perhaps particular to SAFE and make it sound feasible. It’s very clever, but a time sink. I am not saying this was the case here, I suspect just mistaken in areas that could have been cleared up very easily here initially.

My honest opinion these days, at this point is launch, launch, launch and have bug bounties, security bounties etc. and let folk point to a bit of code and show a real world exploit when we are up and running. Right now it’s not worth digging up old and incorrect assumptions and explaining them all to everyone who makes them. During Alpha we have said there are several security updates, the RFC’s clearly state what is being improved and hopefully that helps everyone.

Hope that makes sense.

anon40790172 · August 31, 2016, 10:09am

David already replied but here are my 2 cents. This is called “churn” (it brings up 50 results in search). It’s where nodes leave and other join the network. It adds to security because all data stored with these nodes now move somewhere else. On the other hand it’s a risky thing because if too many nodes fail at once a group might go from 10 nodes to only 5. This is handled in this way:

There’s no fixed quorum size to make a decision in a group. So no need to always have at least 8 nodes in a group to make a decision. If 4 of them drop in an instant, there’s still 4 nodes left and they can make a decision with 3/4 agree.
Nodes have addresses in XOR. So “close” doesn’t mean a thing in geographic terms. Your close nodes are probably all over the world, making the chance smaller that 4 or 5 of them churn at the same time as a region in a country loses internet.
Archive Nodes are implemented. So even if we have a major blackout the network could be booted from 0.
Connections in SAFE are fast and “heart beat” signals are send all the time to see if a node is alive or not. So if 2 nodes drop from a table with 9 nodes, the group knows in an instant. If the group got too small they could merge with another group (Disjoint Groups) and my guess is that this will happen in under a second.

This is what’s written by the Lee Clagett as well and I really don’t understand what he is asking for here. Now I’m not an engineer or coder but this is what’s said:

This is not a solution to the fundamental churn and partition problem. A
[newer document][3] mentions group merging, but does not describe how
groups with different states will be resolved.

This is a comment about Disjoint Groups. It makes the assumption that 2 groups should have the same “state” and therefore consensus on data/decisions and more. This is fundamentally wrong IMO. The idea behind Disjoint Groups is that each group is responsible for a certain range in the address space. So if group A1 becomes too small and the same happens with group A2 than each group still has it’s own “state” about everything they’re responsible for. So group A1 still signs stuff with quorum and group A2 does the same. They can’t conflict because the group sign (quorum) of each group is “law” so to speak. And when they decide to merge they accept each others previous signs and decisions. So IMO the writer of the article doesn’t have a good understanding of DG and therefore the current focus of the devs. But he’s free to explain here in the forum if I missed something in his statements.

Seneca · August 31, 2016, 10:12am

Are you sure about this? That would mean you can launch a DOS attack on other close group nodes to take over a group with a minority.

anon40790172 · August 31, 2016, 10:58am

This is what’s in the RFC.

The quorum cannot be a constant anymore, due to varying group sizes. It needs to be a percentage strictly greater than 50% instead, and in a group of size n, a number x of nodes will constitute a quorum if x / n >= QUORUM.

The reason for making the point is this quote:

It reads to me like: we have 8 nodes, 4 of them churn now we can’t have consensus. This isn’t the case as quorum is a percentage of the number of nodes. Not a fixed number like 8 or 5 or 7. That’s what I mean with no fixed quorum size. Each group finds consensus no matter if it’s 4 nodes or 12. Only at 2 we might have trouble. But that’s why a merge should happen way before we’re at this level. And when a group does get at this level due to extreme churn it can’t route new updates anymore as the other close group doesn’t sign a thing without enough signs from the sending group.

BIGbtc · August 31, 2016, 11:48am

Its unfortunate that you have to do this at this juncture because your focus should be, as you say. “launch, launch, launch” and I think you understate the cost - it is extraordinarily high - and you underestimate the impact of addressing each and every seemingly valid criticism of concept Safenet.

Not unlike the previous rut - which you got out of - making forecasts and promises that were not fulfilled, the expectations of a timely responses from you and Maidsafe are now present. This is bad news for you, the Team, Maidsafe and most in this community. It will become clear that no response means trouble. You are doing a great disservice to yourself, the Team and the Community and you are delaying the launch.

Control Your Time, Manage The Communications:

If there is a public claim about Maidsafe, valid or not, you or your designate should formulate a proper public response that is succinct and direct inviting the claimant to submit their concerns to Github or read this wiki or refer to these bug/security bounties, etc. The SAME MESSAGE goes out for EVERY PUBLIC CLAIM. COPY/PASTE. Use twitter and link to the original public claim. You will see in time the real concerns get real attention in the proper forums - Github, RFC’s etc… and the FUD dissipates as attention grabbing headlines are met with timely, simple and effective responses.

Stop the madness. Stop the instigators. Stop wasting valuable time.

Focus on the network, not the FUDsters.

Great job David.

Josh · August 31, 2016, 12:06pm

I think you are correct but the assertion that David underestimates the cost of reply is probably a bit OTT. Im sure he does but when prominent forum members endorse the questions being asked I’m certain refraining becomes even harder. The turdstirrer wins and we all lose this one.

BIGbtc · August 31, 2016, 12:16pm

I said he underSTATES the cost. I believe he “underestimates the impact” of addressing each and every concern. This is way too time consuming for him or anyone to sit and listen/read every single item being critiqued and disected, by respected members or not. This contributes to the project being behind schedule, pain and simple and these are “distractions” that need to be managed properly.

Hell, most people have no clue that this project even exists and of those that do, few have any clue how it works.

The priority needs to be launch, launch, launch… fuk the noise.

BIGbtc · August 31, 2016, 12:33pm

Turdstirrers are easily identified and they have distinct smell about them wherever they go. The way to deal with turdstirrers is to make them work for their win. I romise you this particular turdstirrer hasnt got the backbone or the balls to work for his win. How do I know that? His cowardly methods. Pay close attention to the conspicuously absent response from the unemplyed not-so-prominent forum member and previous emplyed misnomered “engineer” who did some dev work for David.

The reality is we are all losing right now. We have no Safenet, we have no Safecoin, we have no forecast for said and we are spending excessive amounts of time addressing turstirrers or prominent members who cite turdstirrers.

Josh · August 31, 2016, 12:47pm

Apologies I still had my pre-coffee eyes on.

Traktion · August 31, 2016, 2:56pm

It was good that the thread was brought to light. It is also good that people feel they can air their honest concerns on this forum. It is great that David’s response hopefully alleys those concerns.

It isn’t great that the linked post may have been more FUD than fact, but we can’t help that. The worst thing to do is attempt to silence dissent - people will smell fear when that happens.

BIGbtc · August 31, 2016, 3:16pm

This is not about silencing dissent. It’s about properly and efficiently managing communications.

It is impractical to think every public question or concern, FUD or otherwise, is worthy of a detailed response. Im suggesting people are redirected to the information that will serve as an answer. If it’s a real question the answer will be found if the claimant is really looking.

happybeing · August 31, 2016, 3:39pm

I think this needed answering directly. It is different to have an ex-MaidSafe developer making what appeared to be detailed criticism that few here could counter, and I think carries a lot of weight to have it addressed by MaidSafe themselves - only they know the full role, context and background when someone they know and worked with does this. We’ve seen it a couple of times before, and if it isn’t answered properly it leaves open the possibility for those who wish the project harm to harp on about it, without the community being able to point to a definitive rebuttal.

In the above case I was able to do that on the twitter stream so it is now there for anyone who comes across in the future to find. Others can do so as well if they find it repeated, and David need not get involved again.

When its someone who hasn’t worked directly on the code I’m happy to jump in and try tackling them, but unless you know the code being referred to in detail, its hard to do that with someone who has the credibility to suggest they do have that knowledge.

BIGbtc · August 31, 2016, 4:29pm

I see your point and would almost agree if it was posted on this forum or directly to David or a dev.

There is no strategy/criteria for dealing with these situations and there needs to be one. What you are establishing here is a dangerous precedent and this guy and dude dallylama can post their concerns everywhere and David and team must provide detail responses each and every time.

happybeing · August 31, 2016, 4:57pm

Hardly, David and team can handle these as they see fit and I think it is wrong to suggest it is logical to respond in the way you say (in the following full quote) based on my opinion or David responding in this case. The first time an ex-employee is rebutted and doesn’t stand up their case makes it much easier for the community to respond if they do the same again, or to ignore them as you suggest.

BIGbtc · August 31, 2016, 5:34pm

Sure they can. At the expense of many things. Did David not already express concern this was taxing. Under other circumstances I agree with you but not with the current state of affairs.

yvon.fortier · August 31, 2016, 11:54pm

Thanks for the reply, David and polpolrene.
Makes a lot of sense.

At the end of the day, on a very high level, it’s similar to any centralized multi-node server architecture, in which several nodes offer a service redundantly and have traffic routed to them through a load balancer device or protocol.
In that situation, you are also at risk of losing the data. If you have a service on 3 nodes and all 3 die before you can add more members, you lose the data/service.

You need to weight:

How long is the heartbeat packet timer
What’s the percentage that a node that is down is down because it’s off-line permanently or it’s switched off for a few hours
How critical is the data. Do you want 4 nodes with it? 8? 16? Perhaps users could chose, like you would with backing up your personal data nowhere (case of downloaded movies), one place (a pseudo important work document) or in 3 locations (personal photographs).

Feel much better now, thank you!

And by the way, if Maidsafe and the community in general believes it’s adequate, I could write, with help of anyone else who wants to help, a document replying to common concerns.

I heard people in the crypto space make the same criticisms time and time again and it’s very hard to find the proper document to route them to.

vtnerd · September 1, 2016, 4:52am

Sorry in advance for the very long post …

Something I noticed too. It is a quirk of StackOverflow. I listed the date my contract ended, but it always lists the most recent position listed as current even if ended. And I cannot remove that line.

My employment status is irrelevant to my criticisms - attacking credentials is a flawed method to refute claims or ideas. FWIW after much patience I was given an offer for a dream remote-working position. Unfortunately, it is contingent on a contract approval, and several months of waiting for any resolution has taught me that I severely mis-judged the glacial pace of non-governmental bureaucracy. I expected governmental bureaucracy to be this slow, but I have learned that private businesses can also move very slowly.

Perhaps I should list boost::fusion as a current role; I have recently spent a significant amount of time “normalizing” the C++11 variadic containers to match C++03 behavior. Actually, being paid to work on Boost would be my true dream position, but that sounds like some unicorn non-sense.

I incorrectly assumed all of my points, with the exception of the nonce reuse, should have already been known to the team. I only noticed the nonce reuse last week when looking at the code for the first time since I left.

The first major problem I perceive involves p2p quorum systems in general. Numerous weekly updates have mentioned quorum + churn or quorum + segmentation, so I thought the difficulty was known. There is also freely available literature on the internet, which I think describes the problems in this environment reasonably well.

I think the second major problem is related to the close group attack mentioned on Google Groups, but I failed to bring another thought to attention (I will take the blame for that). The “closest” public-key in terms of XOR space is determined by the longest prefix. This indicates to me that the person with the most computing power can generate the highest number of closest keys, and this person will be capable of setting any desired value immediately. You indicate that there is more to the solution - can you point me to a file in the codebase or greppable search term? The most recent update suggests that the security implementation is still in flux: “Now that Andreas is back, the Routing team is discussing the immediate options and measures we can take to make groups of nodes more secure (there are many approaches here and the implementation schedule is important to get right). - i.e. prevent an attacker from getting enough nodes to reach quorum and control group consensus -, what level of group security can be achieved, and if necessary, how the network can deal with malicious groups.”

I notified the team directly of the subdirectory issue while I was under contract. This is likely a rare occurrence because Tahoe-LAFS appears to have a similar design in this situation. However, major cloud storage services (Google, Amazon, Microsoft) do not allow subdirectories and Tahoe-LAFS FAQ (Q31) recommends only one writer at a time in a directory. Their documentation goes further to state that no mutable file/directory should have multiple writers. Tahoe-LAFS users have yet to complain about about forked-off readers, so maybe I made too much of it. But if Maidsafe does not recommend one writer per directory subtree, then forked-off writers are possible too which adds to the complications.

The convergent encryption issue is mentioned in a whitepaper (no longer available on maidsafe.net) co-authored by a current employee. Also mentioned by me on a post in an intriguing thread on this forum. The proposed remedy is for the user to symmetrically encrypt prior to self-encryption, but it is unlikely many users will be aware of when they should do this or how to do it properly. Luckily the problem should be rare. However, Tahoe-LAFS no longer does system-wide convergent encryption after discovering the potential weakness. If the intent was to reduce storage requirements in exchange for lesser privacy in a small number of cases, then the tradeoff seems peculiar given the projects high priority on privacy.

The key/nonce reuse I just noticed this past week, but I notified the team of similar problems with unauthenticated AES-CFB with key/iv reuse in the C++ codebase. It is a relatively simple mis-application of cryptographic primitives, but the constant code re-writing by the project creates a high probability of future crypto problems. Also, whoever wrote the hybrid_encryption function does not understand how libsodium public-key encryption works - it generates a shared secret using ECC-DH and then uses that secret as the key to XSalsa with the provided nonce. So the call to XSalsa is superfluous in this situation. As stated on the mailing list, plaintext recovery should be difficult due to the content being encrypted. But a serialization change (padding or new fields), could reduce the security. Since a random nonce generation can be stored unencrypted there is no reason to take a chance.

I guess this was not clear; @Josh understood the intended meaning of this paragraph. After re-reading the statement, I wish to have worded it differently. I think a trustless p2p database is vastly more difficult than it first appears. I do not think the success of this project should be assumed; there is a risk of failure and people should be aware of that. I doubt the Maidsafe team will be able to achieve their lofty goals, but I do not wish failure on the team. And obviously my thoughts on the probability of success are subjective.

But not impossible. Which means that you either: (1) accept the risk that a resource (folder, coin, etc.) could theoretically be “locked” forever or require a trusted party to “fix” it; (2) allow for inconsistent writes on the network (forks); (3) accept both. This is outlined by the CAP theorem, and a 2012 follow-up by the author gives a good summary about how there is some known variability between the consistency and availability extremes. [Note: According to the theorem you can also get consistency and availability if you drop partitioning, but that would require a computer network that never drops data, and therefore in current real-world systems “P” is “chosen by default” in the 2 of 3 theorem].

It is not clear from your statement the path Maidsafe has selected, or whether Maidsafe believes they have invalidated the theorem. I think you have decided that you can engineer a very small probability of the resource being locked (“infeasible to lose the whole group”). Unfortunately, maintaining consistency while keeping availability high is difficult, and trustless p2p systems increase that difficulty. The obvious techniques: (1) super-nodes with higher reliability; (2) larger groups; (3) more frequent faulty-node detection messaging. If you know of more techniques, or have created a new one, please direct to some reading material.

Super nodes (1) have to be selected somehow, and Maidsafe previously had algorithms for selecting more reliable peers for group selection. The algorithms have disappeared and I cannot recall ever seeing an implementation. Larger groups (2) have the negative side-effect of requiring more messages (bandwidth) as the group size increases. Additionally, the number of messages required for quorum consensus algorithms to converge will generally increase as the number of faulty or malicious nodes in the group increases (non-responding nodes are faulty so a large group + churn is bad). More frequent faulty-node detection messages (3) require more bandwidth. If the check frequency is taken to the extreme for rapid failure detection and replication, it increases the probability of false-positives and thus instigating even more messages to update or increase group information spuriously.

I have found some interesting research papers on p2p quorum databases. One method varies read and write consistency to assess the availability and messaging tradeoffs. Another uses a hierarchy based quorum to reduce messaging necessary for quorum. The latter seems more useful for Maidsafe.

The 4 remaining nodes cannot determine the difference between nodes without connection, and a network partition. So if the network allows writes to occur in that situation, its possible for writes to occur in both partitions. This will result in a data conflict. Bitcoin has the same limitation, and the nodes agree that the longest chain is the version to take. Other designs prevent writing (write availability loss), while still allowing reading (read consistency loss due to possible stale data). Yet other designs prevent reading and writing (write/read availability loss), but have a single history timeline and never return stale data.

There are ways to play with this a bit, but the important takeaway is that no existing distributed computing literature provides a method that guarantees 100% availability and 100% consistency over a fault-capable network. This is not catastrophic, as Bitcoin does not guarantee 100% consistency either. However, I do not believe Maidsafe has found a way to make a trustless distributed database sufficient for its goals. But this is hard to accurately judge; Maidsafe keeps changing some of its core algorithms making it harder to analyze.

Probably all over the world != are guaranteed to be all over the world. It should be possible to use the IP address prefix to reduce the probability. This technique would be inaccurate if the node connected to the network through a VPN or similar.

The detection is not instant, that would imply a network and OS with zero latency which is non-existent. I stated this once above - as the heartbeat frequency is lowered it increases the necessary bandwidth and increases false-positives. And even if the detection were instantaneous, replication is not.

Its also worth mentioning that heartbeats and ping-ack cannot be used to synchronize multiple writers to the same resource. Messages can be received in a different order from different sources, so a method to reach ordering consensus is needed if this is a desired feature.

The first sentence is a response to the statement, “network partition prevention”, in an update. A 100% solution would require a network that never failed or failed globally, neither of which is how computer networks currently work. Even the statement “lowering the probability of network segmentation” would be incorrect, because that is out of the scope of this project. I assume they meant lowering the risk in the event of a partition, but I do not know what action was taken.

anon40790172:

This is a comment about Disjoint Groups. It makes the assumption that 2 groups should have the same “state” and therefore consensus on data/decisions and more. This is fundamentally wrong IMO. The idea behind Disjoint Groups is that each group is responsible for a certain range in the address space. So if group A1 becomes too small and the same happens with group A2 than each group still has it’s own “state” about everything they’re responsible for. So group A1 still signs stuff with quorum and group A2 does the same. They can’t conflict because the group sign (quorum) of each group is “law” so to speak. And when they decide to merge they accept each others previous signs and decisions. So IMO the writer of the article doesn’t have a good understanding of DG and therefore the current focus of the devs. But he’s free to explain here in the forum if I missed something in his statements.

I was referring to two groups that were previously on different sides of a network partition. In other words, a single group split into two after a partition, then merged. If both accepted writes, which is correct? If one or both of them stopped accepting writes, what other conditions can induce this “locked” resource?

I did not mention this on Metzdowd, but this is something that has bothered me for a while too.

anon40790172:

This is what’s in the RFC.

The quorum cannot be a constant anymore, due to varying group sizes. It needs to be a percentage strictly greater than 50% instead, and in a group of size n, a number x of nodes will constitute a quorum if x / n >= QUORUM.

The reason for making the point is this quote:

A group can losequorum if enough nodes drop out simultaneously, which would mean that no more updates can be made to the resource.

It reads to me like: we have 8 nodes, 4 of them churn now we can’t have consensus. This isn’t the case as quorum is a percentage of the number of nodes. Not a fixed number like 8 or 5 or 7. That’s what I mean with no fixed quorum size. Each group finds consensus no matter if it’s 4 nodes or 12. Only at 2 we might have trouble. But that’s why a merge should happen way before we’re at this level. And when a group does get at this level due to extreme churn it can’t route new updates anymore as the other close group doesn’t sign a thing without enough signs from the sending group.

This design has inherently decided to relax write consistency. If both sides of a partition follow this algorithm, then both sides will accept writes on any available resource. Admittedly I am a stickler, but I think its worth preparing for these edge cases.

I do not stir turds, I flush them. And I do not understand how unemployment is viewed negatively, you get to do whatever you want! Also, I am not using any government assistance, for people concerned about that sort of thing.

Jabba · September 1, 2016, 7:27am

I have to wonder why any former employee would take their concerns to twitter before asking the team directly…?

I completely agree that IF you have valid points you should share/air them, but it would seem more professional to take your concerns to Maidsafe and if their answers were unsatisfactory to you then you post the questions and answers publicly. The fact that you tweeted your initial comments calls the rest of your posts, your professionalism and your motives into question.

Well obviously.

I don’t think there is anyone here who ‘assumes’ this will definitely work. If there wasn’t a reasonable chance of failure maidsafe would have a >$1bn market cap already! IF it does work it will solve a lot of technical and social issues and have very disruptive impact, there is no doubt it would be HUGE if it works. It’s a project worthy of supporting. If you see problems you should be trying to help fix them, not posting FUD to hurt the project (which IS what you are doing by not contacting maidsafe first and then publicising your concerns afterwards). The world doesn’t need you to let them know that nothing is certain and the challenge is a hard one.

I find everything about your post and this situation very suspicious. It seems almost impossible to me that your concern is the ‘public good’ or welfare of investors… that story does not make sense given how you have approached the concerns you apparently have.

anon86652309 · September 1, 2016, 8:38am

Hey Lee,

Thanks for taking the time to write this post up. I did hear you were working with Michael Caisse & co. I think? Boost.Fusion and variadics - sounds like fun

I think all your points probably were known to us except the nonce reuse. We’re discussing that now, but thanks for the heads-up.

anon96570664 · September 1, 2016, 8:52am

With all the respect to you Jabba, I do not think it matters how, where or why he said his criticism. The important part is what he is saying, the essence of his thoughts. Be it valid concerns or not, that is what matters and what should be investigated. No pride should be hurt because there should be no pride to hurt.

He appears to be a highly intelligent person who has a much wider understanding of what Maidsafe is actually trying to solve. It therefore can be argued that he knows the actual difficulties of the project much better than lay people and his insight can draw attention to the deeper aspects so investors can adjust their risk and position accordingly.

I will argue that even though we should theoretically all know the risks of this venture the reality is that we cannot know and for this reason many are definitely over-invested and may not even consciously admit it to themselves.

Many crypto projects are building fantastic cloud castles that may and will disappear. This will bring much harm to the crypto space in the months and years to come. Maidsafe and safenet should be as grounded as possible, with much cold water thrown to their faces. Rather be undervalued then overvalued. This will benefit in the long run.

Topic		Replies	Views
SAFE Network Dev Update - October 4, 2018 Updates	16	2203	October 6, 2018
Developers have they been threatened? Community	3	930	July 28, 2015
SAFE Network Dev Update - March 14, 2019 Updates	56	4544	March 21, 2019
MaidSafe Dev Update - August 17, 2017 Updates	42	7974	August 24, 2017
Update 24 March, 2022 Updates	25	3244	March 31, 2022

SAFE Network concerns from an old employee

Related topics