Today, we are publishing different proposals related to the first part of data chains:
- Data Chains - Option A
- Data Chains - General rules in an eventually consistent network
- Data Chains - Option B
Please note that we have not concluded on any of these proposals. They are all still being considered.
Also, we’d like to share an idea that we are considering for the next testnet with MutableData. Currently, every time we start a new testnet, all user data is lost and people need to create a new account and start from scratch (e.g. you need to reupload your data and recreate your SAFE websites). The real solution will be to implement data republish and network restarts via data chains, but since we are currently running client-only networks (where MaidSafe manages all the vaults), we are thinking that it could be a good idea to implement a temporary solution to make user data more persistent. So if a network is taken offline because we are adding new APIs and vaults need to be updated, we wouldn’t have to lose all the user data. This means that app developers wouldn’t have to reupload all their apps every time we start a new testnet . It also means that we’d have to start caring more about backward compatibility (e.g. deprecating APIs, migrating from one version of the API to another version, etc.).
There are a few simple approaches to achieve this, while none of them would be the ideal long term solution, they should let us get the client tests going and allow network updates without much of a hassle in terms of data loss during restarts. Most approaches involve allowing nodes to retain their previous ID and restart to the same area (address space) of the network. In the long term with node ageing, while nodes will still try and do the same, they will get relocated again at this point. For now however, we’re considering letting nodes just remain where they were which will then get their chunk stores relevant again and data available to continue operations. It involves keeping the chunk store persistent and also flushing MaidManager accounts which are currently not flushed to disk. Doing the last two parts could be achieved again in multiple ways such as using a custom network RPC to trigger and create a named backup which can later be restored to or go for a simpler option of just having the single backup by not dropping chunk_store on termination/startup. Each approach comes with its own set of advantages and we’ve not finalised an approach just yet.
SAFE Authenticator & API
The safe_browser’s DOM APIs were further simplified for usage. Previously, for every API call, the auth token had to be passed as an argument to the API function. Now, we are internally mapping the handle with the corresponding client, so we can remove this unnecessary argument. APIs to free the handles are exposed as part of the DOM API. Also, the for-each
function was not working consistently, but this is now fixed. A new version of the SAFE Browser with the updated APIs and fixes was built with the mock network and published for the devs to try. @bochaco also documented the DOM APIs and we have hosted the same on GitHub Pages for easier reference.
The safe_app_java APIs are being wired up and a few test cases of ImmutableData are failing when run as a complete test suite. We have wired up most of the APIs and only the MutableData API is pending. We are hoping to resolve the issues and have the Java API ready as soon as possible.
@shankar is also getting the UX/UI of the authenticator improved based on the designs from @shona.
SAFE Client Libs & Crust
SAFE Client Libs have been updated to use the latest library versions (including Serde 1.0) and now we have replaced SHA-2 everywhere with a more efficient hashing algorithm, SHA-3. Some changes are being made to the dev branch of safe_client_libs to improve the public API consistency further: now we have a function to generate cryptographic nonces (generate_nonce
), which makes the crypto module available to apps more complete. Further, working closely with the Authenticator team, we have closed the gap by fixing inconsistencies in some of the modules: for example, mdata_info
didn’t have a proper function to deallocate it, and now there is mdata_info_free
.
After updating safe_client_libs master to use the latest version of the serialization library Serde, we ran into some issues with the existing invitation-based account creation scheme. It was causing some weird out-of-memory errors, so we have changed the stored account information to be more strongly defined and put it into a common module in routing that soon will be used in safe_vault too.
The stress test from the master branch of safe_client_libs has been ported to the dev branch to include support of MutableData. This will help us to test the robustness and performance of the MutableData implementation in safe_vault. Then, based on the results, we can see where improvements are needed.
Routing & Vault
While the design and simulation efforts are still ongoing, we decided to make public some of the documents and proposals today and invite you to participate in the discussion:
Why data chains?
There are a lot of decisions the network needs to make as a whole: Which nodes are allowed to join, and in which section? Which nodes are responsible for any given part of the namespace, and in particular for which data chunks? What chunks are currently stored on the network and what is their content?
Eventually, after having exchanged a few messages, the nodes always need to reach an agreement on all of those questions, so that they don’t send contradictory responses to clients, and so that they agree on which sets of nodes are responsible to make the next bunch of decisions. In other words, both the data itself and the information about the network’s own structure are supposed to be eventually consistent: the nodes must always, although possibly after some delay, reach a consensus on them.
The goal of data chains is to be a historical log and a cryptographically verifiable representation of that consensus, that can be persisted and exchanged as a proof among nodes.
As a first step, we are only aiming to include the information about the network’s own structure (which nodes are members of the network, what is their name, and which sections and groups are they in). Even without data, that will already have considerable benefits:
- With a more well-defined notion of a proof that a section consists of a particular list of members, we can at last enable the final piece of the puzzle for securely relaying messages: Each message will contain the required blocks that allow the recipient to validate an uninterrupted chain of signatures back to the sender.
- Clients and nodes joining the network will be able to verify that they are connected to the real network without having to trust their proxy node.
Events in the network
Since the Disjoint Sections RFC, the network is subdivided into sections defined by a prefix, a finite sequence of ones and zeros. The section with a given prefix consists of exactly those nodes whose names in binary begin with that prefix. At first, the network has only one section. Once enough nodes have joined, it splits into sections 0
and 1
. When section 1
has enough nodes, it splits into 10
and 11
, and later into 100
, 101
, 110
and 111
, and so on. If a section loses too many nodes, it merges, effectively undoing a split.
Nodes that join the network are assigned a small interval, and must pick a name within that. They are then challenged by the section they are about to join, to prove that they satisfy some minimum bandwidth (and later CPU) requirements.
So the main events that are represented by the blocks of the chain are:
- A node has passed the challenge and joined the network.
- A node has disconnected and left the network.
- A section has split into two.
- Two sections have merged into one.
The problem is that the existing nodes in the network don’t always agree on which of these events have happened yet or should happen:
- A joining node can pass the challenges it got from different section members at different points in time. And it could even only pass some of them but not all.
- A node could leave because of connectivity problems, and disconnect from its peers one by one, with several minutes in between.
- A node could believe that its section should not split because it will soon have to merge with another section that lost too many nodes. Another node could have received an update on that other section earlier or later, and thus come to a different conclusion.
- And similarly for merges: I might think that we, section
01
, need to merge with section00
, because00
has lost too many nodes, but you might already know that00
has in the meantime regained a node and we don’t need to merge at all.
So our data chain implementation will need to start out from those different subjective views of the network, before arriving at an agreed “truth”, usually by making the nodes vote: “I think this node should join our section.”, “I think that node has disconnected.”, “We should split!”, etc.
Different approaches to a solution
Initially, most of the team’s time went into the Data Chains - Option A plan: We are trying to adapt established consensus algorithms like PBFT to our use case. Each section of the network would run an instance of that algorithm, and special care needs to be taken whenever the section changes, because a node joined or left or the section splits into two, or merges with another one.
Whenever an event happens, e.g. a node goes offline, the other nodes exchange several rounds of messages to reach an agreement that this event should define the next block in the section’s chain, and then commit that block. Apart from merges and splits, where blocks have two successors or predecessors, this produces a serially ordered history of each section: for any two blocks in the same section, it is well-defined which of them was agreed first. That makes this approach relatively easy to reason about, although the consensus algorithms themselves are rather complex. Other drawbacks are:
- If an attacker controls a third of a section’s nodes, they can break the section. (This is a general result that affects a large class of consensus algorithms.)
- PBFT is leader-based: One node in the section plays a special role, and attacking that node (e.g. DoS) can delay progress in the section until it has decided on a new leader — which could immediately be attacked again!
We looked into Honey Badger as an alternative leaderless consensus algorithm, but it looks like it cannot be adapted to our use case. We are continuing to evaluate other consensus algorithms and search for better fits.
However, these drawbacks led to us taking the leaderless patterns further and considering approaches that avoid doing a complex consensus algorithm before committing. That means they don’t necessarily always reach consensus on which event happens “first”: e.g. there might remain an ambiguity (decision branch) about whether node A left first or node B left first. Nodes can process and action events concurrently and eventually reach a single state again on which they agree (both nodes A and B have left), that might be what we need from them to achieve our requirements here.
Data Chains - General rules in an eventually consistent network is a discussion of implicit ordering, and explores a set of rules to reorder a vector of blocks to account for the fact that different nodes could have received the blocks (or rather the votes for them) in a different order.
Data Chains - Option B proposes a mechanism that computes the validity of blocks from a set of votes that is independent of the votes’ order.