Orthogonal Persistence and Algebraic Data Types for documents on Safe Network

TL;DR With a layer on top of the SAFE Network, we can have a planet-wide object store with orthogonal persistence. It’s a great opportunity to reinterpret what “data” is and how we should think about it.


I’m reading an interesting Gulliver story: https://ngnghm.github.io/. It reminds me what the SAFE Network is. And what it could be.

I found this from the Robigalia blog by Corey Richardson. He found inspirations from these stories. Coincidence: He builds a Rust ecosystem around a secure microkernel. Maybe Robigalia could be the future for the SAFE Network? (He’s not a wannabe OS developer: He was intern on seL4 verification team. Good reference. But the project goes very slow. Could use some help. Hint hint.)


One. Store everything user does. No need to think about persistence. No losing work anymore. (No losing post to safenet forum like yesterday. I didn’t retype it. Sorry. You missed my great wisdom forever.)

No SAVE button: You make a new text file, type a word. Saved automatically. Always. (If google docs can do, SAFE Network can do too.)

In SAFE Network: Append-only mutable data block. Pay once, append until block is full. Does it work like that? If not, could be useful.

From Chapter 2 of the interesting Gulliver story. Sorry for long quotes. It’s shorter than the full text.

Orthogonal Persistence

Ann explained to me that Houyhnhnm computing systems make data persistence the default, at every level of abstraction. Whether you open the canvas of a graphical interface and start drawing freely, or you open an interactive evaluation prompt and bind a value to a variable, or you make any kind of modification to any document or program, the change you made will remain in the system forever — that is, until Civilization itself crumbles, or you decide to delete it (a tricky operation, more below). Everything you type remains in your automatic planet-wide backups, providing several layers of availability and of latency — kept private using several layers of cryptography. [DESCRIPTION OF THE SAFE NETWORK!]

Notes about: User interface, security levels, and so on. User is always in control.

Of course, you can control what does or doesn’t get backed up where, by defining domains each with its own privacy policy that may differ from the reasonable defaults. The user interface is aware of these domains, and makes it clear at all times which domain you’re currently working with. It also prevents you from inadvertently copying data from a more private domain then pasting it into a more public one; in particular, you only type your primary passwords but in a very recognizable special secure domain that never stores them; and your secondary access keys are stored in a special private domain using stronger cryptography than usual, and also subject to various safety rules to avoid leakage.

Ideal situation: It is part of the operating system. EROS and KeyKOS did that: RAM is just write-through cache to persistent object store. A good OS could use the SAFE Network to have orthogonal persistence. Until that happens, SAFE Network libraries could implement it as default behavior for documents.

The adjective “orthogonal” means that the persistence of data is a property of the domain you’re working in, as managed by the system; it is not an aspect of data that programmers have to deal with in most ordinary programs; unless of course they are programmers specifically working on a new abstraction for persistence, which is after all an ordinary program, just in a narrow niche.

Anything an App does on SAFE Network uses this without thinking. The writer thought about safecoins too! Joke. But no stupid idealism: We know things has cost.

Regular programmers just manipulate the data with full confidence that the inputs they consume, the code that manipulates them, and the outputs they produce will each remain available as long as the user wants them, with the consistency guarantees specified by the user, as long as the user affords the associated costs.

NOTE: Persistence is not done by the App. It is done for the App:

Actually, ordinary programs don’t know and can’t even possibly know which domain they will be running in, and letting them query those details would be a breach of abstraction, with serious security implications and performance impediments, even assuming for a moment that it wouldn’t otherwise affect program correctness.

Maybe it needs a different model with a new layer between the App and the SAFE Network:

  • Now: App gets a token, can do anything token allows.
  • Then: Domain gets a token. App can’t connect to the SAFE Network directly. Instead: App is started “inside” a Domain and it uses syscalls to the Domain to request and manipulate objects.

The user can pick the Domain for the App. The App doesn’t need to know which. It never saves anything! The domain handles persistence for the objects the App works with. Copies of an App can run in multiple Domains at the same time.

The Domain owns the objects, not the App! It acts as write-through cache to the object store which is the SAFE Network. It applies persistence and other policies. (Described better in the quoted section above.)

Note: Not incompatible with current SAFE Network authentication. A “Domain” is just a special App, DomanApp is just an App that connects through Domain, not directly. Ideally, all “normal” Apps are turned into DomainApps (without access token, running inside Domains) and only Domains are “real” SAFE Network applications (with access token).


Will orthogonal persistence waste much space? Maybe not. From Chapter 3:

Dealing with Bad Memories

But, I inquired, if they log everything and almost never forget anything, don’t Houyhnhnm computing system quickly get filled with garbage? No, replied Ann. The amount of information that users enter through a keyboard and mouse (or their Houyhnhnm counterparts) is minute compared to the memory of modern computers, yet, starting from a well-determined state of the system, it fully determines the subsequent state of the system. Hence, the persistence log doesn’t need to record anything else but these events with their proper timestamp. This however, requires that all sources of non-determinism are either eliminated or recorded — which Houyhnhnm computing systems do by construction.

The writer talks about some methods to lower storage. But user has final control.

Of course, to save resources, you can also configure some computations so they are not recorded, or so their records aren’t kept beyond some number of days. For instance, you might adopt a model-view-controller approach, and consider the view as transient while only logging changes to the model and the controller, or even only to the model; or you might eschew long-term storage of your game sessions; or you might forget the awkward silences and the street noise from your always-on microphone; or you might drop data acquired by your surveillance camera when it didn’t catch any robber; or you might delete uninteresting videos; or you might expunge old software installation backups from long gone computers; or you might preserve a complete log only for a day, then an hourly snapshot for a few days, and a daily snapshot for a few weeks, a weekly snapshot for a few months, etc.; or you might obliterate logs and snapshots as fast as you can while still ensuring that the system will be able to withstand partial or total hardware failure of your personal device; or then again, given enough storage, you might decide to keep everything. It’s your choice — as long as you pay for the storage. The decision doesn’t have to be made by the programmer, though he may provide hints: the end-user has the last say.


Two. SAFE Network libraries could work with real objects as Chapter 3 talks. Algebraic Data Types.

Maybe it’s time to think about “files” differently? They are not blobs but real documents. How did we think about things before computers?

  • A picture is the original activations on a Bayer filter, the exposition settings, the date taken, the GPS location, the title given, the list of faces with pixel coordinates, the keywords, etc. And the editing steps! Like Lightroom: It doesn’t touch the original. It records transformation steps.

  • A book is not a string of numbers. It’s the author, the title, the cover, the chapters, the paragraphs. A snapshot at the end of the editing process.

  • If I am the writer, it’s all the outlines, the character descriptions, the revision graph. Every keypress, every word typed, every section moved from one place to other. The whole editing history, organized into bigger and bigger editing steps. Can my computer think about it like that? But it doesn’t. Want to look at an old version? I can’t. Want to “remove” a change from the history (create new version without that change)? I can’t. Save often? Use git? Great. But it should be automatic! Because it could be automatic.

If there are SAFE Network libraries that work with good document abstractions, it can be a new beginning: Revolutionize how we store data, but also revolutionize what data is and how people think about data. (Not the low level libraries, but higher level “SAFE Network Standard Libraries”.)

Data at the Proper Level of Abstraction

Because persistence in Human Computer Systems consists in communicating sequences of bytes to external processes and systems (whether disks or clouds of servers), all data they hold is ultimately defined in terms of sequences of bytes, or files; when persisting these files, they are identified by file paths that themselves are short sequences of bytes interpreted as a series of names separated by slashes / (or on some systems, backslashes , or yet something else). Because persistence in Houyhnhnm Computing Systems applies to any data in whichever high-level language it was defined, all Houyhnhnm computing data is defined in terms of Algebraic Data Types, independently from underlying encoding (which might automatically and atomically change later system-wide). For the sake of importing data in and out of independently evolving systems, as well as for the sake of keeping the data compressed to make the most of limited resources, some low-level encoding in terms of bytes may be defined for some data types. But on the one hand, this is the exception; on the other other, the data is still part of the regular Algebraic Data Type system, can still be used with type constructors (e.g. as part of sum, product or function types), etc. Whereas Human computer systems would require explicit serialization and deserialization of data, and would require ad hoc containers or protocol generators to allow larger objects to contain smaller ones, Houyhnhnm computing systems abstract those details away and generate any required code from type definitions.

When App sees a document and has better encoding, it can transform it. Transparently. Example: App learns FLAC encoding. It knows it is equivalent to WAV, so it searches for my documents encoded as WAV and recodes them. Without my instruction or knowledge. “It just works.”

Low-level encodings can even be replaced by newer and improved ones, and all objects will be transparently upgraded in due time — while preserving all user-visible identities and relationships across such changes in representation.

You are lucky because I only read this far so. Many great ideas that are good fit for the SAFE Network. Some impossible without the SAFE Network! (Or something similar. But there was nothing similar.) It’s time to make history?

11 Likes

On a long enough time scale, I could certainly see local storage becoming just another layer of cache. It would just hold recently used data which has been persisted or accessed from the SAFENetwork.

I believe it has to be SAFENetwork, or something very like it, to provide the requisite feature set and openness (private, authentic, available, performant etc). I certainly wouldn’t want to entrust all my data with a private company, but an impartial, autonomous data network would fit the bill.

I think we are already heading towards this. We rely on cloud services to get many things on demand now. However, vendor lock in, privacy, availability and security are not at a level to replace local storage yet.

6 Likes

Yes, but wrong mindset. Local storage should be just another layer of cache. Already. My computer can switch off any moment. Whatever is in its memory is forever lost. (Personal experience, not just a metaphor: My laptop is dying. It turns itself off sometimes.) If just a cache, I reboot and continue. Maybe a second is lost. No big deal.

Sure, but we need the network infrastructure to support this. SAFENetwork is the closest thing I have seen to a solution to deliver this in the future.

1 Like

Yes! It was impossible before. On a “planet-wide” scale. With multiple encryption layers. Before: Future dream. Now: Close possibility.

Let’s not miss it!

1 Like

Each update requires work to be done by the network. This is small but certainly not insignificant. Then if everyone typing is causing MD updates for every character or word or sentence then the traffic across the world’s networks would be come immense indeed.

1 Like

I know this can be a problem if not mitigated. Ideas:

  • Natural solution: All who are typing are also processing. Not 100% true because only vaults process data. Still: More people typing => more vaults processing the typing.
  • Batch or Pay: “Pay once. Append until block is full. But no more than 20 times. Then pay again.”
  • Streaming: “Open block. Start sending data. Finish when full or if connection breaks.” Takes less resources than rebuilding connection each time.

“Domains” handle persistence. User can decide what strategy to use for what content (Domain: work, personal, important, “stuff”, etc) to balance cost and benefit. Save every minute? Save every keystroke? User chooses, the network sends the bill (:

1 Like

If you read the docs, any update to a MD requires consensus and updating of the MD on disk. Just like on actual disk writes you have to write the whole block (multiple sectors), so an MD has to have the data you want to update the MD with in order to process the update. And like a hard drive where streaming is still buffering up a block then write it physically to disk, so to SAFE has to have the client buffer up the data and issue a write/update request to the network.

So if you were to ensure every character typed is stored then an update has to be requested for every character.

To do “streaming” then you are caching the data on the PC till you have enough to write/u[date an MD

Remember streaming for physical storage on bock devices is done block by block and not byte by byte. Byte by byte streaming is possible on tape and any hardware simulating streaming storage (cache till block can be written). For block device though streaming is at a level above the physical storage so that caching can be implemented. Actually true tape streaming finally went out in the 80’s and replaced by tape devices that wrote in blocks (introduced long before that with Mag tapes).

3 Likes

I had a feeling this would be a problem.

Maybe: Some data is sent. Marked as incomplete. Group buffers: It is not fully persisted but already on multiple computers. It has high durability. When finished, connection breaks, timeouts, block gets really stored. Only once.

It’s a useful thing. It could improve computing so much. I’m sure it can be done. The SAFE Network is already impossible, right? :slight_smile:

Massive attack vector.

The attacker simply sends these “incomplete” requests across millions/billions of MDs and every section has multiple requests incomplete. Each one taking up each node’s resources and eventually the network clogs up with all these incomplete requests waiting to be complete. Even if you use events to “time out” these incomplete requests it is still taking up valuable node resources and loading down the sections.

1 Like

But you pay first. For the whole block. Sounds expensive.

It would be but we are talking of tying up a lot of resources.

It is expected that 1 safecoin will buy a massive amount of PUTs

So when an attacker tries to fill up storage space then they have to pay for each chunk or MD and as space becomes scarcer then the cost put PUT goes up and their SAFEcoins are used up more quickly. And as spare space becomes critical it will be one coin per PUT.

But when an attacker would do “incomplete” requests then no actual chunk or MD storage is used and the incomplete request is kept in the section Node’s queues waiting for the rest. So the cost for each such request does not go up since spare space does not change. So while farming rate is at its minimum (PUT cost cheapest) then an attacker might get billions of such requests for each SAFEcoin. Thus it becomes an effective attack since it ties up the active resources (Queues, memory, local storage) of each section’s nodes.

See the difference?

IMO, you would just treat local storage as a buffer, which gets flushed to the SAFENetwork when some goals are reaches (size of data, time since last write, etc). If the SAFENetwork lags a little, I don’t think that would be a concern in most cases. Ofc, multi user or important transactional data may be an issue, but perhaps those data types would be committed asap and block until completion.

2 Likes

It worked like that beforehand, when mutable data were implemented with SDs. But not anymore since they have been replaced with MDs with a slightly different set of features, notably a payment needed for each update.

I don’t understand the problem. I don’t think there is. Forgive me but it feels like you trying to find the only way it can’t work. When it’s easy to find ways it can work.

Why? When somebody starts an MD block, it can be reserved. Why do you think it is hard to do? It is the right thing to reserve it. So nobody can use the address you paid for.

  1. Open and reserve MD. (Needs payment first.)
  2. Buffer data as it is arriving. (Can be done efficiently.)
  3. Save and close MD when done / timeout / disconnect.

Defense against excessive storage is simple:

  • Mmap MD storage.
    • Maximum size is known. Can “reserve” that much. No memory is really used yet.
    • The OS handles paging: Memory is not bogged down when no data arrives.
  • Sparse files: No disk space is used until content arrives.

Linux epoll has O(1) complexity, other OS can or did copy this, so number of connections doesn’t matter. But it is impossible to target a node, right? So it is not a problem.

That defies the purpose: It is no longer durable if not replicated immediately.

It depends how durable you need your remote store to be. As long as your local store can restore and apply changes when connectivity is restored, it may be robust enough.

I’d say it depends very much on the requirements. I suspect a moderately less robust system may be substantially cheaper to run. For an average home or business user, having a remote store which is 5 minutes stale (unless flushed) may be fine, especially if it is far cheaper.

Then it is no step forward. False advertising. “Persistence.” In quotes. Sense of security, but with no security.

If you have only one copy, you don’t really have any copy. Local store is often volatile memory: Mobile devices use solid state, temp data can’t use that because wear is an issue. Store it in 2-3 memories (or drives) and durability is almost infinite.

The requirements are moving target: Cheap and expensive. Durable and fragile. They are relative: a) Store my shopping list. b) Store my life savings. I can pay more for B and I want higher durability.

There is no “average home or business user”. It is disrespectful to design for average and ignore specific. Because who decides you’re not average enough? My mom’s life savings are more important than the president’s grocery shopping list. Which one is more average?

So, it has to be flexible. For most, 5 minutes stale is fine. They pay less. Because network use is less. But some want 1s full durable storage. They pay more. Because they use more network resources. I (and the quoted document) promote choice: “The User Decides” what is good for him for a purpose (and if he can afford to pay the price).


The SAFE Network is a revolution. We need to dream big. Get it really right. Maybe it’s a once in a life time opportunity. It’s harder to fix something with momentum. Better getting it right when still slow.

This is easy

This is not hard at all and could be done BUT

Cannot be done in a way that does not tie up node resources and its buffer space. Each MD that is in this mode has to have extra resources in each node and each byte coming in has to be authenticated and consensus reached, otherwise someone could hijack the process, so each byte has to be proved (if you do true streaming of byte by byte.)

So I am sure you can see the massive attack vector here. It would not take all that many of these MDs to swap the resources. (processing and memory)

Also what about buffer overruns, now each client has to keep track of the number of bytes they have sent over the internet to those MDs

Now if 100 million people are online at once and each one has each byte they type or PROCESS send over the internet to be kept using this then imagine the tcp/ip or uTP packets being sent. At 5 characters per second (60 words per minute) then that 500 million packets generated per second

Then think of all the APPs processing 10K or 100K or 1MBytes per second and each byte is steamed to these MDs keeping each state change recorded.

Sorry the attack vector is huge, but then proper usage is huge too.

Its the state that has to be kept, the consensus on each byte being done (to ensure valid stream), the packet rate since each byte is its own internet packet. The increase in load would probably make it impossible for many home users.

How I hate “cannot be done” statements. Why not say “couldn’t figure out hot to do”? More honest. Whole SAFE Network was “cannot be done”.

But your argument is a simple DoS attack. You don’t need my idea for that. If the SAFE Network can’t handle paid traffic of any size, that’s a problem. With or without this thing. I’m sure it will. It is desgned to.

I don’t. Maybe the wrong word, streaming. “Slow sending” is more appropriate.

Single change necessary on server side: A “rolling signature” multiplexed into data stream: One piece of data, then signature for everything before. User can send each byte authenticated. But if every 1 byte data is followed by a big signature, it’s very expensive because signature is 99% of the block. Attack on himself, maybe?

But can be done already. No change: MD already has multiple segments. Just add another segment when new data arrives. It is how it works now. Only difference: What purpose it is used for.

Please understand: Everything is paid, so everybody has to decide: a) How much granularity I need? b) How much I can afford to pay? Most don’t want 1-byte persistence. Most want “Google Docs Persistence”: “Save my changes after every few words, or a few seconds after I stopped typing.” And. If they are poor, they want less.

Not each byte, each piece of data. If a piece is 1 byte long, you can’t fit many bytes in one block: The signatures are 99% of the block. Very expensive. But you can do if you want.

In my still fresh “rolling signature” version, check is done after each piece.

Current MD implementation: Just an update request. If you do it byte-by-byte, that’s okay. But. You pay a lot for that byte.

Other things:

No. Memory is not used until it is used. Malloc doesn’t reserve any memory, just tells kernel about it. Mmap a 10 TB file and nothing is used: Sparse files.

Processing: You misunderstood. No byte-by-byte. Only as (expensive) option. Most use cheaper settings. So. Attacker has to pay more and more for bigger and bigger attack. But wait! Then it is not a “massive attack vector” just normal use of paid resources.

Absolutely not. Don’t send snapshots. Wasteful and opaque. Send transformations. You edit a picture. Why send each pixel? You just send “brighten picture 10%”. Much better: a) Less data. b) More semantics. Final result easily recoverable. But also each steps! Sometimes save snapshots because convenient. Lightroom does this.

Human input already is low bandwidth. No problem there.

I don’t know what to say. This is normal use. When there is a buffer, programs count. Always. That’s what programs do. What is your argument?

No, to say that is to be dishonest. I know it will tie up resources. Basic resource management, If you have ever written buffering, or disk management, or operating systems or low level file handling, then you would know why I can say this.

You have to keep track of what MDs are waiting for more data. If you don’t then you cannot claim to be keeping an operation open.

If you wish to completely redesign how SAFE works then maybe you could have a system where it doesn’t need extra resources since you treat every file as open. But in that case you are needing resources for every MD and kinds defeats the problem by needing active resources for every MD instead of the ones being “streamed” to. And of course then the section is very restricted.

We are working with windows/linux/apple operating systems. I did work once with an IBM system that did everything to disk and you could even kick the cord out and simply plug it back in and it would resume, even updates. But to do that required a special operating system with specialised hardware designs.

In the case of SAFE we don’t have that so I can safely say that it “cannot be done” without tying up resources.

If you do as others have said and buffer up till you get a reasonable amount of data OR data is not forth coming then your idea is not too bad. MDs allow for fields where you can add (or update) a field (up to 1000 atm) with data. Of course this still is an update and has to be paid for. But of course you have gone from streaming to block storing.

I won’t comment on the rest of your post since it is an attempt to justify why resources are not needed if you have a system where an MD is opened for “streaming” filling it up over time. You simply do not show you understand the needs of such a system.

This is an example of the error in what you think and what reality is

ptr=malloc(size);
reserves size bytes of storage and sets the pointer ptr to point to the start of it

The very definition of malloc shows you are in error.

Keeping a file (MD) actively open waiting for future writes very definitely needs resources to handle it.

Ah maybe you misread what I said. I said the issue and what is needed to handle it.

In simple terms if your code in the client does not keep track then the MD will overrun and have data loss. So your client side code needs to handle it and effectively you are blocking and not streaming.

But I see that you have gone away from streaming to blocking data anyhow.

So I guess you are now going to be caching data till there is a convenient or necessary time to update the MD with the block.

1 Like