Mutable Data Caching

intrz · December 11, 2017, 11:18pm

As I understand it, Mutable Data is not currently cached (right?).

I don’t see anything about caching in the RFC. Are there plans to add caching to Mutable Data? If not, wouldn’t they be prone to DDoS attacks? Especially the DNS system, which is based on Mutable Data, would be something an attacker might want to attack to make an app/site temporarily inaccessible.

edit: looks like there’s something called mutable data cache, so maybe caching of Mutable Data is implemented already after all.

edit2: I wonder though, does the mutable data cache automatically scale with incoming traffic?

neo · December 12, 2017, 12:08am

I’ve been wondering myself how mutable data is to be cached.

Cannot be cached like immutable data since the cache could have old data for mutable data as opposed to immutable. So I guess a check on the MD version number is needed and the cache flushed of that MD if found to be old version.

But this would still leave it open to a DDOS but much harder since only version number is retrieved to check cache validity.

Traktion · December 12, 2017, 8:26am

I have wondered about this too. It would be good to hear more from the dev team.

urrtag · December 12, 2017, 10:24am

You don’t need to check the version on reads checking it on writes is enough. Say you want to increment a value you would do:

load the old value (loads the version implicitly)
increment the value (implicitly increment the version)
store the value (abort the store if old_version != new_version - 1)
on abort: repeat

That way you can cache mutable data for a short time period (60s). There will be network latency so you will be dealing with potentially outdated data anyway.

neo · December 12, 2017, 10:43am

You could and I thought about that. But we have yet to see how long cache data can last before it is considered too old. Its possible you could be receiving cached data from a node that has had very little transferred though it for 10 minutes and the cached MD is 10 minutes old.

Even 60 seconds is way to long for many applications including the old one airline reservations.

Traktion · December 12, 2017, 1:04pm

I suppose the caching node could always check whether the hash of the source data is unchanged. This would be quicker than requesting the whole file (if unchanged) from the source.

neo · December 12, 2017, 1:12pm

For MD data only need to check version number of the MD. Less overhead than doing hashes.

Traktion · December 12, 2017, 2:39pm

I should have read this before replying!

If the client has some skin in the game, it would put them off spamming these requests. I suppose it is a balance of performance for honest requests vs malicious requests. You could get the client to sign the request or some such, to rebalance the effort perhaps.

Anyone from @maidsafe available to comment?

urrtag · December 12, 2017, 3:51pm

But if you need “current” data wouldn’t pubsub be the perfect solution for that? I assume somebody who needs the very latest version of a data set is interested in getting it streamed to it on changes. Otherwise periodical polling would be needed for you to get the “current” version of the data => uncool

oetyng · December 12, 2017, 5:09pm

It’s hard to get around it. Having both Consistency and protecting from DDoS via GET.

Pubsub might be something for it.
But not from an MD (its managing group) to all subscribers, but rather out to various distances out on the cache tree (formed by the request paths).
End consumers would still be able to do GET requests, which would fetch relatively fresh data from a cache, or optionally also use subscription for pure push all the way.

Once a node has served some data, in addition to caching it, it would subscribe to updates to it from the sender, and so on recursively for all nodes back to the source. On any conjunction of paths, there would naturally be only one subscription upwards, and the update would split out to the subscribers at that point.
The subscription would have the same TTL as the cache.

urrtag · December 12, 2017, 5:24pm

neo · December 12, 2017, 10:29pm

Not for the near original example of the airline reservations where they don’t know what they want till they need it. There could be millions of records involved world wide and its rather random what record they need. And getting a cached copy even 10 seconds old can see a seat double booked.

So there needs to be a way for the cache of a MD to be invalidated if the MD copy out of date.

Yes there is lag time for consensus but the section will give up the latest version even if some nodes are still updating their copy.

Why polling. This is a on demand situation. Not streaming. Granted streaming is more forgiving on “out of date” copies.

The cache checking the version # is a much quicker way to check validity of the MD and the cache code could have a function to supply the version code it has to the section and the section returns “uptodate” or a new copy of the MD.

Remember the cache only needs to do this when a request for its cached copy of the MD is made. It does not have to do it regularly

neo · December 12, 2017, 10:41pm

Caching is only meant to speed things up and reduce load on the sections/groups

It is filled as people request data and the data is returned. There is no need to keep the cache up to date since it can do that if and only if the data is requested via that node hop.

So keeping the cache up to date is really a waste since in a lot of cases it will not be used. Why increase the bandwidth to keep a cache up to date when it can check only if asked.

And I made a slight mistake in my above post. The request for the MD can have the version number retrieved first and so any caches will not give up their copy of the MD and clear it out and simply store the new one as it hops through their node.

oetyng · December 12, 2017, 11:08pm

Yes, naturally caching is about speeding up and reducing load.
Caching has several implications though. For one, due to reducing load, it reduces the DoS attack vector, as the attacking sources would end up draining it’s closest nodes, instead of the entire route to the data source.
It is even a quite often mentioned benefit of caching. So there is a security side to it also.

I’m not sure why you now say there is no need to keep cache up to date, considering this:

But, anyhow, reservation systems are often designed to not need 100% consistency, they even do overbooking, and solve it in other ways. There is always eventual consistency, all data is always stale. When you see it on the screen, it might change in the db, even if you sit right on top of the db.

Anyway, what I was writing about, was that having both consistency and protecting from DoS is tricky, because the moment you do a request all the way to the source, even if it is just for checking version (which of course needs to go all the way, next node is also stale), then you are opening up a vector, since it is then possible to overload (it needs more resources, than if more data than just version was fetched, but it is then a matter of a factor times input resource).
If you wanted to remove that all together (not considering with what difficulty pubsub in SAFENetwork would be implemented) then a subscription for the data update would almost totally cut off possibility for DoS. (The tiny shed of vector left, is requesting data again once the TTL has run out, i.e. the attacker could do a request with TTL frequency. But that would be a very expensive and weak attack…)

Really, this depends.
What serves most is to look at it from a probabilistic viewpoint.
If data in network is requested 10 times in average within a TTL, and an update on the data has been done in average 3 times during the same period, then using pubsub consumes less resources. These are of course just random figures. We don’t know any average numbers (well, maybe someone can dig up some research on it, would be interesting to see), but at least we can say that if NumberOfRequests converge with NumberOfUpdates, then pubsub is not going to give lower load (and not higher either). Whenever NumberOfRequests > NumberOfUpdates, there is a possible gain with pubsub.
Again, all other complexities of its implementation aside.

neo · December 12, 2017, 11:21pm

Because when the new copy is requested then at that time the cache is updated. No need to constantly update the cache “just in case” since that is wasting bandwidth.

The idea is that in 99+% of cases the MD is not going to be updated in the cache’s (of the MD) lifetime. So to be regularly checking (every sec or few secs) if the cache of the MD is up to date is extra load that is unneeded since if the cache is not up to date then it will be automatically updated as the new copy of the MD hops through the node.

So the procedure to keep the cache up to date is already there (well hopefully done right) without the need for the cache itself to regularly check.

oetyng · December 12, 2017, 11:37pm

Oh, but I was not talking about the cache regularly checking, i.e. polling. That for sure would be the worst of all solutions.
I meant that the source pushes updates to the set of nodes that most recently (within the general cache TTL) requested this data, and they do the same, all the way out to the source of the requests (as long as it’s within the TTL, after that the subscription expires).

Whenever NumberOfRequests > NumberOfUpdates, there is then a possible gain with pubsub. If it is equal or less, as you say it will probably be, there is no gain. No dispute over that.

Just to illustrate:
Data is requested by A, it is at C and passes B. B has a copy.

Now if A requests the data x times within the cache time to live (TTL), and there is a version request every time going from B to C, then that consumes more, if x > y times of updates. With push, the data would go y times C => B => A, so dataSize would be transfered the distance y times. Without push version, we have roundabouts; versionRequestSize would go x times A => B => C and responseSize x-1 times C => B => A, then same direction 1 time with the data (if the version request returns the data when changed).

If cached data is more often than not, requested only once and not again within the cache TTL, then pubsub is more demanding.

neo · December 13, 2017, 12:05am

You would not even have to do that. Just let the normal cache filling occur. If an MD comes through a node and the MD is a later version then update it.

The section in charge of a MD does not know what is caching MDs out in the network. The section does not need to and only adds to the complexity. Imaging the traffic when nodes report back that the caching of an MD has beeing flushed for their caches. That doesn’t happen and is not needed.

Your example only illustrates why at this time caching might not occur for MD data.

oetyng · December 13, 2017, 12:19am

Well, have to, no. I was just interested in the difference between pubsub and no-pubsub with regards to consistency vs closing DoS attack vector when using cache. And, the point I was making is that with cache, that always queries all the way to data source for version check, the vector is not closed to the same extent as is often thought (when caching and DoS is mentioned), and with pubsub in combination with caching, you actually do come a lot farther in closing that attack vector.
I am not even saying that MDs should be cached, that is out of the scope

A subscription would be held at a caching node (the publisher), so the node expires it, just as it expires the cached data itself. There is no reporting from the subscriber.

Anyway, it’s still this:
Whenever NumberOfRequests > NumberOfUpdates, there is then a possible gain with pubsub. If it is equal or less, as you say it will probably be, there is no gain. No dispute over that.

Also, still all complexities with pubsub implementation aside (also out of the scope).

neo · December 13, 2017, 12:25am

So true and I mentioned that earlier on in my posts.

Sorry I am not so familiar with pubsub or the apis as i should be. I have not really delved as deep as I should have

I really would like to know what the plans are for MD caching if any

Traktion · December 13, 2017, 2:02pm

You could just invalidate the caches of the subscribers, rather than updating them. If the caching node is asked for the data again, it can then either provide it or retrieve it and provide (if expired).

Topic		Replies	Views
Generalized caching Development	13	1069	August 12, 2018
Appendable Data discussion Features appendable-data , immutable-data , mutable-data	285	8881	July 8, 2019
DataStore over AppendableData design Development	25	2767	February 27, 2019
Create mutable data once, not each time the page is refreshed Support	8	792	March 18, 2019
Cost of Mutable Data Autonomi Network Token (incl (e)MAID)	3	1359	February 19, 2017

Mutable Data Caching

Related topics