Data Density Attack

Data Density Attack

There’s an attack that is expensive but may cause serious and permanent problems to the network so it’s worth discussing and seeing what it’s about.

Goal

The goal of the attacker is to cause significant disruption to the network. There is no direct gain to the attacker. This makes the threat hopefully not too high.

Description

The crux of the problem is data on the network is expected to be evenly balanced across vault names and data names. Network operations like joining and relocating have this balanced-vaults and balanced-data assumption built in (well, kind of!).

If an attacker is uploading single chunks at a time (ie many 1MB files) they can choose the content of those files / chunks to result in names targeted to a specific part of the network.

This means that one part of the network will have much more data than the rest of the network and the vaults in that part of the network will have much more work to do than other vaults.

This data imbalance is probably ok (not ideal but still tolerable). But if there is a merge between the data-heavy section and a ‘normal’ section, the ‘normal’ vaults need a lot of spare space to be able to facilitate the merge.

Any vaults without enough resources to merge will be booted off the network, meaning a high chance of cascading merges.

Furthermore, this presents a problem to the current understanding of ‘balance’ on the network. Difference in section prefix length is assumed to be unhealthy, but I think difference in data density is even less healthy. It’s probably normal for prefixes to become uneven because there will be times where data is unevenly distributed, so the structure of the network needs to compensate for that.

If one part of the network is very heavy on data because it was subject to a data density attack, then it makes sense for there to be more vaults in that part of the network to manage the load. This naturally leads to a difference in section prefix length.

My proposal is to remove ‘similarity of prefix length’ as a metric for network health and replace it with ‘similarity of data density’ per vault. The reason for desiring similarity in data density is to avoid merges of uneven data loads that may cause vaults to be kicked off the network and possibly cascade the merge.

This may have an impact on the current design work for relocating / joining / disallowing / killing vaults.

Expense

How expensive is this attack? It’s pretty expensive since the imbalance of data needs to be significant compared to all other data entering the network. The more ‘normal’ data there is stored on the network the more data the attacker requires to build up a significant imbalance.

It may take a lot of time and computational energy to generate chunks with a specific prefix. This effort becomes exponentially more difficult as the network grows (or the attack made more targeted), so smaller networks are more at risk than larger networks.

It costs safecoin to store those chunks on the network, but this should ideally be a decreasing cost as the network grows.

Ideally the attacker would also have vaults in the right part of the network to trigger a merge, which is very difficult to achieve (akin to the google attack, but easier since sibling sections are still meaningful to be part of).

The effect of the attack lasts a long time after the attack stops since chunks remain permanently. Eventually the effect is diluted by the additional upload of broadly named data, but this takes time.

The effect, if well enough targeted, would not be diluted by splits, and may actually be made worse in some cases.

Ideas

The concept of Balanced Network should aim for Resource Balance rather than the more abstract Naming Balance (I say this in the context of maidsafe/datachain_sim which measures MaxPrefixLenDiff). In this attack it specifically considers Storage Balance, but I think it probably also applies to Bandwidth Balance and Latency Balance etc.

The assumption that data will be evenly distributed and thus the assumption that vault names should be evenly distributed could be quite damaging.

This attack has impacts on the network understanding of spare space and requirements for how much spare space is needed before enacting incentives to bring more capacity online. Ideally the spare capacity is even across the network, but there may be times where capacity shouldn’t be evenly distributed.

This attack would indicate that random relocation is not good, and targeted / assisted relocation is necessary. I still think random relocation is overall preferable, but this attack is another point to add to the argument against it.

Concluding Thoughts

I’m not especially concerned about this attack in the real world, but cascading merges seem to be a big risk to overall network health, and it does seem to have implications for some of the current design philosophy being developed.

40 Likes

Would it be easier to do with MD data. How is the location for the MD data object determined?

Immutable data is based on its hash, but is MD data based on its address?

I ask because I don’t know and if it based on address then this attack just became easier in that one only has to store MDs with specific addresses.

7 Likes

Yes, It is.
However, such address can easily be adjusted to be not defined by the user, but be given by the network.
more like: if a user want to upload a MD, it request its account manager and be given a name range, then generate a MD name(address) within that range. Similar to the current join/relocation procedure - you need to generate a name withn the range defined by the network.
And such range can be used to avoid this data density attack from MD.

I remember we had explored an idea that the multiple copies of an ImmutableData are stored as multiple levels of hash, i.e. hash(data), hash(hash(data)) …
In that way, instead of having multiple copies of an ImD in the same section, the copies are scattered across the network. Which means the chance of the data density attack and this cascading merge concern is being further reduced. (I hope to the level that can be ignored :slight_smile: )
This approach still need to be further investigated, especially on how to retain copies when storage nodes leaving network.

18 Likes

Currently client programs generate a hash from a user defined name and this hash is the address of the MD. This insures that MDs are randomly dispersed in the XOR space. But nothing prevents someone to create a program that directly defines the address of a MD, so he could kill a specific node by overloading it with targeted addresses. So, yes, a solution must be implemented for MDs.

But not this solution, because this one doesn’t keep the possibility to derive the address of a MD from a user defined name. This is a distinctive feature of MDs compared to Immutable Data that shouldn’t be lost.

One possibility would be that the client passes the source name to the client manager, so that the latter can check that the address is really the hash of a name. The check would be done only at creation time and the source name wouldn’t be stored in the network. The update and get operations don’t need the source name.

4 Likes

Sorry, I shall write that sentence in more detail.
The name is not be given by the network. Network only define a name range.
The user still have the possibility to define the name(address) of a MD, just such name needs to be within a range.

I don’t think the network needs to care about how the address got derived?

1 Like

In theory, a hash of small sized data could be the same as a hash of a large sized text.
So basically, you don’t need to guessing X original texts for X hashes, you only need to carry out recursively hashing, once the result hits an address range wanted, put the previous hash as payload of ImmutableData and the current result as its name.
It’s still expensive, but will be much cheaper (and payload size smaller) than guessing the original.

2 Likes

Not measured in detail, just as an example that with this approach, now you may be able to generate one thousand 64bytes chunks in the same time of get a 1MB chunk. Which may cause more trouble to the section as there is additional number of messages and overheads involved to replicate additional entries.

However, I agree it is still expensive.

Regarding the coin cost involved, yes, that’s true. And I think that’s why mav mentioned the motivation of such attach is not high.

2 Likes

Be aware though given that case no system will survive.

8 Likes

But that seems counter to what is desired by people. For instance I want tag type 1000001 (mytag) and address 1 then 2 then 3 then 4 because I want to be able to calculate the address to request in my application. I don’t want to have to update an index simply because the network no longer lets me specify the full address.

Example is say a series of tokens and they would be best if the address is the actual token number.

Why not simply take the hash of the tag + address to work out where to store the MD. At least it becomes more difficult even if simpler than doing it with MDs.

That sounds good.

Maybe something can be done to make it more difficult to control where the MD is stored AND allow APPs to be specific as to the tag+address for an MD

1 Like

How would this work for domain names? Say a client wants to resolve a domain, he would pass the domain name into a function domain_name_to_net_addr(domain). That domain_name_to_net_addr fn has to be deterministic, otherwise the net addr would change each time a request is made, but the net addr needs to match up with the net addr generated at the PUT.

1 Like

An address can then still be manipulated to make the resulted hash targeting a particular space.

Well, I think you can always have an address book MD chunk to mapping the address. So address 1,2,3 will be translated into XXX, YYY, ZZZ by looking up that.
Also, once you uploaded a MD successfully, the name is then taken and won’t be revoked. So you don’t need to update an index, only need append to that address book each time get a new MD to upload.
In a word, it will be upto the app to decide how to mapping address 1,2,3 (or a url link) into a real name(address) of a MD.

Also, even the network supports using hash of address+tag directly, a particular app will still facing the issue of the hash of a wanted address has already been occupied by other. So, my understanding is that an app-level of address mapping is inevitable.

1 Like

It was clear for me. Constraining the range of the address implies that we cannot derive it from a user defined name anymore.

Not being able to freely generate names and using translation tables instead will break many apps.

In my solution, yes it does, to insure a uniform address distribution. This solution doesn’t break applications, the client layer has just to pass the source name in addition to its hash.

3 Likes

@tfa: your approach of using source name doesn’t eliminate the possibility that someone can manipulate the source name to make the resulted hash targeting particular address space?

1 Like

Yes, but then this is true also for immutable data. Generating an address with a hash is one of the network funding principle. It was supposed to be secured enough with the random data distribution it was creating.

I don’t know what to say, except that a solution must be found that doesn’t break existing apps.

4 Likes

Just a wild idea: maybe we can discourage targeted puts by gradually increasing the cost of a put based on the amount of data already stored in the same section by the same user?

8 Likes

I tend to agree. If there’s to be indirection, I’d think it should be handled network-side rather than client-side. For example, if a given section is struggling to cope with the volume of stored data (or client accounts for example) these could be offloaded to a quieter section with pointers held by the original section. The pointers would map individual chunks/accounts, or address ranges.

If the original section is struggling for bandwidth, the client could be redirected to the new section and deal directly there. If the issue is purely disk space and not bandwidth, the original section could just act as a proxy and not bother the client.

There are a few other ways to address this too (e.g. archiving data, storing data with an expiry time and paying more for long-living data, detecting and punishing malicious behaviour like storing many ImmutableData chunks with very similar names), but as far as I know, none of these are agreed or fleshed out.

1 Like

Good call, although it’s maybe not necessary to even specify the same user. An attacker would probably just try and use multiple clients. Any significant imbalance in data distribution across the address space probably indicates an attack, so charging everyone more seems fair, since honest users won’t be hitting the affected range very often.

12 Likes

I thought about this, wrt data imbalance, but I was thinking about relocation. Not necessarily a good solution.
Increasing PUT cost when a section is nearing an upper threshold is one possible thing.
There could be other things too, more efficient alone or together with other measures.

Unless you meant everyone storing to this section, I wonder if this is not risky too. Increasing PUT price (or causing any effect) globally can be an attacker goal too, as it causes disturbances. Yes, it increases cost for attacker, and so maybe in this case makes it unfeasible anyway, but generally, allowing global effects as response to malicious behaviour can backlash. If possible, effects should be contained.

2 Likes

Yes - I did mean just for that affected section. So an attacker targeting a single section faces ever-increasing PUT costs since he’s storing all his useless data there. But an honest user would target an even distribution of sections and would be charged much less than the attacker since most of his data will go to sections which don’t have elevated charges.

7 Likes

I think this indirection should always exist, not just when a section is overloaded.

The hash of an ID chunk or an MD name wouldn’t be the address of the data anymore but would be the address of an indirection element that gives the real address of the data. This final address would be computed from the hash of the initial address concatenated with an invariant of the section, agreed between data managers at creation time (possibly rehashed several times until the destination section is far enough).

This indirection would also solve the problem of data managers storing data whose id is the same one that is needed to get the data. An attack to get free safecoins is currently easy: launch a vault and a client program that issues GET requests over the ids stored in the vault.

Three years ago there was a topic about it. Gaming the farming reward. AFAIK this hasn’t been solved yet.

3 Likes