Grouping vaults by owner, to strengthen redundancy?

Let’s imagine I’m a large scale farmer. I have dozens of terabytes of storage spread over dozens of vaults, in several different networks, in several different countries.

When another user PUTs data on SAFE, chances are some of it will hit one of my vaults.
In fact, the bigger the size, the higher the likelihood I will have several replicas of a single chunk on my systems.

This is quite obviously bad. If I decide to quit farming, I might take chunks out of the network for good, because their replicas are all stored on vaults owned by me. Or even if I just take out a single server (with a lot of vaults and a lot of storage on its own), I might cause data loss or at least significant churn because several replicas per chunk need to be recreated.

A different issue is the following: I don’t want a full copy of a file - that’s at least one replica of each chunk, and the decryption key - stored on my machines. Depending on the country I’m in, that might make me responsible for the contents of it. And let’s face it, there will be data illegal in one or more countries stored on SAFE.

Other networks that involve nodes handling random data for other unauthenticated, anonymous nodes have measures against such issues in place. The Tor network has a family tag and it will not form circuits including more than one member of a family. Freenet may do something similar, I’m not sure.

Does MaidSafe have ways to group vaults by owner so the network tries to treat them as one when it comes to whether to store chunks on them? Alternatively, am I completely misunderstanding something? :stuck_out_tongue:

No we have no idea who owns vaults, but the choice of storage is initially extremely random and then the data moves to its best spot on the network, multiple copies go to different places. The network decides this in real time based on access, cache, on line time, group ranks and more, we cannot get any algorithm that will get close to modelling this as much is human controlled (access, on off line etc) The chance of you storing a complete file are incredibly small and there would be no way of knowing really unless you owned the file as each chuck is encrypted again on store with a mechanism we cannot guess ahead, so even if you knew the chunk you were looking for it will store as a different name and encrypted gain (so different content). It’s all a bit confusing, but the chunks at rest are not the chunks needed to recreate a file, the require a valid get request to allow them to be decrypted and put in flight back to the requester. The requester does not know and cannot tell who was storing the chunk.

2 Likes
  • You need to be able to take down all four copies for anyone to potentially notice (most people wouldn’t notice anything as their data is mostly garbage anyway)
  • Not only do you need all four copies, but you need all four copies of the same chunk of the same file.

You might as well buy a lottery ticket. Nice try, though.

That is utterly absurd.

  • You have no clue what you have stored in the vault, so you can’t “object”
  • Even if you did have something illegal, you couldn’t tell. Who was ever arrested for “allowing” his Skype to forward conversations and chats from known terrorists?

(FYI, there’s a discussion on the topic of tagging on the forum so if you searched you could have found it.)

1 Like

@dirvine “A bit confusing” whoow, that’s an understatment :wink: Probably another safety level. Is this extra level of encryption done by the vault itself? Or are other vaults or managers responsible for that? Can I do a read-out of my vault and see which chunks are inside?

1 Like

What do you mean by “which”?

If the vault on my computer is storing 20.000 chunks. Can I see them, scan them? See their names? I know they will look like random data, but when there’s a very populair musicclip on SAFE Net, and someone does self-encryption on that video and shares the chunks, can I spot one of these populair chunks in my vault if it’s in there?

How is the metadata (what chunks there are, decryption keys…) for a publicly shared file stored?
This metadata is also distributed around the network somehow, correct?
Then it follows that if you had access to that metadata, identifying how to fetch and decrypt a file, and you had copies of all this data on your own drives, you wouldn’t need external help from the network to read a file stored in your vaults.

Of course this is all highly unlikely if you have 500GB of capacity in a multi-PB network.
But I’ve seen at least one person on here who plans to add about 100TB of space over 30 vaults.
Especially in the early days, this would certainly be a significant percentage of the entire SAFE network, and those are the cases I’m concerned about.

Chunks are up to 1MB in size, correct?
Then for small files, it doesn’t seem too unlikely that a person with 100TB of space may store all the chunks of an entire file, and for large files, several - or even all - copies of the same chunk.

I did search, and didn’t find it. Would you mind linking to it?

Yes :smile:

No you cannot tell what you hold neither name nor content :smile:

3 Likes

The vault holds data that is obfuscated based on the original name, when the network does a get on your vault with the original name it goes through a decryption process to establish if you have the original content.

If you are talking about public data only then you could go to some lengths and try and work out if you have a chunk, not easy but possible, not for private data though. If it is public data then there is much less to worry about I would imagine. Its private data I am much more concerned about staying private. Public data by default would be identifiable at some stage, so should not be a surprise, the issue is do you know you are storing it, well no you don’t unless you go to lengths beyond what a normal user would do.

privacy of public data is a whole other subject :wink:

It is very unlikely unless the network is very very small or very imbalanced our tests so far show you need circa 3X network size to achieve a group (one chunk replica). Of course if you have infinite resource it will always work eventually, but it gets a bit mad at that stage.

Yeah, I am only talking about public data. Private data (i.e. data that requires information not stored on the network to access) is of course inaccessible to people who don’t have that information, I’m aware of that :wink:

I also didn’t mean to imply that it would be easy/obvious, but, while I am not a lawyer, I could imagine governments attempting to force large farmers to go to any extent necessary to identify what they are storing, and that’s what worries me.

Thank you, that is relieving to hear. Running 75% of the network does indeed seem a little unlikely :slight_smile:
That said, I am planning to contribute many terabytes of space on several hundred Mbit/s myself once the network is advanced enough for realistic usage (depending on whether I have the funding for it and predict farming covering my costs), and I’d be happy to do real-world tests then.

2 Likes

https://forum.autonomi.community/t/potential-way-to-weed-out-illegal-content/568/53

If you plan to have 100TB what makes you think you’ll be the only one?
Also, it takes a handful of losers like myself to donate a small fraction of their new HDD’s (100 folks with 30% of their new 4TB HDD) too fifer more capacity than you’ll have.
I predict you will never have more than 5% of the network capacity, even if you launch on Day 1.
As I mentioned above you’d need 70+ % in various locations around the world.

Yes if it’s public data (not encrypted) you could compare those against chunks of the same file you generated yourself and realize they both belong to the same file, but in order to do that it would take a huge amount of compute power and be relatively useless knowledge since the files are apparently public. Furthermore, one could append random garbage to files to make this kind of search and comparison much less likely to succeed in a timely manner.
All this, by the way, was discussed on the forum several months ago.

And when one of the systems loses all it’s data due to a failed HDD your entire ranking gets damaged.
Surely that would be a hugely popular feature! :slight_smile:

I was never talking about tagging data. Just vaults.

I’d say if you’re offering 1.2TB of your hard drive, you’re more dedicated than the average user already :wink:
Anyway, let’s do the maths.
Under the assumption that distribution is initially random (in practice, it would probably shift towards large-scale commercial farmers pretty quickly, as they have 24/7 uptime and much larger bandwidth reserves than the average user), the total storage space on SAFE is 1 PB on day 1 (I mean, I’d love that too, but I think that’s quite an optimistic estimate), and you provide 100TB of space yourself:

When a chunk replica is uploaded, the probability of you not getting it is 100% minus your share of total storage space (as long as distribution is indeed random), so 90%. As 4 replicas are uploaded, the probability of you not getting a single one of them is 90%^4, so 65.61%.

So for every chunk uploaded, you have a 35% chance of getting at least one copy.

For a small file of 3 chunks, your probability of getting a full copy of it is then 35%^3, so roughly 4.2%, and the inverse, the probability of you not getting a full copy, is 95.8%.

Let’s say someone uploads 100 photos which all fit into 3 chunks each. The probability of you not getting at least one full file then is 95.8%^100, which is… about 1.37%.

@dirvine has stated that in practical tests this isn’t a problem, so I’m probably wrong somewhere, and I’d appreciate it if someone were to point out as to where. It’s not like I want this problem to be real :wink:

Getting slightly off topic here, but if you run a server farm, you don’t let a hard drive failure destroy your data :slight_smile:

Vault tagging: okay, I misread that.
Couldn’t a colluding group tag a bunch of vaults with the same tag thereby rendering the network unable to store enough copies? Also as a big farmer I would never tag my vaults because that would mean less business for me.
So I doubt the idea is feasible.

You’d have a full copy with only one chunk. You mean all 4 replicas. Okay.

For private data ou can’t tell which vaults (if any) hold those chunks. You’d have to destroy 100% of your investment (assuming you’re renting the h/w) in vaults and you would still

  • Not be 100% that you could cause data loss
  • Impacted power users would lose just a fraction of their data
  • it would require enormous concentration (10%) across multiple geographically distributed locations

That doesn’t sound too concerning to me, but then again I am a guy who’s been consistently stating on this forum that I would only store copies of my data on the network. If the disappearance of a file bothered me, I’d upload it again.

Sure but you’re already disadvantaged by the fact that you’re running a professional operation (you pay for hosting, maybe tax, etc.) vs. people who run at homes to offset the cost of their HDD which they had to buy anyway.
Now on top of that ou want to sacrifice another 35% of your capacity on RAID6. I don’t think you’ll get very far with that approach… Did I mention your competitors also don’t pay anything for the bandwidth?
But if you’re so sure that’s great - rest assured that the rat race will commence as soon as it becomes clearer (say, in beta) how the financials would work out.
Anyone can spawn 100TB worth of “fat” EC2 instances at a short notice :wink:

hs1.8xlarge $3968 $2.24 per Hour $5997 $1.81 per Hour

But my biggest doubt is reserved for the idea that running hosted farm can compete with the storage from the same provider. How? If someone like Amazon charges you X and you and MaidSafe add a 10% mark up on top of that! how can that be competitive vs. the same user saving his data to S3?

I took a quick look at Google’s pricing and each GB costs 0.04 to store (per month) and 0.01 to serve (once). If you store 1TB and serve it once a month, that’s $600/year or 60K for your little farm. Why would anyone pay you 70K for service they can get from The Unevil Co. for 60K/year? Maybe my calculation is wrong but I also tried this wizard (http://calculator.s3.amazonaws.com/index.html) and 1TB of data storage with 10TB of monthly transfers to the Internet costs $1,300/month. 90% of which is bandwidth charges. Sure you could say you won’t need that much, but how much exactly will you need? If you just store data and keep it there, you won’t make any money. Any in any case to download my 1TB of garbage backups I have on S3 It’d cost me $10, but to do the same on your farm hosted there would have to cost at least $12. And you’d be getting just 1/4 of my requests.

Okay so this was a random rant but what ties it together is my belief that many posters think that not dealing with $ or other fiat gives them magical economic powers. It doesn’t.

One last OT observation: I noticed how Amazon asks for an up-front payment for those instances (see the $3968 figure in column 2 above). Not knowing whether you’ll ever get any requests for data that will be stored on you box will turn into a big gamble on multiple factors (who does what, how SAFE fares, etc).
I do not believe that farmer concentration and professional farming will be as emphasized as it is the case with Bitcoin miners and their pools.

Only if those “tags” are open to anyone. If you need to, say, present a certificate to associate your vault with others, the worst you could do is prevent the network from storing as much data as you’d like on your own vaults - you couldn’t decrease the capacity everyone else has.

Not necessarily, the business you get would just come from more different places. Subject to good implementation, of course.
Besides, you might have to do that to avoid legal action or getting your servers shut down for not attempting to identify “bad” data - who knows what the legislators and courts of the world will come up with.

For a copy of a file that is split in three chunks (which is, I believe, the minimum number of chunks a file may be split into), each of which would be replicated four times, you’d need one replica of each of the three chunks that make up the file.

I’m concerned about both situations - one person getting a full public file, enabling them to know what data is stored on their drives, and one person getting all replicas of one or more chunks, compromising redundancy.

Fair enough.

Also, not gonna argue whether large-scale commercial farming is viable/competitive in this thread, we can take that to another thread or PMs if you’d like. I will say that I’m excited for how things turn out in the beta and beyond though :wink:

Do you have any plans for splitting the HDD’s and assigning to multiple vaults?

I’m not sure yet, maybe I will. I’ve yet to read up on the advantages of such an approach.

From: Price discovery for purchase of network resources - #56 by fergish

I asked the question, as vault images will be made available as ‘docker files’ which will be an easy way to split a hdd into many vaults.

I see. I’ll do that then.
The technical side should be no problem, anyway. I assume we’ll be able to specify resource limits in a config file?