Introducing Verifi, regularly ensure uploaded files aren't lost or corrupt

[No code has been written. I’m revealing my plan early with the hope of maximizing Verifi’s usefulness through collective input. Also because my dev machine is currently occupied with the Node Rewards Program. :smiley: When the Node Rewards Program ends in December, this plan will have been refined, I’ll begin on the code, and should have an early build in a git repo ready for cloning after a week.]

Background

For a data+ network to succeed, it must be trustworthy. There are several aspects to trustworthiness, including security, integrity, and performance. Verifi is focused on integrity.

Verifi defines integrity as: uploaded files aren’t lost or corrupt. Although that can currently be accomplished using the autonomi CLI app, it’s a manual process that must be repeated for each upload. Verifi automates the process, enabling on-going verification of any number of uploads.

Goals

Automatic verification that uploaded files aren’t lost or corrupt benefits the network in several ways:

  • Users, whether individuals or corporations, have peace-of-mind that the network has integrity, and that should an error occur at any time, they will know and can re-upload. (and hopefully report the error to developers)
  • Developers have additional tooling for validating design and code.
  • Testers have additional tooling for verifying and load-testing the network.

Organization

Verifi is written in open-source Rust and structured as both a CLI app and library. Although most usage of Verifi is through its CLI app, separating the core functionality into a library makes it more useful to other projects in the network’s ecosystem.

Concepts

  • Notification - An alert informing a user of a verification error.
  • Upload item - An original file, its md5 hash, its upload address, and optionally a comment.
  • Upload list - One or more upload items.

Features

  • Command-line arguments:
    • Either an upload list or path(s) to files containing an upload list.
    • Optional repetition delay, (continuous, an amount of time, or once) defaulting to running once and exiting.
  • Output notifications to the command-line.

Stages

Stage 1

  • Integrate with the network by directly invoking the autonomi CLI app.

Stage 2

  • Add an API network integration method.
  • Add optional concurrent verification.

Stage 3

  • Add additional notification methods, such as playing a sound or sending an email.
  • Add OpenTelemetry support for diagnostics integration.
  • Engage FYEO (@TammyFYEO, @Brian_FYEO) in a security audit of the codebase.

Notes

Verifi is hopefully a contribution to:

Moreover, they’re likely validating it too with regular downloads. - @Traktion

But I suspect that the up/downloading, especially the downloading is nowhere near the amounts we would see in the actual network. - @Toivo

What we are looking for is the chance to help where we can and taking part in structured testing directed by Autonomi is exactly what we should be doing…another team download and verify them. Tens of thousands of times. - @Southside

imo the Emerging Autonomi Community needs its own Test work to do development on… - @rreive

large uploads fail a lot of the time with being unable to get quotes and also the failed verification check. but I am guessing… - @aatonnomicc

Also tagging @happybeing, @neo, I hope you don’t mind.

Feedback

How does Verifi, and this plan, suck? How could they be better?

17 Likes

Sounds like a useful tool to help build confidence in the network.

I’m just doing things by hand and every few days I download all uploads and check the md5sum’s manually. Since we started testing iv never seen a bad md5sum of an upload.

Once a file is uploaded it has been rock stable and reliable.

Issues that I have had are with large files 1GB and above but I believe that’s being looked into by the team.

I’ll look forward to helping test this out when you get your dev machine back from the node olymics :slight_smile:

8 Likes

Sounds quite interesting and a nice App for backups as well.

Thinking on this, are you planning on including meta-data for the file, like creation date, date of upload, etc.

Also I’d go for sha512 over md5 or both if MD5 is needed for some other program to digest that wants to remain in the past.

8 Likes

Thoughts on backups: If the backup method would add redundancy across chunks (parity) to be able to recover a missing chunk I could make sense to run verify weekly, or perhaps as often as internet bandwidth would let me, and re-upload chunks as needed to maintain redundancy level.

I wonder how the unlimited free GETs could tax the network due to these kinds of use-cases becoming too popular, by the way. It would be nice if we could just GET a checksum instead of a whole chunk.

3 Likes

If the chunk exists then it should be correct because the hash is the address. Maybe the API will have that function where you can verify the chunk exists

4 Likes

Thanks for tagging me. This seems a well thought out plan and a decent idea. I’m not sure who or how many would use it - folk here certainly until they’ve assured themselves the network retains data and maybe even Autonomi for the same reason.

FYI having reviewed my view of the project and it’s usefulness to me I decided that backup was what I wanted so am working on that. So it might well overlap with Verifi :man_shrugging:

I’ll be saying more about my plan in due course but was very happy to find that the implementation ideas went full circle, because the route I chose to deliver backup happens to build on the work I did with awe and will extend beyond backup if all goes to plan. It’s also going to build on the Autonomi API and provide easier ways to build other apps which want to manage collections of data, not just files. I’m very excited about that part too and it finally got me into learning how to use Rust traits (with a little help from the Rust forum - which I mention for the sleuths among us).

So if you want to know more about what I’m up to let me know. I’ll be publishing an outline plan before long. It will all be AGPL licensed.

Meanwhile I’ll be happy to see how you get on with Verifi. Good luck!

11 Likes

Interesting idea and question. Do you happen to know if file metadata is preserved by the network? For example, if a file created last year is uploaded and then downloaded, will the creation date of the downloaded file be last year?

For the simplest user experience and deepest network integration, I intended to use whichever hashing algorithm was native to the network. From Verifying uploaded files - #2 by neo and moreso Verifying uploaded files - #3 by Southside I assumed that was md5, although perhaps that’s incorrect? Nevertheless, if there would be value in it, the hashing algorithm Verifi uses could certainly be configurable, defaulting to whichever is native to the network.

2 Likes

When I used the language “and can re-upload” in the topic, I had envisioned the user manually re-uploading or otherwise using their existing upload system. That said, automatic re-uploading, at least as an option, could indeed be useful. More thoughts on the idea in my reply to @happybeing below.

Hopefully tremendously. IMHO those friendly to the network should push it harder than those adversarial to it can, in order to help optimize it and strengthen it against attacks.

That reminds me of HTTP’s HEAD. It could very well be a good API feature for the network to have. I see that as beside the purpose of Verifi however, as it relies on the network itself rather than being external tooling verifying the network’s and upload’s integrity.

3 Likes

Ideally, Verifi would remain lightweight, focused, and avoid duplicating effort of dedicated backup tools, such as what you’re working on. @drirmbda suggests above a useful feature of automatically re-uploading files that fail verification.

What if your backup tool imported the Verifi library so it could offer both backups as well as automatic healing of files that fail verification any point after the backup? I would very much welcome such a collaboration.

4 Likes

Hashing used by the network is sha256. But if you used the same then there is no double checking there isn’t the situation where sha256 says its OK but isn’t because a hash can very rarely double up. So using sha512 means that you are actually adding a check across the file that isn’t just duplicating what the network is doing already.

The network at this time is not storing metadata. I did suggest they allow user defined meta data be added to the datamap so the user can add metadata (not enforced values) if they want.

I was thinking you are adding an extra file with meta data because of this

You indicated you were adding some meta data already which has to be a separate file

2 Likes

Using md5 is double work when chunks are self-verifying by design.


Besides Libp2p has a validation function so, in my opinion, it would be enough to check the existence of the chunk.

2 Likes

I’m convinced. So if the default hashing algorithm of Verifi is sha512, would there be value in allowing the user to override that and specify an alternative algorithm to use?

An upload list file could be extended to support more dimensions of verification, such as file creation date metadata, but if the network doesn’t preserve that metadata how could Verifi verify it? Or are you thinking that the metadata in the upload list file wouldn’t necessarily be used for verification, but could still be useful to the user, such as restoring the file as it was when uploaded? (file permissions comes to mind)

Correct. And that’s a keen observation, that some of the metadata is a dimension of verification (hash algorithm and value) while some is not. (comment)

From how I’ve described Verifi, is it clear that upload list files would be local to the user and not something stored in the network?

That gives me an idea. Upload list files could themselves be uploaded to the network, and their network address could be passed to Verifi as a third option for the command-line argument specifying uploads to be verified. The user flow would be to run Verifi with the network address of the upload list file, Verifi downloads the file from the network, then uses its contents to download and verify its upload items. It’s like an integrity seed from which all of a user’s uploads could be verified. What do you think?

Not to bike shed, but do you think upload list is good terminology? Perhaps piggybacking on network native terminology is better and instead of calling it an upload list, call it an upload map?

1 Like

I don’t think redundancy is necessarily pointless. Redundancy is built into the network itself, with multiple copies of data being stored, as is a common tactic in fault-tolerance. Do you have another idea of how Verifi could ensure uploaded files aren’t lost or corrupt? I’m certainly open.

The issue that comes to mind is that users don’t work in chunks, they work in complete files. My intuition is that brass tacks integrity verification should work at the level users work. What do you think?

Not as great as providing even the one.

But if you have the time then I see no reason why some people would not prefer it. But then also you have to have the hash algos in place. Of course, at much expense, you could have some sort of add-on system where people could supply their own addon or use one that someone else wrote. (like browsers)

Yes I was thinking along the lines of it is a file with all the meta data, hashes, etc. Either one meta file for each file or in the case of backup one file for the whole backup.

The verification done would be to check the hash agrees with the hash of the file once retrieved.

Not sure. But when you do your use case research before starting to write/design then the correct name will come to you or the person who reviews your design & use cases

2 Likes

Manifest could be the word you are looking for.

And I think all this is a great idea. If you don’t do it someone else will think of it and do it! And it’s great to see people thinking about these things and starting to think about building them.

3 Likes