Yes that timer was introduced to fix a bug, but it’s likely wrong (as usual with timeouts) and the next few days will hopefully prove that. Also a significant upgrade to user-facing logs incoming and inspired/helped by @happybeing s vdash
It might help (to confirm quic is doing the right thing) but without full AE then infinite errors is my only issue. Without sync of the section actor (a bunch of nodes acting in unison) there will be infinite errors, this may fall into that category. So we are much much faster getting AE in place then see what we have.
The weird thing is that after constantly telling me
user@SafeNetwork:~$ safe auth create --test-coins
Passphrase:
Password:
Sending request to authd to create a Safe...
Error: AuthdError: AuthenticatorError: Failed to store Safe on a Map: Insufficient balance to complete this operation
All of a sudden I got this response
user@SafeNetwork:~$ safe auth create --test-coins
Passphrase:
Password:
Sending request to authd to create a Safe...
Error: AuthdError: AuthenticatorError: Client data already exists
How come it already exists, wasn’t timing out before?
And then, this behavor:
user@SafeNetwork:~$ safe auth status
Sending request to authd to obtain a status report...
+------------------------------------------+-------+
| Safe Authenticator status | |
+------------------------------------------+-------+
| Authenticator daemon version | 0.5.0 |
+------------------------------------------+-------+
| Is there a Safe currently unlocked? | No |
+------------------------------------------+-------+
| Number of pending authorisation requests | 0 |
+------------------------------------------+-------+
| Number of notifications subscribers | 0 |
+------------------------------------------+-------+
user@SafeNetwork:~$ safe auth unlock
Passphrase:
Password:
Sending action request to authd to unlock the Safe...
Error: AuthdClientError: [Error] ClientError - Response not received: read error: connection closed: timed out
user@SafeNetwork:~$ safe auth status
Sending request to authd to obtain a status report...
+------------------------------------------+-------+
| Safe Authenticator status | |
+------------------------------------------+-------+
| Authenticator daemon version | 0.5.0 |
+------------------------------------------+-------+
| Is there a Safe currently unlocked? | Yes |
+------------------------------------------+-------+
| Number of pending authorisation requests | 0 |
+------------------------------------------+-------+
| Number of notifications subscribers | 0 |
+------------------------------------------+-------+
This is the issue with timeouts, they lie! Sometimes nodes will be waiting on some action and the timeout happens, than just afterwards the OK response comes back and is swallowed.
It’s a continuous mistake to use timeouts like this. i.e. we used to have tests that checked a Store happened, they timed out and failed. The next test tried to Get that data and passed
It always always happens with timouts in an eventually consistent system. Engineers demand responses and the network is continually saying, cool you will get the response.
Like folk saying when launch when launch and us giving a time (timeout), we miss the launch date, then launch and folk say it failed, it’s bonkers.
Basically I would ignore almost all of these timeouts as they are misleading and perhaps worse that that.
There is a very tiny place timeouts can work, but almost always it is wrong.
Insufficient balance is not a timeout.
I think that accounts become corrupted in such case - partially created.
You can’t delete them and can’t continue creation process.
Nobody is not chasing bugs though Think like this, we released the Ford Model T, no suspension. Then we have things like bad cornering, lumpy ride and more. Do we stop putting suspension in place or fix the loose steering on corners ?
Fixing bugs does not mean try to fix every issue, it means (to me) find the common problem causing all these other problems, otherwise you will never finish the project.
So we need to gather all data and then have an attack plan. So far this is all working out for us and we have such a plan. This may be in addition to that but if so we will see and then attack it as we will have fitted suspension and the ride should now be more comfortable.
Could be, or something like NAT/filter limits or timeouts.
When I was working for a local ISP I saw some “smart” routers and company firewalls doing really weird things to UDP traffic. Sometimes 1000 people have broken internet, but only 1 or 2 do the specific thing, that shows the problem. When you are the only one, it is hard to make somebody do something about it.
If you want to do some testing I would try different ISP (for example mobile hotspot) or VPN from your PC to somewhere outside you ISP network. If something changes, it may indicate problem with UDP handling in your home router or in the ISP network. If nothing changes, it is probably something else, or just the current state of the testnet itself.
You may look at Wireshark dump I posted before.
I think that messages (?) actually lost logically, not physically.
In other words, some elders lagging very much for some reasons.
(RAM overflow and swapping or something similar - just a guess!)
I get the timed out error using a VPN too. Strange one, because on previous testnets my setup (BT Home Hub router, decent bandwidth, WiFi, Linux) has had no connectivity problems (Windows is another story however…)
Just tried again with a different VPN server and got a new error
Error: AuthdClientError: Failed to establish connection with authd: [Error] ClientError - Failed to establish connection with remote QUIC endpoint: timed out
Tail of log looks like this:
Summary
WARN 2021-04-11T13:20:58.718746784+01:00 [/home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/qp2p-0.11.7/src/connections.rs:260] Failed to read incoming message on bi-stream for peer 138.68.154.164:44520 with error: timed out
WARN 2021-04-11T13:20:58.734982824+01:00 [/home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/qp2p-0.11.7/src/connections.rs:220] Failed to read incoming message on uni-stream for peer 178.62.58.241:42766 with error: timed out
WARN 2021-04-11T13:20:58.735075145+01:00 [/home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/qp2p-0.11.7/src/connections.rs:260] Failed to read incoming message on bi-stream for peer 178.62.58.241:42766 with error: timed out
WARN 2021-04-11T13:22:27.076651868+01:00 [qjsonrpc/src/server_endpoint.rs:160] Failed to read incoming request: timed out
Hmm not sure how to do that. The spreadsheet
is mentioned in the Fleming Testnet Release OP, but perhaps it gets a bit lost what with all the other stuff in it. If it’s helpful to the Maidsafe Devs we could make it a new topic?
Yes - It may also be helpful to separate out the actual instructions for getting up and running out from the OP - Its pretty long and we should have this in a quickly accessible location where we link to the essential getting started info.
I created a dev lab at home, and used terraform to spin up/down alpine lxc containers and execute shell script for the various safe cli commands to test the consistency of the cli commands. I was trying to attempt to PUT data on the network, as well as trying to get a node joined, and have yet to be successful.
I am using pfSense as the router of this dev lab, and enabling static ports on the NAT outbound mapping did help alleviate some of the receive timeout errors that I saw on the first day of the testnet release. I also enabled UPnP Port Mapping + NAT-PMP Port Mapping in pfSense (temporarily as they are disabled by default), but didn’t see any mapping being registered/triggered against that service within pfSense from the container’s IP that ran the safe commands. I was glad to see the IDS (Snort) not block any traffic associated with the container that was executing the safe cli commands!
As mentioned above by Vort and Piluso, I too am seeing all of their error messages at some point when replaying the safe container scripts (from spinup until destroy). ‘safe auth create --test-coins’ has not successfully worked yet for me, with ‘Client data exists’ being the most common failure. Commands such as ‘safe keys create --test-coins --preload xxxx’ have worked for me (on and off).
Even though I have yet to get a home node to join the testnet, my congrats to the Maidsafe team for this epic milestone. Looking forward to future iterations of the test net with AE to see if it helps solve some of these issues!
I guess I will add on that I am also seeing these issues, though my network setup is much less involved - I am running on a VM on my local machine, behind a consumer netgear router (Netgear R6700v3, latest firmware updates, UPnP enabled), and the behavior is just as described by Shu/Vort/Piluso. I will be trying this on my local machine without a VM later today, to see if the double NAT is causing slowness issues for me.
Personally I have been able to get safe auth create --test-coins to work occasionally, and have not had the ‘Client data exists’ problem. I find that running auth commands before auth create --test-coins is very snappy, but even something simple like safe auth status is much slower after running the auth create --test-coins.
I have made it as far as attaching a spendable balance to my wallet, but any attempts to retrieve/upload files from/to the network have failed for me. It is worth confirming something else I saw in the original post, sometimes the error it throws is Insufficient Balance on the command line, but, after looking in the logs, it is actually a timeout error under the hood.
Timeouts aside it has been great to finally start getting familiar with these commands! Thanks to everyone for their hard work! Really hoping to get a node on the testnet soon!
After testing an install directly on local hardware and not a VM, I am still seeing the same mostly-consistent timeouts from the safe auth create --test-coins command. It doesn’t exactly rule out issues such as double NAT, but now I know the same issues are present outside a VM at least.
Can somebody tell me what I am seeing is normal? I havent tried this when the testnet was running better, so I have nothing to compare to.
I am running safe cat and I watch network traffic in Wireshark. I expected to see packets going out and never receiving a response, but what I see is huge number of active rejections: ICMP type 3 (destination unreachable) code 3 (port unreachable) https://i.imgur.com/GfAzmSG.png