Fleming-testnet: Timeout issues

dirvine · April 11, 2021, 10:40am

Yes that timer was introduced to fix a bug, but it’s likely wrong (as usual with timeouts) and the next few days will hopefully prove that. Also a significant upgrade to user-facing logs incoming and inspired/helped by @happybeing s vdash

All good though, all good.

dirvine · April 11, 2021, 10:49am

It might help (to confirm quic is doing the right thing) but without full AE then infinite errors is my only issue. Without sync of the section actor (a bunch of nodes acting in unison) there will be infinite errors, this may fall into that category. So we are much much faster getting AE in place then see what we have.

piluso · April 11, 2021, 10:52am

The weird thing is that after constantly telling me

user@SafeNetwork:~$ safe auth create --test-coins
Passphrase: 
Password: 
Sending request to authd to create a Safe...
Error: AuthdError: AuthenticatorError: Failed to store Safe on a Map: Insufficient balance to complete this operation

All of a sudden I got this response

user@SafeNetwork:~$ safe auth create --test-coins
Passphrase: 
Password: 
Sending request to authd to create a Safe...
Error: AuthdError: AuthenticatorError: Client data already exists

How come it already exists, wasn’t timing out before?
And then, this behavor:

user@SafeNetwork:~$ safe auth status
Sending request to authd to obtain a status report...
+------------------------------------------+-------+
| Safe Authenticator status                |       |
+------------------------------------------+-------+
| Authenticator daemon version             | 0.5.0 |
+------------------------------------------+-------+
| Is there a Safe currently unlocked?      | No    |
+------------------------------------------+-------+
| Number of pending authorisation requests | 0     |
+------------------------------------------+-------+
| Number of notifications subscribers      | 0     |
+------------------------------------------+-------+
user@SafeNetwork:~$ safe auth unlock
Passphrase: 
Password: 
Sending action request to authd to unlock the Safe...
Error: AuthdClientError: [Error] ClientError - Response not received: read error: connection closed: timed out
user@SafeNetwork:~$ safe auth status
Sending request to authd to obtain a status report...
+------------------------------------------+-------+
| Safe Authenticator status                |       |
+------------------------------------------+-------+
| Authenticator daemon version             | 0.5.0 |
+------------------------------------------+-------+
| Is there a Safe currently unlocked?      | Yes   |
+------------------------------------------+-------+
| Number of pending authorisation requests | 0     |
+------------------------------------------+-------+
| Number of notifications subscribers      | 0     |
+------------------------------------------+-------+

Vort · April 11, 2021, 10:55am

If bugs are not chased, sometimes they goes away and sometimes they bury deeper.
Let’s hope for the first case…

dirvine · April 11, 2021, 10:57am

This is the issue with timeouts, they lie! Sometimes nodes will be waiting on some action and the timeout happens, than just afterwards the OK response comes back and is swallowed.

It’s a continuous mistake to use timeouts like this. i.e. we used to have tests that checked a Store happened, they timed out and failed. The next test tried to Get that data and passed

It always always happens with timouts in an eventually consistent system. Engineers demand responses and the network is continually saying, cool you will get the response.

Like folk saying when launch when launch and us giving a time (timeout), we miss the launch date, then launch and folk say it failed, it’s bonkers.

Basically I would ignore almost all of these timeouts as they are misleading and perhaps worse that that.

There is a very tiny place timeouts can work, but almost always it is wrong.

Vort · April 11, 2021, 10:57am

Insufficient balance is not a timeout.
I think that accounts become corrupted in such case - partially created.
You can’t delete them and can’t continue creation process.

dirvine · April 11, 2021, 11:02am

Nobody is not chasing bugs though Think like this, we released the Ford Model T, no suspension. Then we have things like bad cornering, lumpy ride and more. Do we stop putting suspension in place or fix the loose steering on corners ?

Fixing bugs does not mean try to fix every issue, it means (to me) find the common problem causing all these other problems, otherwise you will never finish the project.

So we need to gather all data and then have an attack plan. So far this is all working out for us and we have such a plan. This may be in addition to that but if so we will see and then attack it as we will have fitted suspension and the ride should now be more comfortable.

Hope that helps.

Vort · April 11, 2021, 11:03am

Fix a bug, which will automatically fixes another bug or will make its fixing easier. Ok.

peca · April 11, 2021, 11:19am

Could be, or something like NAT/filter limits or timeouts.

When I was working for a local ISP I saw some “smart” routers and company firewalls doing really weird things to UDP traffic. Sometimes 1000 people have broken internet, but only 1 or 2 do the specific thing, that shows the problem. When you are the only one, it is hard to make somebody do something about it.

If you want to do some testing I would try different ISP (for example mobile hotspot) or VPN from your PC to somewhere outside you ISP network. If something changes, it may indicate problem with UDP handling in your home router or in the ISP network. If nothing changes, it is probably something else, or just the current state of the testnet itself.

Vort · April 11, 2021, 12:26pm

You may look at Wireshark dump I posted before.
I think that messages (?) actually lost logically, not physically.
In other words, some elders lagging very much for some reasons.
(RAM overflow and swapping or something similar - just a guess!)

JPL · April 11, 2021, 12:28pm

I get the timed out error using a VPN too. Strange one, because on previous testnets my setup (BT Home Hub router, decent bandwidth, WiFi, Linux) has had no connectivity problems (Windows is another story however…)

JPL · April 11, 2021, 12:42pm

Just tried again with a different VPN server and got a new error

Error: AuthdClientError: Failed to establish connection with authd: [Error] ClientError - Failed to establish connection with remote QUIC endpoint: timed out

Tail of log looks like this:

Summary

WARN 2021-04-11T13:20:58.718746784+01:00 [/home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/qp2p-0.11.7/src/connections.rs:260] Failed to read incoming message on bi-stream for peer 138.68.154.164:44520 with error: timed out
WARN 2021-04-11T13:20:58.734982824+01:00 [/home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/qp2p-0.11.7/src/connections.rs:220] Failed to read incoming message on uni-stream for peer 178.62.58.241:42766 with error: timed out
WARN 2021-04-11T13:20:58.735075145+01:00 [/home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/qp2p-0.11.7/src/connections.rs:260] Failed to read incoming message on bi-stream for peer 178.62.58.241:42766 with error: timed out
WARN 2021-04-11T13:22:27.076651868+01:00 [qjsonrpc/src/server_endpoint.rs:160] Failed to read incoming request: timed out

Southside · April 11, 2021, 1:43pm

Can we make this a sticky?

JPL · April 11, 2021, 2:17pm

Hmm not sure how to do that. The spreadsheet
is mentioned in the Fleming Testnet Release OP, but perhaps it gets a bit lost what with all the other stuff in it. If it’s helpful to the Maidsafe Devs we could make it a new topic?

Southside · April 11, 2021, 3:28pm

Yes - It may also be helpful to separate out the actual instructions for getting up and running out from the OP - Its pretty long and we should have this in a quickly accessible location where we link to the essential getting started info.

Southside · April 11, 2021, 4:08pm

Im real busy with RealLife™ right now but will make a new topic later tonight to bring this together

Shu · April 11, 2021, 4:57pm

I created a dev lab at home, and used terraform to spin up/down alpine lxc containers and execute shell script for the various safe cli commands to test the consistency of the cli commands. I was trying to attempt to PUT data on the network, as well as trying to get a node joined, and have yet to be successful.

I am using pfSense as the router of this dev lab, and enabling static ports on the NAT outbound mapping did help alleviate some of the receive timeout errors that I saw on the first day of the testnet release. I also enabled UPnP Port Mapping + NAT-PMP Port Mapping in pfSense (temporarily as they are disabled by default), but didn’t see any mapping being registered/triggered against that service within pfSense from the container’s IP that ran the safe commands. I was glad to see the IDS (Snort) not block any traffic associated with the container that was executing the safe cli commands!

As mentioned above by Vort and Piluso, I too am seeing all of their error messages at some point when replaying the safe container scripts (from spinup until destroy). ‘safe auth create --test-coins’ has not successfully worked yet for me, with ‘Client data exists’ being the most common failure. Commands such as ‘safe keys create --test-coins --preload xxxx’ have worked for me (on and off).

Even though I have yet to get a home node to join the testnet, my congrats to the Maidsafe team for this epic milestone. Looking forward to future iterations of the test net with AE to see if it helps solve some of these issues!

nbsp1 · April 11, 2021, 6:15pm

I guess I will add on that I am also seeing these issues, though my network setup is much less involved - I am running on a VM on my local machine, behind a consumer netgear router (Netgear R6700v3, latest firmware updates, UPnP enabled), and the behavior is just as described by Shu/Vort/Piluso. I will be trying this on my local machine without a VM later today, to see if the double NAT is causing slowness issues for me.

Personally I have been able to get safe auth create --test-coins to work occasionally, and have not had the ‘Client data exists’ problem. I find that running auth commands before auth create --test-coins is very snappy, but even something simple like safe auth status is much slower after running the auth create --test-coins.

I have made it as far as attaching a spendable balance to my wallet, but any attempts to retrieve/upload files from/to the network have failed for me. It is worth confirming something else I saw in the original post, sometimes the error it throws is Insufficient Balance on the command line, but, after looking in the logs, it is actually a timeout error under the hood.

Timeouts aside it has been great to finally start getting familiar with these commands! Thanks to everyone for their hard work! Really hoping to get a node on the testnet soon!

nbsp1 · April 12, 2021, 4:38am

After testing an install directly on local hardware and not a VM, I am still seeing the same mostly-consistent timeouts from the safe auth create --test-coins command. It doesn’t exactly rule out issues such as double NAT, but now I know the same issues are present outside a VM at least.

peca · April 12, 2021, 11:31am

Can somebody tell me what I am seeing is normal? I havent tried this when the testnet was running better, so I have nothing to compare to.

I am running safe cat and I watch network traffic in Wireshark. I expected to see packets going out and never receiving a response, but what I see is huge number of active rejections:
ICMP type 3 (destination unreachable) code 3 (port unreachable)
https://i.imgur.com/GfAzmSG.png

Topic		Replies	Views
Account creation fails Support	5	625	June 9, 2021
Safe auth create --test-coins error on Ipv4 LAN: [Error] AuthdError - [Error] AuthenticatorError - Failed to store Safe on a Map: Data error -> Unexpected error: Cannot send zero-value transfers Support	21	715	February 19, 2021
Can't create coins, can also not debug Support	31	1475	February 5, 2021
Fleming-testnet - Error: Unable to send Message to Peer: MissingSecretKeyShare Support	21	1290	June 9, 2021
Maidsafe-testnet - Error on auth create --test-coins Support	3	563	June 5, 2021

Fleming-testnet: Timeout issues

Related topics