Below I was restarting the safenode pids via RPC in the background every 15 seconds, iterating through one node at a time out of total of 35 nodes, while the safe files upload
was continuing:
Did not store file "8cb1a32a-c8c7-4b1a-b333-c78e9c64eaae.txt" to all nodes in the close group! Network Error Outbound Error.
Storing file "0aef2409-7ded-40c6-bb05-d4e041a7f146.txt" of 10485760 bytes..
Did not store file "0aef2409-7ded-40c6-bb05-d4e041a7f146.txt" to all nodes in the close group! ResponseTimeout.
Storing file "7d68421e-d68a-4aac-bda5-e3fdd3e5b693.txt" of 10485760 bytes..
Did not store file "7d68421e-d68a-4aac-bda5-e3fdd3e5b693.txt" to all nodes in the close group! Network Error Outbound Error.
Storing file "0f2108cd-fe72-4ddb-b387-216194b31e33.txt" of 10485760 bytes..
Successfully stored file to ChunkAddress(41c1f6(01000001)..)
Storing file "49a8699f-f13f-488a-8cb0-44418f5b95e6.txt" of 10485760 bytes..
Successfully stored file to ChunkAddress(35068c(00110101)..)
Storing file "e97c7d2c-8ba8-49f4-b3bf-a420e2c409b5.txt" of 10485760 bytes..
Seems like the panics are gone (nice!), and the safe files upload
if it errors out on certain files, it seems like it continues to proceed to the next set of files in the target upload directory (recovering properly and starting to store the next set of file etc).
I did find it interesting if the churn rate for restart requests via RPC was increased from every 15 seconds to 100ms, it eventually led to the following messages in the safe files upload
:
Did not store file "ee080536-5f37-4014-95ce-012a231854dd.txt" to all nodes in the close group! Network Error Could not get CLOSE_GROUP_SIZE number of peers..
Storing file "3ede80bd-a106-4a76-86b2-c687bf326d71.txt" of 10485760 bytes..
Did not store file "3ede80bd-a106-4a76-86b2-c687bf326d71.txt" to all nodes in the close group! Network Error Could not get CLOSE_GROUP_SIZE number of peers..
Storing file "872654bb-b3b8-48c9-9bb9-42b3bd504541.txt" of 10485760 bytes..
Did not store file "872654bb-b3b8-48c9-9bb9-42b3bd504541.txt" to all nodes in the close group! Network Error Could not get CLOSE_GROUP_SIZE number of peers..
Storing file "3d6ad666-ee8a-4e33-ac81-780c43648350.txt" of 10485760 bytes..
Did not store file "3d6ad666-ee8a-4e33-ac81-780c43648350.txt" to all nodes in the close group! Network Error Could not get CLOSE_GROUP_SIZE number of peers..
Storing file "13288e7f-3050-48f6-ac4d-c5f5e082990c.txt" of 10485760 bytes..
Did not store file "13288e7f-3050-48f6-ac4d-c5f5e082990c.txt" to all nodes in the close group! Network Error Could not get CLOSE_GROUP_SIZE number of peers..
Storing file "f0477552-5d7c-4133-a678-81d8cbe10d9f.txt" of 10485760 bytes..
The above may not be a realistic test case, however, whats interesting here is if I kill all safenode pids via killall safenode
, the safe files upload
doesn’t seem to exit (waited more than 20 mins and no activity on the console output), and cpu at 0%.
A backtrace after attaching to the process (debug build) shows the following for safe
(thread 1 out of 8):
(gdb) backtrace
#0 0x00007f68cd5978aa in __syscall6 () at ../src_musl/arch/x86_64/syscall_arch.h:59
#1 syscall () at ../src_musl/src/misc/syscall.c:20
#2 0x00007f68cd32419d in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (self=0x7f68cbfe0220, ts=...) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.9.7/src/thread_parker/linux.rs:112
#3 0x00007f68cd323fbc in parking_lot_core::thread_parker::imp::{impl#0}::park (self=0x7f68cbfe0220) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.9.7/src/thread_parker/linux.rs:66
#4 0x00007f68cd32e7ac in parking_lot_core::parking_lot::park::{closure#0}<parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#0}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#1}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#2}> (thread_data=0x7f68cbfe0200) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.9.7/src/parking_lot.rs:635
#5 0x00007f68cd32e028 in parking_lot_core::parking_lot::with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#0}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#1}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#2}>> (f=...)
at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.9.7/src/parking_lot.rs:207
#6 parking_lot_core::parking_lot::park<parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#0}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#1}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#2}> (key=93824994079384, validate=..., before_sleep=..., timed_out=..., park_token=..., timeout=...) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot_core-0.9.7/src/parking_lot.rs:600
#7 0x00007f68cd328fd9 in parking_lot::condvar::Condvar::wait_until_internal (self=0x555555717298, mutex=0x5555557172a0, timeout=...) at src/condvar.rs:333
#8 0x00007f68cd288a8e in parking_lot::condvar::Condvar::wait<()> (self=0x555555717298, mutex_guard=0x7ffc2d2d1388) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/parking_lot-0.12.1/src/condvar.rs:256
#9 0x00007f68cd2c696c in tokio::loom::std::parking_lot::Condvar::wait<()> (self=0x555555717298, guard=...) at src/loom/std/parking_lot.rs:150
#10 0x00007f68cd28f4f6 in tokio::runtime::park::Inner::park (self=0x555555717290) at src/runtime/park.rs:117
#11 0x00007f68cd28fe57 in tokio::runtime::park::{impl#4}::park::{closure#0} (park_thread=0x7f68cbfe01c8) at src/runtime/park.rs:255
#12 0x00007f68cd28ff96 in tokio::runtime::park::{impl#4}::with_current::{closure#0}<tokio::runtime::park::{impl#4}::park::{closure_env#0}, ()> (inner=0x7f68cbfe01c8) at src/runtime/park.rs:269
#13 0x00007f68cd2a6352 in std::thread::local::LocalKey<tokio::runtime::park::ParkThread>::try_with<tokio::runtime::park::ParkThread, tokio::runtime::park::{impl#4}::with_current::{closure_env#0}<tokio::runtime::park::{impl#4}::park::{closure_env#0}, ()>, ()> (self=0x7f68cdb6b8c8, f=...) at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/thread/local.rs:446
#14 0x00007f68cd28ff26 in tokio::runtime::park::CachedParkThread::with_current<tokio::runtime::park::{impl#4}::park::{closure_env#0}, ()> (self=0x7ffc2d2d3f78, f=...) at src/runtime/park.rs:269
#15 0x00007f68cd28fe1a in tokio::runtime::park::CachedParkThread::park (self=0x7ffc2d2d3f78) at src/runtime/park.rs:255
#16 0x00007f68cc2089fb in tokio::runtime::park::CachedParkThread::block_on<safe::main::{async_block_env#0}> (self=0x7ffc2d2d3f78, f=...) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.27.0/src/runtime/park.rs:292
#17 0x00007f68cc1ed1cb in tokio::runtime::context::BlockingRegionGuard::block_on<safe::main::{async_block_env#0}> (self=0x7ffc2d2d66a8, f=...)
at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.27.0/src/runtime/context.rs:315
#18 0x00007f68cc17ff7b in tokio::runtime::scheduler::multi_thread::MultiThread::block_on<safe::main::{async_block_env#0}> (self=0x7ffc2d2e2a20, handle=0x7ffc2d2e2a48, future=...)
at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.27.0/src/runtime/scheduler/multi_thread/mod.rs:66
#19 0x00007f68cc23a54a in tokio::runtime::runtime::Runtime::block_on<safe::main::{async_block_env#0}> (self=0x7ffc2d2e2a08, future=...) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.27.0/src/runtime/runtime.rs:304
#20 0x00007f68cc207f39 in safe::main () at safenode/src/bin/kadclient.rs:48
Please let me know if this specific topic isnt the right place for the above, or all is good
.