Ticket #159 (new bug)

Opened 5 months ago

Last modified 3 months ago

mr500 repeaters offline after OTA upgrading to r299p/ gateways refuse ssh password

Reported by: jason Owned by: dfl-owner
Priority: minor Milestone: ng-beta
Component: firmware Keywords:
Cc: Network name: ecrv

Description

I have an 18 node mr500 network (that was using r299.) I decided to OTA upgrade to r330 in an attempt to address some user slowness/ dropped connectivity concerns. The network all upgraded to the interim r229p three days ago and then most (currently all) of the repeaters went offline and only a few would come back on line sporadically. The gateways show as online in the dashboard but users are not able connect to the internet with them. All of the nodes have been cold booted several times but with no improvement. In addition, I found that I could not ssh into four of the gateways – I could connect, but the password would be denied. I tried changing the ssh password several times but with no improvement. As an attempted fix, I then separated the nodes into multiple small networks in an attempt to get them to OTA upgrade but only one small gateway and two repeaters into their own network upgraded to r347. The rest remain basically non usable. Is my only option to manually re-flash all of the nodes or are there any other options? The network is across the country from me, the nodes are mounted 20’ above the ground and I will have to walk a non-technical person through the flash process. Network name is ecrv

Change History

Changed 5 months ago by marek

Did you do anything on your network since you filed this ticket ? Many nodes seem back up and only one still is ng299p.

If the password is not updated it could indicate that the dashboard config is not read/processed. This also would explain why the nodes don't upgrade (the upgrade instructions come from the dashboard as well).

Changed 5 months ago by jason

I seperated the physical noedes into number small logical networks in an attempt to get the nodes to OTA upgrade and continued rebooting. As you have seen, this has worked to a degree but three of the gateways (the ones I could not ssh into) and one repeater refuse to update after a day or more in a netwotk that meets the conditions of an OTA upgrade.

The repeater is in the "ecrv" network, one gateway is in "ecrvfree" and two gateways are in "ecrvwest"

When I seperated the nodes into their own networks I also changed the channels which was reflected in the dashboard. Doesen't that indicate that the dashboard config is being read?

Regardless is the only option manual reflashing at this point?

Changed 5 months ago by jason

Could you move ecrv, ecrvfree and ecrvwest up in the upgrade schedule? Perhaps they have not hit their time slot yet?

Changed 5 months ago by marek

The upgrade queue should be empty. Only after a release when all the networks need to be upgraded the queue is full.

Ok, let's focus on ecrvfree for now. I can ssh into 5.150.150.127 but not into 5.150.150.192 (password seems incorrect). Your "channel objection" is correct. If the node changes the channel it has to be able to retrieve the dashboard config. Can we change the channel from 7 / 157 to something else for the sake of testing ?

If that works we probably have to try to get into the node using the custom.sh script (installing a new password from there). Any idea how this node ended up in this state ? I have never seen that before.

Changed 5 months ago by jason

Since there were only four nodes not working, I had them pulled down and shipped to me so I could reflash them (people onsite were not able to flash the units for some reason.) If there is any testing you would like me to do for academic reasons, I am happy to do so, otherwise, I am assuming that after I update to r347 everything should be back to normal.

My guess for how the nodes ended up in this state is as follows: Nodes were using r299 but having intermittent connectivity issues. I set them to update to the test firmware (r330 at the time.) All nodes successfully update to r299p in preparation for updating to r330 as per design. Then at some point before updating to r330 there is a connectivity issue/ other problem which aborts the upgrade and leaves some of the nodes in a "broken" state.

Changed 5 months ago by marek

Ok, then just reflash it. We have enough strange bugs on our plate - no need for an additional exercise.

Changed 3 months ago by jason

So I am having a repeat of my prior issues with the OTA upgrade from r354 to r376. Some of the nodes have gone offline during the upgrade process and have yet to come back up. Also, the "west end hard line" node in the ecrv network will not upgrade and also I cannot ssh into the node - it refuses the password as if it is incorrect. I have changed the password in the dashboard to no effect.

It was suggested before that a custom.sh script might be used to update the password. Is that still an option? I would like to avoid pulling units off the poles again if possible.

Changed 3 months ago by jason

Update: Now the "west end hard line" gateway is showing offline on the dashboard but it responds to pings and responds to an ssh login attempt (that it still rejects as if the password was wrong.)

Changed 3 months ago by marek

Did you try the default SSH password "0p3nm35h" ? It seems the node never checked in after the upgrade. It would be logical to assume that it never got the password you configured on the dashboard.

If you do manage to ssh in could you get the output of "logread" and attach it to this ticket ?

Thanks

Changed 3 months ago by jason

I did try the default pasword to no avail. Since I have been unsuccessful in getting the node to OTA upgrade to r376 I have pulled it down and replaced it with a spare. Hopefully all goes well with the next release.

Changed 3 months ago by marek

It will be difficult to fix a problem if we don't know what causes it.

What did you do to the "broken" unit ?

Changed 3 months ago by jason

The "broken" unit is just pulled out of production and sitting on a shelf. If there is something you would like me to test with it, just let me know.

Changed 3 months ago by marek

If you can't SSH into the device you won't be able to see what it is doing and why (unless you have a serial console).

Could you try to reflash the unit and see if it comes back to life ?

Changed 3 months ago by jason

I will reflash it. From my last experience, I have confidence that reflashing will "fix" the unit. Unfortunately, no help in fixing the issue this way.

Note: See TracTickets for help on using tickets.