Operation failed. Reason: LCM operation kLcmUpdateOperation failed on phoenix,ip: [cc.cc.cc.128] due to Upgrade encountered an error: Error occurred: Failed to start the foundation service

OPERATION FAILED. REASON: LCM OPERATION KLCMUPDATEOPERATION FAILED ON PHOENIX,IP:

Nutanix Life Cycle Manager
Nutanix Life Cyle Management (LCM)

Always make sure that your cluster can tolerate a node/host failure by having the data resiliency status as “OK” in Prism Elements dashboard.

Issue :


————————————–
LCM failed with error : kLcmUpdateOperation failed on phoenix, ip: [cc.cc.cc.128] due to Upgrade encountered an error: Error occurred: Failed to start the foundation service : on [u’10.x.x.x’],ret: False, err:Foundation service could not be started after 3 retries.. Logs have been collected and are available to download on 10.x.x.x at /home/nutanix/data/log_collector/lcm_logs__10.x.x.x__2020-05-16_13-10-.tar.gz

Current Status :

-Upgrade on the node x.x.x.128 is in progress

Findings / Summary :
————————————–

  • checked via IPMI and confirmed the host is UP
  • Confirmed host was online, but the CVM was in the maintenance mode.
  • removed CVM from maintenance.
  • All the nodes are running on the same foundation version and foundation service is stopped
  • checked the logs and noticed the failure was due to Foundation service not starting, possible due to permission errors :
/home/nutanix/foundation/bin/../lib/py/nutanix_foundation.egg/foundation/monkey.py:160: UserWarning: Patching paramiko to use SHA256 for fingerprint
Traceback (most recent call last):
  File "/home/nutanix/foundation/bin/foundation", line 368, in <module>
    service(options, args)
  File "/home/nutanix/foundation/bin/foundation", line 252, in service
    main(options, args)
  File "/home/nutanix/foundation/bin/foundation", line 171, in main
    service_log = folder_central.get_service_log_path()
  File "foundation/folder_central.py", line 309, in get_service_log_path
  File "foundation/folder_central.py", line 190, in _get_ntnx_log_folder
  File "foundation/folder_central.py", line 90, in _get_folder
  File "/usr/lib64/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/home/nutanix/data/logs/foundation/.'
foundation.out (END)

Unable to open foundation directory for CVM 87 and 85

================== x.x.x.85 =================
ls: cannot open directory /home/nutanix/data/logs/foundation: Permission denied
================== x.x.x.86 =================
total of 160
drwxr-x—. 2 nutanix nutanix   4096 Jun 10  2017 archive
-rw-r—–. 1 nutanix nutanix      0 Jul 19 02:19 foundation_central.log
-rw-r—–. 1 nutanix nutanix      0 Jul 19 02:19 debug.log
-rw-r—–. 1 nutanix nutanix      0 Jul 19 02:19 api.log
-rw-r—–. 1 nutanix nutanix      0 Jul 19 02:19 http.error
-rw-r—–. 1 nutanix nutanix      0 Jul 19 02:19 http.access
-rw-r—–. 1 nutanix nutanix      0 Jul 19 02:19 component_manager.log
-rw-r—–. 1 nutanix nutanix  10735 Jul 19 11:54 phoenix.log
drwxr-x—. 3 nutanix nutanix   4096 Jul 19 11:54 .
drwxr-x—. 6 nutanix nutanix 139264 Jul 19 18:02 ..
================== 10.x.x.x =================
ls: cannot open directory /home/nutanix/data/logs/foundation: Permission denied
nutanix@NTNX-x.x.x-D-CVM:10.x.x.x:~/data/logs$

+ Found foundation directory owner and group was set to root on the node 87 and 85

nutanix@NTNX-x.x.x-D-CVM:10.x.x.x:~/data/logs$ allssh sudo  ls -lad /home/nutanix/data/logs/foundation
================== x.x.x.126 =================
drwxr-x---. 4 nutanix nutanix 4096 Jul 19 06:54 /home/nutanix/data/logs/foundation
================== x.x.x.127 =================
drwxr-x---. 3 nutanix nutanix 4096 Jul 19 17:31 /home/nutanix/data/logs/foundation
================== x.x.x.128 =================
drwxr-x---. 4 nutanix nutanix 4096 Jul 19 12:48 /home/nutanix/data/logs/foundation
================== x.x.x.129 =================
drwxr-x---. 3 nutanix nutanix 4096 Jul 19 10:35 /home/nutanix/data/logs/foundation
================== x.x.x.84 =================
drwxr-x---. 4 nutanix nutanix 4096 Jul 19 12:47 /home/nutanix/data/logs/foundation
================== x.x.x.85 =================
drwxr-x---. 2 root root 4096 Jul 19 09:24 /home/nutanix/data/logs/foundation
================== x.x.x.86 =================
drwxr-x---. 3 nutanix nutanix 4096 Jul 19 11:54 /home/nutanix/data/logs/foundation
================== 10.x.x.x =================
drwxr-x---. 2 root root 4096 Jul 19 05:43 /home/nutanix/data/logs/foundation
nutanix@NTNX-x.x.x-D-CVM:10.x.x.x:~/data/logs$

Changed the directory owner and group  to nutanix to resolve this issue

nutanix@X.X.x-CVM$ upgrade_status
2020-07-19 17:28:41 INFO zookeeper_session.py:131 upgrade_status is attempting to connect to Zookeeper
2020-07-19 17:28:41 INFO upgrade_status:38 Target release version: el7.3-release-euphrates-5.10.10-stable-125f671ba8982a0199e18b756e8ef33232
2020--07-19 17:28:41 INFO upgrade_status:43 Cluster upgrade method is set to: automatic rolling upgrade
2020-07-19 17:28:41 INFO upgrade_status:96 SVM x.x.x.x is up to date
2020-07-19 17:28:41 INFO upgrade_status:96 SVM x.x.x.x is up to date
2020-07-19 17:28:41 INFO upgrade_status:96 SVM x.x.x.x is up to date
2020-07-19 17:28:41 INFO upgrade_status:96 SVM x.x.x.x is up to date

Noticed that the pre-check/inventory was failing  because node x.x.x.128 did not realize shutdown token

2020-07-19 15:25:16 INFO cluster_manager.py:4651 Not releasing token – HA status not UP for x.x.x.128
2020-07-19 15:26:03 INFO cluster_manager.py:4651 Not releasing token – HA status not UP for x.x.x.128
2020-07-19 15:26:48 INFO cluster_manager.py:4651 Not releasing token – HA status not UP for x.x.x.128
2020-07-19 15:33:08 INFO cluster_manager.py:4651 Not releasing token – HA status not UP for x.x.x.128

nutanix@X.X.x-CVM$ host_upgrade_status
2020-07-19 17:28:48 INFO zookeeper_session.py:131 host_upgrade_status is attempting to connect to Zookeeper
Automatic Hypervisor upgrade: Disabled
Target host version: el6.nutanix.20170830.402
2020-07-19: Completed hypervisor upgrade on this node
2020-07-19 Completed hypervisor upgrade on this node
2020-07-19 Completed hypervisor upgrade on this node
2020-07-19 Completed hypervisor upgrade on this node

Solution:-

Restarted genesis on the affected node to resolve this issue.

nutanix@X.X.x-CVM$ genesis restart
2020-07-19 18:02:03.491308: Stopping genesis (pids [5657, 7420, 7743, 7744, 9238, 9230])
2020-07-19 18:02:04.866536: Genesis started on pids [4537]

After successfully completion of LCM inventory started the firmware upgrades on the CVM and all hosts are upgraded.

LCM Issue :-

Note:- Always involve Nutanix Support for any activity.

==============================================================================

~/data/logs/foundation/last_session.log ------- Workflows Involving Phoenix
~/data/logs/lcm_wget.out --------- LCM Manifest Download from nutanix
~/data/logs/genesis.ou t----  Inventory & Upload Operations
~/data/logs/lcm_ops.out--- Inventory & Upload Operations

See also :-

How Nutanix LCM works

Nutanix X-Ray download and configuration

Leave a Reply