DAOS Command Fails with "Transport layer mercury error" on CentOS 7.9
by NorthernLights2003 from LinuxQuestions.org on (#6NE3F)
Hello,
I'm encountering an issue when running the daos cont create command on my DAOS setup. The command fails with a "Transport layer mercury error." Below are the details of the error and my setup:
Command and Error Message:
[root@client2 ~]# daos cont create tank --label mycont
external ERR # [5323.920594] mercury->msg: [error] /builddir/build/BUILD/mercury-2.1.0rc4/src/na/na_ofi.c:3047
# na_ofi_msg_send(): fi_tsend() failed, rc: -2 (No such file or directory)
external ERR # [5323.921055] mercury->hg: [error] /builddir/build/BUILD/mercury-2.1.0rc4/src/mercury_core.c:2727
# hg_core_forward_na(): Could not post send for input buffer (NA_NOENTRY)
hg ERR src/cart/crt_hg.c:1104 crt_hg_req_send_cb(0x2a5c8c0) [opc=0x1020004 (DAOS) rpcid=0x18ea69b600000000 rank:tag=0:0] RPC failed; rc: DER_HG(-1020): 'Transport layer mercury error'
mgmt ERR src/mgmt/cli_mgmt.c:882 dc_mgmt_pool_find() tank: failed to get PS replicas from 1 servers, DER_HG(-1020): 'Transport layer mercury error'
pool ERR src/pool/cli.c:198 dc_pool_choose_svc_rank() 00000000:tank: dc_mgmt_pool_find() failed, DER_HG(-1020): 'Transport layer mercury error'
pool ERR src/pool/cli.c:503 dc_pool_connect_internal() 00000000:tank: cannot find pool service: DER_HG(-1020): 'Transport layer mercury error'
ERROR: daos: DER_HG(-1020): Transport layer mercury error
Environment Details:
DAOS Version: daos-2.0.3-5.el7.x86_64
DAOS Client Version: daos-client-2.0.3-5.el7.x86_64
Libfabric Version: libfabric-1.15.1-1.el7.x86_64
Mercury Version: mercury-2.1.0~rc4-9.el7.x86_64
CentOS Version: CentOS 7.9
Fabric Interface: enp0s3
Additional Information:
[root@server ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:bd:95:c2 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic enp0s3
valid_lft 564sec preferred_lft 564sec
inet6 fe80::e25:a2fd:9904:a8ac/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:bb:cb:4d brd ff:ff:ff:ff:ff:ff
inet 192.168.56.104/24 brd 192.168.56.255 scope global noprefixroute dynamic enp0s8
valid_lft 370sec preferred_lft 370sec
inet6 fe80::8d85:5b39:5f73:6e0b/64 scope link noprefixroute
valid_lft forever preferred_lft forever
I have also mentioned the DAOS server, client, and agent configuration files for reference.
DAOS Server
## default: daos_server
name: daos_server
#
#
## Access points
## Immutable after running "dmg storage format".
#
## To operate, DAOS will need a quorum of access point nodes to be available.
## Must have the same value for all agents and servers in a system.
## Hosts can be specified with or without port. The default port that is set
## up in port: will be used if a port is not specified here.
#
## default: hostname of this node
access_points: ['10.0.2.15']
#
#
## Default control plane port
#
## Port number to bind daos_server to. This will also be used when connecting
## to access points, unless a port is specified in access_points:
#
## default: 10001
port: 10001
#
#
## Transport credentials specifying certificates to secure communications
#
transport_config:
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
allow_insecure: true
#
# # Location where daos_server will look for Client certificates
client_cert_dir: /etc/daos/certs/clients
# # Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosCA.crt
# # Server certificate for use in TLS handshakes
cert: /etc/daos/certs/server.crt
# # Key portion of Server Certificate
key: /etc/daos/certs/server.key
provider: ofi+sockets
socket_dir: /var/run/daos_server
nr_hugepages: 4096
control_log_mask: DEBUG
control_log_file: /tmp/daos_server.log
helper_log_file: /tmp/daos_admin.log
engines:
-
targets: 8
nr_xs_helpers: 0
fabric_iface: enp0s3
fabric_iface_port: 31316
log_mask: INFO
log_file: /tmp/daos_engine_0.log
env_vars:
- CRT_TIMEOUT=30
scm_mount: /mnt/daos0
scm_class: ram
scm_size: 8
DAOS Control file
# default: daos_server
name: daos_server
# Default destination port to use when connecting to hosts in the hostlist.
# default: 10001
port: 10001
# Hostlist, a comma separated list of addresses (hostnames or IPv4 addresses).
# default: ['localhost']
hostlist: ['10.0.2.15']
## Transport Credentials Specifying certificates to secure communications
transport_config:
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
allow_insecure: true
#
# # Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosCA.crt
# # Admin certificate for use in TLS handshakes
cert: /etc/daos/certs/admin.crt
# # Key portion of Admin Certificate
key: /etc/daos/certs/admin.key
DAOS Agent file
# default: daos_server
name: daos_server
# Management server access points
# Must have the same value for all agents and servers in a system.
# default: hostname of this node
access_points: ['10.0.2.15']
# Force different port number to connect to access points.
# default: 10001
port: 10001
## Transport Credentials Specifying certificates to secure communications
#
transport_config:
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
allow_insecure: true
#
# # Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosCA.crt
# # Agent certificate for use in TLS handshakes
cert: /etc/daos/certs/agent.crt
# # Key portion of Agent Certificate
key: /etc/daos/certs/agent.key
# Use the given directory for creating unix domain sockets
#
# NOTE: Do not change this when running under systemd control. If it needs to
# be changed, then make sure that it matches the RuntimeDirectory setting
# in /usr/lib/systemd/system/daos_agent.service
#
# default: /var/run/daos_agent
#runtime_dir: /var/run/daos_agent
# Full path and name of the DAOS agent logfile.
# default: /tmp/daos_agent.log
log_file: /tmp/daos_agent.log
# Manually define the fabric interfaces and domains to be used by the agent,
# organized by NUMA node.
# If not defined, the agent will automatically detect all fabric interfaces and
# select appropriate ones based on the server preferences.
#
#fabric_ifaces:
#-
# numa_node: 0
# devices:
# -
# iface: ib0
# domain: mlx5_0
# -
# iface: ib1
# domain: mlx5_1
#-
# numa_node: 1
# devices:
# -
# iface: ib2
# domain: mlx5_2
# -
# iface: ib3
# domain: mlx5_3
Any assistance or insights into resolving this issue would be greatly appreciated. Thank you!
I'm encountering an issue when running the daos cont create command on my DAOS setup. The command fails with a "Transport layer mercury error." Below are the details of the error and my setup:
Command and Error Message:
[root@client2 ~]# daos cont create tank --label mycont
external ERR # [5323.920594] mercury->msg: [error] /builddir/build/BUILD/mercury-2.1.0rc4/src/na/na_ofi.c:3047
# na_ofi_msg_send(): fi_tsend() failed, rc: -2 (No such file or directory)
external ERR # [5323.921055] mercury->hg: [error] /builddir/build/BUILD/mercury-2.1.0rc4/src/mercury_core.c:2727
# hg_core_forward_na(): Could not post send for input buffer (NA_NOENTRY)
hg ERR src/cart/crt_hg.c:1104 crt_hg_req_send_cb(0x2a5c8c0) [opc=0x1020004 (DAOS) rpcid=0x18ea69b600000000 rank:tag=0:0] RPC failed; rc: DER_HG(-1020): 'Transport layer mercury error'
mgmt ERR src/mgmt/cli_mgmt.c:882 dc_mgmt_pool_find() tank: failed to get PS replicas from 1 servers, DER_HG(-1020): 'Transport layer mercury error'
pool ERR src/pool/cli.c:198 dc_pool_choose_svc_rank() 00000000:tank: dc_mgmt_pool_find() failed, DER_HG(-1020): 'Transport layer mercury error'
pool ERR src/pool/cli.c:503 dc_pool_connect_internal() 00000000:tank: cannot find pool service: DER_HG(-1020): 'Transport layer mercury error'
ERROR: daos: DER_HG(-1020): Transport layer mercury error
Environment Details:
DAOS Version: daos-2.0.3-5.el7.x86_64
DAOS Client Version: daos-client-2.0.3-5.el7.x86_64
Libfabric Version: libfabric-1.15.1-1.el7.x86_64
Mercury Version: mercury-2.1.0~rc4-9.el7.x86_64
CentOS Version: CentOS 7.9
Fabric Interface: enp0s3
Additional Information:
[root@server ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:bd:95:c2 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic enp0s3
valid_lft 564sec preferred_lft 564sec
inet6 fe80::e25:a2fd:9904:a8ac/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:bb:cb:4d brd ff:ff:ff:ff:ff:ff
inet 192.168.56.104/24 brd 192.168.56.255 scope global noprefixroute dynamic enp0s8
valid_lft 370sec preferred_lft 370sec
inet6 fe80::8d85:5b39:5f73:6e0b/64 scope link noprefixroute
valid_lft forever preferred_lft forever
I have also mentioned the DAOS server, client, and agent configuration files for reference.
DAOS Server
## default: daos_server
name: daos_server
#
#
## Access points
## Immutable after running "dmg storage format".
#
## To operate, DAOS will need a quorum of access point nodes to be available.
## Must have the same value for all agents and servers in a system.
## Hosts can be specified with or without port. The default port that is set
## up in port: will be used if a port is not specified here.
#
## default: hostname of this node
access_points: ['10.0.2.15']
#
#
## Default control plane port
#
## Port number to bind daos_server to. This will also be used when connecting
## to access points, unless a port is specified in access_points:
#
## default: 10001
port: 10001
#
#
## Transport credentials specifying certificates to secure communications
#
transport_config:
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
allow_insecure: true
#
# # Location where daos_server will look for Client certificates
client_cert_dir: /etc/daos/certs/clients
# # Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosCA.crt
# # Server certificate for use in TLS handshakes
cert: /etc/daos/certs/server.crt
# # Key portion of Server Certificate
key: /etc/daos/certs/server.key
provider: ofi+sockets
socket_dir: /var/run/daos_server
nr_hugepages: 4096
control_log_mask: DEBUG
control_log_file: /tmp/daos_server.log
helper_log_file: /tmp/daos_admin.log
engines:
-
targets: 8
nr_xs_helpers: 0
fabric_iface: enp0s3
fabric_iface_port: 31316
log_mask: INFO
log_file: /tmp/daos_engine_0.log
env_vars:
- CRT_TIMEOUT=30
scm_mount: /mnt/daos0
scm_class: ram
scm_size: 8
DAOS Control file
# default: daos_server
name: daos_server
# Default destination port to use when connecting to hosts in the hostlist.
# default: 10001
port: 10001
# Hostlist, a comma separated list of addresses (hostnames or IPv4 addresses).
# default: ['localhost']
hostlist: ['10.0.2.15']
## Transport Credentials Specifying certificates to secure communications
transport_config:
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
allow_insecure: true
#
# # Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosCA.crt
# # Admin certificate for use in TLS handshakes
cert: /etc/daos/certs/admin.crt
# # Key portion of Admin Certificate
key: /etc/daos/certs/admin.key
DAOS Agent file
# default: daos_server
name: daos_server
# Management server access points
# Must have the same value for all agents and servers in a system.
# default: hostname of this node
access_points: ['10.0.2.15']
# Force different port number to connect to access points.
# default: 10001
port: 10001
## Transport Credentials Specifying certificates to secure communications
#
transport_config:
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
allow_insecure: true
#
# # Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosCA.crt
# # Agent certificate for use in TLS handshakes
cert: /etc/daos/certs/agent.crt
# # Key portion of Agent Certificate
key: /etc/daos/certs/agent.key
# Use the given directory for creating unix domain sockets
#
# NOTE: Do not change this when running under systemd control. If it needs to
# be changed, then make sure that it matches the RuntimeDirectory setting
# in /usr/lib/systemd/system/daos_agent.service
#
# default: /var/run/daos_agent
#runtime_dir: /var/run/daos_agent
# Full path and name of the DAOS agent logfile.
# default: /tmp/daos_agent.log
log_file: /tmp/daos_agent.log
# Manually define the fabric interfaces and domains to be used by the agent,
# organized by NUMA node.
# If not defined, the agent will automatically detect all fabric interfaces and
# select appropriate ones based on the server preferences.
#
#fabric_ifaces:
#-
# numa_node: 0
# devices:
# -
# iface: ib0
# domain: mlx5_0
# -
# iface: ib1
# domain: mlx5_1
#-
# numa_node: 1
# devices:
# -
# iface: ib2
# domain: mlx5_2
# -
# iface: ib3
# domain: mlx5_3
Any assistance or insights into resolving this issue would be greatly appreciated. Thank you!