Fixed
Created: Sep 11, 2017
Updated: Dec 3, 2018
Resolved Date: Apr 22, 2018
Previous ID: LINCCM-1608
Found In Version: 6.0
Fix Version: 6.0.0.37
Severity: Severe
Applicable for: Wind River Linux 6
Component/s: Kernel
We found a dpaa driver issue on wrlinux6, here is the description:
--------------------------
1. root cause
Rekeying procedure consists of two sub-tasks (but still in one thread) of inbound and outbound both of which are processed partly by DPAA driver. Inbound comes firstly, and then outbound follows. When processing inbound, DPAA driver peels off a part of procedure into Linux kernel work queue for delay. Now, the delayed part runs concurrently along with our application process which brings out outbound procedure soon. Both of them are all requiring the access to PCD, which makes race condition occurring. Below demonstrates the details.
From our traces, we found the DPAA driver interface ‘dpa_ipsec_sa_rekeying’ returned -87 (-EUSERS). I demonstrate the function call chain as below.
execution path function name file name line number (call in) line number (call out)
inbound handle dpa_ipsec_sa_rekeying drivers/staging/fsl_dpa_offload/dpa_ipsec.c 4674 4887
queue_delayed_work drivers/staging/fsl_dpa_offload/dpa_ipsec.c 4887 5167
sa_rekeying_work_func (delayed 100us) drivers/staging/fsl_dpa_offload/dpa_ipsec.c 5167 5196
sa_rekeying_inbound drivers/staging/fsl_dpa_offload/dpa_ipsec.c 5052 5071
remove_inbound_hash_entry drivers/staging/fsl_dpa_offload/dpa_ipsec.c 2248 2254
update_pre_sec_inbound_table drivers/staging/fsl_dpa_offload/dpa_ipsec.c 2105 2215
dpa_classif_table_delete_entry_by_ref drivers/staging/fsl_dpa_offload/dpa_classifier.c 1564 1574
Here reaches DPAA’s PCD operation which must be mutually exclusively accessed.
outbound handle dpa_ipsec_sa_rekeying drivers/staging/fsl_dpa_offload/dpa_ipsec.c 4674 4831
update_outbound_policy drivers/staging/fsl_dpa_offload/dpa_ipsec.c 1879 2092
dpa_classif_table_modify_entry_by_ref drivers/staging/fsl_dpa_offload/dpa_classifier.c 894 914
Here reaches DPAA’s PCD operation which must be mutually exclusively accessed. But it returns -EBUSY which causes -EUSERS outside.
Yellow means kernel executing path which is concurrent and interleaved with process executing path (not highlighted).
When eNB device begins the rekey procedure, we’ll successively call the inbound and outbound handles as above. Now we see the outbound returns -EUSERS. After analyzing the function call chain, we suspect the DPAA’s PCD part attempts to be reentered because part of the inbound procedure is delayed and executed in kernel executing path which concurrently runs belong process path (outbound procedure). So, it caused the PCD EBUSY error, and be returned as EUSERS.
2. SA (secure association) resource leakage
We’d thought out a solution based on retrying strategy. As long as getting the EBUSY error, we consider the race condition happened, and just go around to do it again. But in outbound processing, we have no idea to firstly release SA resource allocated the last time. By retrying every time, a new SA resource will be allocated, but never be released. Finally, all the SAs could be exhausted. Below demonstrates the details.
Today, while we were looking over the DPAA driver code to consider the retry strategy, we found a potential problem in function ‘dpa_ipsec_sa_rekeying’ when processing outbound SA. The main line of this function firstly calls ‘get_new_sa’ to allocate a new SA, and then enters one of the two branches for outbound or inbound. At most of error points (ret < 0), a goto statement jumps to the label ‘rekey_sa_err’ to call ‘rollback_rekeying_sa’ which will then call ‘put_sa’ to release the just newly allocated SA. That’s why you told us that inbound has the rollback facility to make it possible for upper layer code to retry.
Nevertheless, we consider outbound should also have such the rollback facility when ‘update_outbound_policy’ returns failure, especially -EUSERS meaning PCD busy. The upper layer application has no way to release the new SA before retrying unless DPAA driver settles it properly.
‘put_instance’ seems to only decrease the reference counter of ‘dpa_ipsec’, not to release SA. ‘put_instance’ matches with ‘get_instance’. But ‘get_new_sa’ matches with ‘put_sa’. The former allocates an SA from ‘dpa_ipsec->sa_mng.sa’ and increases ‘dpa_ipsec->num_used_sas’ by 1. The latter frees an SA back to ‘dpa_ipsec->sa_mng.sa’ and decreases ‘dpa_ipsec->num_used_sas’ by 1.