Fixed
Created: Jul 9, 2025
Updated: Sep 1, 2025
Resolved Date: Jul 15, 2025
Found In Version: 10.25.33.1
Severity: Standard
Applicable for: Wind River Linux LTS 25
Component/s: Kernel
In the Linux kernel, the following vulnerability has been resolved:EOL][EOL]mm: userfaultfd: fix race of userfaultfd_move and swap cache[EOL][EOL]This commit fixes two kinds of races, they may have different results:[EOL][EOL]Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same[EOL]BUG_ON if the filemap lookup returned NULL and folio is added to swap[EOL]cache after that.[EOL][EOL]If another kind of race is triggered (folio changed after lookup) we[EOL]may see RSS counter is corrupted:[EOL][EOL][ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0[EOL]type:MM_ANONPAGES val:-1[EOL][ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0[EOL]type:MM_SHMEMPAGES val:1[EOL][EOL]Because the folio is being accounted to the wrong VMA.[EOL][EOL]I'm not sure if there will be any data corruption though, seems no. [EOL]The issues above are critical already.[EOL][EOL][EOL]On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache[EOL]lookup, and tries to move the found folio to the faulting vma. Currently,[EOL]it relies on checking the PTE value to ensure that the moved folio still[EOL]belongs to the src swap entry and that no new folio has been added to the[EOL]swap cache, which turns out to be unreliable.[EOL][EOL]While working and reviewing the swap table series with Barry, following[EOL]existing races are observed and reproduced [1]:[EOL][EOL]In the example below, move_pages_pte is moving src_pte to dst_pte, where[EOL]src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the[EOL]swap cache:[EOL][EOL]CPU1 CPU2[EOL]userfaultfd_move[EOL] move_pages_pte()[EOL] entry = pte_to_swp_entry(orig_src_pte);[EOL] // Here it got entry = S1[EOL] ... < interrupted> ...[EOL] <swapin src_pte, alloc and use folio A>[EOL] // folio A is a new allocated folio[EOL] // and get installed into src_pte[EOL] <frees swap entry S1>[EOL] // src_pte now points to folio A, S1[EOL] // has swap count == 0, it can be freed[EOL] // by folio_swap_swap or swap[EOL] // allocator's reclaim.[EOL] <try to swap out another folio B>[EOL] // folio B is a folio in another VMA.[EOL] <put folio B to swap cache using S1 >[EOL] // S1 is freed, folio B can use it[EOL] // for swap out with no problem.[EOL] ...[EOL] folio = filemap_get_folio(S1)[EOL] // Got folio B here !!![EOL] ... < interrupted again> ...[EOL] <swapin folio B and free S1>[EOL] // Now S1 is free to be used again.[EOL] <swapout src_pte & folio A using S1>[EOL] // Now src_pte is a swap entry PTE[EOL] // holding S1 again.[EOL] folio_trylock(folio)[EOL] move_swap_pte[EOL] double_pt_lock[EOL] is_pte_pages_stable[EOL] // Check passed because src_pte == S1[EOL] folio_move_anon_rmap(...)[EOL] // Moved invalid folio B here !!![EOL][EOL]The race window is very short and requires multiple collisions of multiple[EOL]rare events, so it's very unlikely to happen, but with a deliberately[EOL]constructed reproducer and increased time window, it can be reproduced[EOL]easily.[EOL][EOL]This can be fixed by checking if the folio returned by filemap is the[EOL]valid swap cache folio after acquiring the folio lock.[EOL][EOL]Another similar race is possible: filemap_get_folio may return NULL, but[EOL]folio (A) could be swapped in and then swapped out again using the same[EOL]swap entry after the lookup. In such a case, folio (A) may remain in the[EOL]swap cache, so it must be moved too:[EOL][EOL]CPU1 CPU2[EOL]userfaultfd_move[EOL] move_pages_pte()[EOL] entry = pte_to_swp_entry(orig_src_pte);[EOL] // Here it got entry = S1, and S1 is not in swap cache[EOL] folio = filemap_get[EOL]---truncated---
CREATE(Triage):(User=admin) [CVE-2025-38242 (https://nvd.nist.gov/vuln/detail/CVE-2025-38242)