commit xen for openSUSE:Factory

1 Aug 2022

Script 'mail_helper' called by obssrc
Hello community,

here is the log from the commit of package xen for openSUSE:Factory checked in at 2022-08-01 21:28:06
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/xen (Old)
 and      /work/SRC/openSUSE:Factory/.xen.new.1533 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Package is "xen"

Mon Aug  1 21:28:06 2022 rev:319 rq:991297 version:4.16.1_06

Changes:
--------

--- /work/SRC/openSUSE:Factory/xen/xen.changes	2022-07-01 13:43:51.674799559 +0200
+++ /work/SRC/openSUSE:Factory/.xen.new.1533/xen.changes	2022-08-01 21:28:11.237275758 +0200
@@ -1,0 +2,61 @@
+Wed Jul 13 11:10:03 MDT 2022 - carnold@suse.com
+
+- Added --disable-pvshim when running configure in xen.spec.
+  We have never shipped the shim and don't need to build it.
+
+-------------------------------------------------------------------
+Tue Jul 13 10:30:00 CEST 2022 - jbeulich@suse.com
+
+- bsc#1199965 - VUL-0: CVE-2022-26362: xen: Race condition
+  in typeref acquisition
+  62a1e594-x86-clean-up-_get_page_type.patch
+  62a1e5b0-x86-ABAC-race-in-_get_page_type.patch
+- bsc#1199966 - VUL-0: CVE-2022-26363,CVE-2022-26364: xen:
+  Insufficient care with non-coherent mappings
+  62a1e5d2-x86-introduce-_PAGE_-for-mem-types.patch
+  62a1e5f0-x86-dont-change-cacheability-of-directmap.patch
+  62a1e60e-x86-split-cache_flush-out-of-cache_writeback.patch
+  62a1e62b-x86-AMD-work-around-CLFLUSH-ordering.patch
+  62a1e649-x86-track-and-flush-non-coherent.patch
+- bsc#1200549 VUL-0: CVE-2022-21123,CVE-2022-21125,CVE-2022-21166:
+  xen: x86: MMIO Stale Data vulnerabilities (XSA-404)
+  62ab0fab-x86-spec-ctrl-VERW-flushing-runtime-cond.patch
+  62ab0fac-x86-spec-ctrl-enum-for-MMIO-Stale-Data.patch
+  62ab0fad-x86-spec-ctrl-add-unpriv-mmio.patch
+- bsc#1201469 - VUL-0: CVE-2022-23816,CVE-2022-23825,CVE-2022-29900:
+  xen: retbleed - arbitrary speculative code execution with return
+  instructions (XSA-407)
+  62cc31ed-x86-honour-spec-ctrl-0-for-unpriv-mmio.patch
+  62cc31ee-cmdline-extend-parse_boolean.patch
+  62cc31ef-x86-spec-ctrl-fine-grained-cmdline-subopts.patch
+  62cd91d0-x86-spec-ctrl-rework-context-switching.patch
+  62cd91d1-x86-spec-ctrl-rename-SCF_ist_wrmsr.patch
+  62cd91d2-x86-spec-ctrl-rename-opt_ibpb.patch
+  62cd91d3-x86-spec-ctrl-rework-SPEC_CTRL_ENTRY_FROM_INTR_IST.patch
+  62cd91d4-x86-spec-ctrl-IBPB-on-entry.patch
+  62cd91d5-x86-cpuid-BTC_NO-enum.patch
+  62cd91d6-x86-spec-ctrl-enable-Zen2-chickenbit.patch
+  62cd91d7-x86-spec-ctrl-mitigate-Branch-Type-Confusion.patch
+- Upstream bug fixes (bsc#1027519)
+  62a99614-IOMMU-x86-gcc12.patch
+  62bdd840-x86-spec-ctrl-only-adjust-idle-with-legacy-IBRS.patch
+  62bdd841-x86-spec-ctrl-knobs-for-STIBP-and-PSFD.patch
+- Drop patches replaced by upstream versions
+  xsa401-1.patch
+  xsa401-2.patch
+  xsa402-1.patch
+  xsa402-2.patch
+  xsa402-3.patch
+  xsa402-4.patch
+  xsa402-5.patch
+
+-------------------------------------------------------------------
+Tue Jul 12 08:32:19 MDT 2022 - carnold@suse.com
+
+- bsc#1201394 - VUL-0: CVE-2022-33745: xen: insufficient TLB flush
+  for x86 PV guests in shadow mode (XSA-408)
+  xsa408.patch
+- Fix gcc13 compilation error
+  62c56cc0-libxc-fix-compilation-error-with-gcc13.patch
+
+-------------------------------------------------------------------
@@ -5,0 +67,26 @@
+
+-------------------------------------------------------------------
+Tue Jun 08 17:50:00 CEST 2022 - jbeulich@suse.com
+
+- bsc#1199966 - VUL-0: EMBARGOED: CVE-2022-26363,CVE-2022-26364: xen:
+  Insufficient care with non-coherent mappings
+  fix xsa402-5.patch
+
+-------------------------------------------------------------------
+Tue May 31 17:25:00 CEST 2022 - jbeulich@suse.com
+
+- Upstream bug fixes (bsc#1027519)
+  625fca42-VT-d-reserved-CAP-ND.patch
+  626f7ee8-x86-MSR-handle-P5-MC-reads.patch
+  627549d6-IO-shutdown-race.patch
+- bsc#1199965 - VUL-0: EMBARGOED: CVE-2022-26362: xen: Race condition
+  in typeref acquisition
+  xsa401-1.patch
+  xsa401-2.patch
+- bsc#1199966 - VUL-0: EMBARGOED: CVE-2022-26363,CVE-2022-26364: xen:
+  Insufficient care with non-coherent mappings
+  xsa402-1.patch
+  xsa402-2.patch
+  xsa402-3.patch
+  xsa402-4.patch
+  xsa402-5.patch

New:
----
  625fca42-VT-d-reserved-CAP-ND.patch
  626f7ee8-x86-MSR-handle-P5-MC-reads.patch
  627549d6-IO-shutdown-race.patch
  62a1e594-x86-clean-up-_get_page_type.patch
  62a1e5b0-x86-ABAC-race-in-_get_page_type.patch
  62a1e5d2-x86-introduce-_PAGE_-for-mem-types.patch
  62a1e5f0-x86-dont-change-cacheability-of-directmap.patch
  62a1e60e-x86-split-cache_flush-out-of-cache_writeback.patch
  62a1e62b-x86-AMD-work-around-CLFLUSH-ordering.patch
  62a1e649-x86-track-and-flush-non-coherent.patch
  62a99614-IOMMU-x86-gcc12.patch
  62ab0fab-x86-spec-ctrl-VERW-flushing-runtime-cond.patch
  62ab0fac-x86-spec-ctrl-enum-for-MMIO-Stale-Data.patch
  62ab0fad-x86-spec-ctrl-add-unpriv-mmio.patch
  62bdd840-x86-spec-ctrl-only-adjust-idle-with-legacy-IBRS.patch
  62bdd841-x86-spec-ctrl-knobs-for-STIBP-and-PSFD.patch
  62c56cc0-libxc-fix-compilation-error-with-gcc13.patch
  62cc31ed-x86-honour-spec-ctrl-0-for-unpriv-mmio.patch
  62cc31ee-cmdline-extend-parse_boolean.patch
  62cc31ef-x86-spec-ctrl-fine-grained-cmdline-subopts.patch
  62cd91d0-x86-spec-ctrl-rework-context-switching.patch
  62cd91d1-x86-spec-ctrl-rename-SCF_ist_wrmsr.patch
  62cd91d2-x86-spec-ctrl-rename-opt_ibpb.patch
  62cd91d3-x86-spec-ctrl-rework-SPEC_CTRL_ENTRY_FROM_INTR_IST.patch
  62cd91d4-x86-spec-ctrl-IBPB-on-entry.patch
  62cd91d5-x86-cpuid-BTC_NO-enum.patch
  62cd91d6-x86-spec-ctrl-enable-Zen2-chickenbit.patch
  62cd91d7-x86-spec-ctrl-mitigate-Branch-Type-Confusion.patch
  xsa408.patch

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ xen.spec ++++++
--- /var/tmp/diff_new_pack.TxjnPB/_old	2022-08-01 21:28:12.825280313 +0200
+++ /var/tmp/diff_new_pack.TxjnPB/_new	2022-08-01 21:28:12.829280325 +0200
@@ -119,7 +119,7 @@
 %endif
 Provides:       installhint(reboot-needed)
 
-Version:        4.16.1_02
+Version:        4.16.1_06
 Release:        0
 Summary:        Xen Virtualization: Hypervisor (aka VMM aka Microkernel)
 License:        GPL-2.0-only
@@ -155,7 +155,36 @@
 # For xen-libs
 Source99:       baselibs.conf
 # Upstream patches
+Patch1:         625fca42-VT-d-reserved-CAP-ND.patch
+Patch2:         626f7ee8-x86-MSR-handle-P5-MC-reads.patch
+Patch3:         627549d6-IO-shutdown-race.patch
+Patch4:         62a1e594-x86-clean-up-_get_page_type.patch
+Patch5:         62a1e5b0-x86-ABAC-race-in-_get_page_type.patch
+Patch6:         62a1e5d2-x86-introduce-_PAGE_-for-mem-types.patch
+Patch7:         62a1e5f0-x86-dont-change-cacheability-of-directmap.patch
+Patch8:         62a1e60e-x86-split-cache_flush-out-of-cache_writeback.patch
+Patch9:         62a1e62b-x86-AMD-work-around-CLFLUSH-ordering.patch
+Patch10:        62a1e649-x86-track-and-flush-non-coherent.patch
+Patch11:        62a99614-IOMMU-x86-gcc12.patch
+Patch12:        62ab0fab-x86-spec-ctrl-VERW-flushing-runtime-cond.patch
+Patch13:        62ab0fac-x86-spec-ctrl-enum-for-MMIO-Stale-Data.patch
+Patch14:        62ab0fad-x86-spec-ctrl-add-unpriv-mmio.patch
+Patch15:        62bdd840-x86-spec-ctrl-only-adjust-idle-with-legacy-IBRS.patch
+Patch16:        62bdd841-x86-spec-ctrl-knobs-for-STIBP-and-PSFD.patch
+Patch17:        62c56cc0-libxc-fix-compilation-error-with-gcc13.patch
+Patch18:        62cc31ed-x86-honour-spec-ctrl-0-for-unpriv-mmio.patch
+Patch19:        62cc31ee-cmdline-extend-parse_boolean.patch
+Patch20:        62cc31ef-x86-spec-ctrl-fine-grained-cmdline-subopts.patch
+Patch21:        62cd91d0-x86-spec-ctrl-rework-context-switching.patch
+Patch22:        62cd91d1-x86-spec-ctrl-rename-SCF_ist_wrmsr.patch
+Patch23:        62cd91d2-x86-spec-ctrl-rename-opt_ibpb.patch
+Patch24:        62cd91d3-x86-spec-ctrl-rework-SPEC_CTRL_ENTRY_FROM_INTR_IST.patch
+Patch25:        62cd91d4-x86-spec-ctrl-IBPB-on-entry.patch
+Patch26:        62cd91d5-x86-cpuid-BTC_NO-enum.patch
+Patch27:        62cd91d6-x86-spec-ctrl-enable-Zen2-chickenbit.patch
+Patch28:        62cd91d7-x86-spec-ctrl-mitigate-Branch-Type-Confusion.patch
 # EMBARGOED security fixes
+Patch108:       xsa408.patch
 # libxc
 Patch301:       libxc-bitmap-long.patch
 Patch302:       libxc-sr-xl-migration-debug.patch
@@ -480,6 +509,7 @@
 configure_flags="${configure_flags} --disable-qemu-traditional"
 ./configure \
         --disable-xen \
+        --disable-pvshim \
         --enable-tools \
         --enable-docs \
         --prefix=/usr \

++++++ 625fca42-VT-d-reserved-CAP-ND.patch ++++++
# Commit a1545fbf45c689aff39ce76a6eaa609d32ef72a7
# Date 2022-04-20 10:54:26 +0200
# Author Jan Beulich <jbeulich@suse.com>
# Committer Jan Beulich <jbeulich@suse.com>
VT-d: refuse to use IOMMU with reserved CAP.ND value

The field taking the value 7 (resulting in 18-bit DIDs when using the
calculation in cap_ndoms(), when the DID fields are only 16 bits wide)
is reserved. Instead of misbehaving in case we would encounter such an
IOMMU, refuse to use it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monn�� <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1279,8 +1279,11 @@ int __init iommu_alloc(struct acpi_drhd_
 
     quirk_iommu_caps(iommu);
 
+    nr_dom = cap_ndoms(iommu->cap);
+
     if ( cap_fault_reg_offset(iommu->cap) +
          cap_num_fault_regs(iommu->cap) * PRIMARY_FAULT_REG_LEN >= PAGE_SIZE ||
+         ((nr_dom - 1) >> 16) /* I.e. cap.nd > 6 */ ||
          ecap_iotlb_offset(iommu->ecap) >= PAGE_SIZE )
     {
         printk(XENLOG_ERR VTDPREFIX "IOMMU: unsupported\n");
@@ -1305,7 +1308,6 @@ int __init iommu_alloc(struct acpi_drhd_
         vtd_ops.sync_cache = sync_cache;
 
     /* allocate domain id bitmap */
-    nr_dom = cap_ndoms(iommu->cap);
     iommu->domid_bitmap = xzalloc_array(unsigned long, BITS_TO_LONGS(nr_dom));
     if ( !iommu->domid_bitmap )
         return -ENOMEM;

++++++ 626f7ee8-x86-MSR-handle-P5-MC-reads.patch ++++++
# Commit ce59e472b581e4923f6892172dde62b88c8aa8b7
# Date 2022-05-02 08:49:12 +0200
# Author Roger Pau Monn�� <roger.pau@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/msr: handle reads to MSR_P5_MC_{ADDR,TYPE}

Windows Server 2019 Essentials will unconditionally attempt to read
P5_MC_ADDR MSR at boot and throw a BSOD if injected a #GP.

Fix this by mapping MSR_P5_MC_{ADDR,TYPE} to
MSR_IA32_MCi_{ADDR,STATUS}, as reported also done by hardware in Intel
SDM "Mapping of the Pentium Processor Machine-Check Errors to the
Machine-Check Architecture" section.

Reported-by: Steffen Einsle <einsle@phptrix.de>
Signed-off-by: Roger Pau Monn�� <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -169,6 +169,12 @@ static inline int mce_vendor_bank_msr(co
         if (msr >= MSR_IA32_MC0_CTL2 &&
             msr < MSR_IA32_MCx_CTL2(v->arch.vmce.mcg_cap & MCG_CAP_COUNT) )
             return 1;
+        fallthrough;
+
+    case X86_VENDOR_CENTAUR:
+    case X86_VENDOR_SHANGHAI:
+        if (msr == MSR_P5_MC_ADDR || msr == MSR_P5_MC_TYPE)
+            return 1;
         break;
 
     case X86_VENDOR_AMD:
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1001,8 +1001,27 @@ int vmce_intel_wrmsr(struct vcpu *v, uin
 
 int vmce_intel_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
+    const struct cpuid_policy *cp = v->domain->arch.cpuid;
     unsigned int bank = msr - MSR_IA32_MC0_CTL2;
 
+    switch ( msr )
+    {
+    case MSR_P5_MC_ADDR:
+        /*
+         * Bank 0 is used for the 'bank 0 quirk' on older processors.
+         * See vcpu_fill_mc_msrs() for reference.
+         */
+        *val = v->arch.vmce.bank[1].mci_addr;
+        return 1;
+
+    case MSR_P5_MC_TYPE:
+        *val = v->arch.vmce.bank[1].mci_status;
+        return 1;
+    }
+
+    if ( !(cp->x86_vendor & X86_VENDOR_INTEL) )
+        return 0;
+
     if ( bank < GUEST_MC_BANK_NUM )
     {
         *val = v->arch.vmce.bank[bank].mci_ctl2;
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -150,6 +150,8 @@ static int bank_mce_rdmsr(const struct v
     default:
         switch ( boot_cpu_data.x86_vendor )
         {
+        case X86_VENDOR_CENTAUR:
+        case X86_VENDOR_SHANGHAI:
         case X86_VENDOR_INTEL:
             ret = vmce_intel_rdmsr(v, msr, val);
             break;
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -15,6 +15,9 @@
  * abbreviated name.  Exceptions will be considered on a case-by-case basis.
  */
 
+#define MSR_P5_MC_ADDR                      0
+#define MSR_P5_MC_TYPE                      0x00000001
+
 #define MSR_APIC_BASE                       0x0000001b
 #define  APIC_BASE_BSP                      (_AC(1, ULL) <<  8)
 #define  APIC_BASE_EXTD                     (_AC(1, ULL) << 10)
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -282,6 +282,8 @@ int guest_rdmsr(struct vcpu *v, uint32_t
         *val = msrs->misc_features_enables.raw;
         break;
 
+    case MSR_P5_MC_ADDR:
+    case MSR_P5_MC_TYPE:
     case MSR_IA32_MCG_CAP     ... MSR_IA32_MCG_CTL:      /* 0x179 -> 0x17b */
     case MSR_IA32_MCx_CTL2(0) ... MSR_IA32_MCx_CTL2(31): /* 0x280 -> 0x29f */
     case MSR_IA32_MCx_CTL(0)  ... MSR_IA32_MCx_MISC(31): /* 0x400 -> 0x47f */

++++++ 627549d6-IO-shutdown-race.patch ++++++
# Commit b7e0d8978810b534725e94a321736496928f00a5
# Date 2022-05-06 17:16:22 +0100
# Author Julien Grall <jgrall@amazon.com>
# Committer Julien Grall <jgrall@amazon.com>
xen: io: Fix race between sending an I/O and domain shutdown

Xen provides hypercalls to shutdown (SCHEDOP_shutdown{,_code}) and
resume a domain (XEN_DOMCTL_resumedomain). They can be used for checkpoint
where the expectation is the domain should continue as nothing happened
afterwards.

hvmemul_do_io() and handle_pio() will act differently if the return
code of hvm_send_ioreq() (resp. hvmemul_do_pio_buffer()) is X86EMUL_RETRY.

In this case, the I/O state will be reset to STATE_IOREQ_NONE (i.e
no I/O is pending) and/or the PC will not be advanced.

If the shutdown request happens right after the I/O was sent to the
IOREQ, then emulation code will end up to re-execute the instruction
and therefore forward again the same I/O (at least when reading IO port).

This would be problem if the access has a side-effect. A dumb example,
is a device implementing a counter which is incremented by one for every
access. When running shutdown/resume in a loop, the value read by the
OS may not be the old value + 1.

Add an extra boolean in the structure hvm_vcpu_io to indicate whether
the I/O was suspended. This is then used in place of checking the domain
is shutting down in hvmemul_do_io() and handle_pio() as they should
act on suspend (i.e. vcpu_start_shutdown_deferral() returns false) rather
than shutdown.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/arch/arm/ioreq.c
+++ b/xen/arch/arm/ioreq.c
@@ -80,9 +80,10 @@ enum io_state try_fwd_ioserv(struct cpu_
         return IO_ABORT;
 
     vio->req = p;
+    vio->suspended = false;
 
     rc = ioreq_send(s, &p, 0);
-    if ( rc != IO_RETRY || v->domain->is_shutting_down )
+    if ( rc != IO_RETRY || vio->suspended )
         vio->req.state = STATE_IOREQ_NONE;
     else if ( !ioreq_needs_completion(&vio->req) )
         rc = IO_HANDLED;
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -239,6 +239,7 @@ static int hvmemul_do_io(
     ASSERT(p.count);
 
     vio->req = p;
+    vio->suspended = false;
 
     rc = hvm_io_intercept(&p);
 
@@ -334,7 +335,7 @@ static int hvmemul_do_io(
         else
         {
             rc = ioreq_send(s, &p, 0);
-            if ( rc != X86EMUL_RETRY || currd->is_shutting_down )
+            if ( rc != X86EMUL_RETRY || vio->suspended )
                 vio->req.state = STATE_IOREQ_NONE;
             else if ( !ioreq_needs_completion(&vio->req) )
                 rc = X86EMUL_OKAY;
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -138,10 +138,11 @@ bool handle_pio(uint16_t port, unsigned
 
     case X86EMUL_RETRY:
         /*
-         * We should not advance RIP/EIP if the domain is shutting down or
-         * if X86EMUL_RETRY has been returned by an internal handler.
+         * We should not advance RIP/EIP if the vio was suspended (e.g.
+         * because the domain is shutting down) or if X86EMUL_RETRY has
+         * been returned by an internal handler.
          */
-        if ( curr->domain->is_shutting_down || !vcpu_ioreq_pending(curr) )
+        if ( vio->suspended || !vcpu_ioreq_pending(curr) )
             return false;
         break;
 
--- a/xen/common/ioreq.c
+++ b/xen/common/ioreq.c
@@ -1256,6 +1256,7 @@ int ioreq_send(struct ioreq_server *s, i
     struct vcpu *curr = current;
     struct domain *d = curr->domain;
     struct ioreq_vcpu *sv;
+    struct vcpu_io *vio = &curr->io;
 
     ASSERT(s);
 
@@ -1263,7 +1264,10 @@ int ioreq_send(struct ioreq_server *s, i
         return ioreq_send_buffered(s, proto_p);
 
     if ( unlikely(!vcpu_start_shutdown_deferral(curr)) )
+    {
+        vio->suspended = true;
         return IOREQ_STATUS_RETRY;
+    }
 
     list_for_each_entry ( sv,
                           &s->ioreq_vcpu_list,
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -159,6 +159,11 @@ enum vio_completion {
 struct vcpu_io {
     /* I/O request in flight to device model. */
     enum vio_completion  completion;
+    /*
+     * Indicate whether the I/O was not handled because the domain
+     * is about to be paused.
+     */
+    bool                 suspended;
     ioreq_t              req;
 };
 

++++++ 62a1e594-x86-clean-up-_get_page_type.patch ++++++
# Commit 9186e96b199e4f7e52e033b238f9fe869afb69c7
# Date 2022-06-09 14:20:36 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/pv: Clean up _get_page_type()

Various fixes for clarity, ahead of making complicated changes.

 * Split the overflow check out of the if/else chain for type handling, as
   it's somewhat unrelated.
 * Comment the main if/else chain to explain what is going on.  Adjust one
   ASSERT() and state the bit layout for validate-locked and partial states.
 * Correct the comment about TLB flushing, as it's backwards.  The problem
   case is when writeable mappings are retained to a page becoming read-only,
   as it allows the guest to bypass Xen's safety checks for updates.
 * Reduce the scope of 'y'.  It is an artefact of the cmpxchg loop and not
   valid for use by subsequent logic.  Switch to using ACCESS_ONCE() to treat
   all reads as explicitly volatile.  The only thing preventing the validated
   wait-loop being infinite is the compiler barrier hidden in cpu_relax().
 * Replace one page_get_owner(page) with the already-calculated 'd' already in
   scope.

No functional change.

This is part of XSA-401 / CVE-2022-26362.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

# Commit c2095ac76be0f4a1940346c9ffb49fb967345060
# Date 2022-06-10 10:21:06 +0200
# Author Jan Beulich <jbeulich@suse.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/mm: account for PGT_pae_xen_l2 in recently added assertion

While PGT_pae_xen_l2 will be zapped once the type refcount of an L2 page
reaches zero, it'll be retained as long as the type refcount is non-
zero. Hence any checking against the requested type needs to either zap
the bit from the type or include it in the used mask.

Fixes: 9186e96b199e ("x86/pv: Clean up _get_page_type()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2906,16 +2906,17 @@ static int _put_page_type(struct page_in
 static int _get_page_type(struct page_info *page, unsigned long type,
                           bool preemptible)
 {
-    unsigned long nx, x, y = page->u.inuse.type_info;
+    unsigned long nx, x;
     int rc = 0;
 
     ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2)));
     ASSERT(!in_irq());
 
-    for ( ; ; )
+    for ( unsigned long y = ACCESS_ONCE(page->u.inuse.type_info); ; )
     {
         x  = y;
         nx = x + 1;
+
         if ( unlikely((nx & PGT_count_mask) == 0) )
         {
             gdprintk(XENLOG_WARNING,
@@ -2923,8 +2924,15 @@ static int _get_page_type(struct page_in
                      mfn_x(page_to_mfn(page)));
             return -EINVAL;
         }
-        else if ( unlikely((x & PGT_count_mask) == 0) )
+
+        if ( unlikely((x & PGT_count_mask) == 0) )
         {
+            /*
+             * Typeref 0 -> 1.
+             *
+             * Type changes are permitted when the typeref is 0.  If the type
+             * actually changes, the page needs re-validating.
+             */
             struct domain *d = page_get_owner(page);
 
             if ( d && shadow_mode_enabled(d) )
@@ -2935,8 +2943,8 @@ static int _get_page_type(struct page_in
             {
                 /*
                  * On type change we check to flush stale TLB entries. It is
-                 * vital that no other CPUs are left with mappings of a frame
-                 * which is about to become writeable to the guest.
+                 * vital that no other CPUs are left with writeable mappings
+                 * to a frame which is intending to become pgtable/segdesc.
                  */
                 cpumask_t *mask = this_cpu(scratch_cpumask);
 
@@ -2948,7 +2956,7 @@ static int _get_page_type(struct page_in
 
                 if ( unlikely(!cpumask_empty(mask)) &&
                      /* Shadow mode: track only writable pages. */
-                     (!shadow_mode_enabled(page_get_owner(page)) ||
+                     (!shadow_mode_enabled(d) ||
                       ((nx & PGT_type_mask) == PGT_writable_page)) )
                 {
                     perfc_incr(need_flush_tlb_flush);
@@ -2979,7 +2987,14 @@ static int _get_page_type(struct page_in
         }
         else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
         {
-            /* Don't log failure if it could be a recursive-mapping attempt. */
+            /*
+             * else, we're trying to take a new reference, of the wrong type.
+             *
+             * This (being able to prohibit use of the wrong type) is what the
+             * typeref system exists for, but skip printing the failure if it
+             * looks like a recursive mapping, as subsequent logic might
+             * ultimately permit the attempt.
+             */
             if ( ((x & PGT_type_mask) == PGT_l2_page_table) &&
                  (type == PGT_l1_page_table) )
                 return -EINVAL;
@@ -2998,18 +3013,47 @@ static int _get_page_type(struct page_in
         }
         else if ( unlikely(!(x & PGT_validated)) )
         {
+            /*
+             * else, the count is non-zero, and we're grabbing the right type;
+             * but the page hasn't been validated yet.
+             *
+             * The page is in one of two states (depending on PGT_partial),
+             * and should have exactly one reference.
+             */
+            ASSERT((x & (PGT_type_mask | PGT_pae_xen_l2 | PGT_count_mask)) ==
+                   (type | 1));
+
             if ( !(x & PGT_partial) )
             {
-                /* Someone else is updating validation of this page. Wait... */
+                /*
+                 * The page has been left in the "validate locked" state
+                 * (i.e. PGT_[type] | 1) which means that a concurrent caller
+                 * of _get_page_type() is in the middle of validation.
+                 *
+                 * Spin waiting for the concurrent user to complete (partial
+                 * or fully validated), then restart our attempt to acquire a
+                 * type reference.
+                 */
                 do {
                     if ( preemptible && hypercall_preempt_check() )
                         return -EINTR;
                     cpu_relax();
-                } while ( (y = page->u.inuse.type_info) == x );
+                } while ( (y = ACCESS_ONCE(page->u.inuse.type_info)) == x );
                 continue;
             }
-            /* Type ref count was left at 1 when PGT_partial got set. */
-            ASSERT((x & PGT_count_mask) == 1);
+
+            /*
+             * The page has been left in the "partial" state
+             * (i.e., PGT_[type] | PGT_partial | 1).
+             *
+             * Rather than bumping the type count, we need to try to grab the
+             * validation lock; if we succeed, we need to validate the page,
+             * then drop the general ref associated with the PGT_partial bit.
+             *
+             * We grab the validation lock by setting nx to (PGT_[type] | 1)
+             * (i.e., non-zero type count, neither PGT_validated nor
+             * PGT_partial set).
+             */
             nx = x & ~PGT_partial;
         }
 
@@ -3058,6 +3102,13 @@ static int _get_page_type(struct page_in
     }
 
  out:
+    /*
+     * Did we drop the PGT_partial bit when acquiring the typeref?  If so,
+     * drop the general reference that went along with it.
+     *
+     * N.B. validate_page() may have have re-set PGT_partial, not reflected in
+     * nx, but will have taken an extra ref when doing so.
+     */
     if ( (x & PGT_partial) && !(nx & PGT_partial) )
         put_page(page);
 

++++++ 62a1e5b0-x86-ABAC-race-in-_get_page_type.patch ++++++
# Commit 8cc5036bc385112a82f1faff27a0970e6440dfed
# Date 2022-06-09 14:21:04 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/pv: Fix ABAC cmpxchg() race in _get_page_type()

_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between.  Consider:

CPU A:
  1. Creates an L2e referencing pg
     `-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
  2.     Issues flush_tlb_mask()
CPU B:
  3. Creates a writeable mapping of pg
     `-> _get_page_type(pg, PGT_writable_page), count increases to 1
  4. Writes into new mapping, creating a TLB entry for pg
  5. Removes the writeable mapping of pg
     `-> _put_page_type(pg), count goes back down to 0
CPU A:
  7.     Issues cmpxchg(), setting count 1, type PGT_l1_page_table

CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only).  The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.

Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.

Also remove the early validation for writeable and shared pages.  This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
 * The IOMMU pagetables are in sync with the new page type
 * Writeable mappings to shared pages have been torn down

This is part of XSA-401 / CVE-2022-26362.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2933,56 +2933,12 @@ static int _get_page_type(struct page_in
              * Type changes are permitted when the typeref is 0.  If the type
              * actually changes, the page needs re-validating.
              */
-            struct domain *d = page_get_owner(page);
-
-            if ( d && shadow_mode_enabled(d) )
-               shadow_prepare_page_type_change(d, page, type);
 
             ASSERT(!(x & PGT_pae_xen_l2));
             if ( (x & PGT_type_mask) != type )
             {
-                /*
-                 * On type change we check to flush stale TLB entries. It is
-                 * vital that no other CPUs are left with writeable mappings
-                 * to a frame which is intending to become pgtable/segdesc.
-                 */
-                cpumask_t *mask = this_cpu(scratch_cpumask);
-
-                BUG_ON(in_irq());
-                cpumask_copy(mask, d->dirty_cpumask);
-
-                /* Don't flush if the timestamp is old enough */
-                tlbflush_filter(mask, page->tlbflush_timestamp);
-
-                if ( unlikely(!cpumask_empty(mask)) &&
-                     /* Shadow mode: track only writable pages. */
-                     (!shadow_mode_enabled(d) ||
-                      ((nx & PGT_type_mask) == PGT_writable_page)) )
-                {
-                    perfc_incr(need_flush_tlb_flush);
-                    /*
-                     * If page was a page table make sure the flush is
-                     * performed using an IPI in order to avoid changing the
-                     * type of a page table page under the feet of
-                     * spurious_page_fault().
-                     */
-                    flush_mask(mask,
-                               (x & PGT_type_mask) &&
-                               (x & PGT_type_mask) <= PGT_root_page_table
-                               ? FLUSH_TLB | FLUSH_FORCE_IPI
-                               : FLUSH_TLB);
-                }
-
-                /* We lose existing type and validity. */
                 nx &= ~(PGT_type_mask | PGT_validated);
                 nx |= type;
-
-                /*
-                 * No special validation needed for writable pages.
-                 * Page tables and GDT/LDT need to be scanned for validity.
-                 */
-                if ( type == PGT_writable_page || type == PGT_shared_page )
-                    nx |= PGT_validated;
             }
         }
         else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
@@ -3064,6 +3020,56 @@ static int _get_page_type(struct page_in
             return -EINTR;
     }
 
+    /*
+     * One typeref has been taken and is now globally visible.
+     *
+     * The page is either in the "validate locked" state (PGT_[type] | 1) or
+     * fully validated (PGT_[type] | PGT_validated | >0).
+     */
+
+    if ( unlikely((x & PGT_count_mask) == 0) )
+    {
+        struct domain *d = page_get_owner(page);
+
+        if ( d && shadow_mode_enabled(d) )
+            shadow_prepare_page_type_change(d, page, type);
+
+        if ( (x & PGT_type_mask) != type )
+        {
+            /*
+             * On type change we check to flush stale TLB entries. It is
+             * vital that no other CPUs are left with writeable mappings
+             * to a frame which is intending to become pgtable/segdesc.
+             */
+            cpumask_t *mask = this_cpu(scratch_cpumask);
+
+            BUG_ON(in_irq());
+            cpumask_copy(mask, d->dirty_cpumask);
+
+            /* Don't flush if the timestamp is old enough */
+            tlbflush_filter(mask, page->tlbflush_timestamp);
+
+            if ( unlikely(!cpumask_empty(mask)) &&
+                 /* Shadow mode: track only writable pages. */
+                 (!shadow_mode_enabled(d) ||
+                  ((nx & PGT_type_mask) == PGT_writable_page)) )
+            {
+                perfc_incr(need_flush_tlb_flush);
+                /*
+                 * If page was a page table make sure the flush is
+                 * performed using an IPI in order to avoid changing the
+                 * type of a page table page under the feet of
+                 * spurious_page_fault().
+                 */
+                flush_mask(mask,
+                           (x & PGT_type_mask) &&
+                           (x & PGT_type_mask) <= PGT_root_page_table
+                           ? FLUSH_TLB | FLUSH_FORCE_IPI
+                           : FLUSH_TLB);
+            }
+        }
+    }
+
     if ( unlikely(((x & PGT_type_mask) == PGT_writable_page) !=
                   (type == PGT_writable_page)) )
     {
@@ -3092,13 +3098,25 @@ static int _get_page_type(struct page_in
 
     if ( unlikely(!(nx & PGT_validated)) )
     {
-        if ( !(x & PGT_partial) )
+        /*
+         * No special validation needed for writable or shared pages.  Page
+         * tables and GDT/LDT need to have their contents audited.
+         *
+         * per validate_page(), non-atomic updates are fine here.
+         */
+        if ( type == PGT_writable_page || type == PGT_shared_page )
+            page->u.inuse.type_info |= PGT_validated;
+        else
         {
-            page->nr_validated_ptes = 0;
-            page->partial_flags = 0;
-            page->linear_pt_count = 0;
+            if ( !(x & PGT_partial) )
+            {
+                page->nr_validated_ptes = 0;
+                page->partial_flags = 0;
+                page->linear_pt_count = 0;
+            }
+
+            rc = validate_page(page, type, preemptible);
         }
-        rc = validate_page(page, type, preemptible);
     }
 
  out:

++++++ 62a1e5d2-x86-introduce-_PAGE_-for-mem-types.patch ++++++
# Commit 1be8707c75bf4ba68447c74e1618b521dd432499
# Date 2022-06-09 14:21:38 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/page: Introduce _PAGE_* constants for memory types

... rather than opencoding the PAT/PCD/PWT attributes in __PAGE_HYPERVISOR_*
constants.  These are going to be needed by forthcoming logic.

No functional change.

This is part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -331,6 +331,14 @@ void efi_update_l4_pgtable(unsigned int
 
 #define PAGE_CACHE_ATTRS (_PAGE_PAT | _PAGE_PCD | _PAGE_PWT)
 
+/* Memory types, encoded under Xen's choice of MSR_PAT. */
+#define _PAGE_WB         (                                0)
+#define _PAGE_WT         (                        _PAGE_PWT)
+#define _PAGE_UCM        (            _PAGE_PCD            )
+#define _PAGE_UC         (            _PAGE_PCD | _PAGE_PWT)
+#define _PAGE_WC         (_PAGE_PAT                        )
+#define _PAGE_WP         (_PAGE_PAT |             _PAGE_PWT)
+
 /*
  * Debug option: Ensure that granted mappings are not implicitly unmapped.
  * WARNING: This will need to be disabled to run OSes that use the spare PTE
@@ -349,8 +357,8 @@ void efi_update_l4_pgtable(unsigned int
 #define __PAGE_HYPERVISOR_RX      (_PAGE_PRESENT | _PAGE_ACCESSED)
 #define __PAGE_HYPERVISOR         (__PAGE_HYPERVISOR_RX | \
                                    _PAGE_DIRTY | _PAGE_RW)
-#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
-#define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_UCM)
+#define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_UC)
 #define __PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
 
 #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */

++++++ 62a1e5f0-x86-dont-change-cacheability-of-directmap.patch ++++++
# Commit ae09597da34aee6bc5b76475c5eea6994457e854
# Date 2022-06-09 14:22:08 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86: Don't change the cacheability of the directmap

Changeset 55f97f49b7ce ("x86: Change cache attributes of Xen 1:1 page mappings
in response to guest mapping requests") attempted to keep the cacheability
consistent between different mappings of the same page.

The reason wasn't described in the changelog, but it is understood to be in
regards to a concern over machine check exceptions, owing to errata when using
mixed cacheabilities.  It did this primarily by updating Xen's mapping of the
page in the direct map when the guest mapped a page with reduced cacheability.

Unfortunately, the logic didn't actually prevent mixed cacheability from
occurring:
 * A guest could map a page normally, and then map the same page with
   different cacheability; nothing prevented this.
 * The cacheability of the directmap was always latest-takes-precedence in
   terms of guest requests.
 * Grant-mapped frames with lesser cacheability didn't adjust the page's
   cacheattr settings.
 * The map_domain_page() function still unconditionally created WB mappings,
   irrespective of the page's cacheattr settings.

Additionally, update_xen_mappings() had a bug where the alias calculation was
wrong for mfn's which were .init content, which should have been treated as
fully guest pages, not Xen pages.

Worse yet, the logic introduced a vulnerability whereby necessary
pagetable/segdesc adjustments made by Xen in the validation logic could become
non-coherent between the cache and main memory.  The CPU could subsequently
operate on the stale value in the cache, rather than the safe value in main
memory.

The directmap contains primarily mappings of RAM.  PAT/MTRR conflict
resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser
cacheability resolves to being coherent.  The special case is WC mappings,
which are non-coherent against MTRR=WB regions (except for fully-coherent
CPUs).

Xen must not have any WC cacheability in the directmap, to prevent Xen's
actions from creating non-coherency.  (Guest actions creating non-coherency is
dealt with in subsequent patches.)  As all memory types for MTRR=WB ranges
inter-operate coherently, so leave Xen's directmap mappings as WB.

Only PV guests with access to devices can use reduced-cacheability mappings to
begin with, and they're trusted not to mount DoSs against the system anyway.

Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them.
Shift the later PGC_* constants up, to gain 3 extra bits in the main reference
count.  Retain the check in get_page_from_l1e() for special_pages() because a
guest has no business using reduced cacheability on these.

This reverts changeset 55f97f49b7ce6c3520c555d19caac6cf3f9a5df0

This is CVE-2022-26363, part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -783,28 +783,6 @@ bool is_iomem_page(mfn_t mfn)
     return (page_get_owner(page) == dom_io);
 }
 
-static int update_xen_mappings(unsigned long mfn, unsigned int cacheattr)
-{
-    int err = 0;
-    bool alias = mfn >= PFN_DOWN(xen_phys_start) &&
-         mfn < PFN_UP(xen_phys_start + xen_virt_end - XEN_VIRT_START);
-    unsigned long xen_va =
-        XEN_VIRT_START + ((mfn - PFN_DOWN(xen_phys_start)) << PAGE_SHIFT);
-
-    if ( boot_cpu_has(X86_FEATURE_XEN_SELFSNOOP) )
-        return 0;
-
-    if ( unlikely(alias) && cacheattr )
-        err = map_pages_to_xen(xen_va, _mfn(mfn), 1, 0);
-    if ( !err )
-        err = map_pages_to_xen((unsigned long)mfn_to_virt(mfn), _mfn(mfn), 1,
-                     PAGE_HYPERVISOR | cacheattr_to_pte_flags(cacheattr));
-    if ( unlikely(alias) && !cacheattr && !err )
-        err = map_pages_to_xen(xen_va, _mfn(mfn), 1, PAGE_HYPERVISOR);
-
-    return err;
-}
-
 #ifndef NDEBUG
 struct mmio_emul_range_ctxt {
     const struct domain *d;
@@ -1009,47 +987,14 @@ get_page_from_l1e(
         goto could_not_pin;
     }
 
-    if ( pte_flags_to_cacheattr(l1f) !=
-         ((page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base) )
+    if ( (l1f & PAGE_CACHE_ATTRS) != _PAGE_WB && is_special_page(page) )
     {
-        unsigned long x, nx, y = page->count_info;
-        unsigned long cacheattr = pte_flags_to_cacheattr(l1f);
-        int err;
-
-        if ( is_special_page(page) )
-        {
-            if ( write )
-                put_page_type(page);
-            put_page(page);
-            gdprintk(XENLOG_WARNING,
-                     "Attempt to change cache attributes of Xen heap page\n");
-            return -EACCES;
-        }
-
-        do {
-            x  = y;
-            nx = (x & ~PGC_cacheattr_mask) | (cacheattr << PGC_cacheattr_base);
-        } while ( (y = cmpxchg(&page->count_info, x, nx)) != x );
-
-        err = update_xen_mappings(mfn, cacheattr);
-        if ( unlikely(err) )
-        {
-            cacheattr = y & PGC_cacheattr_mask;
-            do {
-                x  = y;
-                nx = (x & ~PGC_cacheattr_mask) | cacheattr;
-            } while ( (y = cmpxchg(&page->count_info, x, nx)) != x );
-
-            if ( write )
-                put_page_type(page);
-            put_page(page);
-
-            gdprintk(XENLOG_WARNING, "Error updating mappings for mfn %" PRI_mfn
-                     " (pfn %" PRI_pfn ", from L1 entry %" PRIpte ") for d%d\n",
-                     mfn, get_gpfn_from_mfn(mfn),
-                     l1e_get_intpte(l1e), l1e_owner->domain_id);
-            return err;
-        }
+        if ( write )
+            put_page_type(page);
+        put_page(page);
+        gdprintk(XENLOG_WARNING,
+                 "Attempt to change cache attributes of Xen heap page\n");
+        return -EACCES;
     }
 
     return 0;
@@ -2467,25 +2412,10 @@ static int mod_l4_entry(l4_pgentry_t *pl
  */
 static int cleanup_page_mappings(struct page_info *page)
 {
-    unsigned int cacheattr =
-        (page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base;
     int rc = 0;
     unsigned long mfn = mfn_x(page_to_mfn(page));
 
     /*
-     * If we've modified xen mappings as a result of guest cache
-     * attributes, restore them to the "normal" state.
-     */
-    if ( unlikely(cacheattr) )
-    {
-        page->count_info &= ~PGC_cacheattr_mask;
-
-        BUG_ON(is_special_page(page));
-
-        rc = update_xen_mappings(mfn, 0);
-    }
-
-    /*
      * If this may be in a PV domain's IOMMU, remove it.
      *
      * NB that writable xenheap pages have their type set and cleared by
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -69,25 +69,22 @@
  /* Set when is using a page as a page table */
 #define _PGC_page_table   PG_shift(3)
 #define PGC_page_table    PG_mask(1, 3)
- /* 3-bit PAT/PCD/PWT cache-attribute hint. */
-#define PGC_cacheattr_base PG_shift(6)
-#define PGC_cacheattr_mask PG_mask(7, 6)
  /* Page is broken? */
-#define _PGC_broken       PG_shift(7)
-#define PGC_broken        PG_mask(1, 7)
+#define _PGC_broken       PG_shift(4)
+#define PGC_broken        PG_mask(1, 4)
  /* Mutually-exclusive page states: { inuse, offlining, offlined, free }. */
-#define PGC_state         PG_mask(3, 9)
-#define PGC_state_inuse   PG_mask(0, 9)
-#define PGC_state_offlining PG_mask(1, 9)
-#define PGC_state_offlined PG_mask(2, 9)
-#define PGC_state_free    PG_mask(3, 9)
+#define PGC_state           PG_mask(3, 6)
+#define PGC_state_inuse     PG_mask(0, 6)
+#define PGC_state_offlining PG_mask(1, 6)
+#define PGC_state_offlined  PG_mask(2, 6)
+#define PGC_state_free      PG_mask(3, 6)
 #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st)
 /* Page is not reference counted (see below for caveats) */
-#define _PGC_extra        PG_shift(10)
-#define PGC_extra         PG_mask(1, 10)
+#define _PGC_extra        PG_shift(7)
+#define PGC_extra         PG_mask(1, 7)
 
 /* Count of references to this frame. */
-#define PGC_count_width   PG_shift(10)
+#define PGC_count_width   PG_shift(7)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
 /*

++++++ 62a1e60e-x86-split-cache_flush-out-of-cache_writeback.patch ++++++
# Commit 9a67ffee3371506e1cbfdfff5b90658d4828f6a2
# Date 2022-06-09 14:22:38 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86: Split cache_flush() out of cache_writeback()

Subsequent changes will want a fully flushing version.

Use the new helper rather than opencoding it in flush_area_local().  This
resolves an outstanding issue where the conditional sfence is on the wrong
side of the clflushopt loop.  clflushopt is ordered with respect to older
stores, not to younger stores.

Rename gnttab_cache_flush()'s helper to avoid colliding in name.
grant_table.c can see the prototype from cache.h so the build fails
otherwise.

This is part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Xen 4.16 and earlier:
 * Also backport half of c/s 3330013e67396 "VT-d / x86: re-arrange cache
   syncing" to split cache_writeback() out of the IOMMU logic, but without the
   associated hooks changes.

--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -234,7 +234,7 @@ unsigned int flush_area_local(const void
     if ( flags & FLUSH_CACHE )
     {
         const struct cpuinfo_x86 *c = ¤t_cpu_data;
-        unsigned long i, sz = 0;
+        unsigned long sz = 0;
 
         if ( order < (BITS_PER_LONG - PAGE_SHIFT) )
             sz = 1UL << (order + PAGE_SHIFT);
@@ -244,13 +244,7 @@ unsigned int flush_area_local(const void
              c->x86_clflush_size && c->x86_cache_size && sz &&
              ((sz >> 10) < c->x86_cache_size) )
         {
-            alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
-            for ( i = 0; i < sz; i += c->x86_clflush_size )
-                alternative_input(".byte " __stringify(NOP_DS_PREFIX) ";"
-                                  " clflush %0",
-                                  "data16 clflush %0",      /* clflushopt */
-                                  X86_FEATURE_CLFLUSHOPT,
-                                  "m" (((const char *)va)[i]));
+            cache_flush(va, sz);
             flags &= ~FLUSH_CACHE;
         }
         else
@@ -265,6 +259,80 @@ unsigned int flush_area_local(const void
     return flags;
 }
 
+void cache_flush(const void *addr, unsigned int size)
+{
+    /*
+     * This function may be called before current_cpu_data is established.
+     * Hence a fallback is needed to prevent the loop below becoming infinite.
+     */
+    unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16;
+    const void *end = addr + size;
+
+    addr -= (unsigned long)addr & (clflush_size - 1);
+    for ( ; addr < end; addr += clflush_size )
+    {
+        /*
+         * Note regarding the "ds" prefix use: it's faster to do a clflush
+         * + prefix than a clflush + nop, and hence the prefix is added instead
+         * of letting the alternative framework fill the gap by appending nops.
+         */
+        alternative_io("ds; clflush %[p]",
+                       "data16 clflush %[p]", /* clflushopt */
+                       X86_FEATURE_CLFLUSHOPT,
+                       /* no outputs */,
+                       [p] "m" (*(const char *)(addr)));
+    }
+
+    alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
+}
+
+void cache_writeback(const void *addr, unsigned int size)
+{
+    unsigned int clflush_size;
+    const void *end = addr + size;
+
+    /* Fall back to CLFLUSH{,OPT} when CLWB isn't available. */
+    if ( !boot_cpu_has(X86_FEATURE_CLWB) )
+        return cache_flush(addr, size);
+
+    /*
+     * This function may be called before current_cpu_data is established.
+     * Hence a fallback is needed to prevent the loop below becoming infinite.
+     */
+    clflush_size = current_cpu_data.x86_clflush_size ?: 16;
+    addr -= (unsigned long)addr & (clflush_size - 1);
+    for ( ; addr < end; addr += clflush_size )
+    {
+/*
+ * The arguments to a macro must not include preprocessor directives. Doing so
+ * results in undefined behavior, so we have to create some defines here in
+ * order to avoid it.
+ */
+#if defined(HAVE_AS_CLWB)
+# define CLWB_ENCODING "clwb %[p]"
+#elif defined(HAVE_AS_XSAVEOPT)
+# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
+#else
+# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
+#endif
+
+#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
+#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
+# define INPUT BASE_INPUT
+#else
+# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
+#endif
+
+        asm volatile (CLWB_ENCODING :: INPUT(addr));
+
+#undef INPUT
+#undef BASE_INPUT
+#undef CLWB_ENCODING
+    }
+
+    asm volatile ("sfence" ::: "memory");
+}
+
 unsigned int guest_flush_tlb_flags(const struct domain *d)
 {
     bool shadow = paging_mode_shadow(d);
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3431,7 +3431,7 @@ gnttab_swap_grant_ref(XEN_GUEST_HANDLE_P
     return 0;
 }
 
-static int cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref)
+static int _cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref)
 {
     struct domain *d, *owner;
     struct page_info *page;
@@ -3525,7 +3525,7 @@ gnttab_cache_flush(XEN_GUEST_HANDLE_PARA
             return -EFAULT;
         for ( ; ; )
         {
-            int ret = cache_flush(&op, cur_ref);
+            int ret = _cache_flush(&op, cur_ref);
 
             if ( ret < 0 )
                 return ret;
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -76,7 +76,6 @@ int __must_check qinval_device_iotlb_syn
                                           struct pci_dev *pdev,
                                           u16 did, u16 size, u64 addr);
 
-unsigned int get_cache_line_size(void);
 void flush_all_cache(void);
 
 uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -31,6 +31,7 @@
 #include <xen/pci.h>
 #include <xen/pci_regs.h>
 #include <xen/keyhandler.h>
+#include <asm/cache.h>
 #include <asm/msi.h>
 #include <asm/nops.h>
 #include <asm/irq.h>
@@ -206,54 +207,6 @@ static void check_cleanup_domid_map(cons
     }
 }
 
-static void sync_cache(const void *addr, unsigned int size)
-{
-    static unsigned long clflush_size = 0;
-    const void *end = addr + size;
-
-    if ( clflush_size == 0 )
-        clflush_size = get_cache_line_size();
-
-    addr -= (unsigned long)addr & (clflush_size - 1);
-    for ( ; addr < end; addr += clflush_size )
-/*
- * The arguments to a macro must not include preprocessor directives. Doing so
- * results in undefined behavior, so we have to create some defines here in
- * order to avoid it.
- */
-#if defined(HAVE_AS_CLWB)
-# define CLWB_ENCODING "clwb %[p]"
-#elif defined(HAVE_AS_XSAVEOPT)
-# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
-#else
-# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
-#endif
-
-#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
-#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
-# define INPUT BASE_INPUT
-#else
-# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
-#endif
-        /*
-         * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush
-         * + prefix than a clflush + nop, and hence the prefix is added instead
-         * of letting the alternative framework fill the gap by appending nops.
-         */
-        alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]",
-                         "data16 clflush %[p]", /* clflushopt */
-                         X86_FEATURE_CLFLUSHOPT,
-                         CLWB_ENCODING,
-                         X86_FEATURE_CLWB, /* no outputs */,
-                         INPUT(addr));
-#undef INPUT
-#undef BASE_INPUT
-#undef CLWB_ENCODING
-
-    alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT,
-                      "sfence", X86_FEATURE_CLWB);
-}
-
 /* Allocate page table, return its machine address */
 uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node)
 {
@@ -273,7 +226,7 @@ uint64_t alloc_pgtable_maddr(unsigned lo
         clear_page(vaddr);
 
         if ( (iommu_ops.init ? &iommu_ops : &vtd_ops)->sync_cache )
-            sync_cache(vaddr, PAGE_SIZE);
+            cache_writeback(vaddr, PAGE_SIZE);
         unmap_domain_page(vaddr);
         cur_pg++;
     }
@@ -1305,7 +1258,7 @@ int __init iommu_alloc(struct acpi_drhd_
     iommu->nr_pt_levels = agaw_to_level(agaw);
 
     if ( !ecap_coherent(iommu->ecap) )
-        vtd_ops.sync_cache = sync_cache;
+        vtd_ops.sync_cache = cache_writeback;
 
     /* allocate domain id bitmap */
     iommu->domid_bitmap = xzalloc_array(unsigned long, BITS_TO_LONGS(nr_dom));
--- a/xen/drivers/passthrough/vtd/x86/vtd.c
+++ b/xen/drivers/passthrough/vtd/x86/vtd.c
@@ -47,11 +47,6 @@ void unmap_vtd_domain_page(const void *v
     unmap_domain_page(va);
 }
 
-unsigned int get_cache_line_size(void)
-{
-    return ((cpuid_ebx(1) >> 8) & 0xff) * 8;
-}
-
 void flush_all_cache()
 {
     wbinvd();
--- a/xen/include/asm-x86/cache.h
+++ b/xen/include/asm-x86/cache.h
@@ -11,4 +11,11 @@
 
 #define __read_mostly __section(".data.read_mostly")
 
+#ifndef __ASSEMBLY__
+
+void cache_flush(const void *addr, unsigned int size);
+void cache_writeback(const void *addr, unsigned int size);
+
+#endif
+
 #endif

++++++ 62a1e62b-x86-AMD-work-around-CLFLUSH-ordering.patch ++++++
# Commit 062868a5a8b428b85db589fa9a6d6e43969ffeb9
# Date 2022-06-09 14:23:07 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/amd: Work around CLFLUSH ordering on older parts

On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything,
including reads and writes to the address, and LFENCE/SFENCE instructions.

This creates a multitude of problematic corner cases, laid out in the manual.
Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering.

This is part of XSA-402.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -812,6 +812,14 @@ static void init_amd(struct cpuinfo_x86
 	if (!cpu_has_lfence_dispatch)
 		__set_bit(X86_FEATURE_MFENCE_RDTSC, c->x86_capability);
 
+	/*
+	 * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with
+	 * everything, including reads and writes to address, and
+	 * LFENCE/SFENCE instructions.
+	 */
+	if (!cpu_has_clflushopt)
+		setup_force_cpu_cap(X86_BUG_CLFLUSH_MFENCE);
+
 	switch(c->x86)
 	{
 	case 0xf ... 0x11:
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -259,6 +259,13 @@ unsigned int flush_area_local(const void
     return flags;
 }
 
+/*
+ * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with everything,
+ * including reads and writes to address, and LFENCE/SFENCE instructions.
+ *
+ * This function only works safely after alternatives have run.  Luckily, at
+ * the time of writing, we don't flush the caches that early.
+ */
 void cache_flush(const void *addr, unsigned int size)
 {
     /*
@@ -268,6 +275,8 @@ void cache_flush(const void *addr, unsig
     unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16;
     const void *end = addr + size;
 
+    alternative("", "mfence", X86_BUG_CLFLUSH_MFENCE);
+
     addr -= (unsigned long)addr & (clflush_size - 1);
     for ( ; addr < end; addr += clflush_size )
     {
@@ -283,7 +292,9 @@ void cache_flush(const void *addr, unsig
                        [p] "m" (*(const char *)(addr)));
     }
 
-    alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
+    alternative_2("",
+                  "sfence", X86_FEATURE_CLFLUSHOPT,
+                  "mfence", X86_BUG_CLFLUSH_MFENCE);
 }
 
 void cache_writeback(const void *addr, unsigned int size)
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -47,6 +47,7 @@ XEN_CPUFEATURE(XEN_IBT,           X86_SY
 
 #define X86_BUG_FPU_PTRS          X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */
 #define X86_BUG_NULL_SEG          X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */
+#define X86_BUG_CLFLUSH_MFENCE    X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
 
 /* Total number of capability words, inc synth and bug words. */
 #define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit words worth of info */

++++++ 62a1e649-x86-track-and-flush-non-coherent.patch ++++++
# Commit c1c9cae3a9633054b177c5de21ad7268162b2f2c
# Date 2022-06-09 14:23:37 +0200
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Jan Beulich <jbeulich@suse.com>
x86/pv: Track and flush non-coherent mappings of RAM

There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with
devices that make non-coherent writes.  The Linux sound subsystem makes
extensive use of this technique.

For such usecases, the guest's DMA buffer is mapped and consistently used as
WC, and Xen doesn't interact with the buffer.

However, a mischevious guest can use WC mappings to deliberately create
non-coherency between the cache and RAM, and use this to trick Xen into
validating a pagetable which isn't actually safe.

Allocate a new PGT_non_coherent to track the non-coherency of mappings.  Set
it whenever a non-coherent writeable mapping is created.  If the page is used
as anything other than PGT_writable_page, force a cache flush before
validation.  Also force a cache flush before the page is returned to the heap.

This is CVE-2022-26364, part of XSA-402.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -997,6 +997,15 @@ get_page_from_l1e(
         return -EACCES;
     }
 
+    /*
+     * Track writeable non-coherent mappings to RAM pages, to trigger a cache
+     * flush later if the target is used as anything but a PGT_writeable page.
+     * We care about all writeable mappings, including foreign mappings.
+     */
+    if ( !boot_cpu_has(X86_FEATURE_XEN_SELFSNOOP) &&
+         (l1f & (PAGE_CACHE_ATTRS | _PAGE_RW)) == (_PAGE_WC | _PAGE_RW) )
+        set_bit(_PGT_non_coherent, &page->u.inuse.type_info);
+
     return 0;
 
  could_not_pin:
@@ -2454,6 +2463,19 @@ static int cleanup_page_mappings(struct
         }
     }
 
+    /*
+     * Flush the cache if there were previously non-coherent writeable
+     * mappings of this page.  This forces the page to be coherent before it
+     * is freed back to the heap.
+     */
+    if ( __test_and_clear_bit(_PGT_non_coherent, &page->u.inuse.type_info) )
+    {
+        void *addr = __map_domain_page(page);
+
+        cache_flush(addr, PAGE_SIZE);
+        unmap_domain_page(addr);
+    }
+
     return rc;
 }
 
@@ -3029,6 +3051,22 @@ static int _get_page_type(struct page_in
     if ( unlikely(!(nx & PGT_validated)) )
     {
         /*
+         * Flush the cache if there were previously non-coherent mappings of
+         * this page, and we're trying to use it as anything other than a
+         * writeable page.  This forces the page to be coherent before we
+         * validate its contents for safety.
+         */
+        if ( (nx & PGT_non_coherent) && type != PGT_writable_page )
+        {
+            void *addr = __map_domain_page(page);
+
+            cache_flush(addr, PAGE_SIZE);
+            unmap_domain_page(addr);
+
+            page->u.inuse.type_info &= ~PGT_non_coherent;
+        }
+
+        /*
          * No special validation needed for writable or shared pages.  Page
          * tables and GDT/LDT need to have their contents audited.
          *
--- a/xen/arch/x86/pv/grant_table.c
+++ b/xen/arch/x86/pv/grant_table.c
@@ -109,7 +109,17 @@ int create_grant_pv_mapping(uint64_t add
 
     ol1e = *pl1e;
     if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, curr, 0) )
+    {
+        /*
+         * We always create mappings in this path.  However, our caller,
+         * map_grant_ref(), only passes potentially non-zero cache_flags for
+         * MMIO frames, so this path doesn't create non-coherent mappings of
+         * RAM frames and there's no need to calculate PGT_non_coherent.
+         */
+        ASSERT(!cache_flags || is_iomem_page(frame));
+
         rc = GNTST_okay;
+    }
 
  out_unlock:
     page_unlock(page);
@@ -294,7 +304,18 @@ int replace_grant_pv_mapping(uint64_t ad
                  l1e_get_flags(ol1e), addr, grant_pte_flags);
 
     if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, curr, 0) )
+    {
+        /*
+         * Generally, replace_grant_pv_mapping() is used to destroy mappings
+         * (n1le = l1e_empty()), but it can be a present mapping on the
+         * GNTABOP_unmap_and_replace path.
+         *
+         * In such cases, the PTE is fully transplanted from its old location
+         * via steal_linear_addr(), so we need not perform PGT_non_coherent
+         * checking here.
+         */
         rc = GNTST_okay;
+    }
 
  out_unlock:
     page_unlock(page);
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -53,8 +53,12 @@
 #define _PGT_partial      PG_shift(8)
 #define PGT_partial       PG_mask(1, 8)
 
+/* Has this page been mapped writeable with a non-coherent memory type? */
+#define _PGT_non_coherent PG_shift(9)
+#define PGT_non_coherent  PG_mask(1, 9)
+
  /* Count of uses of this frame as its current type. */
-#define PGT_count_width   PG_shift(8)
+#define PGT_count_width   PG_shift(9)
 #define PGT_count_mask    ((1UL<<PGT_count_width)-1)
 
 /* Are the 'type mask' bits identical? */

++++++ 62a99614-IOMMU-x86-gcc12.patch ++++++
# Commit 80ad8db8a4d9bb24952f0aea788ce6f47566fa76
# Date 2022-06-15 10:19:32 +0200
# Author Jan Beulich <jbeulich@suse.com>
# Committer Jan Beulich <jbeulich@suse.com>
IOMMU/x86: work around bogus gcc12 warning in hvm_gsi_eoi()

As per [1] the expansion of the pirq_dpci() macro causes a -Waddress
controlled warning (enabled implicitly in our builds, if not by default)
tying the middle part of the involved conditional expression to the
surrounding boolean context. Work around this by introducing a local
inline function in the affected source file.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monn�� <roger.pau@citrix.com>

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102967

--- a/xen/drivers/passthrough/x86/hvm.c
+++ b/xen/drivers/passthrough/x86/hvm.c
@@ -25,6 +25,18 @@
 #include <asm/hvm/support.h>
 #include <asm/io_apic.h>
 
+/*
+ * Gcc12 takes issue with pirq_dpci() being used in boolean context (see gcc
+ * bug 102967). While we can't replace the macro definition in the header by an
+ * inline function, we can do so here.
+ */
+static inline struct hvm_pirq_dpci *_pirq_dpci(struct pirq *pirq)
+{
+    return pirq_dpci(pirq);
+}
+#undef pirq_dpci
+#define pirq_dpci(pirq) _pirq_dpci(pirq)
+
 static DEFINE_PER_CPU(struct list_head, dpci_list);
 
 /*

++++++ 62ab0fab-x86-spec-ctrl-VERW-flushing-runtime-cond.patch ++++++
# Commit e06b95c1d44ab80da255219fc9f1e2fc423edcb6
# Date 2022-06-16 12:10:37 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Make VERW flushing runtime conditional

Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.

Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.

Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.

For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.

No change in behaviour.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monn�� <roger.pau@citrix.com>

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2258,9 +2258,8 @@ in place for guests to use.
 Use of a positive boolean value for either of these options is invalid.
 
 The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
-grained control over the alternative blocks used by Xen.  These impact Xen's
-ability to protect itself, and Xen's ability to virtualise support for guests
-to use.
+grained control over the primitives by Xen.  These impact Xen's ability to
+protect itself, and Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -863,6 +863,8 @@ int arch_domain_create(struct domain *d,
 
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
+    spec_ctrl_init_domain(d);
+
     return 0;
 
  fail:
@@ -2017,14 +2019,15 @@ static void __context_switch(void)
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
+    struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
@@ -2088,6 +2091,11 @@ void context_switch(struct vcpu *prev, s
                 *last_id = next_id;
             }
         }
+
+        /* Update the top-of-stack block with the VERW disposition. */
+        info->spec_ctrl_flags &= ~SCF_verw;
+        if ( nextd->arch.verw )
+            info->spec_ctrl_flags |= SCF_verw;
     }
 
     sched_context_switched(prev, next);
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,7 +87,7 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM
+        DO_SPEC_CTRL_COND_VERW
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -36,8 +36,8 @@ static bool __initdata opt_msr_sc_pv = t
 static bool __initdata opt_msr_sc_hvm = true;
 static int8_t __initdata opt_rsb_pv = -1;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __initdata opt_md_clear_pv = -1;
-static int8_t __initdata opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_md_clear_pv = -1;
+static int8_t __read_mostly opt_md_clear_hvm = -1;
 
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
@@ -932,6 +932,13 @@ static __init void mds_calculations(uint
     }
 }
 
+void spec_ctrl_init_domain(struct domain *d)
+{
+    bool pv = is_pv_domain(d);
+
+    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+}
+
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
@@ -1196,21 +1203,20 @@ void __init init_speculation_mitigations
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The PV blocks need using all the
-     * time, and the Idle blocks need using if either PV or HVM defences are
-     * used.
+     * Enable MDS defences as applicable.  The Idle blocks need using if
+     * either PV or HVM defences are used.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivelent semantics to avoid needing to perform both flushes on the
-     * HVM path.  The HVM blocks don't need activating if our hypervisor told
-     * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
+     * equivalent semantics to avoid needing to perform both flushes on the
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     *
+     * After calculating the appropriate idle setting, simplify
+     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
     if ( opt_md_clear_pv || opt_md_clear_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
+    opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -35,8 +35,7 @@ XEN_CPUFEATURE(SC_RSB_HVM,        X86_SY
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
 XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
-XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
-XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
+/* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -319,6 +319,9 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
+    /* Use VERW on return-to-guest for its flushing side effect. */
+    bool verw;
+
     union {
         struct pv_domain pv;
         struct hvm_domain hvm;
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -24,6 +24,7 @@
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
+#define SCF_verw       (1 << 3)
 
 #ifndef __ASSEMBLY__
 
@@ -32,6 +33,7 @@
 #include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
+void spec_ctrl_init_domain(struct domain *d);
 
 extern bool opt_ibpb;
 extern bool opt_ssbd;
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -136,6 +136,19 @@
 #endif
 .endm
 
+.macro DO_SPEC_CTRL_COND_VERW
+/*
+ * Requires %rsp=cpuinfo
+ *
+ * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
+ * v1 gadget, but the IRET/VMEntry is serialising.
+ */
+    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    jz .L\@_verw_skip
+    verw CPUINFO_verw_sel(%rsp)
+.L\@_verw_skip:
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
@@ -231,8 +244,7 @@
 #define SPEC_CTRL_EXIT_TO_PV                                            \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_PV
+    DO_SPEC_CTRL_COND_VERW
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.

++++++ 62ab0fac-x86-spec-ctrl-enum-for-MMIO-Stale-Data.patch ++++++
# Commit 2ebe8fe9b7e0d36e9ec3cfe4552b2b197ef0dcec
# Date 2022-06-16 12:10:37 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Enumeration for MMIO Stale Data controls

The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.

FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect.  This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.

FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monn�� <roger.pau@citrix.com>

--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -323,7 +323,7 @@ static void __init print_details(enum in
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_IBRS_ALL)                       ? " IBRS_ALL"       : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -332,13 +332,16 @@ static void __init print_details(enum in
            (caps & ARCH_CAPS_SSB_NO)                         ? " SSB_NO"         : "",
            (caps & ARCH_CAPS_MDS_NO)                         ? " MDS_NO"         : "",
            (caps & ARCH_CAPS_TAA_NO)                         ? " TAA_NO"         : "",
+           (caps & ARCH_CAPS_SBDR_SSDP_NO)                   ? " SBDR_SSDP_NO"   : "",
+           (caps & ARCH_CAPS_FBSDP_NO)                       ? " FBSDP_NO"       : "",
+           (caps & ARCH_CAPS_PSDP_NO)                        ? " PSDP_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -353,7 +356,9 @@ static void __init print_details(enum in
            (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR))       ? " MD_CLEAR"       : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL))     ? " SRBDS_CTRL"     : "",
            (e8b  & cpufeat_mask(X86_FEATURE_VIRT_SSBD))      ? " VIRT_SSBD"      : "",
-           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "");
+           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "");
 
     /* Compiled-in support which pertains to mitigations. */
     if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -66,6 +66,11 @@
 #define  ARCH_CAPS_IF_PSCHANGE_MC_NO        (_AC(1, ULL) <<  6)
 #define  ARCH_CAPS_TSX_CTRL                 (_AC(1, ULL) <<  7)
 #define  ARCH_CAPS_TAA_NO                   (_AC(1, ULL) <<  8)
+#define  ARCH_CAPS_SBDR_SSDP_NO             (_AC(1, ULL) << 13)
+#define  ARCH_CAPS_FBSDP_NO                 (_AC(1, ULL) << 14)
+#define  ARCH_CAPS_PSDP_NO                  (_AC(1, ULL) << 15)
+#define  ARCH_CAPS_FB_CLEAR                 (_AC(1, ULL) << 17)
+#define  ARCH_CAPS_FB_CLEAR_CTRL            (_AC(1, ULL) << 18)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
@@ -83,6 +88,7 @@
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS        (_AC(1, ULL) <<  0)
 #define  MCU_OPT_CTRL_RTM_ALLOW             (_AC(1, ULL) <<  1)
 #define  MCU_OPT_CTRL_RTM_LOCKED            (_AC(1, ULL) <<  2)
+#define  MCU_OPT_CTRL_FB_CLEAR_DIS          (_AC(1, ULL) <<  3)
 
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561

++++++ 62ab0fad-x86-spec-ctrl-add-unpriv-mmio.patch ++++++
# Commit 8c24b70fedcb52633b2370f834d8a2be3f7fa38e
# Date 2022-06-16 12:10:37 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Add spec-ctrl=unpriv-mmio

Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.

As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.

However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
users should enable this option.

On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.

On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:

 * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
   using FB_CLEAR.
 * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
   srb-lock, previously used to mitigate SRBDS.

Both mitigations require microcode from IPU 2022.1, May 2022.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monn�� <roger.pau@citrix.com>

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2235,7 +2235,7 @@ By default SSBD will be mitigated at run
 ### spec-ctrl (x86)
...
`= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
             bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock}=<bool> ]`
+>              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2314,8 +2314,16 @@ Xen will enable this mitigation.
 On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
 or prevent Xen from protect the Special Register Buffer from leaking stale
 data. By default, Xen will enable this mitigation, except on parts where MDS
-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
-way for an attacker to obtain the stale data).
+is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
+mappings (in which case, there is believed to be no way for an attacker to
+obtain stale data).
+
+The `unpriv-mmio=` boolean indicates whether the system has (or will have)
+less than fully privileged domains granted access to MMIO devices.  By
+default, this option is disabled.  If enabled, Xen will use the `FB_CLEAR`
+and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
+release to mitigate cross-domain leakage of data via the MMIO Stale Data
+vulnerabilities.
 
 ### sync_console
...
`= <boolean>`
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -67,6 +67,8 @@ static bool __initdata cpu_has_bug_msbds
 static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */
static int8_t __initdata opt_srb_lock = -1;
+static bool __initdata opt_unpriv_mmio;
+static bool __read_mostly opt_fb_clear_mmio;
 
 static int __init parse_spec_ctrl(const char *s)
 {
@@ -184,6 +186,8 @@ static int __init parse_spec_ctrl(const
             opt_branch_harden = val;
         else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
             opt_srb_lock = val;
+        else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
+            opt_unpriv_mmio = val;
         else
             rc = -EINVAL;
 
@@ -392,7 +396,8 @@ static void __init print_details(enum in
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb                                  ? " IBPB"  : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm       ? " VERW"  : "",
+           opt_md_clear_pv || opt_md_clear_hvm ||
+           opt_fb_clear_mmio                         ? " VERW"  : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
     /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
@@ -941,7 +946,9 @@ void spec_ctrl_init_domain(struct domain
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+    d->arch.verw =
+        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+        (opt_fb_clear_mmio && is_iommu_enabled(d));
 }
 
 void __init init_speculation_mitigations(void)
@@ -1196,6 +1203,18 @@ void __init init_speculation_mitigations
     mds_calculations(caps);
 
     /*
+     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
+     * reintroduced the VERW fill buffer flushing side effect because of a
+     * susceptibility to FBSDP.
+     *
+     * If unprivileged guests have (or will have) MMIO mappings, we can
+     * mitigate cross-domain leakage of fill buffer data by issuing VERW on
+     * the return-to-guest path.
+     */
+    if ( opt_unpriv_mmio )
+        opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
+
+    /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
@@ -1208,18 +1227,20 @@ void __init init_speculation_mitigations
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The Idle blocks need using if
-     * either PV or HVM defences are used.
+     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
+     * either the PV or HVM MDS defences are used, or if we may give MMIO
+     * access to untrusted guests.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
      * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
+     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
      * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm )
+    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
     opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
@@ -1284,14 +1305,19 @@ void __init init_speculation_mitigations
      * On some SRBDS-affected hardware, it may be safe to relax srb-lock by
      * default.
      *
-     * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only known
-     * way to access the Fill Buffer.  If TSX isn't available (inc. SKU
-     * reasons on some models), or TSX is explicitly disabled, then there is
-     * no need for the extra overhead to protect RDRAND/RDSEED.
+     * All parts with SRBDS_CTRL suffer SSDP, the mechanism by which stale RNG
+     * data becomes available to other contexts.  To recover the data, an
+     * attacker needs to use:
+     *  - SBDS (MDS or TAA to sample the cores fill buffer)
+     *  - SBDR (Architecturally retrieve stale transaction buffer contents)
+     *  - DRPW (Architecturally latch stale fill buffer data)
+     *
+     * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and there
+     * is no unprivileged MMIO access, the RNG data doesn't need protecting.
      */
     if ( cpu_has_srbds_ctrl )
     {
-        if ( opt_srb_lock == -1 &&
+        if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
              (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
              (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && rtm_disabled)) )
             opt_srb_lock = 0;

++++++ 62bdd840-x86-spec-ctrl-only-adjust-idle-with-legacy-IBRS.patch ++++++
# Commit ffc7694e0c99eea158c32aa164b7d1e1bb1dc46b
# Date 2022-06-30 18:07:13 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Only adjust MSR_SPEC_CTRL for idle with legacy IBRS

Back at the time of the original Spectre-v2 fixes, it was recommended to clear
MSR_SPEC_CTRL when going idle.  This is because of the side effects on the
sibling thread caused by the microcode IBRS and STIBP implementations which
were retrofitted to existing CPUs.

However, there are no relevant cross-thread impacts for the hardware
IBRS/STIBP implementations, so this logic should not be used on Intel CPUs
supporting eIBRS, or any AMD CPUs; doing so only adds unnecessary latency to
the idle path.

Furthermore, there's no point playing with MSR_SPEC_CTRL in the idle paths if
SMT is disabled for other reasons.

Fixes: 8d03080d2a33 ("x86/spec-ctrl: Cease using thunk=lfence on AMD")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monn�� <roger.pau@citrix.com>

--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1150,8 +1150,14 @@ void __init init_speculation_mitigations
     /* (Re)init BSP state now that default_spec_ctrl_flags has been calculated. */
     init_shadow_spec_ctrl_state();
 
-    /* If Xen is using any MSR_SPEC_CTRL settings, adjust the idle path. */
-    if ( default_xen_spec_ctrl )
+    /*
+     * For microcoded IBRS only (i.e. Intel, pre eIBRS), it is recommended to
+     * clear MSR_SPEC_CTRL before going idle, to avoid impacting sibling
+     * threads.  Activate this if SMT is enabled, and Xen is using a non-zero
+     * MSR_SPEC_CTRL setting.
+     */
+    if ( boot_cpu_has(X86_FEATURE_IBRSB) && !(caps & ARCH_CAPS_IBRS_ALL) &&
+         hw_smt_enabled && default_xen_spec_ctrl )
         setup_force_cpu_cap(X86_FEATURE_SC_MSR_IDLE);
 
     xpti_init_default(caps);
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -33,7 +33,7 @@ XEN_CPUFEATURE(SC_MSR_HVM,        X86_SY
 XEN_CPUFEATURE(SC_RSB_PV,         X86_SYNTH(18)) /* RSB overwrite needed for PV */
 XEN_CPUFEATURE(SC_RSB_HVM,        X86_SYNTH(19)) /* RSB overwrite needed for HVM */
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
-XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
+XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* Clear MSR_SPEC_CTRL on idle */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
 /* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -78,7 +78,8 @@ static always_inline void spec_ctrl_ente
     uint32_t val = 0;
 
     /*
-     * Branch Target Injection:
+     * It is recommended in some cases to clear MSR_SPEC_CTRL when going idle,
+     * to avoid impacting sibling threads.
      *
      * Latch the new shadow value, then enable shadowing, then update the MSR.
      * There are no SMP issues here; only local processor ordering concerns.
@@ -114,7 +115,7 @@ static always_inline void spec_ctrl_exit
     uint32_t val = info->xen_spec_ctrl;
 
     /*
-     * Branch Target Injection:
+     * Restore MSR_SPEC_CTRL on exit from idle.
      *
      * Disable shadowing before updating the MSR.  There are no SMP issues
      * here; only local processor ordering concerns.

++++++ 62bdd841-x86-spec-ctrl-knobs-for-STIBP-and-PSFD.patch ++++++
# Commit fef244b179c06fcdfa581f7d57fa6e578c49ff50
# Date 2022-06-30 18:07:13 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Knobs for STIBP and PSFD, and follow hardware STIBP hint

STIBP and PSFD are slightly weird bits, because they're both implied by other
bits in MSR_SPEC_CTRL.  Add fine grain controls for them, and take the
implications into account when setting IBRS/SSBD.

Rearrange the IBPB text/variables/logic to keep all the MSR_SPEC_CTRL bits
together, for consistency.

However, AMD have a hardware hint CPUID bit recommending that STIBP be set
unilaterally.  This is advertised on Zen3, so follow the recommendation.
Furthermore, in such cases, set STIBP behind the guest's back for now.  This
has negligible overhead for the guest, but saves a WRMSR on vmentry.  This is
the only default change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monn�� <roger.pau@citrix.com>

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2234,8 +2234,9 @@ By default SSBD will be mitigated at run
 
 ### spec-ctrl (x86)
...
`= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
->              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
+>              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
+>              eager-fpu,l1d-flush,branch-harden,srb-lock,
+>              unpriv-mmio}=<bool> ]`
Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2285,9 +2286,10 @@ On hardware supporting IBRS (Indirect Br
 If Xen is not using IBRS itself, functionality is still set up so IBRS can be
 virtualised for guests.
 
-On hardware supporting IBPB (Indirect Branch Prediction Barrier), the `ibpb=`
-option can be used to force (the default) or prevent Xen from issuing branch
-prediction barriers on vcpu context switches.
+On hardware supporting STIBP (Single Thread Indirect Branch Predictors), the
+`stibp=` option can be used to force or prevent Xen using the feature itself.
+By default, Xen will use STIBP when IBRS is in use (IBRS implies STIBP), and
+when hardware hints recommend using it as a blanket setting.
 
 On hardware supporting SSBD (Speculative Store Bypass Disable), the `ssbd=`
 option can be used to force or prevent Xen using the feature itself.  On AMD
@@ -2295,6 +2297,15 @@ hardware, this is a global option applie
 guest use.  On Intel hardware, the feature is virtualised for guests,
 independently of Xen's choice of setting.
 
+On hardware supporting PSFD (Predictive Store Forwarding Disable), the `psfd=`
+option can be used to force or prevent Xen using the feature itself.  By
+default, Xen will not use PSFD.  PSFD is implied by SSBD, and SSBD is off by
+default.
+
+On hardware supporting IBPB (Indirect Branch Prediction Barrier), the `ibpb=`
+option can be used to force (the default) or prevent Xen from issuing branch
+prediction barriers on vcpu context switches.
+
 On all hardware, the `eager-fpu=` option can be used to force or prevent Xen
 from using fully eager FPU context switches.  This is currently implemented as
 a global control.  By default, Xen will choose to use fully eager context
--- a/xen/arch/x86/hvm/svm/vmcb.c
+++ b/xen/arch/x86/hvm/svm/vmcb.c
@@ -29,6 +29,7 @@
 #include <asm/hvm/support.h>
 #include <asm/hvm/svm/svm.h>
 #include <asm/hvm/svm/svmdebug.h>
+#include <asm/spec_ctrl.h>
 
 struct vmcb_struct *alloc_vmcb(void)
 {
@@ -176,6 +177,14 @@ static int construct_vmcb(struct vcpu *v
             vmcb->_pause_filter_thresh = SVM_PAUSETHRESH_INIT;
     }
 
+    /*
+     * When default_xen_spec_ctrl simply SPEC_CTRL_STIBP, default this behind
+     * the back of the VM too.  Our SMT topology isn't accurate, the overhead
+     * is neglegable, and doing this saves a WRMSR on the vmentry path.
+     */
+    if ( default_xen_spec_ctrl == SPEC_CTRL_STIBP )
+        v->arch.msrs->spec_ctrl.raw = SPEC_CTRL_STIBP;
+
     return 0;
 }
 
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -48,9 +48,13 @@ static enum ind_thunk {
     THUNK_LFENCE,
     THUNK_JMP,
 } opt_thunk __initdata = THUNK_DEFAULT;
+
 static int8_t __initdata opt_ibrs = -1;
+int8_t __initdata opt_stibp = -1;
+bool __read_mostly opt_ssbd;
+int8_t __initdata opt_psfd = -1;
+
 bool __read_mostly opt_ibpb = true;
-bool __read_mostly opt_ssbd = false;
 int8_t __read_mostly opt_eager_fpu = -1;
 int8_t __read_mostly opt_l1d_flush = -1;
 static bool __initdata opt_branch_harden = true;
@@ -172,12 +176,20 @@ static int __init parse_spec_ctrl(const
             else
                 rc = -EINVAL;
         }
+
+        /* Bits in MSR_SPEC_CTRL. */
         else if ( (val = parse_boolean("ibrs", s, ss)) >= 0 )
             opt_ibrs = val;
-        else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 )
-            opt_ibpb = val;
+        else if ( (val = parse_boolean("stibp", s, ss)) >= 0 )
+            opt_stibp = val;
         else if ( (val = parse_boolean("ssbd", s, ss)) >= 0 )
             opt_ssbd = val;
+        else if ( (val = parse_boolean("psfd", s, ss)) >= 0 )
+            opt_psfd = val;
+
+        /* Misc settings. */
+        else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 )
+            opt_ibpb = val;
         else if ( (val = parse_boolean("eager-fpu", s, ss)) >= 0 )
             opt_eager_fpu = val;
         else if ( (val = parse_boolean("l1d-flush", s, ss)) >= 0 )
@@ -376,7 +388,7 @@ static void __init print_details(enum in
                "\n");
 
     /* Settings for Xen's protection, irrespective of guests. */
-    printk("  Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s%s, Other:%s%s%s%s%s\n",
+    printk("  Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s%s%s, Other:%s%s%s%s%s\n",
            thunk == THUNK_NONE      ? "N/A" :
            thunk == THUNK_RETPOLINE ? "RETPOLINE" :
            thunk == THUNK_LFENCE    ? "LFENCE" :
@@ -390,6 +402,9 @@ static void __init print_details(enum in
            (!boot_cpu_has(X86_FEATURE_SSBD) &&
             !boot_cpu_has(X86_FEATURE_AMD_SSBD))     ? "" :
            (default_xen_spec_ctrl & SPEC_CTRL_SSBD)  ? " SSBD+" : " SSBD-",
+           (!boot_cpu_has(X86_FEATURE_PSFD) &&
+            !boot_cpu_has(X86_FEATURE_INTEL_PSFD))   ? "" :
+           (default_xen_spec_ctrl & SPEC_CTRL_PSFD)  ? " PSFD+" : " PSFD-",
            !(caps & ARCH_CAPS_TSX_CTRL)              ? "" :
            (opt_tsx & 1)                             ? " TSX+" : " TSX-",
            !cpu_has_srbds_ctrl                       ? "" :
@@ -979,10 +994,7 @@ void __init init_speculation_mitigations
         if ( !has_spec_ctrl )
             printk(XENLOG_WARNING "?!? CET active, but no MSR_SPEC_CTRL?\n");
         else if ( opt_ibrs == -1 )
-        {
             opt_ibrs = ibrs = true;
-            default_xen_spec_ctrl |= SPEC_CTRL_IBRS | SPEC_CTRL_STIBP;
-        }
 
         if ( opt_thunk == THUNK_DEFAULT || opt_thunk == THUNK_RETPOLINE )
             thunk = THUNK_JMP;
@@ -1086,14 +1098,49 @@ void __init init_speculation_mitigations
             setup_force_cpu_cap(X86_FEATURE_SC_MSR_HVM);
     }
 
-    /* If we have IBRS available, see whether we should use it. */
+    /* Figure out default_xen_spec_ctrl. */
     if ( has_spec_ctrl && ibrs )
+    {
+        /* IBRS implies STIBP.  */
+        if ( opt_stibp == -1 )
+            opt_stibp = 1;
+
         default_xen_spec_ctrl |= SPEC_CTRL_IBRS;
+    }
+
+    /*
+     * Use STIBP by default if the hardware hint is set.  Otherwise, leave it
+     * off as it a severe performance pentalty on pre-eIBRS Intel hardware
+     * where it was retrofitted in microcode.
+     */
+    if ( opt_stibp == -1 )
+        opt_stibp = !!boot_cpu_has(X86_FEATURE_STIBP_ALWAYS);
+
+    if ( opt_stibp && (boot_cpu_has(X86_FEATURE_STIBP) ||
+                       boot_cpu_has(X86_FEATURE_AMD_STIBP)) )
+        default_xen_spec_ctrl |= SPEC_CTRL_STIBP;
 
-    /* If we have SSBD available, see whether we should use it. */
     if ( opt_ssbd && (boot_cpu_has(X86_FEATURE_SSBD) ||
                       boot_cpu_has(X86_FEATURE_AMD_SSBD)) )
+    {
+        /* SSBD implies PSFD */
+        if ( opt_psfd == -1 )
+            opt_psfd = 1;
+
         default_xen_spec_ctrl |= SPEC_CTRL_SSBD;
+    }
+
+    /*
+     * Don't use PSFD by default.  AMD designed the predictor to
+     * auto-clear on privilege change.  PSFD is implied by SSBD, which is
+     * off by default.
+     */
+    if ( opt_psfd == -1 )
+        opt_psfd = 0;
+
+    if ( opt_psfd && (boot_cpu_has(X86_FEATURE_PSFD) ||
+                      boot_cpu_has(X86_FEATURE_INTEL_PSFD)) )
+        default_xen_spec_ctrl |= SPEC_CTRL_PSFD;
 
     /*
      * PV guests can create RSB entries for any linear address they control,

++++++ 62c56cc0-libxc-fix-compilation-error-with-gcc13.patch ++++++
Subject: libxc: fix compilation error with gcc13
From: Charles Arnold carnold@suse.com Wed Jul 6 13:06:40 2022 +0200
Date: Wed Jul 6 13:06:40 2022 +0200:
Git: 8eeae8c2b4efefda8e946461e86cf2ae9c18e5a9

xc_psr.c:161:5: error: conflicting types for 'xc_psr_cmt_get_data'
due to enum/integer mismatch;

Signed-off-by: Charles Arnold <carnold@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>

--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -2516,7 +2516,7 @@ int xc_psr_cmt_get_l3_event_mask(xc_inte
 int xc_psr_cmt_get_l3_cache_size(xc_interface *xch, uint32_t cpu,
                                  uint32_t *l3_cache_size);
 int xc_psr_cmt_get_data(xc_interface *xch, uint32_t rmid, uint32_t cpu,
-                        uint32_t psr_cmt_type, uint64_t *monitor_data,
+                        xc_psr_cmt_type type, uint64_t *monitor_data,
                         uint64_t *tsc);
 int xc_psr_cmt_enabled(xc_interface *xch);
 

++++++ 62cc31ed-x86-honour-spec-ctrl-0-for-unpriv-mmio.patch ++++++
# Commit 4cdb519d797c19ebb8fadc5938cdb47479d5a21b
# Date 2022-07-11 15:21:35 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Honour spec-ctrl=0 for unpriv-mmio sub-option

This was an oversight from when unpriv-mmio was introduced.

Fixes: 8c24b70fedcb ("x86/spec-ctrl: Add spec-ctrl=unpriv-mmio")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -122,6 +122,7 @@ static int __init parse_spec_ctrl(const
             opt_l1d_flush = 0;
             opt_branch_harden = false;
             opt_srb_lock = 0;
+            opt_unpriv_mmio = false;
         }
         else if ( val > 0 )
             rc = -EINVAL;

++++++ 62cc31ee-cmdline-extend-parse_boolean.patch ++++++
# Commit 382326cac528dd1eb0d04efd5c05363c453e29f4
# Date 2022-07-11 15:21:35 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
xen/cmdline: Extend parse_boolean() to signal a name match

This will help parsing a sub-option which has boolean and non-boolean options
available.

First, rework 'int val' into 'bool has_neg_prefix'.  This inverts it's value,
but the resulting logic is far easier to follow.

Second, reject anything of the form 'no-$FOO=' which excludes ambiguous
constructs such as 'no-$foo=yes' which have never been valid.

This just leaves the case where everything is otherwise fine, but parse_bool()
can't interpret the provided string.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -272,9 +272,9 @@ int parse_bool(const char *s, const char
 int parse_boolean(const char *name, const char *s, const char *e)
 {
     size_t slen, nlen;
-    int val = !!strncmp(s, "no-", 3);
+    bool has_neg_prefix = !strncmp(s, "no-", 3);
 
-    if ( !val )
+    if ( has_neg_prefix )
         s += 3;
 
     slen = e ? ({ ASSERT(e >= s); e - s; }) : strlen(s);
@@ -286,11 +286,23 @@ int parse_boolean(const char *name, cons
 
     /* Exact, unadorned name?  Result depends on the 'no-' prefix. */
     if ( slen == nlen )
-        return val;
+        return !has_neg_prefix;
+
+    /* Inexact match with a 'no-' prefix?  Not valid. */
+    if ( has_neg_prefix )
+        return -1;
 
     /* =$SOMETHING?  Defer to the regular boolean parsing. */
     if ( s[nlen] == '=' )
-        return parse_bool(&s[nlen + 1], e);
+    {
+        int b = parse_bool(&s[nlen + 1], e);
+
+        if ( b >= 0 )
+            return b;
+
+        /* Not a boolean, but the name matched.  Signal specially. */
+        return -2;
+    }
 
     /* Unrecognised.  Give up. */
     return -1;
--- a/xen/include/xen/lib.h
+++ b/xen/include/xen/lib.h
@@ -80,7 +80,8 @@ int parse_bool(const char *s, const char
 /**
  * Given a specific name, parses a string of the form:
  *   [no-]$NAME[=...]
- * returning 0 or 1 for a recognised boolean, or -1 for an error.
+ * returning 0 or 1 for a recognised boolean.  Returns -1 for general errors,
+ * and -2 for "not a boolean, but $NAME= matches".
  */
 int parse_boolean(const char *name, const char *s, const char *e);
 

++++++ 62cc31ef-x86-spec-ctrl-fine-grained-cmdline-subopts.patch ++++++
# Commit 27357c394ba6e1571a89105b840ce1c6f026485c
# Date 2022-07-11 15:21:35 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Add fine-grained cmdline suboptions for primitives

Support controling the PV/HVM suboption of msr-sc/rsb/md-clear, which
previously wasn't possible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2233,7 +2233,8 @@ not be able to control the state of the
 By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 
 ### spec-ctrl (x86)
-> `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
+> `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
+>              {msr-sc,rsb,md-clear}=<bool>|{pv,hvm}=<bool>,
...
bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
             eager-fpu,l1d-flush,branch-harden,srb-lock,
             unpriv-mmio}=<bool> ]`
@@ -2258,12 +2259,17 @@ in place for guests to use.
 
 Use of a positive boolean value for either of these options is invalid.
 
-The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` options offer fine
 grained control over the primitives by Xen.  These impact Xen's ability to
-protect itself, and Xen's ability to virtualise support for guests to use.
+protect itself, and/or Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
+* Each other option can be used either as a plain boolean
+  (e.g. `spec-ctrl=rsb` to control both the PV and HVM sub-options), or with
+  `pv=` or `hvm=` subsuboptions (e.g. `spec-ctrl=rsb=no-hvm` to disable HVM
+  RSB only).
+
 * `msr-sc=` offers control over Xen's support for manipulating `MSR_SPEC_CTRL`
   on entry and exit.  These blocks are necessary to virtualise support for
   guests and if disabled, guests will be unable to use IBRS/STIBP/SSBD/etc.
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -147,20 +147,68 @@ static int __init parse_spec_ctrl(const
             opt_rsb_hvm = val;
             opt_md_clear_hvm = val;
         }
-        else if ( (val = parse_boolean("msr-sc", s, ss)) >= 0 )
+        else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
         {
-            opt_msr_sc_pv = val;
-            opt_msr_sc_hvm = val;
+            switch ( val )
+            {
+            case 0:
+            case 1:
+                opt_msr_sc_pv = opt_msr_sc_hvm = val;
+                break;
+
+            case -2:
+                s += strlen("msr-sc=");
+                if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                    opt_msr_sc_pv = val;
+                else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                    opt_msr_sc_hvm = val;
+                else
+            default:
+                    rc = -EINVAL;
+                break;
+            }
         }
-        else if ( (val = parse_boolean("rsb", s, ss)) >= 0 )
+        else if ( (val = parse_boolean("rsb", s, ss)) != -1 )
         {
-            opt_rsb_pv = val;
-            opt_rsb_hvm = val;
+            switch ( val )
+            {
+            case 0:
+            case 1:
+                opt_rsb_pv = opt_rsb_hvm = val;
+                break;
+
+            case -2:
+                s += strlen("rsb=");
+                if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                    opt_rsb_pv = val;
+                else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                    opt_rsb_hvm = val;
+                else
+            default:
+                    rc = -EINVAL;
+                break;
+            }
         }
-        else if ( (val = parse_boolean("md-clear", s, ss)) >= 0 )
+        else if ( (val = parse_boolean("md-clear", s, ss)) != -1 )
         {
-            opt_md_clear_pv = val;
-            opt_md_clear_hvm = val;
+            switch ( val )
+            {
+            case 0:
+            case 1:
+                opt_md_clear_pv = opt_md_clear_hvm = val;
+                break;
+
+            case -2:
+                s += strlen("md-clear=");
+                if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                    opt_md_clear_pv = val;
+                else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                    opt_md_clear_hvm = val;
+                else
+            default:
+                    rc = -EINVAL;
+                break;
+            }
         }
 
         /* Xen's speculative sidechannel mitigation settings. */

++++++ 62cd91d0-x86-spec-ctrl-rework-context-switching.patch ++++++
# Commit 5796912f7279d9348a3166655588d30eae9f72cc
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Rework spec_ctrl_flags context switching

We are shortly going to need to context switch new bits in both the vcpu and
S3 paths.  Introduce SCF_IST_MASK and SCF_DOM_MASK, and rework d->arch.verw
into d->arch.spec_ctrl_flags to accommodate.

No functional change.

This is part of XSA-407.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/acpi/power.c
+++ b/xen/arch/x86/acpi/power.c
@@ -248,8 +248,8 @@ static int enter_state(u32 state)
         error = 0;
 
     ci = get_cpu_info();
-    /* Avoid NMI/#MC using MSR_SPEC_CTRL until we've reloaded microcode. */
-    ci->spec_ctrl_flags &= ~SCF_ist_wrmsr;
+    /* Avoid NMI/#MC using unsafe MSRs until we've reloaded microcode. */
+    ci->spec_ctrl_flags &= ~SCF_IST_MASK;
 
     ACPI_FLUSH_CPU_CACHE();
 
@@ -292,8 +292,8 @@ static int enter_state(u32 state)
     if ( !recheck_cpu_features(0) )
         panic("Missing previously available feature(s)\n");
 
-    /* Re-enabled default NMI/#MC use of MSR_SPEC_CTRL. */
-    ci->spec_ctrl_flags |= (default_spec_ctrl_flags & SCF_ist_wrmsr);
+    /* Re-enabled default NMI/#MC use of MSRs now microcode is loaded. */
+    ci->spec_ctrl_flags |= (default_spec_ctrl_flags & SCF_IST_MASK);
 
     if ( boot_cpu_has(X86_FEATURE_IBRSB) || boot_cpu_has(X86_FEATURE_IBRS) )
     {
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2092,10 +2092,10 @@ void context_switch(struct vcpu *prev, s
             }
         }
 
-        /* Update the top-of-stack block with the VERW disposition. */
-        info->spec_ctrl_flags &= ~SCF_verw;
-        if ( nextd->arch.verw )
-            info->spec_ctrl_flags |= SCF_verw;
+        /* Update the top-of-stack block with the new spec_ctrl settings. */
+        info->spec_ctrl_flags =
+            (info->spec_ctrl_flags       & ~SCF_DOM_MASK) |
+            (nextd->arch.spec_ctrl_flags &  SCF_DOM_MASK);
     }
 
     sched_context_switched(prev, next);
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1010,9 +1010,12 @@ void spec_ctrl_init_domain(struct domain
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw =
-        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
-        (opt_fb_clear_mmio && is_iommu_enabled(d));
+    bool verw = ((pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+                 (opt_fb_clear_mmio && is_iommu_enabled(d)));
+
+    d->arch.spec_ctrl_flags =
+        (verw   ? SCF_verw         : 0) |
+        0;
 }
 
 void __init init_speculation_mitigations(void)
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -319,8 +319,7 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
-    /* Use VERW on return-to-guest for its flushing side effect. */
-    bool verw;
+    uint8_t spec_ctrl_flags; /* See SCF_DOM_MASK */
 
     union {
         struct pv_domain pv;
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -20,12 +20,40 @@
 #ifndef __X86_SPEC_CTRL_H__
 #define __X86_SPEC_CTRL_H__
 
-/* Encoding of cpuinfo.spec_ctrl_flags */
+/*
+ * Encoding of:
+ *   cpuinfo.spec_ctrl_flags
+ *   default_spec_ctrl_flags
+ *   domain.spec_ctrl_flags
+ *
+ * Live settings are in the top-of-stack block, because they need to be
+ * accessable when XPTI is active.  Some settings are fixed from boot, some
+ * context switched per domain, and some inhibited in the S3 path.
+ */
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
 #define SCF_verw       (1 << 3)
 
+/*
+ * The IST paths (NMI/#MC) can interrupt any arbitrary context.  Some
+ * functionality requires updated microcode to work.
+ *
+ * On boot, this is easy; we load microcode before figuring out which
+ * speculative protections to apply.  However, on the S3 resume path, we must
+ * be able to disable the configured mitigations until microcode is reloaded.
+ *
+ * These are the controls to inhibit on the S3 resume path until microcode has
+ * been reloaded.
+ */
+#define SCF_IST_MASK (SCF_ist_wrmsr)
+
+/*
+ * Some speculative protections are per-domain.  These settings are merged
+ * into the top-of-stack block in the context switch path.
+ */
+#define SCF_DOM_MASK (SCF_verw)
+
 #ifndef __ASSEMBLY__
 
 #include <asm/alternative.h>
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -248,9 +248,6 @@
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.
- * Fine grain control of SCF_ist_wrmsr is needed for safety in the S3 resume
- * path to avoid using MSR_SPEC_CTRL before the microcode introducing it has
- * been reloaded.
  */
 .macro SPEC_CTRL_ENTRY_FROM_INTR_IST
 /*

++++++ 62cd91d1-x86-spec-ctrl-rename-SCF_ist_wrmsr.patch ++++++
# Commit 76d6a36f645dfdbad8830559d4d52caf36efc75e
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Rename SCF_ist_wrmsr to SCF_ist_sc_msr

We are about to introduce SCF_ist_ibpb, at which point SCF_ist_wrmsr becomes
ambiguous.

No functional change.

This is part of XSA-407.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1115,7 +1115,7 @@ void __init init_speculation_mitigations
     {
         if ( opt_msr_sc_pv )
         {
-            default_spec_ctrl_flags |= SCF_ist_wrmsr;
+            default_spec_ctrl_flags |= SCF_ist_sc_msr;
             setup_force_cpu_cap(X86_FEATURE_SC_MSR_PV);
         }
 
@@ -1126,7 +1126,7 @@ void __init init_speculation_mitigations
              * Xen's value is not restored atomically.  An early NMI hitting
              * the VMExit path needs to restore Xen's value for safety.
              */
-            default_spec_ctrl_flags |= SCF_ist_wrmsr;
+            default_spec_ctrl_flags |= SCF_ist_sc_msr;
             setup_force_cpu_cap(X86_FEATURE_SC_MSR_HVM);
         }
     }
@@ -1139,7 +1139,7 @@ void __init init_speculation_mitigations
          * on real hardware matches the availability of MSR_SPEC_CTRL in the
          * first place.
          *
-         * No need for SCF_ist_wrmsr because Xen's value is restored
+         * No need for SCF_ist_sc_msr because Xen's value is restored
          * atomically WRT NMIs in the VMExit path.
          *
          * TODO: Adjust cpu_has_svm_spec_ctrl to be usable earlier on boot.
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -31,7 +31,7 @@
  * context switched per domain, and some inhibited in the S3 path.
  */
 #define SCF_use_shadow (1 << 0)
-#define SCF_ist_wrmsr  (1 << 1)
+#define SCF_ist_sc_msr (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
 #define SCF_verw       (1 << 3)
 
@@ -46,7 +46,7 @@
  * These are the controls to inhibit on the S3 resume path until microcode has
  * been reloaded.
  */
-#define SCF_IST_MASK (SCF_ist_wrmsr)
+#define SCF_IST_MASK (SCF_ist_sc_msr)
 
 /*
  * Some speculative protections are per-domain.  These settings are merged
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -266,8 +266,8 @@
 
 .L\@_skip_rsb:
 
-    test $SCF_ist_wrmsr, %al
-    jz .L\@_skip_wrmsr
+    test $SCF_ist_sc_msr, %al
+    jz .L\@_skip_msr_spec_ctrl
 
     xor %edx, %edx
     testb $3, UREGS_cs(%rsp)
@@ -290,7 +290,7 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
      * to speculate around the WRMSR.  As a result, we need a dispatch
      * serialising instruction in the else clause.
      */
-.L\@_skip_wrmsr:
+.L\@_skip_msr_spec_ctrl:
     lfence
     UNLIKELY_END(\@_serialise)
 .endm
@@ -301,7 +301,7 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
  * Requires %rbx=stack_end
  * Clobbers %rax, %rcx, %rdx
  */
-    testb $SCF_ist_wrmsr, STACK_CPUINFO_FIELD(spec_ctrl_flags)(%rbx)
+    testb $SCF_ist_sc_msr, STACK_CPUINFO_FIELD(spec_ctrl_flags)(%rbx)
     jz .L\@_skip
 
     DO_SPEC_CTRL_EXIT_TO_XEN

++++++ 62cd91d2-x86-spec-ctrl-rename-opt_ibpb.patch ++++++
# Commit a8e5ef079d6f5c88c472e3e620db5a8d1402a50d
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Rename opt_ibpb to opt_ibpb_ctxt_switch

We are about to introduce the use of IBPB at different points in Xen, making
opt_ibpb ambiguous.  Rename it to opt_ibpb_ctxt_switch.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2064,7 +2064,7 @@ void context_switch(struct vcpu *prev, s
 
         ctxt_switch_levelling(next);
 
-        if ( opt_ibpb && !is_idle_domain(nextd) )
+        if ( opt_ibpb_ctxt_switch && !is_idle_domain(nextd) )
         {
             static DEFINE_PER_CPU(unsigned int, last);
             unsigned int *last_id = &this_cpu(last);
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -54,7 +54,7 @@ int8_t __initdata opt_stibp = -1;
 bool __read_mostly opt_ssbd;
 int8_t __initdata opt_psfd = -1;
 
-bool __read_mostly opt_ibpb = true;
+bool __read_mostly opt_ibpb_ctxt_switch = true;
 int8_t __read_mostly opt_eager_fpu = -1;
 int8_t __read_mostly opt_l1d_flush = -1;
 static bool __initdata opt_branch_harden = true;
@@ -117,7 +117,7 @@ static int __init parse_spec_ctrl(const
 
             opt_thunk = THUNK_JMP;
             opt_ibrs = 0;
-            opt_ibpb = false;
+            opt_ibpb_ctxt_switch = false;
             opt_ssbd = false;
             opt_l1d_flush = 0;
             opt_branch_harden = false;
@@ -238,7 +238,7 @@ static int __init parse_spec_ctrl(const
 
         /* Misc settings. */
         else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 )
-            opt_ibpb = val;
+            opt_ibpb_ctxt_switch = val;
         else if ( (val = parse_boolean("eager-fpu", s, ss)) >= 0 )
             opt_eager_fpu = val;
         else if ( (val = parse_boolean("l1d-flush", s, ss)) >= 0 )
@@ -458,7 +458,7 @@ static void __init print_details(enum in
            (opt_tsx & 1)                             ? " TSX+" : " TSX-",
            !cpu_has_srbds_ctrl                       ? "" :
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
-           opt_ibpb                                  ? " IBPB"  : "",
+           opt_ibpb_ctxt_switch                      ? " IBPB-ctxt" : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
            opt_md_clear_pv || opt_md_clear_hvm ||
            opt_fb_clear_mmio                         ? " VERW"  : "",
@@ -1240,7 +1240,7 @@ void __init init_speculation_mitigations
 
     /* Check we have hardware IBPB support before using it... */
     if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) )
-        opt_ibpb = false;
+        opt_ibpb_ctxt_switch = false;
 
     /* Check whether Eager FPU should be enabled by default. */
     if ( opt_eager_fpu == -1 )
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -63,7 +63,7 @@
 void init_speculation_mitigations(void);
 void spec_ctrl_init_domain(struct domain *d);
 
-extern bool opt_ibpb;
+extern bool opt_ibpb_ctxt_switch;
 extern bool opt_ssbd;
 extern int8_t opt_eager_fpu;
 extern int8_t opt_l1d_flush;

++++++ 62cd91d3-x86-spec-ctrl-rework-SPEC_CTRL_ENTRY_FROM_INTR_IST.patch ++++++
# Commit e9b8d31981f184c6539f91ec54bd9cae29cdae36
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Rework SPEC_CTRL_ENTRY_FROM_INTR_IST

We are shortly going to add a conditional IBPB in this path.

Therefore, we cannot hold spec_ctrl_flags in %eax, and rely on only clobbering
it after we're done with its contents.  %rbx is available for use, and the
more normal register to hold preserved information in.

With %rax freed up, use it instead of %rdx for the RSB tmp register, and for
the adjustment to spec_ctrl_flags.

This leaves no use of %rdx, except as 0 for the upper half of WRMSR.  In
practice, %rdx is 0 from SAVE_ALL on all paths and isn't likely to change in
the foreseeable future, so update the macro entry requirements to state this
dependency.  This marginal optimisation can be revisited if circumstances
change.

No practical change.

This is part of XSA-407.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -932,7 +932,7 @@ ENTRY(double_fault)
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, %rdx=0, Clob: abcd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
@@ -968,7 +968,7 @@ handle_ist_exception:
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, %rdx=0, Clob: abcd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -251,34 +251,33 @@
  */
 .macro SPEC_CTRL_ENTRY_FROM_INTR_IST
 /*
- * Requires %rsp=regs, %r14=stack_end
- * Clobbers %rax, %rcx, %rdx
+ * Requires %rsp=regs, %r14=stack_end, %rdx=0
+ * Clobbers %rax, %rbx, %rcx, %rdx
  *
  * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
  * maybexen=1, but with conditionals rather than alternatives.
  */
-    movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %eax
+    movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %ebx
 
-    test $SCF_ist_rsb, %al
+    test $SCF_ist_rsb, %bl
     jz .L\@_skip_rsb
 
-    DO_OVERWRITE_RSB tmp=rdx /* Clobbers %rcx/%rdx */
+    DO_OVERWRITE_RSB         /* Clobbers %rax/%rcx */
 
 .L\@_skip_rsb:
 
-    test $SCF_ist_sc_msr, %al
+    test $SCF_ist_sc_msr, %bl
     jz .L\@_skip_msr_spec_ctrl
 
-    xor %edx, %edx
+    xor %eax, %eax
     testb $3, UREGS_cs(%rsp)
-    setnz %dl
-    not %edx
-    and %dl, STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14)
+    setnz %al
+    not %eax
+    and %al, STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14)
 
     /* Load Xen's intended value. */
     mov $MSR_SPEC_CTRL, %ecx
     movzbl STACK_CPUINFO_FIELD(xen_spec_ctrl)(%r14), %eax
-    xor %edx, %edx
     wrmsr
 
     /* Opencoded UNLIKELY_START() with no condition. */

++++++ 62cd91d4-x86-spec-ctrl-IBPB-on-entry.patch ++++++
# Commit 53a570b285694947776d5190f591a0d5b9b18de7
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Support IBPB-on-entry

We are going to need this to mitigate Branch Type Confusion on AMD/Hygon CPUs,
but as we've talked about using it in other cases too, arrange to support it
generally.  However, this is also very expensive in some cases, so we're going
to want per-domain controls.

Introduce SCF_ist_ibpb and SCF_entry_ibpb controls, adding them to the IST and
DOM masks as appropriate.  Also introduce X86_FEATURE_IBPB_ENTRY_{PV,HVM} to
to patch the code blocks.

For SVM, the STGI is serialising enough to protect against Spectre-v1 attacks,
so no "else lfence" is necessary.  VT-x will use use the MSR host load list,
so doesn't need any code in the VMExit path.

For the IST path, we can't safely check CPL==0 to skip a flush, as we might
have hit an entry path before it's IBPB.  As IST hitting Xen is rare, flush
irrespective of CPL.  A later path, SCF_ist_sc_msr, provides Spectre-v1
safety.

For the PV paths, we know we're interrupting CPL>0, while for the INTR paths,
we can safely check CPL==0.  Only flush when interrupting guest context.

An "else lfence" is needed for safety, but we want to be able to skip it on
unaffected CPUs, so the block wants to be an alternative, which means the
lfence has to be inline rather than UNLIKELY() (the replacement block doesn't
have displacements fixed up for anything other than the first instruction).

As with SPEC_CTRL_ENTRY_FROM_INTR_IST, %rdx is 0 on entry so rely on this to
shrink the logic marginally.  Update the comments to specify this new
dependency.

This is part of XSA-407.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -97,7 +97,19 @@ __UNLIKELY_END(nsvm_hap)
 
         GET_CURRENT(bx)
 
-        /* SPEC_CTRL_ENTRY_FROM_SVM    Req: %rsp=regs/cpuinfo         Clob: acd */
+        /* SPEC_CTRL_ENTRY_FROM_SVM    Req: %rsp=regs/cpuinfo, %rdx=0 Clob: acd */
+
+        .macro svm_vmexit_cond_ibpb
+            testb  $SCF_entry_ibpb, CPUINFO_xen_spec_ctrl(%rsp)
+            jz     .L_skip_ibpb
+
+            mov    $MSR_PRED_CMD, %ecx
+            mov    $PRED_CMD_IBPB, %eax
+            wrmsr
+.L_skip_ibpb:
+	.endm
+        ALTERNATIVE "", svm_vmexit_cond_ibpb, X86_FEATURE_IBPB_ENTRY_HVM
+
         ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM
 
         .macro svm_vmexit_spec_ctrl
@@ -114,6 +126,10 @@ __UNLIKELY_END(nsvm_hap)
         ALTERNATIVE "", svm_vmexit_spec_ctrl, X86_FEATURE_SC_MSR_HVM
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
+        /*
+         * STGI is executed unconditionally, and is sufficiently serialising
+         * to safely resolve any Spectre-v1 concerns in the above logic.
+         */
         stgi
 GLOBAL(svm_stgi_label)
         mov  %rsp,%rdi
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -1345,6 +1345,10 @@ static int construct_vmcs(struct vcpu *v
         rc = vmx_add_msr(v, MSR_FLUSH_CMD, FLUSH_CMD_L1D,
                          VMX_MSR_GUEST_LOADONLY);
 
+    if ( !rc && (d->arch.spec_ctrl_flags & SCF_entry_ibpb) )
+        rc = vmx_add_msr(v, MSR_PRED_CMD, PRED_CMD_IBPB,
+                         VMX_MSR_HOST);
+
  out:
     vmx_vmcs_exit(v);
 
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -18,7 +18,7 @@ ENTRY(entry_int82)
         movl  $HYPERCALL_VECTOR, 4(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
 
-        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         CR4_PV32_RESTORE
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -260,7 +260,7 @@ ENTRY(lstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
-        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         GET_STACK_END(bx)
@@ -298,7 +298,7 @@ ENTRY(cstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
-        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         GET_STACK_END(bx)
@@ -338,7 +338,7 @@ GLOBAL(sysenter_eflags_saved)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
-        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         GET_STACK_END(bx)
@@ -392,7 +392,7 @@ ENTRY(int80_direct_trap)
         movl  $0x80, 4(%rsp)
         SAVE_ALL
 
-        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         GET_STACK_END(bx)
@@ -674,7 +674,7 @@ ENTRY(common_interrupt)
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, %rdx=0, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
@@ -708,7 +708,7 @@ GLOBAL(handle_exception)
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, %rdx=0, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -39,6 +39,8 @@ XEN_CPUFEATURE(XEN_LBR,           X86_SY
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
+XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
+XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -34,6 +34,8 @@
 #define SCF_ist_sc_msr (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
 #define SCF_verw       (1 << 3)
+#define SCF_ist_ibpb   (1 << 4)
+#define SCF_entry_ibpb (1 << 5)
 
 /*
  * The IST paths (NMI/#MC) can interrupt any arbitrary context.  Some
@@ -46,13 +48,13 @@
  * These are the controls to inhibit on the S3 resume path until microcode has
  * been reloaded.
  */
-#define SCF_IST_MASK (SCF_ist_sc_msr)
+#define SCF_IST_MASK (SCF_ist_sc_msr | SCF_ist_ibpb)
 
 /*
  * Some speculative protections are per-domain.  These settings are merged
  * into the top-of-stack block in the context switch path.
  */
-#define SCF_DOM_MASK (SCF_verw)
+#define SCF_DOM_MASK (SCF_verw | SCF_entry_ibpb)
 
 #ifndef __ASSEMBLY__
 
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -88,6 +88,35 @@
  *  - SPEC_CTRL_EXIT_TO_{SVM,VMX}
  */
 
+.macro DO_SPEC_CTRL_COND_IBPB maybexen:req
+/*
+ * Requires %rsp=regs (also cpuinfo if !maybexen)
+ * Requires %r14=stack_end (if maybexen), %rdx=0
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * Conditionally issue IBPB if SCF_entry_ibpb is active.  In the maybexen
+ * case, we can safely look at UREGS_cs to skip taking the hit when
+ * interrupting Xen.
+ */
+    .if \maybexen
+        testb  $SCF_entry_ibpb, STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14)
+        jz     .L\@_skip
+        testb  $3, UREGS_cs(%rsp)
+    .else
+        testb  $SCF_entry_ibpb, CPUINFO_xen_spec_ctrl(%rsp)
+    .endif
+    jz     .L\@_skip
+
+    mov     $MSR_PRED_CMD, %ecx
+    mov     $PRED_CMD_IBPB, %eax
+    wrmsr
+    jmp     .L\@_done
+
+.L\@_skip:
+    lfence
+.L\@_done:
+.endm
+
 .macro DO_OVERWRITE_RSB tmp=rax
 /*
  * Requires nothing
@@ -225,12 +254,16 @@
 
 /* Use after an entry from PV context (syscall/sysenter/int80/int82/etc). */
 #define SPEC_CTRL_ENTRY_FROM_PV                                         \
+    ALTERNATIVE "", __stringify(DO_SPEC_CTRL_COND_IBPB maybexen=0),     \
+        X86_FEATURE_IBPB_ENTRY_PV;                                      \
     ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_PV;            \
     ALTERNATIVE "", __stringify(DO_SPEC_CTRL_ENTRY maybexen=0),         \
         X86_FEATURE_SC_MSR_PV
 
 /* Use in interrupt/exception context.  May interrupt Xen or PV context. */
 #define SPEC_CTRL_ENTRY_FROM_INTR                                       \
+    ALTERNATIVE "", __stringify(DO_SPEC_CTRL_COND_IBPB maybexen=1),     \
+        X86_FEATURE_IBPB_ENTRY_PV;                                      \
     ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_PV;            \
     ALTERNATIVE "", __stringify(DO_SPEC_CTRL_ENTRY maybexen=1),         \
         X86_FEATURE_SC_MSR_PV
@@ -254,11 +287,23 @@
  * Requires %rsp=regs, %r14=stack_end, %rdx=0
  * Clobbers %rax, %rbx, %rcx, %rdx
  *
- * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
- * maybexen=1, but with conditionals rather than alternatives.
+ * This is logical merge of:
+ *    DO_SPEC_CTRL_COND_IBPB maybexen=0
+ *    DO_OVERWRITE_RSB
+ *    DO_SPEC_CTRL_ENTRY maybexen=1
+ * but with conditionals rather than alternatives.
  */
     movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %ebx
 
+    test    $SCF_ist_ibpb, %bl
+    jz      .L\@_skip_ibpb
+
+    mov     $MSR_PRED_CMD, %ecx
+    mov     $PRED_CMD_IBPB, %eax
+    wrmsr
+
+.L\@_skip_ibpb:
+
     test $SCF_ist_rsb, %bl
     jz .L\@_skip_rsb
 

++++++ 62cd91d5-x86-cpuid-BTC_NO-enum.patch ++++++
# Commit 76cb04ad64f3ab9ae785988c40655a71dde9c319
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/cpuid: Enumeration for BTC_NO

BTC_NO indicates that hardware is not succeptable to Branch Type Confusion.

Zen3 CPUs don't suffer BTC.

This is part of XSA-407.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -288,6 +288,7 @@ int libxl_cpuid_parse_config(libxl_cpuid
         {"virt-ssbd",    0x80000008, NA, CPUID_REG_EBX, 25,  1},
         {"ssb-no",       0x80000008, NA, CPUID_REG_EBX, 26,  1},
         {"psfd",         0x80000008, NA, CPUID_REG_EBX, 28,  1},
+        {"btc-no",       0x80000008, NA, CPUID_REG_EBX, 29,  1},
 
         {"nc",           0x80000008, NA, CPUID_REG_ECX,  0,  8},
         {"apicidsize",   0x80000008, NA, CPUID_REG_ECX, 12,  4},
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -158,7 +158,7 @@ static const char *const str_e8b[32] =
     /* [22] */                 [23] = "ppin",
     [24] = "amd-ssbd",         [25] = "virt-ssbd",
     [26] = "ssb-no",
-    [28] = "psfd",
+    [28] = "psfd",             [29] = "btc-no",
 };
 
 static const char *const str_7d0[32] =
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -847,6 +847,16 @@ static void init_amd(struct cpuinfo_x86
 			warning_add(text);
 		}
 		break;
+
+	case 0x19:
+		/*
+		 * Zen3 (Fam19h model < 0x10) parts are not susceptible to
+		 * Branch Type Confusion, but predate the allocation of the
+		 * BTC_NO bit.  Fill it back in if we're not virtualised.
+		 */
+		if (!cpu_has_hypervisor && !cpu_has(c, X86_FEATURE_BTC_NO))
+			__set_bit(X86_FEATURE_BTC_NO, c->x86_capability);
+		break;
 	}
 
 	display_cacheinfo(c);
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -388,7 +388,7 @@ static void __init print_details(enum in
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_IBRS_ALL)                       ? " IBRS_ALL"       : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -403,7 +403,8 @@ static void __init print_details(enum in
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
-           (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
+           (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "",
+           (e8b  & cpufeat_mask(X86_FEATURE_BTC_NO))         ? " BTC_NO"         : "");
 
     /* Hardware features which need driving to mitigate issues. */
     printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -266,6 +266,7 @@ XEN_CPUFEATURE(AMD_SSBD,      8*32+24) /
 XEN_CPUFEATURE(VIRT_SSBD,     8*32+25) /*   MSR_VIRT_SPEC_CTRL.SSBD */
 XEN_CPUFEATURE(SSB_NO,        8*32+26) /*A  Hardware not vulnerable to SSB */
 XEN_CPUFEATURE(PSFD,          8*32+28) /*S  MSR_SPEC_CTRL.PSFD */
+XEN_CPUFEATURE(BTC_NO,        8*32+29) /*A  Hardware not vulnerable to Branch Type Confusion */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.edx, word 9 */
 XEN_CPUFEATURE(AVX512_4VNNIW, 9*32+ 2) /*A  AVX512 Neural Network Instructions */

++++++ 62cd91d6-x86-spec-ctrl-enable-Zen2-chickenbit.patch ++++++
# Commit 9deaf2d932f08c16c6b96a1c426e4b1142c0cdbe
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Enable Zen2 chickenbit

... as instructed in the Branch Type Confusion whitepaper.

This is part of XSA-407.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -731,6 +731,31 @@ void amd_init_ssbd(const struct cpuinfo_
 		printk_once(XENLOG_ERR "No SSBD controls available\n");
 }
 
+/*
+ * On Zen2 we offer this chicken (bit) on the altar of Speculation.
+ *
+ * Refer to the AMD Branch Type Confusion whitepaper:
+ * https://XXX
+ *
+ * Setting this unnamed bit supposedly causes prediction information on
+ * non-branch instructions to be ignored.  It is to be set unilaterally in
+ * newer microcode.
+ *
+ * This chickenbit is something unrelated on Zen1, and Zen1 vs Zen2 isn't a
+ * simple model number comparison, so use STIBP as a heuristic to separate the
+ * two uarches in Fam17h(AMD)/18h(Hygon).
+ */
+void amd_init_spectral_chicken(void)
+{
+	uint64_t val, chickenbit = 1 << 1;
+
+	if (cpu_has_hypervisor || !boot_cpu_has(X86_FEATURE_AMD_STIBP))
+		return;
+
+	if (rdmsr_safe(MSR_AMD64_DE_CFG2, val) == 0 && !(val & chickenbit))
+		wrmsr_safe(MSR_AMD64_DE_CFG2, val | chickenbit);
+}
+
 void __init detect_zen2_null_seg_behaviour(void)
 {
 	uint64_t base;
@@ -796,6 +821,9 @@ static void init_amd(struct cpuinfo_x86
 
 	amd_init_ssbd(c);
 
+	if (c->x86 == 0x17)
+		amd_init_spectral_chicken();
+
 	/* Probe for NSCB on Zen2 CPUs when not virtualised */
 	if (!cpu_has_hypervisor && !cpu_has_nscb && c == &boot_cpu_data &&
 	    c->x86 == 0x17)
--- a/xen/arch/x86/cpu/cpu.h
+++ b/xen/arch/x86/cpu/cpu.h
@@ -22,4 +22,5 @@ void early_init_amd(struct cpuinfo_x86 *
 void amd_log_freq(const struct cpuinfo_x86 *c);
 void amd_init_lfence(struct cpuinfo_x86 *c);
 void amd_init_ssbd(const struct cpuinfo_x86 *c);
+void amd_init_spectral_chicken(void);
 void detect_zen2_null_seg_behaviour(void);
--- a/xen/arch/x86/cpu/hygon.c
+++ b/xen/arch/x86/cpu/hygon.c
@@ -41,6 +41,12 @@ static void init_hygon(struct cpuinfo_x8
 		detect_zen2_null_seg_behaviour();
 
 	/*
+	 * TODO: Check heuristic safety with Hygon first
+	if (c->x86 == 0x18)
+		amd_init_spectral_chicken();
+	 */
+
+	/*
 	 * Hygon CPUs before Zen2 don't clear segment bases/limits when
 	 * loading a NULL selector.
 	 */
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -361,6 +361,7 @@
 #define MSR_AMD64_DE_CFG		0xc0011029
 #define AMD64_DE_CFG_LFENCE_SERIALISE	(_AC(1, ULL) << 1)
 #define MSR_AMD64_EX_CFG		0xc001102c
+#define MSR_AMD64_DE_CFG2		0xc00110e3
 
 #define MSR_AMD64_DR0_ADDRESS_MASK	0xc0011027
 #define MSR_AMD64_DR1_ADDRESS_MASK	0xc0011019

++++++ 62cd91d7-x86-spec-ctrl-mitigate-Branch-Type-Confusion.patch ++++++
# Commit d8cb7e0f069e0f106d24941355b59b45a731eabe
# Date 2022-07-12 16:23:00 +0100
# Author Andrew Cooper <andrew.cooper3@citrix.com>
# Committer Andrew Cooper <andrew.cooper3@citrix.com>
x86/spec-ctrl: Mitigate Branch Type Confusion when possible

Branch Type Confusion affects AMD/Hygon CPUs on Zen2 and earlier.  To
mitigate, we require SMT safety (STIBP on Zen2, no-SMT on Zen1), and to issue
an IBPB on each entry to Xen, to flush the BTB.

Due to performance concerns, dom0 (which is trusted in most configurations) is
excluded from protections by default.

Therefore:
 * Use STIBP by default on Zen2 too, which now means we want it on by default
   on all hardware supporting STIBP.
 * Break the current IBPB logic out into a new function, extending it with
   IBPB-at-entry logic.
 * Change the existing IBPB-at-ctxt-switch boolean to be tristate, and disable
   it by default when IBPB-at-entry is providing sufficient safety.

If all PV guests on the system are trusted, then it is recommended to boot
with `spec-ctrl=ibpb-entry=no-pv`, as this will provide an additional marginal
perf improvement.

This is part of XSA-407 / CVE-2022-23825.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2234,7 +2234,7 @@ By default SSBD will be mitigated at run
 
 ### spec-ctrl (x86)
...
`= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
->              {msr-sc,rsb,md-clear}=<bool>|{pv,hvm}=<bool>,
+>              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
             bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
             eager-fpu,l1d-flush,branch-harden,srb-lock,
             unpriv-mmio}=<bool> ]`
@@ -2259,9 +2259,10 @@ in place for guests to use.
Use of a positive boolean value for either of these options is invalid.
 
-The `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` options offer fine
-grained control over the primitives by Xen.  These impact Xen's ability to
-protect itself, and/or Xen's ability to virtualise support for guests to use.
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `md-clear=` and `ibpb-entry=` options
+offer fine grained control over the primitives by Xen.  These impact Xen's
+ability to protect itself, and/or Xen's ability to virtualise support for
+guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
@@ -2280,6 +2281,11 @@ protect itself, and/or Xen's ability to
   compatibility with development versions of this fix, `mds=` is also accepted
   on Xen 4.12 and earlier as an alias.  Consult vendor documentation in
   preference to here.*
+* `ibpb-entry=` offers control over whether IBPB (Indirect Branch Prediction
+  Barrier) is used on entry to Xen.  This is used by default on hardware
+  vulnerable to Branch Type Confusion, but for performance reasons, dom0 is
+  unprotected by default.  If it necessary to protect dom0 too, boot with
+  `spec-ctrl=ibpb-entry`.
 
 If Xen was compiled with INDIRECT_THUNK support, `bti-thunk=` can be used to
 select which of the thunks gets patched into the `__x86_indirect_thunk_%reg`
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -39,6 +39,10 @@ static bool __initdata opt_rsb_hvm = tru
 static int8_t __read_mostly opt_md_clear_pv = -1;
 static int8_t __read_mostly opt_md_clear_hvm = -1;
 
+static int8_t __read_mostly opt_ibpb_entry_pv = -1;
+static int8_t __read_mostly opt_ibpb_entry_hvm = -1;
+static bool __read_mostly opt_ibpb_entry_dom0;
+
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
     THUNK_DEFAULT, /* Decide which thunk to use at boot time. */
@@ -54,7 +58,7 @@ int8_t __initdata opt_stibp = -1;
 bool __read_mostly opt_ssbd;
 int8_t __initdata opt_psfd = -1;
 
-bool __read_mostly opt_ibpb_ctxt_switch = true;
+int8_t __read_mostly opt_ibpb_ctxt_switch = -1;
 int8_t __read_mostly opt_eager_fpu = -1;
 int8_t __read_mostly opt_l1d_flush = -1;
 static bool __initdata opt_branch_harden = true;
@@ -114,6 +118,9 @@ static int __init parse_spec_ctrl(const
             opt_rsb_hvm = false;
             opt_md_clear_pv = 0;
             opt_md_clear_hvm = 0;
+            opt_ibpb_entry_pv = 0;
+            opt_ibpb_entry_hvm = 0;
+            opt_ibpb_entry_dom0 = false;
 
             opt_thunk = THUNK_JMP;
             opt_ibrs = 0;
@@ -140,12 +147,14 @@ static int __init parse_spec_ctrl(const
             opt_msr_sc_pv = val;
             opt_rsb_pv = val;
             opt_md_clear_pv = val;
+            opt_ibpb_entry_pv = val;
         }
         else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
         {
             opt_msr_sc_hvm = val;
             opt_rsb_hvm = val;
             opt_md_clear_hvm = val;
+            opt_ibpb_entry_hvm = val;
         }
         else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
         {
@@ -210,6 +219,28 @@ static int __init parse_spec_ctrl(const
                 break;
             }
         }
+        else if ( (val = parse_boolean("ibpb-entry", s, ss)) != -1 )
+        {
+            switch ( val )
+            {
+            case 0:
+            case 1:
+                opt_ibpb_entry_pv = opt_ibpb_entry_hvm =
+                    opt_ibpb_entry_dom0 = val;
+                break;
+
+            case -2:
+                s += strlen("ibpb-entry=");
+                if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                    opt_ibpb_entry_pv = val;
+                else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                    opt_ibpb_entry_hvm = val;
+                else
+            default:
+                    rc = -EINVAL;
+                break;
+            }
+        }
 
         /* Xen's speculative sidechannel mitigation settings. */
         else if ( !strncmp(s, "bti-thunk=", 10) )
@@ -477,27 +508,31 @@ static void __init print_details(enum in
      * mitigation support for guests.
      */
 #ifdef CONFIG_HVM
-    printk("  Support for HVM VMs:%s%s%s%s%s\n",
+    printk("  Support for HVM VMs:%s%s%s%s%s%s\n",
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_HVM) ||
             boot_cpu_has(X86_FEATURE_MD_CLEAR)   ||
+            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ||
             opt_eager_fpu)                           ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_HVM)      ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_HVM)      ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           boot_cpu_has(X86_FEATURE_MD_CLEAR)        ? " MD_CLEAR"      : "");
+           boot_cpu_has(X86_FEATURE_MD_CLEAR)        ? " MD_CLEAR"      : "",
+           boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "");
 
 #endif
 #ifdef CONFIG_PV
-    printk("  Support for PV VMs:%s%s%s%s%s\n",
+    printk("  Support for PV VMs:%s%s%s%s%s%s\n",
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_MD_CLEAR)  ||
+            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
             opt_eager_fpu)                           ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           boot_cpu_has(X86_FEATURE_MD_CLEAR)        ? " MD_CLEAR"      : "");
+           boot_cpu_has(X86_FEATURE_MD_CLEAR)        ? " MD_CLEAR"      : "",
+           boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
            opt_xpti_hwdom ? "enabled" : "disabled",
@@ -759,6 +794,55 @@ static bool __init should_use_eager_fpu(
     }
 }
 
+static void __init ibpb_calculations(void)
+{
+    /* Check we have hardware IBPB support before using it... */
+    if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) )
+    {
+        opt_ibpb_entry_hvm = opt_ibpb_entry_pv = opt_ibpb_ctxt_switch = 0;
+        opt_ibpb_entry_dom0 = false;
+        return;
+    }
+
+    /*
+     * IBPB-on-entry mitigations for Branch Type Confusion.
+     *
+     * IBPB && !BTC_NO selects all AMD/Hygon hardware, not known to be safe,
+     * that we can provide some form of mitigation on.
+     */
+    if ( opt_ibpb_entry_pv == -1 )
+        opt_ibpb_entry_pv = (IS_ENABLED(CONFIG_PV) &&
+                             boot_cpu_has(X86_FEATURE_IBPB) &&
+                             !boot_cpu_has(X86_FEATURE_BTC_NO));
+    if ( opt_ibpb_entry_hvm == -1 )
+        opt_ibpb_entry_hvm = (IS_ENABLED(CONFIG_HVM) &&
+                              boot_cpu_has(X86_FEATURE_IBPB) &&
+                              !boot_cpu_has(X86_FEATURE_BTC_NO));
+
+    if ( opt_ibpb_entry_pv )
+    {
+        setup_force_cpu_cap(X86_FEATURE_IBPB_ENTRY_PV);
+
+        /*
+         * We only need to flush in IST context if we're protecting against PV
+         * guests.  HVM IBPB-on-entry protections are both atomic with
+         * NMI/#MC, so can't interrupt Xen ahead of having already flushed the
+         * BTB.
+         */
+        default_spec_ctrl_flags |= SCF_ist_ibpb;
+    }
+    if ( opt_ibpb_entry_hvm )
+        setup_force_cpu_cap(X86_FEATURE_IBPB_ENTRY_HVM);
+
+    /*
+     * If we're using IBPB-on-entry to protect against PV and HVM guests
+     * (ignoring dom0 if trusted), then there's no need to also issue IBPB on
+     * context switch too.
+     */
+    if ( opt_ibpb_ctxt_switch == -1 )
+        opt_ibpb_ctxt_switch = !(opt_ibpb_entry_hvm && opt_ibpb_entry_pv);
+}
+
 /* Calculate whether this CPU is vulnerable to L1TF. */
 static __init void l1tf_calculations(uint64_t caps)
 {
@@ -1014,8 +1098,12 @@ void spec_ctrl_init_domain(struct domain
     bool verw = ((pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
                  (opt_fb_clear_mmio && is_iommu_enabled(d)));
 
+    bool ibpb = ((pv ? opt_ibpb_entry_pv : opt_ibpb_entry_hvm) &&
+                 (d->domain_id != 0 || opt_ibpb_entry_dom0));
+
     d->arch.spec_ctrl_flags =
         (verw   ? SCF_verw         : 0) |
+        (ibpb   ? SCF_entry_ibpb   : 0) |
         0;
 }
 
@@ -1162,12 +1250,15 @@ void __init init_speculation_mitigations
     }
 
     /*
-     * Use STIBP by default if the hardware hint is set.  Otherwise, leave it
-     * off as it a severe performance pentalty on pre-eIBRS Intel hardware
-     * where it was retrofitted in microcode.
+     * Use STIBP by default on all AMD systems.  Zen3 and later enumerate
+     * STIBP_ALWAYS, but STIBP is needed on Zen2 as part of the mitigations
+     * for Branch Type Confusion.
+     *
+     * Leave STIBP off by default on Intel.  Pre-eIBRS systems suffer a
+     * substantial perf hit when it was implemented in microcode.
      */
     if ( opt_stibp == -1 )
-        opt_stibp = !!boot_cpu_has(X86_FEATURE_STIBP_ALWAYS);
+        opt_stibp = !!boot_cpu_has(X86_FEATURE_AMD_STIBP);
 
     if ( opt_stibp && (boot_cpu_has(X86_FEATURE_STIBP) ||
                        boot_cpu_has(X86_FEATURE_AMD_STIBP)) )
@@ -1239,9 +1330,7 @@ void __init init_speculation_mitigations
     if ( opt_rsb_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_RSB_HVM);
 
-    /* Check we have hardware IBPB support before using it... */
-    if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) )
-        opt_ibpb_ctxt_switch = false;
+    ibpb_calculations();
 
     /* Check whether Eager FPU should be enabled by default. */
     if ( opt_eager_fpu == -1 )
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -65,7 +65,7 @@
 void init_speculation_mitigations(void);
 void spec_ctrl_init_domain(struct domain *d);
 
-extern bool opt_ibpb_ctxt_switch;
+extern int8_t opt_ibpb_ctxt_switch;
 extern bool opt_ssbd;
 extern int8_t opt_eager_fpu;
 extern int8_t opt_l1d_flush;

++++++ gcc12-fixes.patch ++++++
--- /var/tmp/diff_new_pack.TxjnPB/_old	2022-08-01 21:28:13.229281472 +0200
+++ /var/tmp/diff_new_pack.TxjnPB/_new	2022-08-01 21:28:13.233281484 +0200
@@ -2,61 +2,21 @@
 
 Compiling against gcc12.
 
-Many of the failures are -Werror=array-bounds where macros
-from mm.h are being used. Common Examples are,
-include/asm/mm.h:528:61: error: array subscript 0 is outside array bounds of 'long unsigned int[0]' [-Werror=array-bounds]
-include/xen/mm.h:287:21: error: array subscript [0, 288230376151711743] is outside array bounds of 'struct page_info[0]' [-Werror=array-bounds]
-
-There are also several other headers that generate array-bounds macro failures.
-The pragmas to override are mostly in '.c' files with the exception of,
-xen/arch/x86/mm/shadow/private.h
-xen/include/asm-x86/paging.h
-
-
-Index: xen-4.16.1-testing/xen/drivers/passthrough/amd/iommu_intr.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/drivers/passthrough/amd/iommu_intr.c
-+++ xen-4.16.1-testing/xen/drivers/passthrough/amd/iommu_intr.c
-@@ -23,6 +23,10 @@
- 
- #include "iommu.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- union irte32 {
-     uint32_t raw;
-     struct {
-Index: xen-4.16.1-testing/xen/drivers/passthrough/x86/hvm.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/drivers/passthrough/x86/hvm.c
-+++ xen-4.16.1-testing/xen/drivers/passthrough/x86/hvm.c
-@@ -901,6 +901,9 @@ static void __hvm_dpci_eoi(struct domain
-     hvm_pirq_eoi(pirq);
- }
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Waddress"
-+#endif
- static void hvm_gsi_eoi(struct domain *d, unsigned int gsi)
- {
-     struct pirq *pirq = pirq_info(d, gsi);
-Index: xen-4.16.1-testing/xen/common/domctl.c
+Index: xen-4.16.1-testing/xen/arch/x86/tboot.c
 ===================================================================
---- xen-4.16.1-testing.orig/xen/common/domctl.c
-+++ xen-4.16.1-testing/xen/common/domctl.c
-@@ -32,6 +32,10 @@
- #include <public/domctl.h>
- #include <xsm/xsm.h>
+--- xen-4.16.1-testing.orig/xen/arch/x86/tboot.c
++++ xen-4.16.1-testing/xen/arch/x86/tboot.c
+@@ -16,6 +16,10 @@
+ #include <asm/setup.h>
+ #include <crypto/vmac.h>
  
 +#if __GNUC__ >= 12
 +#pragma GCC diagnostic ignored "-Warray-bounds"
 +#endif
 +
- static DEFINE_SPINLOCK(domctl_lock);
- 
- static int nodemask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_nodemap,
+ /* tboot=<physical address of shared page> */
+ static unsigned long __initdata opt_tboot_pa;
+ integer_param("tboot", opt_tboot_pa);
 Index: xen-4.16.1-testing/xen/common/efi/boot.c
 ===================================================================
 --- xen-4.16.1-testing.orig/xen/common/efi/boot.c
@@ -72,36 +32,6 @@
  #define EFI_REVISION(major, minor) (((major) << 16) | (minor))
  
  #define SMBIOS3_TABLE_GUID \
-Index: xen-4.16.1-testing/xen/common/xmalloc_tlsf.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/common/xmalloc_tlsf.c
-+++ xen-4.16.1-testing/xen/common/xmalloc_tlsf.c
-@@ -28,6 +28,10 @@
- #include <xen/pfn.h>
- #include <asm/time.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- #define MAX_POOL_NAME_LEN       16
- 
- /* Some IMPORTANT TLSF parameters */
-Index: xen-4.16.1-testing/xen/common/memory.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/common/memory.c
-+++ xen-4.16.1-testing/xen/common/memory.c
-@@ -35,6 +35,10 @@
- #include <asm/guest.h>
- #endif
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- struct memop_args {
-     /* INPUT */
-     struct domain *domain;     /* Domain to be affected. */
 Index: xen-4.16.1-testing/xen/common/page_alloc.c
 ===================================================================
 --- xen-4.16.1-testing.orig/xen/common/page_alloc.c
@@ -117,313 +47,4 @@
  /*
   * Comma-separated list of hexadecimal page numbers containing bad bytes.
   * e.g. 'badpage=0x3f45,0x8a321'.
-@@ -1529,6 +1533,7 @@ static void free_heap_pages(
- }
- 
- 
-+
- /*
-  * Following rules applied for page offline:
-  * Once a page is broken, it can't be assigned anymore
-Index: xen-4.16.1-testing/xen/common/vmap.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/common/vmap.c
-+++ xen-4.16.1-testing/xen/common/vmap.c
-@@ -9,6 +9,10 @@
- #include <xen/vmap.h>
- #include <asm/page.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- static DEFINE_SPINLOCK(vm_lock);
- static void *__read_mostly vm_base[VMAP_REGION_NR];
- #define vm_bitmap(x) ((unsigned long *)vm_base[x])
-Index: xen-4.16.1-testing/xen/include/asm-x86/paging.h
-===================================================================
---- xen-4.16.1-testing.orig/xen/include/asm-x86/paging.h
-+++ xen-4.16.1-testing/xen/include/asm-x86/paging.h
-@@ -32,6 +32,10 @@
- #include <asm/flushtlb.h>
- #include <asm/domain.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /*****************************************************************************
-  * Macros to tell which paging mode a domain is in */
- 
-Index: xen-4.16.1-testing/xen/arch/x86/x86_64/traps.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/x86_64/traps.c
-+++ xen-4.16.1-testing/xen/arch/x86/x86_64/traps.c
-@@ -25,6 +25,9 @@
- #include <asm/hvm/hvm.h>
- #include <asm/hvm/support.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
- 
- static void print_xen_info(void)
- {
-Index: xen-4.16.1-testing/xen/arch/x86/cpu/mcheck/mcaction.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/cpu/mcheck/mcaction.c
-+++ xen-4.16.1-testing/xen/arch/x86/cpu/mcheck/mcaction.c
-@@ -4,6 +4,10 @@
- #include "vmce.h"
- #include "mce.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- static struct mcinfo_recovery *
- mci_action_add_pageoffline(int bank, struct mc_info *mi,
-                            mfn_t mfn, uint32_t status)
-Index: xen-4.16.1-testing/xen/arch/x86/cpu/mcheck/mce.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/cpu/mcheck/mce.c
-+++ xen-4.16.1-testing/xen/arch/x86/cpu/mcheck/mce.c
-@@ -30,6 +30,10 @@
- #include "util.h"
- #include "vmce.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- bool __read_mostly opt_mce = true;
- boolean_param("mce", opt_mce);
- bool __read_mostly mce_broadcast;
-Index: xen-4.16.1-testing/xen/arch/x86/hvm/hvm.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/hvm/hvm.c
-+++ xen-4.16.1-testing/xen/arch/x86/hvm/hvm.c
-@@ -81,6 +81,10 @@
- 
- #include <compat/hvm/hvm_op.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- bool_t __read_mostly hvm_enabled;
- 
- #ifdef DBG_LEVEL_0
-Index: xen-4.16.1-testing/xen/arch/x86/pv/dom0_build.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/pv/dom0_build.c
-+++ xen-4.16.1-testing/xen/arch/x86/pv/dom0_build.c
-@@ -22,6 +22,10 @@
- #include <asm/pv/mm.h>
- #include <asm/setup.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /* Allow ring-3 access in long mode as guest cannot use ring 1 ... */
- #define BASE_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_USER)
- #define L1_PROT (BASE_PROT|_PAGE_GUEST_KERNEL)
-Index: xen-4.16.1-testing/xen/arch/x86/pv/ro-page-fault.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/pv/ro-page-fault.c
-+++ xen-4.16.1-testing/xen/arch/x86/pv/ro-page-fault.c
-@@ -26,6 +26,10 @@
- #include "emulate.h"
- #include "mm.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /*********************
-  * Writable Pagetables
-  */
-Index: xen-4.16.1-testing/xen/arch/x86/pv/emul-priv-op.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/pv/emul-priv-op.c
-+++ xen-4.16.1-testing/xen/arch/x86/pv/emul-priv-op.c
-@@ -40,6 +40,10 @@
- #include "emulate.h"
- #include "mm.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- struct priv_op_ctxt {
-     struct x86_emulate_ctxt ctxt;
-     struct {
-Index: xen-4.16.1-testing/xen/arch/x86/pv/mm.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/pv/mm.c
-+++ xen-4.16.1-testing/xen/arch/x86/pv/mm.c
-@@ -26,6 +26,10 @@
- 
- #include "mm.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /*
-  * Get a mapping of a PV guest's l1e for this linear address.  The return
-  * pointer should be unmapped using unmap_domain_page().
-Index: xen-4.16.1-testing/xen/arch/x86/domain_page.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/domain_page.c
-+++ xen-4.16.1-testing/xen/arch/x86/domain_page.c
-@@ -18,6 +18,10 @@
- #include <asm/hardirq.h>
- #include <asm/setup.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- static DEFINE_PER_CPU(struct vcpu *, override);
- 
- static inline struct vcpu *mapcache_current_vcpu(void)
-Index: xen-4.16.1-testing/xen/arch/x86/mm/shadow/private.h
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/mm/shadow/private.h
-+++ xen-4.16.1-testing/xen/arch/x86/mm/shadow/private.h
-@@ -33,6 +33,10 @@
- 
- #include "../mm-locks.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /******************************************************************************
-  * Levels of self-test and paranoia
-  */
-Index: xen-4.16.1-testing/xen/arch/x86/mm/hap/hap.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/mm/hap/hap.c
-+++ xen-4.16.1-testing/xen/arch/x86/mm/hap/hap.c
-@@ -42,6 +42,10 @@
- 
- #include "private.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /************************************************/
- /*          HAP VRAM TRACKING SUPPORT           */
- /************************************************/
-Index: xen-4.16.1-testing/xen/arch/x86/mm/p2m-pod.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/mm/p2m-pod.c
-+++ xen-4.16.1-testing/xen/arch/x86/mm/p2m-pod.c
-@@ -31,6 +31,10 @@
- 
- #include "mm-locks.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- #define superpage_aligned(_x)  (((_x)&(SUPERPAGE_PAGES-1))==0)
- 
- /* Enforce lock ordering when grabbing the "external" page_alloc lock */
-Index: xen-4.16.1-testing/xen/arch/x86/mm/p2m-ept.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/mm/p2m-ept.c
-+++ xen-4.16.1-testing/xen/arch/x86/mm/p2m-ept.c
-@@ -36,6 +36,10 @@
- 
- #include "mm-locks.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- #define atomic_read_ept_entry(__pepte)                              \
-     ( (ept_entry_t) { .epte = read_atomic(&(__pepte)->epte) } )
- 
-Index: xen-4.16.1-testing/xen/arch/x86/mm/p2m.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/mm/p2m.c
-+++ xen-4.16.1-testing/xen/arch/x86/mm/p2m.c
-@@ -44,6 +44,10 @@
- 
- #include "mm-locks.h"
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /* Override macro from asm/page.h to make work with mfn_t */
- #undef virt_to_mfn
- #define virt_to_mfn(v) _mfn(__virt_to_mfn(v))
-Index: xen-4.16.1-testing/xen/arch/x86/tboot.c
-===================================================================
---- xen-4.16.1-testing.orig/xen/arch/x86/tboot.c
-+++ xen-4.16.1-testing/xen/arch/x86/tboot.c
-@@ -16,6 +16,10 @@
- #include <asm/setup.h>
- #include <crypto/vmac.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /* tboot=<physical address of shared page> */
- static unsigned long __initdata opt_tboot_pa;
- integer_param("tboot", opt_tboot_pa);
-Index: xen-4.16.1-testing/tools/firmware/hvmloader/ovmf.c
-===================================================================
---- xen-4.16.1-testing.orig/tools/firmware/hvmloader/ovmf.c
-+++ xen-4.16.1-testing/tools/firmware/hvmloader/ovmf.c
-@@ -34,6 +34,11 @@
- #include <xen/hvm/ioreq.h>
- #include <xen/memory.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#pragma GCC diagnostic ignored "-Wstringop-overflow"
-+#endif
-+
- #define OVMF_MAXOFFSET          0x000FFFFFULL
- #define OVMF_END                0x100000000ULL
- #define LOWCHUNK_BEGIN          0x000F0000
-Index: xen-4.16.1-testing/tools/firmware/hvmloader/seabios.c
-===================================================================
---- xen-4.16.1-testing.orig/tools/firmware/hvmloader/seabios.c
-+++ xen-4.16.1-testing/tools/firmware/hvmloader/seabios.c
-@@ -29,6 +29,11 @@
- #include <acpi2_0.h>
- #include <libacpi.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#pragma GCC diagnostic ignored "-Wstringop-overflow"
-+#endif
-+
- struct seabios_info {
-     char signature[14]; /* XenHVMSeaBIOS\0 */
-     uint8_t length;     /* Length of this struct */
-Index: xen-4.16.1-testing/tools/firmware/hvmloader/util.c
-===================================================================
---- xen-4.16.1-testing.orig/tools/firmware/hvmloader/util.c
-+++ xen-4.16.1-testing/tools/firmware/hvmloader/util.c
-@@ -31,6 +31,10 @@
- #include <xen/hvm/hvm_xs_strings.h>
- #include <xen/hvm/params.h>
- 
-+#if __GNUC__ >= 12
-+#pragma GCC diagnostic ignored "-Warray-bounds"
-+#endif
-+
- /*
-  * Check whether there exists overlap in the specified memory range.
-  * Returns true if exists, else returns false.
 

++++++ xsa408.patch ++++++
x86/mm: correct TLB flush condition in _get_page_type()

When this logic was moved, it was moved across the point where nx is
updated to hold the new type for the page. IOW originally it was
equivalent to using x (and perhaps x would better have been used), but
now it isn't anymore. Switch to using x, which then brings things in
line again with the slightly earlier comment there (now) talking about
transitions _from_ writable.

I have to confess though that I cannot make a direct connection between
the reported observed behavior of guests leaving several pages around
with pending general references and the change here. Repeated testing,
nevertheless, confirms the reported issue is no longer there.

This is XSA-???.

Fixes: 8cc5036bc385 ("x86/pv: Fix ABAC cmpxchg() race in _get_page_type()")
Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Furthermore aren't we using ->tlbflush_timestamp there even when the
shadow_flags union member is active, i.e. for PGC_page_table pages? I
for one can't convince myself that this isn't possible with OOS active
(and {page,mfn}_oos_may_write() producing "true" for a page).

I'd be happy to update the description to actually connect things, as
long as someone can give some plausible explanation.

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3004,7 +3004,7 @@ static int _get_page_type(struct page_in
             if ( unlikely(!cpumask_empty(mask)) &&
                  /* Shadow mode: track only writable pages. */
                  (!shadow_mode_enabled(d) ||
-                  ((nx & PGT_type_mask) == PGT_writable_page)) )
+                  ((x & PGT_type_mask) == PGT_writable_page)) )
             {
                 perfc_incr(need_flush_tlb_flush);
                 /*

    

commit xen for openSUSE:Factory

Source-Sync