Hello community, here is the log from the commit of package xen for openSUSE:Factory checked in at 2018-03-30 12:00:34 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/xen (Old) and /work/SRC/openSUSE:Factory/.xen.new (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "xen" Fri Mar 30 12:00:34 2018 rev:245 rq:591751 version:4.10.0_16 Changes: -------- --- /work/SRC/openSUSE:Factory/xen/xen.changes 2018-03-20 21:50:48.542316318 +0100 +++ /work/SRC/openSUSE:Factory/.xen.new/xen.changes 2018-03-30 12:00:43.480265750 +0200 @@ -1,0 +2,12 @@ +Mon Mar 26 08:20:45 MDT 2018 - carnold@suse.com + +- Upstream patches from Jan (bsc#1027519) and fixes related to + Page Table Isolation (XPTI). See also bsc#1074562 XSA-254 + 5a856a2b-x86-xpti-hide-almost-all-of-Xen-image-mappings.patch + 5a9eb7f1-x86-xpti-dont-map-stack-guard-pages.patch + 5a9eb85c-x86-slightly-reduce-XPTI-overhead.patch + 5a9eb890-x86-remove-CR-reads-from-exit-to-guest-path.patch + 5aa2b6b9-cpufreq-ondemand-CPU-offlining-race.patch + 5aaa9878-x86-vlapic-clear-TMR-bit-for-edge-triggered-intr.patch + +------------------------------------------------------------------- New: ---- 5a856a2b-x86-xpti-hide-almost-all-of-Xen-image-mappings.patch 5a9eb7f1-x86-xpti-dont-map-stack-guard-pages.patch 5a9eb85c-x86-slightly-reduce-XPTI-overhead.patch 5a9eb890-x86-remove-CR-reads-from-exit-to-guest-path.patch 5aa2b6b9-cpufreq-ondemand-CPU-offlining-race.patch 5aaa9878-x86-vlapic-clear-TMR-bit-for-edge-triggered-intr.patch ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ xen.spec ++++++ --- /var/tmp/diff_new_pack.XjVnAv/_old 2018-03-30 12:00:46.148169274 +0200 +++ /var/tmp/diff_new_pack.XjVnAv/_new 2018-03-30 12:00:46.152169129 +0200 @@ -126,7 +126,7 @@ BuildRequires: pesign-obs-integration %endif -Version: 4.10.0_14 +Version: 4.10.0_16 Release: 0 Summary: Xen Virtualization: Hypervisor (aka VMM aka Microkernel) License: GPL-2.0 @@ -211,13 +211,19 @@ Patch48: 5a843807-x86-spec_ctrl-fix-bugs-in-SPEC_CTRL_ENTRY_FROM_INTR_IST.patch Patch49: 5a856a2b-x86-emul-fix-64bit-decoding-of-segment-overrides.patch Patch50: 5a856a2b-x86-use-32bit-xors-for-clearing-GPRs.patch -Patch51: 5a8be788-x86-nmi-start-NMI-watchdog-on-CPU0-after-SMP.patch -Patch52: 5a95373b-x86-PV-avoid-leaking-other-guests-MSR_TSC_AUX.patch -Patch53: 5a95571f-memory-dont-implicitly-unpin-in-decrease-res.patch -Patch54: 5a95576c-gnttab-ARM-dont-corrupt-shared-GFN-array.patch -Patch55: 5a955800-gnttab-dont-free-status-pages-on-ver-change.patch -Patch56: 5a955854-x86-disallow-HVM-creation-without-LAPIC-emul.patch -Patch57: 5a956747-x86-HVM-dont-give-wrong-impression-of-WRMSR-success.patch +Patch51: 5a856a2b-x86-xpti-hide-almost-all-of-Xen-image-mappings.patch +Patch52: 5a8be788-x86-nmi-start-NMI-watchdog-on-CPU0-after-SMP.patch +Patch53: 5a95373b-x86-PV-avoid-leaking-other-guests-MSR_TSC_AUX.patch +Patch54: 5a95571f-memory-dont-implicitly-unpin-in-decrease-res.patch +Patch55: 5a95576c-gnttab-ARM-dont-corrupt-shared-GFN-array.patch +Patch56: 5a955800-gnttab-dont-free-status-pages-on-ver-change.patch +Patch57: 5a955854-x86-disallow-HVM-creation-without-LAPIC-emul.patch +Patch58: 5a956747-x86-HVM-dont-give-wrong-impression-of-WRMSR-success.patch +Patch59: 5a9eb7f1-x86-xpti-dont-map-stack-guard-pages.patch +Patch60: 5a9eb85c-x86-slightly-reduce-XPTI-overhead.patch +Patch61: 5a9eb890-x86-remove-CR-reads-from-exit-to-guest-path.patch +Patch62: 5aa2b6b9-cpufreq-ondemand-CPU-offlining-race.patch +Patch63: 5aaa9878-x86-vlapic-clear-TMR-bit-for-edge-triggered-intr.patch # Our platform specific patches Patch400: xen-destdir.patch Patch401: vif-bridge-no-iptables.patch @@ -465,6 +471,12 @@ %patch55 -p1 %patch56 -p1 %patch57 -p1 +%patch58 -p1 +%patch59 -p1 +%patch60 -p1 +%patch61 -p1 +%patch62 -p1 +%patch63 -p1 # Our platform specific patches %patch400 -p1 %patch401 -p1 ++++++ 5a856a2b-x86-xpti-hide-almost-all-of-Xen-image-mappings.patch ++++++ # Commit 422588e88511d17984544c0f017a927de3315290 # Date 2018-02-15 11:08:27 +0000 # Author Andrew Cooper <andrew.cooper3@citrix.com> # Committer Andrew Cooper <andrew.cooper3@citrix.com> x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings The current XPTI implementation isolates the directmap (and therefore a lot of guest data), but a large quantity of CPU0's state (including its stack) remains visible. Furthermore, an attacker able to read .text is in a vastly superior position to normal when it comes to fingerprinting Xen for known vulnerabilities, or scanning for ROP/Spectre gadgets. Collect together the entrypoints in .text.entry (currently 3x4k frames, but can almost certainly be slimmed down), and create a common mapping which is inserted into each per-cpu shadow. The stubs are also inserted into this mapping by pointing at the in-use L2. This allows stubs allocated later (SMP boot, or CPU hotplug) to work without further changes to the common mappings. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> # Commit d1d6fc97d66cf56847fc0bcc2ddc370707c22378 # Date 2018-03-06 16:46:27 +0100 # Author Jan Beulich <jbeulich@suse.com> # Committer Jan Beulich <jbeulich@suse.com> x86/xpti: really hide almost all of Xen image Commit 422588e885 ("x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings") carefully limited the Xen image cloning to just entry code, but then overwrote the just allocated and populated L3 entry with the normal one again covering both Xen image and stubs. Drop the respective code in favor of an explicit clone_mapping() invocation. This in turn now requires setup_cpu_root_pgt() to run after stub setup in all cases. Additionally, with (almost) no unintended mappings left, the BSP's IDT now also needs to be page aligned. The moving ahead of cleanup_cpu_root_pgt() is not strictly necessary for functionality, but things are more logical this way, and we retain cleanup being done in the inverse order of setup. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> # Commit 044fedfaa29b5d5774196e3fc7d955a48bfceac4 # Date 2018-03-09 15:42:24 +0000 # Author Andrew Cooper <andrew.cooper3@citrix.com> # Committer Andrew Cooper <andrew.cooper3@citrix.com> x86/traps: Put idt_table[] back into .bss c/s d1d6fc97d "x86/xpti: really hide almost all of Xen image" accidentially moved idt_table[] from .bss to .data by virtue of using the page_aligned section. We also have .bss.page_aligned, so use that. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -1897,9 +1897,6 @@ mode. Override default selection of whether to isolate 64-bit PV guest page tables. -** WARNING: Not yet a complete isolation implementation, but better than -nothing. ** - ### xsave
`= <boolean>`
--- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -644,13 +644,24 @@ static int clone_mapping(const void *ptr { unsigned long linear = (unsigned long)ptr, pfn; unsigned int flags; - l3_pgentry_t *pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) + - l3_table_offset(linear); + l3_pgentry_t *pl3e; l2_pgentry_t *pl2e; l1_pgentry_t *pl1e; - if ( linear < DIRECTMAP_VIRT_START ) - return 0; + /* + * Sanity check 'linear'. We only allow cloning from the Xen virtual + * range, and in particular, only from the directmap and .text ranges. + */ + if ( root_table_offset(linear) > ROOT_PAGETABLE_LAST_XEN_SLOT || + root_table_offset(linear) < ROOT_PAGETABLE_FIRST_XEN_SLOT ) + return -EINVAL; + + if ( linear < XEN_VIRT_START || + (linear >= XEN_VIRT_END && linear < DIRECTMAP_VIRT_START) ) + return -EINVAL; + + pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) + + l3_table_offset(linear); flags = l3e_get_flags(*pl3e); ASSERT(flags & _PAGE_PRESENT); @@ -742,6 +753,10 @@ static __read_mostly int8_t opt_xpti = - boolean_param("xpti", opt_xpti); DEFINE_PER_CPU(root_pgentry_t *, root_pgt); +static root_pgentry_t common_pgt; + +extern const char _stextentry[], _etextentry[]; + static int setup_cpu_root_pgt(unsigned int cpu) { root_pgentry_t *rpt; @@ -762,8 +777,23 @@ static int setup_cpu_root_pgt(unsigned i idle_pg_table[root_table_offset(RO_MPT_VIRT_START)]; /* SH_LINEAR_PT inserted together with guest mappings. */ /* PERDOMAIN inserted during context switch. */ - rpt[root_table_offset(XEN_VIRT_START)] = - idle_pg_table[root_table_offset(XEN_VIRT_START)]; + + /* One-time setup of common_pgt, which maps .text.entry and the stubs. */ + if ( unlikely(!root_get_intpte(common_pgt)) ) + { + const char *ptr; + + for ( rc = 0, ptr = _stextentry; + !rc && ptr < _etextentry; ptr += PAGE_SIZE ) + rc = clone_mapping(ptr, rpt); + + if ( rc ) + return rc; + + common_pgt = rpt[root_table_offset(XEN_VIRT_START)]; + } + + rpt[root_table_offset(XEN_VIRT_START)] = common_pgt; /* Install direct map page table entries for stack, IDT, and TSS. */ for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE ) @@ -773,6 +803,8 @@ static int setup_cpu_root_pgt(unsigned i rc = clone_mapping(idt_tables[cpu], rpt); if ( !rc ) rc = clone_mapping(&per_cpu(init_tss, cpu), rpt); + if ( !rc ) + rc = clone_mapping((void *)per_cpu(stubs.addr, cpu), rpt); return rc; } @@ -781,6 +813,7 @@ static void cleanup_cpu_root_pgt(unsigne { root_pgentry_t *rpt = per_cpu(root_pgt, cpu); unsigned int r; + unsigned long stub_linear = per_cpu(stubs.addr, cpu); if ( !rpt ) return; @@ -825,6 +858,16 @@ static void cleanup_cpu_root_pgt(unsigne } free_xen_pagetable(rpt); + + /* Also zap the stub mapping for this CPU. */ + if ( stub_linear ) + { + l3_pgentry_t *l3t = l4e_to_l3e(common_pgt); + l2_pgentry_t *l2t = l3e_to_l2e(l3t[l3_table_offset(stub_linear)]); + l1_pgentry_t *l1t = l2e_to_l1e(l2t[l2_table_offset(stub_linear)]); + + l1t[l2_table_offset(stub_linear)] = l1e_empty(); + } } static void cpu_smpboot_free(unsigned int cpu) @@ -848,6 +891,8 @@ static void cpu_smpboot_free(unsigned in if ( per_cpu(scratch_cpumask, cpu) != &scratch_cpu0mask ) free_cpumask_var(per_cpu(scratch_cpumask, cpu)); + cleanup_cpu_root_pgt(cpu); + if ( per_cpu(stubs.addr, cpu) ) { mfn_t mfn = _mfn(per_cpu(stubs.mfn, cpu)); @@ -865,8 +910,6 @@ static void cpu_smpboot_free(unsigned in free_domheap_page(mfn_to_page(mfn)); } - cleanup_cpu_root_pgt(cpu); - order = get_order_from_pages(NR_RESERVED_GDT_PAGES); free_xenheap_pages(per_cpu(gdt_table, cpu), order); @@ -922,9 +965,6 @@ static int cpu_smpboot_alloc(unsigned in set_ist(&idt_tables[cpu][TRAP_nmi], IST_NONE); set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE); - if ( setup_cpu_root_pgt(cpu) ) - goto oom; - for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1); i < nr_cpu_ids && i <= (cpu | (STUBS_PER_PAGE - 1)); ++i ) if ( cpu_online(i) && cpu_to_node(i) == node ) @@ -938,6 +978,9 @@ static int cpu_smpboot_alloc(unsigned in goto oom; per_cpu(stubs.addr, cpu) = stub_page + STUB_BUF_CPU_OFFS(cpu); + if ( setup_cpu_root_pgt(cpu) ) + goto oom; + if ( secondary_socket_cpumask == NULL && (secondary_socket_cpumask = xzalloc(cpumask_t)) == NULL ) goto oom; --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -102,7 +102,8 @@ DEFINE_PER_CPU_READ_MOSTLY(struct desc_s DEFINE_PER_CPU_READ_MOSTLY(struct desc_struct *, compat_gdt_table); /* Master table, used by CPU0. */ -idt_entry_t idt_table[IDT_ENTRIES]; +idt_entry_t __section(".bss.page_aligned") __aligned(PAGE_SIZE) + idt_table[IDT_ENTRIES]; /* Pointer to the IDT of every CPU. */ idt_entry_t *idt_tables[NR_CPUS] __read_mostly; --- a/xen/arch/x86/x86_64/compat/entry.S +++ b/xen/arch/x86/x86_64/compat/entry.S @@ -13,6 +13,8 @@ #include <public/xen.h> #include <irq_vectors.h> + .section .text.entry, "ax", @progbits + ENTRY(entry_int82) ASM_CLAC pushq $0 @@ -270,6 +272,9 @@ ENTRY(compat_int80_direct_trap) call compat_create_bounce_frame jmp compat_test_all_events + /* compat_create_bounce_frame & helpers don't need to be in .text.entry */ + .text + /* CREATE A BASIC EXCEPTION FRAME ON GUEST OS (RING-1) STACK: */ /* {[ERRCODE,] EIP, CS, EFLAGS, [ESP, SS]} */ /* %rdx: trap_bounce, %rbx: struct vcpu */ --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -14,6 +14,8 @@ #include <public/xen.h> #include <irq_vectors.h> + .section .text.entry, "ax", @progbits + /* %rbx: struct vcpu */ ENTRY(switch_to_kernel) leaq VCPU_trap_bounce(%rbx),%rdx @@ -357,6 +359,9 @@ int80_slow_path: subq $2,UREGS_rip(%rsp) jmp handle_exception_saved + /* create_bounce_frame & helpers don't need to be in .text.entry */ + .text + /* CREATE A BASIC EXCEPTION FRAME ON GUEST OS STACK: */ /* { RCX, R11, [ERRCODE,] RIP, CS, RFLAGS, RSP, SS } */ /* %rdx: trap_bounce, %rbx: struct vcpu */ @@ -487,6 +492,8 @@ ENTRY(dom_crash_sync_extable) jmp asm_domain_crash_synchronous /* Does not return */ .popsection + .section .text.entry, "ax", @progbits + ENTRY(common_interrupt) SAVE_ALL CLAC @@ -846,8 +853,7 @@ GLOBAL(trap_nop) -.section .rodata, "a", @progbits - + .pushsection .rodata, "a", @progbits ENTRY(exception_table) .quad do_trap .quad do_debug @@ -873,9 +879,10 @@ ENTRY(exception_table) .quad do_reserved_trap /* Architecturally reserved exceptions. */ .endr .size exception_table, . - exception_table + .popsection /* Table of automatically generated entry points. One per vector. */ - .section .init.rodata, "a", @progbits + .pushsection .init.rodata, "a", @progbits GLOBAL(autogen_entrypoints) /* pop into the .init.rodata section and record an entry point. */ .macro entrypoint ent @@ -884,7 +891,7 @@ GLOBAL(autogen_entrypoints) .popsection .endm - .text + .popsection autogen_stubs: /* Automatically generated stubs. */ vec = 0 --- a/xen/arch/x86/xen.lds.S +++ b/xen/arch/x86/xen.lds.S @@ -60,6 +60,13 @@ SECTIONS _stext = .; /* Text and read-only data */ *(.text) *(.text.__x86_indirect_thunk_*) + + . = ALIGN(PAGE_SIZE); + _stextentry = .; + *(.text.entry) + . = ALIGN(PAGE_SIZE); + _etextentry = .; + *(.text.cold) *(.text.unlikely) *(.fixup) ++++++ 5a8be788-x86-nmi-start-NMI-watchdog-on-CPU0-after-SMP.patch ++++++ --- /var/tmp/diff_new_pack.XjVnAv/_old 2018-03-30 12:00:46.384160741 +0200 +++ /var/tmp/diff_new_pack.XjVnAv/_new 2018-03-30 12:00:46.384160741 +0200 @@ -28,10 +28,8 @@ Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> -Index: xen-4.10.0-testing/xen/arch/x86/apic.c -=================================================================== ---- xen-4.10.0-testing.orig/xen/arch/x86/apic.c -+++ xen-4.10.0-testing/xen/arch/x86/apic.c +--- a/xen/arch/x86/apic.c ++++ b/xen/arch/x86/apic.c @@ -682,7 +682,7 @@ void setup_local_APIC(void) printk("Leaving ESR disabled.\n"); } @@ -41,11 +39,9 @@ setup_apic_nmi_watchdog(); apic_pm_activate(); } -Index: xen-4.10.0-testing/xen/arch/x86/smpboot.c -=================================================================== ---- xen-4.10.0-testing.orig/xen/arch/x86/smpboot.c -+++ xen-4.10.0-testing/xen/arch/x86/smpboot.c -@@ -1241,7 +1241,10 @@ int __cpu_up(unsigned int cpu) +--- a/xen/arch/x86/smpboot.c ++++ b/xen/arch/x86/smpboot.c +@@ -1284,7 +1284,10 @@ int __cpu_up(unsigned int cpu) void __init smp_cpus_done(void) { if ( nmi_watchdog == NMI_LOCAL_APIC ) @@ -56,11 +52,9 @@ setup_ioapic_dest(); -Index: xen-4.10.0-testing/xen/arch/x86/traps.c -=================================================================== ---- xen-4.10.0-testing.orig/xen/arch/x86/traps.c -+++ xen-4.10.0-testing/xen/arch/x86/traps.c -@@ -1669,7 +1669,7 @@ static nmi_callback_t *nmi_callback = du +--- a/xen/arch/x86/traps.c ++++ b/xen/arch/x86/traps.c +@@ -1670,7 +1670,7 @@ static nmi_callback_t *nmi_callback = du void do_nmi(const struct cpu_user_regs *regs) { unsigned int cpu = smp_processor_id(); @@ -69,7 +63,7 @@ bool handle_unknown = false; ++nmi_count(cpu); -@@ -1677,6 +1677,16 @@ void do_nmi(const struct cpu_user_regs * +@@ -1678,6 +1678,16 @@ void do_nmi(const struct cpu_user_regs * if ( nmi_callback(regs, cpu) ) return; @@ -86,7 +80,7 @@ if ( (nmi_watchdog == NMI_NONE) || (!nmi_watchdog_tick(regs) && watchdog_force) ) handle_unknown = true; -@@ -1684,7 +1694,6 @@ void do_nmi(const struct cpu_user_regs * +@@ -1685,7 +1695,6 @@ void do_nmi(const struct cpu_user_regs * /* Only the BSP gets external NMIs from the system. */ if ( cpu == 0 ) { ++++++ 5a956747-x86-HVM-dont-give-wrong-impression-of-WRMSR-success.patch ++++++ --- /var/tmp/diff_new_pack.XjVnAv/_old 2018-03-30 12:00:46.420159439 +0200 +++ /var/tmp/diff_new_pack.XjVnAv/_new 2018-03-30 12:00:46.424159294 +0200 @@ -19,6 +19,20 @@ Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> +# Commit 59c0983e10d70ea2368085271b75fb007811fe52 +# Date 2018-03-15 12:44:24 +0100 +# Author Jan Beulich <jbeulich@suse.com> +# Committer Jan Beulich <jbeulich@suse.com> +x86: ignore guest microcode loading attempts + +The respective MSRs are write-only, and hence attempts by guests to +write to these are - as of 1f1d183d49 ("x86/HVM: don't give the wrong +impression of WRMSR succeeding") no longer ignored. Restore original +behavior for the two affected MSRs. + +Signed-off-by: Jan Beulich <jbeulich@suse.com> +Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> + --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -2106,6 +2106,13 @@ static int svm_msr_write_intercept(unsig @@ -51,3 +65,43 @@ case 1: break; default: +--- a/xen/arch/x86/msr.c ++++ b/xen/arch/x86/msr.c +@@ -128,6 +128,8 @@ int guest_rdmsr(const struct vcpu *v, ui + + switch ( msr ) + { ++ case MSR_AMD_PATCHLOADER: ++ case MSR_IA32_UCODE_WRITE: + case MSR_PRED_CMD: + /* Write-only */ + goto gp_fault; +@@ -181,6 +183,28 @@ int guest_wrmsr(struct vcpu *v, uint32_t + /* Read-only */ + goto gp_fault; + ++ case MSR_AMD_PATCHLOADER: ++ /* ++ * See note on MSR_IA32_UCODE_WRITE below, which may or may not apply ++ * to AMD CPUs as well (at least the architectural/CPUID part does). ++ */ ++ if ( is_pv_domain(d) || ++ d->arch.cpuid->x86_vendor != X86_VENDOR_AMD ) ++ goto gp_fault; ++ break; ++ ++ case MSR_IA32_UCODE_WRITE: ++ /* ++ * Some versions of Windows at least on certain hardware try to load ++ * microcode before setting up an IDT. Therefore we must not inject #GP ++ * for such attempts. Also the MSR is architectural and not qualified ++ * by any CPUID bit. ++ */ ++ if ( is_pv_domain(d) || ++ d->arch.cpuid->x86_vendor != X86_VENDOR_INTEL ) ++ goto gp_fault; ++ break; ++ + case MSR_SPEC_CTRL: + if ( !cp->feat.ibrsb ) + goto gp_fault; /* MSR available? */ ++++++ 5a9eb7f1-x86-xpti-dont-map-stack-guard-pages.patch ++++++ # Commit d303784b68237ff3050daa184f560179dda21b8c # Date 2018-03-06 16:46:57 +0100 # Author Jan Beulich <jbeulich@suse.com> # Committer Jan Beulich <jbeulich@suse.com> x86/xpti: don't map stack guard pages Other than for the main mappings, don't even do this in release builds, as there are no huge page shattering concerns here. Note that since we don't run on the restructed page tables while HVM guests execute, the non-present mappings won't trigger the triple fault issue AMD SVM is susceptible to with our current placement of STGI vs TR loading. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -5538,6 +5538,14 @@ void memguard_unguard_stack(void *p) memguard_unguard_range(p, PAGE_SIZE); } +bool memguard_is_stack_guard_page(unsigned long addr) +{ + addr &= STACK_SIZE - 1; + + return addr >= STACK_SIZE - PRIMARY_STACK_SIZE - PAGE_SIZE && + addr < STACK_SIZE - PRIMARY_STACK_SIZE; +} + void arch_dump_shared_mem_info(void) { printk("Shared frames %u -- Saved frames %u\n", --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -797,7 +797,8 @@ static int setup_cpu_root_pgt(unsigned i /* Install direct map page table entries for stack, IDT, and TSS. */ for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE ) - rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt); + if ( !memguard_is_stack_guard_page(off) ) + rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt); if ( !rc ) rc = clone_mapping(idt_tables[cpu], rpt); --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -519,6 +519,7 @@ void memguard_unguard_range(void *p, uns void memguard_guard_stack(void *p); void memguard_unguard_stack(void *p); +bool __attribute_const__ memguard_is_stack_guard_page(unsigned long addr); struct mmio_ro_emulate_ctxt { unsigned long cr2; ++++++ 5a9eb85c-x86-slightly-reduce-XPTI-overhead.patch ++++++ # Commit 9d1d31ad9498e6ceb285d5774e34fed5f648c273 # Date 2018-03-06 16:48:44 +0100 # Author Jan Beulich <jbeulich@suse.com> # Committer Jan Beulich <jbeulich@suse.com> x86: slightly reduce Meltdown band-aid overhead I'm not sure why I didn't do this right away: By avoiding the use of global PTEs in the cloned directmap, there's no need to fiddle with CR4.PGE on any of the entry paths. Only the exit paths need to flush global mappings. The reduced flushing, however, requires that we now have interrupts off on all entry paths until after the page table switch, so that flush IPIs can't be serviced while on the restricted pagetables, leaving a window where a potentially stale guest global mapping can be brought into the TLB. Along those lines the "sync" IPI after L4 entry updates now needs to become a real (and global) flush IPI, so that inside Xen we'll also pick up such changes. Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> # Commit c4dd58f0cf23cdf119bbccedfb8c24435fc6f3ab # Date 2018-03-16 17:27:36 +0100 # Author Jan Beulich <jbeulich@suse.com> # Committer Jan Beulich <jbeulich@suse.com> x86: correct EFLAGS.IF in SYSENTER frame Commit 9d1d31ad94 ("x86: slightly reduce Meltdown band-aid overhead") moved the STI past the PUSHF. While this isn't an active problem (as we force EFLAGS.IF to 1 before exiting to guest context), let's not risk internal confusion by finding a PV guest frame with interrupts apparently off. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -3782,18 +3782,14 @@ long do_mmu_update( { /* * Force other vCPU-s of the affected guest to pick up L4 entry - * changes (if any). Issue a flush IPI with empty operation mask to - * facilitate this (including ourselves waiting for the IPI to - * actually have arrived). Utilize the fact that FLUSH_VA_VALID is - * meaningless without FLUSH_CACHE, but will allow to pass the no-op - * check in flush_area_mask(). + * changes (if any). */ unsigned int cpu = smp_processor_id(); cpumask_t *mask = per_cpu(scratch_cpumask, cpu); cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, cpumask_of(cpu)); if ( !cpumask_empty(mask) ) - flush_area_mask(mask, ZERO_BLOCK_PTR, FLUSH_VA_VALID); + flush_mask(mask, FLUSH_TLB_GLOBAL); } perfc_add(num_page_updates, i); --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -737,6 +737,7 @@ static int clone_mapping(const void *ptr } pl1e += l1_table_offset(linear); + flags &= ~_PAGE_GLOBAL; if ( l1e_get_flags(*pl1e) & _PAGE_PRESENT ) { @@ -1046,8 +1047,17 @@ void __init smp_prepare_cpus(unsigned in if ( rc ) panic("Error %d setting up PV root page table\n", rc); if ( per_cpu(root_pgt, 0) ) + { get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0)); + /* + * All entry points which may need to switch page tables have to start + * with interrupts off. Re-write what pv_trap_init() has put there. + */ + _set_gate(idt_table + LEGACY_SYSCALL_VECTOR, SYS_DESC_irq_gate, 3, + &int80_direct_trap); + } + set_nr_sockets(); socket_cpumask = xzalloc_array(cpumask_t *, nr_sockets); --- a/xen/arch/x86/x86_64/compat/entry.S +++ b/xen/arch/x86/x86_64/compat/entry.S @@ -202,7 +202,7 @@ ENTRY(compat_post_handle_exception) /* See lstar_enter for entry register state. */ ENTRY(cstar_enter) - sti + /* sti could live here when we don't switch page tables below. */ CR4_PV32_RESTORE movq 8(%rsp),%rax /* Restore %rax. */ movq $FLAT_KERNEL_SS,8(%rsp) @@ -222,11 +222,12 @@ ENTRY(cstar_enter) jz .Lcstar_cr3_okay mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) neg %rcx - write_cr3 rcx, rdi, rsi + mov %rcx, %cr3 movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) .Lcstar_cr3_okay: + sti - GET_CURRENT(bx) + __GET_CURRENT(bx) movq VCPU_domain(%rbx),%rcx cmpb $0,DOMAIN_is_32bit_pv(%rcx) je switch_to_kernel --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -150,7 +150,7 @@ UNLIKELY_END(exit_cr3) * %ss must be saved into the space left by the trampoline. */ ENTRY(lstar_enter) - sti + /* sti could live here when we don't switch page tables below. */ movq 8(%rsp),%rax /* Restore %rax. */ movq $FLAT_KERNEL_SS,8(%rsp) pushq %r11 @@ -169,9 +169,10 @@ ENTRY(lstar_enter) jz .Llstar_cr3_okay mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) neg %rcx - write_cr3 rcx, rdi, rsi + mov %rcx, %cr3 movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) .Llstar_cr3_okay: + sti __GET_CURRENT(bx) testb $TF_kernel_mode,VCPU_thread_flags(%rbx) @@ -254,7 +255,7 @@ process_trap: jmp test_all_events ENTRY(sysenter_entry) - sti + /* sti could live here when we don't switch page tables below. */ pushq $FLAT_USER_SS pushq $0 pushfq @@ -270,14 +271,17 @@ GLOBAL(sysenter_eflags_saved) /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ GET_STACK_END(bx) + /* PUSHF above has saved EFLAGS.IF clear (the caller had it set). */ + orl $X86_EFLAGS_IF, UREGS_eflags(%rsp) mov STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx neg %rcx jz .Lsyse_cr3_okay mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) neg %rcx - write_cr3 rcx, rdi, rsi + mov %rcx, %cr3 movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) .Lsyse_cr3_okay: + sti __GET_CURRENT(bx) cmpb $0,VCPU_sysenter_disables_events(%rbx) @@ -324,9 +328,10 @@ ENTRY(int80_direct_trap) jz .Lint80_cr3_okay mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) neg %rcx - write_cr3 rcx, rdi, rsi + mov %rcx, %cr3 movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx) .Lint80_cr3_okay: + sti cmpb $0,untrusted_msi(%rip) UNLIKELY_START(ne, msi_check) @@ -510,7 +515,7 @@ ENTRY(common_interrupt) mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14) neg %rcx .Lintr_cr3_load: - write_cr3 rcx, rdi, rsi + mov %rcx, %cr3 xor %ecx, %ecx mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14) testb $3, UREGS_cs(%rsp) @@ -552,7 +557,7 @@ GLOBAL(handle_exception) mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14) neg %rcx .Lxcpt_cr3_load: - write_cr3 rcx, rdi, rsi + mov %rcx, %cr3 xor %ecx, %ecx mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14) testb $3, UREGS_cs(%rsp) @@ -748,7 +753,7 @@ ENTRY(double_fault) jns .Ldblf_cr3_load neg %rbx .Ldblf_cr3_load: - write_cr3 rbx, rdi, rsi + mov %rbx, %cr3 .Ldblf_cr3_okay: movq %rsp,%rdi @@ -783,7 +788,7 @@ handle_ist_exception: mov %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14) neg %rcx .List_cr3_load: - write_cr3 rcx, rdi, rsi + mov %rcx, %cr3 movq $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14) .List_cr3_okay: ++++++ 5a9eb890-x86-remove-CR-reads-from-exit-to-guest-path.patch ++++++ # Commit 31bf55cb5fe3796cf6a4efbcfc0a9418bb1c783f # Date 2018-03-06 16:49:36 +0100 # Author Jan Beulich <jbeulich@suse.com> # Committer Jan Beulich <jbeulich@suse.com> x86: remove CR reads from exit-to-guest path CR3 is - during normal operation - only ever loaded from v->arch.cr3, so there's no need to read the actual control register. For CR4 we can generally use the cached value on all synchronous entry end exit paths. Drop the write_cr3 macro, as the two use sites are probably easier to follow without its use. Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> --- a/xen/arch/x86/x86_64/asm-offsets.c +++ b/xen/arch/x86/x86_64/asm-offsets.c @@ -88,6 +88,7 @@ void __dummy__(void) OFFSET(VCPU_kernel_ss, struct vcpu, arch.pv_vcpu.kernel_ss); OFFSET(VCPU_iopl, struct vcpu, arch.pv_vcpu.iopl); OFFSET(VCPU_guest_context_flags, struct vcpu, arch.vgc_flags); + OFFSET(VCPU_cr3, struct vcpu, arch.cr3); OFFSET(VCPU_arch_msr, struct vcpu, arch.msr); OFFSET(VCPU_nmi_pending, struct vcpu, nmi_pending); OFFSET(VCPU_mce_pending, struct vcpu, mce_pending); --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -45,7 +45,7 @@ restore_all_guest: mov VCPUMSR_spec_ctrl_raw(%rdx), %r15d /* Copy guest mappings and switch to per-CPU root page table. */ - mov %cr3, %r9 + mov VCPU_cr3(%rbx), %r9 GET_STACK_END(dx) mov STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi movabs $PADDR_MASK & PAGE_MASK, %rsi @@ -67,8 +67,13 @@ restore_all_guest: sub $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \ ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi rep movsq + mov STACK_CPUINFO_FIELD(cr4)(%rdx), %rdi mov %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx) - write_cr3 rax, rdi, rsi + mov %rdi, %rsi + and $~X86_CR4_PGE, %rdi + mov %rdi, %cr4 + mov %rax, %cr3 + mov %rsi, %cr4 .Lrag_keep_cr3: /* Restore stashed SPEC_CTRL value. */ @@ -124,7 +129,12 @@ restore_all_xen: * so "g" will have to do. */ UNLIKELY_START(g, exit_cr3) - write_cr3 rax, rdi, rsi + mov %cr4, %rdi + mov %rdi, %rsi + and $~X86_CR4_PGE, %rdi + mov %rdi, %cr4 + mov %rax, %cr3 + mov %rsi, %cr4 UNLIKELY_END(exit_cr3) /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */ --- a/xen/include/asm-x86/asm_defns.h +++ b/xen/include/asm-x86/asm_defns.h @@ -207,15 +207,6 @@ void ret_from_intr(void); #define ASM_STAC ASM_AC(STAC) #define ASM_CLAC ASM_AC(CLAC) -.macro write_cr3 val:req, tmp1:req, tmp2:req - mov %cr4, %\tmp1 - mov %\tmp1, %\tmp2 - and $~X86_CR4_PGE, %\tmp1 - mov %\tmp1, %cr4 - mov %\val, %cr3 - mov %\tmp2, %cr4 -.endm - #define CR4_PV32_RESTORE \ 667: ASM_NOP5; \ .pushsection .altinstr_replacement, "ax"; \ ++++++ 5aa2b6b9-cpufreq-ondemand-CPU-offlining-race.patch ++++++ # Commit 185413355fe331cbc926d48568838227234c9a20 # Date 2018-03-09 17:30:49 +0100 # Author Jan Beulich <jbeulich@suse.com> # Committer Jan Beulich <jbeulich@suse.com> cpufreq/ondemand: fix race while offlining CPU Offlining a CPU involves stopping the cpufreq governor. The on-demand governor will kill the timer before letting generic code proceed, but since that generally isn't happening on the subject CPU, cpufreq_dbs_timer_resume() may run in parallel. If that managed to invoke the timer handler, that handler needs to run to completion before dbs_timer_exit() may safely exit. Make the "stoppable" field a tristate, changing it from +1 to -1 around the timer function invocation, and make dbs_timer_exit() wait for it to become non-negative (still writing zero if it's +1). Also adjust coding style in cpufreq_dbs_timer_resume(). Reported-by: Martin Cerveny <martin@c-home.cz> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Martin Cerveny <martin@c-home.cz> Reviewed-by: Wei Liu <wei.liu2@citrix.com> --- a/xen/drivers/cpufreq/cpufreq_ondemand.c +++ b/xen/drivers/cpufreq/cpufreq_ondemand.c @@ -204,7 +204,14 @@ static void dbs_timer_init(struct cpu_db static void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info) { dbs_info->enable = 0; - dbs_info->stoppable = 0; + + /* + * The timer function may be running (from cpufreq_dbs_timer_resume) - + * wait for it to complete. + */ + while ( cmpxchg(&dbs_info->stoppable, 1, 0) < 0 ) + cpu_relax(); + kill_timer(&per_cpu(dbs_timer, dbs_info->cpu)); } @@ -369,23 +376,22 @@ void cpufreq_dbs_timer_suspend(void) void cpufreq_dbs_timer_resume(void) { - int cpu; - struct timer* t; - s_time_t now; - - cpu = smp_processor_id(); + unsigned int cpu = smp_processor_id(); + int8_t *stoppable = &per_cpu(cpu_dbs_info, cpu).stoppable; - if ( per_cpu(cpu_dbs_info,cpu).stoppable ) + if ( *stoppable ) { - now = NOW(); - t = &per_cpu(dbs_timer, cpu); - if (t->expires <= now) + s_time_t now = NOW(); + struct timer *t = &per_cpu(dbs_timer, cpu); + + if ( t->expires <= now ) { + if ( !cmpxchg(stoppable, 1, -1) ) + return; t->function(t->data); + (void)cmpxchg(stoppable, -1, 1); } else - { - set_timer(t, align_timer(now , dbs_tuners_ins.sampling_rate)); - } + set_timer(t, align_timer(now, dbs_tuners_ins.sampling_rate)); } } --- a/xen/include/acpi/cpufreq/cpufreq.h +++ b/xen/include/acpi/cpufreq/cpufreq.h @@ -225,8 +225,8 @@ struct cpu_dbs_info_s { struct cpufreq_frequency_table *freq_table; int cpu; unsigned int enable:1; - unsigned int stoppable:1; unsigned int turbo_enabled:1; + int8_t stoppable; }; int cpufreq_governor_dbs(struct cpufreq_policy *policy, unsigned int event); ++++++ 5aaa9878-x86-vlapic-clear-TMR-bit-for-edge-triggered-intr.patch ++++++ # Commit 12a50030a81a14a3c7be672ddfde707b961479ec # Date 2018-03-15 16:59:52 +0100 # Author Liran Alon <liran.alon@oracle.com> # Committer Jan Beulich <jbeulich@suse.com> x86/vlapic: clear TMR bit upon acceptance of edge-triggered interrupt to IRR According to Intel SDM section "Interrupt Acceptance for Fixed Interrupts": "The trigger mode register (TMR) indicates the trigger mode of the interrupt (see Figure 10-20). Upon acceptance of an interrupt into the IRR, the corresponding TMR bit is cleared for edge-triggered interrupts and set for level-triggered interrupts. If a TMR bit is set when an EOI cycle for its corresponding interrupt vector is generated, an EOI message is sent to all I/O APICs." Before this patch TMR-bit was cleared on LAPIC EOI which is not what real hardware does. This was also confirmed in KVM upstream commit a0c9a822bf37 ("KVM: dont clear TMR on EOI"). Behavior after this patch is aligned with both Intel SDM and KVM implementation. Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/hvm/vlapic.c +++ b/xen/arch/x86/hvm/vlapic.c @@ -161,6 +161,8 @@ void vlapic_set_irq(struct vlapic *vlapi if ( trig ) vlapic_set_vector(vec, &vlapic->regs->data[APIC_TMR]); + else + vlapic_clear_vector(vec, &vlapic->regs->data[APIC_TMR]); if ( hvm_funcs.update_eoi_exit_bitmap ) hvm_funcs.update_eoi_exit_bitmap(target, vec, trig); @@ -434,7 +436,7 @@ void vlapic_handle_EOI(struct vlapic *vl { struct domain *d = vlapic_domain(vlapic); - if ( vlapic_test_and_clear_vector(vector, &vlapic->regs->data[APIC_TMR]) ) + if ( vlapic_test_vector(vector, &vlapic->regs->data[APIC_TMR]) ) vioapic_update_EOI(d, vector); hvm_dpci_msi_eoi(d, vector);