October 2020 - openSUSE Commits - openSUSE Mailing Lists

commit mozilla-nspr for openSUSE:Leap:15.1:Update
by root 31 Oct '20

31 Oct '20

Hello community, here is the log from the commit of package mozilla-nspr for openSUSE:Leap:15.1:Update checked in at 2020-10-31 10:35:23 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.1:Update/mozilla-nspr (Old) and /work/SRC/openSUSE:Leap:15.1:Update/.mozilla-nspr.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "mozilla-nspr" Sat Oct 31 10:35:23 2020 rev:4 rq:844991 version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ _link ++++++ --- /var/tmp/diff_new_pack.xWvxXg/_old 2020-10-31 10:35:24.490989951 +0100 +++ /var/tmp/diff_new_pack.xWvxXg/_new 2020-10-31 10:35:24.494989956 +0100 @@ -1 +1 @@ -<link package='mozilla-nspr.12911' cicount='copy' /> +<link package='mozilla-nspr.14803' cicount='copy' />

1 0

commit MozillaThunderbird for openSUSE:Leap:15.1:Update
by root 31 Oct '20

31 Oct '20

Hello community, here is the log from the commit of package MozillaThunderbird for openSUSE:Leap:15.1:Update checked in at 2020-10-31 10:35:20 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.1:Update/MozillaThunderbird (Old) and /work/SRC/openSUSE:Leap:15.1:Update/.MozillaThunderbird.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "MozillaThunderbird" Sat Oct 31 10:35:20 2020 rev:17 rq:844991 version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ _link ++++++ --- /var/tmp/diff_new_pack.SRBKI7/_old 2020-10-31 10:35:22.614987530 +0100 +++ /var/tmp/diff_new_pack.SRBKI7/_new 2020-10-31 10:35:22.614987530 +0100 @@ -1 +1 @@ -<link package='MozillaThunderbird.13925' cicount='copy' /> +<link package='MozillaThunderbird.14803' cicount='copy' />

1 0

commit 00Meta for openSUSE:Leap:15.2:Images
by root 31 Oct '20

31 Oct '20

Hello community, here is the log from the commit of package 00Meta for openSUSE:Leap:15.2:Images checked in at 2020-10-31 01:32:25 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2:Images/00Meta (Old) and /work/SRC/openSUSE:Leap:15.2:Images/.00Meta.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "00Meta" Sat Oct 31 01:32:25 2020 rev:576 rq: version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ version_totest ++++++ --- /var/tmp/diff_new_pack.gDVhdk/_old 2020-10-31 01:32:26.918730242 +0100 +++ /var/tmp/diff_new_pack.gDVhdk/_new 2020-10-31 01:32:26.922730245 +0100 @@ -1 +1 @@ -31.216 \ No newline at end of file +31.217 \ No newline at end of file

1 0

commit xen.14764 for openSUSE:Leap:15.2:Update
by root 30 Oct '20

30 Oct '20

Hello community, here is the log from the commit of package xen.14764 for openSUSE:Leap:15.2:Update checked in at 2020-10-31 00:23:22 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2:Update/xen.14764 (Old) and /work/SRC/openSUSE:Leap:15.2:Update/.xen.14764.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "xen.14764" Sat Oct 31 00:23:22 2020 rev:1 rq:844482 version:4.13.1_10 Changes: -------- New Changes file: --- /dev/null 2020-10-22 01:51:33.322291705 +0200 +++ /work/SRC/openSUSE:Leap:15.2:Update/.xen.14764.new.3463/xen.changes 2020-10-31 00:23:31.587627360 +0100 @@ -0,0 +1,12678 @@ +------------------------------------------------------------------- +Tue Oct 13 10:48:04 MDT 2020 - carnold(a)suse.com + +- bsc#1177409 - VUL-0: xen: x86 PV guest INVLPG-like flushes may + leave stale TLB entries (XSA-286) + xsa286-1.patch + xsa286-2.patch + xsa286-3.patch + xsa286-4.patch + xsa286-5.patch + xsa286-6.patch +- bsc#1177412 - VUL-0: xen: Race condition in Xen mapping code + (XSA-345) + xsa345-1.patch + xsa345-2.patch + xsa345-3.patch +- bsc#1177413 - VUL-0: xen: undue deferral of IOMMU TLB flushes + (XSA-346) + xsa346-1.patch + xsa346-2.patch +- bsc#1177414 - VUL-0: xen: unsafe AMD IOMMU page table updates + (XSA-347) + xsa347-1.patch + xsa347-2.patch + xsa347-3.patch + +------------------------------------------------------------------- +Fri Sep 11 11:11:11 UTC 2020 - ohering(a)suse.de + +- Escape some % chars in xen.spec, they have to appear verbatim + +------------------------------------------------------------------- +Wed Sep 9 10:11:12 UTC 2020 - ohering(a)suse.de + +- Enhance libxc.migrate_tracking.patch + Print number of allocated pages on sending side, this is more + accurate than p2m_size. + +------------------------------------------------------------------- +Tue Sep 8 11:20:40 MDT 2020 - carnold(a)suse.com + +- bsc#1176339 - VUL-0: CVE-2020-25602: xen: x86 pv: Crash when + handling guest access to MSR_MISC_ENABLE (XSA-333) + xsa333.patch +- bsc#1176341 - VUL-0: CVE-2020-25598: xen: Missing unlock in + XENMEM_acquire_resource error path (XSA-334) + xsa334.patch +- bsc#1176343 - VUL-0: CVE-2020-25604: xen: race when migrating + timers between x86 HVM vCPU-s (XSA-336) + xsa336.patch +- bsc#1176344 - VUL-0: CVE-2020-25595: xen: PCI passthrough code + reading back hardware registers (XSA-337) + xsa337-1.patch + xsa337-2.patch +- bsc#1176346 - VUL-0: CVE-2020-25597: xen: once valid event + channels may not turn invalid (XSA-338) + xsa338.patch +- bsc#1176345 - VUL-0: CVE-2020-25596: xen: x86 pv guest kernel + DoS via SYSENTER (XSA-339) + xsa339.patch +- bsc#1176347 - VUL-0: CVE-2020-25603: xen: Missing barrier + barriers when accessing/allocating an event channel (XSA-340) + xsa340.patch +- bsc#1176348 - VUL-0: CVE-2020-25600: xen: out of bounds event + channels available to 32-bit x86 domains (XSA-342) + xsa342.patch +- bsc#1176349 - VUL-0: CVE-2020-25599: xen: races with + evtchn_reset() (XSA-343) + xsa343-1.patch + xsa343-2.patch + xsa343-3.patch +- bsc#1176350 - VUL-0: CVE-2020-25601: xen: lack of preemption in + evtchn_reset() / evtchn_destroy() (XSA-344) + xsa344-1.patch + xsa344-2.patch +- Upstream bug fixes (bsc#1027519) + 5f479d9e-x86-begin-to-support-MSR_ARCH_CAPS.patch + 5f4cf06e-x86-Dom0-expose-MSR_ARCH_CAPS.patch + 5f4cf96a-x86-PV-fix-SEGBASE_GS_USER_SEL.patch + 5f560c42-x86-PV-64bit-segbase-consistency.patch + +------------------------------------------------------------------- +Mon Aug 3 10:21:59 MDT 2020 - carnold(a)suse.com + +- Upstream bug fixes (bsc#1027519) + 5ef44e0d-x86-PMTMR-use-FADT-flags.patch + 5ef6156a-x86-disallow-access-to-PT-MSRs.patch + 5efcb354-x86-protect-CALL-JMP-straight-line-speculation.patch + 5f046c18-evtchn-dont-ignore-error-in-get_free_port.patch (Replaces xsa317.patch) + 5f046c48-x86-shadow-dirty-VRAM-inverted-conditional.patch (Replaces xsa319.patch) + 5f046c64-EPT-set_middle_entry-adjustments.patch (Replaces xsa328-1.patch) + 5f046c78-EPT-atomically-modify-ents-in-ept_next_level.patch (Replaces xsa328-2.patch) + 5f046c9a-VT-d-improve-IOMMU-TLB-flush.patch (Replaces xsa321-1.patch) + 5f046cb5-VT-d-prune-rename-cache-flush-funcs.patch (Replaces xsa321-2.patch) + 5f046cca-x86-IOMMU-introduce-cache-sync-hook.patch (Replaces xsa321-3.patch) + 5f046ce9-VT-d-sync_cache-misaligned-addresses.patch (Replaces xsa32141.patch) + 5f046cfd-x86-introduce-alternative_2.patch (Replaces xsa321-5.patch) + 5f046d1a-VT-d-optimize-CPU-cache-sync.patch (Replaces xsa321-6.patch) + 5f046d2b-EPT-flush-cache-when-modifying-PTEs.patch (Replaces xsa321-7.patch) + 5f046d5c-check-VCPUOP_register_vcpu_info-alignment.patch (Replaces xsa327.patch) + 5f1a9916-x86-S3-put-data-sregs-into-known-state.patch + 5f21b9fd-x86-cpuid-APIC-bit-clearing.patch + +------------------------------------------------------------------- +Thu Jul 23 11:12:58 MDT 2020 - carnold(a)suse.com + +- bsc#1172356 - Not able to hot-plug NIC via virt-manager, asks to + attach on next reboot while it should be live attached + ignore-ip-command-script-errors.patch + +------------------------------------------------------------------- +Fri Jul 17 14:14:14 UTC 2020 - ohering(a)suse.de + +- Enhance libxc.migrate_tracking.patch + After transfer of domU memory, the target host has to assemble + the backend devices. Track the time prior xc_domain_unpause. + +------------------------------------------------------------------- +Tue Jun 30 18:03:40 UTC 2020 - ohering(a)suse.de + +- Add libxc.migrate_tracking.patch to track live migrations + unconditionally in logfiles, especially in libvirt. + This will track how long a domU was suspended during transit. + +------------------------------------------------------------------- +Mon Jun 29 11:28:27 MDT 2020 - carnold(a)suse.com + +- bsc#1173376 - VUL-0: CVE-2020-15566: xen: XSA-317 - Incorrect + error handling in event channel port allocation + xsa317.patch +- bsc#1173377 - VUL-0: CVE-2020-15563: xen: XSA-319 - inverted code + paths in x86 dirty VRAM tracking + xsa319.patch +- bsc#1173378 - VUL-0: CVE-2020-15565: xen: XSA-321 - insufficient + cache write- back under VT-d + xsa321-1.patch + xsa321-2.patch + xsa321-3.patch + xsa321-4.patch + xsa321-5.patch + xsa321-6.patch + xsa321-7.patch +- bsc#1173380 - VUL-0: CVE-2020-15567: xen: XSA-328 - non-atomic + modification of live EPT PTE + xsa328-1.patch + xsa328-2.patch + +------------------------------------------------------------------- +Mon Jun 22 11:24:48 MDT 2020 - carnold(a)suse.com + +- bsc#1172205 - VUL-0: CVE-2020-0543: xen: Special Register Buffer + Data Sampling (SRBDS) aka "CrossTalk" (XSA-320) + 5ee24d0e-x86-spec-ctrl-document-SRBDS-workaround.patch + 5edfbbea-x86-spec-ctrl-CPUID-MSR-defs-for-SRBDS.patch (Replaces xsa320-1.patch) + 5edfbbea-x86-spec-ctrl-mitigate-SRBDS.patch (Replaces xsa320-2.patch) +- Upstream bug fixes (bsc#1027519) + 5ec50b05-x86-idle-rework-C6-EOI-workaround.patch + 5ec7dcaa-x86-dont-enter-C6-with-in-service-intr.patch + 5ec7dcf6-x86-dont-enter-C3-C6-with-errata.patch + 5ec82237-x86-extend-ISR-C6-workaround-to-Haswell.patch + 5ece1b91-x86-clear-RDRAND-CPUID-bit-on-AMD-fam-15-16.patch + 5ece8ac4-x86-load_system_tables-NMI-MC-safe.patch + 5ed69804-x86-ucode-fix-start-end-update.patch + 5eda60cb-SVM-split-recalc-NPT-fault-handling.patch + 5edf6ad8-ioreq-pending-emulation-server-destruction-race.patch + +------------------------------------------------------------------- +Fri Jun 5 16:42:16 UTC 2020 - Callum Farmer <callumjfarmer13(a)gmail.com> + +- Fixes for %_libexecdir changing to /usr/libexec + +------------------------------------------------------------------- +Thu May 28 08:35:20 MDT 2020 - carnold(a)suse.com + +- bsc#1172205 - VUL-0: CVE-2020-0543: xen: Special Register Buffer + Data Sampling (SRBDS) aka "CrossTalk" (XSA-320) + xsa320-1.patch + xsa320-2.patch + +------------------------------------------------------------------- +Mon May 18 10:55:26 MDT 2020 - carnold(a)suse.com + +- Update to Xen 4.13.1 bug fix release (bsc#1027519) + xen-4.13.1-testing-src.tar.bz2 + 5eb51be6-cpupool-fix-removing-cpu-from-pool.patch + 5eb51caa-sched-vcpu-pause-flags-atomic.patch + 5ec2a760-x86-determine-MXCSR-mask-always.patch +- Drop patches contained in new tarball + 5de65f84-gnttab-map-always-do-IOMMU-part.patch + 5de65fc4-x86-avoid-HPET-use-on-certain-Intel.patch + 5e15e03d-sched-fix-S3-resume-with-smt=0.patch + 5e16fb6a-x86-clear-per-cpu-stub-page-info.patch + 5e1da013-IRQ-u16-is-too-narrow-for-evtchn.patch + 5e1dcedd-Arm-place-speculation-barrier-after-ERET.patch + 5e21ce98-x86-time-update-TSC-stamp-after-deep-C-state.patch + 5e286cce-VT-d-dont-pass-bridges-to-domain_context_mapping_one.patch + 5e318cd4-x86-apic-fix-disabling-LVT0.patch ++++ 12481 more lines (skipped) ++++ between /dev/null ++++ and /work/SRC/openSUSE:Leap:15.2:Update/.xen.14764.new.3463/xen.changes New: ---- 5eb51be6-cpupool-fix-removing-cpu-from-pool.patch 5eb51caa-sched-vcpu-pause-flags-atomic.patch 5ec2a760-x86-determine-MXCSR-mask-always.patch 5ec50b05-x86-idle-rework-C6-EOI-workaround.patch 5ec7dcaa-x86-dont-enter-C6-with-in-service-intr.patch 5ec7dcf6-x86-dont-enter-C3-C6-with-errata.patch 5ec82237-x86-extend-ISR-C6-workaround-to-Haswell.patch 5ece1b91-x86-clear-RDRAND-CPUID-bit-on-AMD-fam-15-16.patch 5ece8ac4-x86-load_system_tables-NMI-MC-safe.patch 5ed69804-x86-ucode-fix-start-end-update.patch 5eda60cb-SVM-split-recalc-NPT-fault-handling.patch 5edf6ad8-ioreq-pending-emulation-server-destruction-race.patch 5edfbbea-x86-spec-ctrl-CPUID-MSR-defs-for-SRBDS.patch 5edfbbea-x86-spec-ctrl-mitigate-SRBDS.patch 5ee24d0e-x86-spec-ctrl-document-SRBDS-workaround.patch 5ef44e0d-x86-PMTMR-use-FADT-flags.patch 5ef6156a-x86-disallow-access-to-PT-MSRs.patch 5efcb354-x86-protect-CALL-JMP-straight-line-speculation.patch 5f046c18-evtchn-dont-ignore-error-in-get_free_port.patch 5f046c48-x86-shadow-dirty-VRAM-inverted-conditional.patch 5f046c64-EPT-set_middle_entry-adjustments.patch 5f046c78-EPT-atomically-modify-ents-in-ept_next_level.patch 5f046c9a-VT-d-improve-IOMMU-TLB-flush.patch 5f046cb5-VT-d-prune-rename-cache-flush-funcs.patch 5f046cca-x86-IOMMU-introduce-cache-sync-hook.patch 5f046ce9-VT-d-sync_cache-misaligned-addresses.patch 5f046cfd-x86-introduce-alternative_2.patch 5f046d1a-VT-d-optimize-CPU-cache-sync.patch 5f046d2b-EPT-flush-cache-when-modifying-PTEs.patch 5f046d5c-check-VCPUOP_register_vcpu_info-alignment.patch 5f1a9916-x86-S3-put-data-sregs-into-known-state.patch 5f21b9fd-x86-cpuid-APIC-bit-clearing.patch 5f479d9e-x86-begin-to-support-MSR_ARCH_CAPS.patch 5f4cf06e-x86-Dom0-expose-MSR_ARCH_CAPS.patch 5f4cf96a-x86-PV-fix-SEGBASE_GS_USER_SEL.patch 5f560c42-x86-PV-64bit-segbase-consistency.patch README.SUSE aarch64-maybe-uninitialized.patch aarch64-rename-PSR_MODE_ELxx-to-match-linux-headers.patch baselibs.conf bin-python3-conversion.patch block-dmmd block-npiv block-npiv-common.sh block-npiv-vport boot.local.xenU boot.xen build-python3-conversion.patch disable-building-pv-shim.patch etc_pam.d_xen-api gcc10-fixes.patch hibernate.patch ignore-ip-command-script-errors.patch init.pciback init.xen_loop ipxe-enable-nics.patch ipxe-no-error-logical-not-parentheses.patch ipxe-use-rpm-opt-flags.patch ipxe.tar.bz2 libxc.migrate_tracking.patch libxc.sr.superpage.patch libxl.LIBXL_HOTPLUG_TIMEOUT.patch libxl.add-option-to-disable-disk-cache-flushes-in-qdisk.patch libxl.helper_done-crash.patch libxl.libxl__domain_pvcontrol.patch libxl.max_event_channels.patch libxl.pvscsi.patch libxl.set-migration-constraints-from-cmdline.patch logrotate.conf migration-python3-conversion.patch mini-os.tar.bz2 pygrub-boot-legacy-sles.patch pygrub-handle-one-line-menu-entries.patch pygrub-netware-xnloader.patch replace-obsolete-network-configuration-commands-in-s.patch reproducible.patch stdvga-cache.patch stubdom-have-iovec.patch stubdom.tar.bz2 suse-xendomains-service.patch suspend_evtchn_lock.patch sysconfig.pciback tmp_build.patch vif-bridge-no-iptables.patch vif-bridge-tap-fix.patch vif-route.patch x86-cpufreq-report.patch x86-ioapic-ack-default.patch xen-4.13.1-testing-src.tar.bz2 xen-arch-kconfig-nr_cpus.patch xen-destdir.patch xen-dom0-modules.service xen-supportconfig xen-utils-0.1.tar.bz2 xen.bug1026236.suse_vtsc_tolerance.patch xen.build-compare.doc_html.patch xen.changes xen.libxl.dmmd.patch xen.spec xen.stubdom.newlib.patch xen2libvirt.py xen_maskcalc.py xenapiusers xencommons.service xenconsole-no-multiple-connections.patch xendomains-wait-disks.LICENSE xendomains-wait-disks.README.md xendomains-wait-disks.sh xenstore-launch.patch xenstore-run-in-studomain.patch xl-conf-default-bridge.patch xl-conf-disable-autoballoon.patch xnloader.py xsa286-1.patch xsa286-2.patch xsa286-3.patch xsa286-4.patch xsa286-5.patch xsa286-6.patch xsa333.patch xsa334.patch xsa336.patch xsa337-1.patch xsa337-2.patch xsa338.patch xsa339.patch xsa340.patch xsa342.patch xsa343-1.patch xsa343-2.patch xsa343-3.patch xsa344-1.patch xsa344-2.patch xsa345-1.patch xsa345-2.patch xsa345-3.patch xsa346-1.patch xsa346-2.patch xsa347-1.patch xsa347-2.patch xsa347-3.patch ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ xen.spec ++++++ ++++ 1429 lines (skipped) ++++++ 5eb51be6-cpupool-fix-removing-cpu-from-pool.patch ++++++ # Commit 498d73647fa17d9eb7a67d2e9bdccac6b438e559 # Date 2020-05-08 10:44:22 +0200 # Author Juergen Gross <jgross(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> cpupool: fix removing cpu from a cpupool Commit cb563d7665f2 ("xen/sched: support core scheduling for moving cpus to/from cpupools") introduced a regression when trying to remove an offline cpu from a cpupool, as the system would crash in this situation. Fix that by testing the cpu to be online. Fixes: cb563d7665f2 ("xen/sched: support core scheduling for moving cpus to/from cpupools") Signed-off-by: Juergen Gross <jgross(a)suse.com> Acked-by: Dario Faggioli <dfaggioli(a)suse.com> --- a/xen/common/cpupool.c +++ b/xen/common/cpupool.c @@ -519,6 +519,9 @@ static int cpupool_unassign_cpu(struct c debugtrace_printk("cpupool_unassign_cpu(pool=%d,cpu=%d)\n", c->cpupool_id, cpu); + if ( !cpu_online(cpu) ) + return -EINVAL; + master_cpu = sched_get_resource_cpu(cpu); ret = cpupool_unassign_cpu_start(c, master_cpu); if ( ret ) ++++++ 5eb51caa-sched-vcpu-pause-flags-atomic.patch ++++++ # Commit e0d92d9bd7997c6bcda17a19aba4f3957dd1a2e9 # Date 2020-05-08 10:47:38 +0200 # Author Juergen Gross <jgross(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> sched: always modify vcpu pause flags atomically credit2 is currently modifying the pause flags of vcpus non-atomically via sched_set_pause_flags() and sched_clear_pause_flags(). This is dangerous as there are cases where the paus flags are modified without any lock held. So drop the non-atomic pause flag modification functions and rename the atomic ones dropping the _atomic suffix. Fixes: a76255b4266516 ("xen/sched: make credit2 scheduler vcpu agnostic.") Signed-off-by: Juergen Gross <jgross(a)suse.com> Reviewed-by: Dario Faggioli <dfaggioli(a)suse.com> --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -452,7 +452,7 @@ static inline void __runq_tickle(struct SCHED_UNIT_STAT_CRANK(cur, kicked_away); SCHED_UNIT_STAT_CRANK(cur, migrate_r); SCHED_STAT_CRANK(migrate_kicked_away); - sched_set_pause_flags_atomic(cur->unit, _VPF_migrating); + sched_set_pause_flags(cur->unit, _VPF_migrating); } /* Tickle cpu anyway, to let new preempt cur. */ SCHED_STAT_CRANK(tickled_busy_cpu); @@ -983,7 +983,7 @@ csched_unit_acct(struct csched_private * { SCHED_UNIT_STAT_CRANK(svc, migrate_r); SCHED_STAT_CRANK(migrate_running); - sched_set_pause_flags_atomic(currunit, _VPF_migrating); + sched_set_pause_flags(currunit, _VPF_migrating); /* * As we are about to tickle cpu, we should clear its bit in * idlers. But, if we are here, it means there is someone running --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -175,7 +175,7 @@ static inline void sched_set_pause_flags struct vcpu *v; for_each_sched_unit_vcpu ( unit, v ) - __set_bit(bit, &v->pause_flags); + set_bit(bit, &v->pause_flags); } /* Clear a bit in pause_flags of all vcpus of a unit. */ @@ -184,26 +184,6 @@ static inline void sched_clear_pause_fla { struct vcpu *v; - for_each_sched_unit_vcpu ( unit, v ) - __clear_bit(bit, &v->pause_flags); -} - -/* Set a bit in pause_flags of all vcpus of a unit via atomic updates. */ -static inline void sched_set_pause_flags_atomic(struct sched_unit *unit, - unsigned int bit) -{ - struct vcpu *v; - - for_each_sched_unit_vcpu ( unit, v ) - set_bit(bit, &v->pause_flags); -} - -/* Clear a bit in pause_flags of all vcpus of a unit via atomic updates. */ -static inline void sched_clear_pause_flags_atomic(struct sched_unit *unit, - unsigned int bit) -{ - struct vcpu *v; - for_each_sched_unit_vcpu ( unit, v ) clear_bit(bit, &v->pause_flags); } ++++++ 5ec2a760-x86-determine-MXCSR-mask-always.patch ++++++ # Commit 2b532519d64e653a6bbfd9eefed6040a09c8876d # Date 2020-05-18 17:18:56 +0200 # Author Jan Beulich <jbeulich(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86: determine MXCSR mask in all cases For its use(s) by the emulator to be correct in all cases, the filling of the variable needs to be independent of XSAVE availability. As there's no suitable function in i387.c to put the logic in, keep it in xstate_init(), arrange for the function to be called unconditionally, and pull the logic ahead of all return paths there. Fixes: 9a4496a35b20 ("x86emul: support {,V}{LD,ST}MXCSR") Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/cpu/common.c +++ b/xen/arch/x86/cpu/common.c @@ -487,8 +487,7 @@ void identify_cpu(struct cpuinfo_x86 *c) /* Now the feature flags better reflect actual CPU features! */ - if ( cpu_has_xsave ) - xstate_init(c); + xstate_init(c); #ifdef NOISY_CAPS printk(KERN_DEBUG "CPU: After all inits, caps:"); --- a/xen/arch/x86/xstate.c +++ b/xen/arch/x86/xstate.c @@ -587,6 +587,18 @@ void xstate_init(struct cpuinfo_x86 *c) u32 eax, ebx, ecx, edx; u64 feature_mask; + if ( bsp ) + { + static typeof(current->arch.xsave_area->fpu_sse) __initdata ctxt; + + asm ( "fxsave %0" : "=m" (ctxt) ); + if ( ctxt.mxcsr_mask ) + mxcsr_mask = ctxt.mxcsr_mask; + } + + if ( !cpu_has_xsave ) + return; + if ( (bsp && !use_xsave) || boot_cpu_data.cpuid_level < XSTATE_CPUID ) { @@ -610,8 +622,6 @@ void xstate_init(struct cpuinfo_x86 *c) if ( bsp ) { - static typeof(current->arch.xsave_area->fpu_sse) __initdata ctxt; - xfeature_mask = feature_mask; /* * xsave_cntxt_size is the max size required by enabled features. @@ -620,10 +630,6 @@ void xstate_init(struct cpuinfo_x86 *c) xsave_cntxt_size = _xstate_ctxt_size(feature_mask); printk("xstate: size: %#x and states: %#"PRIx64"\n", xsave_cntxt_size, xfeature_mask); - - asm ( "fxsave %0" : "=m" (ctxt) ); - if ( ctxt.mxcsr_mask ) - mxcsr_mask = ctxt.mxcsr_mask; } else { ++++++ 5ec50b05-x86-idle-rework-C6-EOI-workaround.patch ++++++ # Commit 5fef1fd713660406a6187ef352fbf79986abfe43 # Date 2020-05-20 12:48:37 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/idle: rework C6 EOI workaround Change the C6 EOI workaround (errata AAJ72) to use x86_match_cpu. Also call the workaround from mwait_idle, previously it was only used by the ACPI idle driver. Finally make sure the routine is called for all states equal or greater than ACPI_STATE_C3, note that the ACPI driver doesn't currently handle them, but the errata condition shouldn't be limited by that. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/acpi/cpu_idle.c +++ b/xen/arch/x86/acpi/cpu_idle.c @@ -537,26 +537,35 @@ void trace_exit_reason(u32 *irq_traced) } } -/* - * "AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6 During - * an Interrupt Service Routine" - * - * There was an errata with some Core i7 processors that an EOI transaction - * may not be sent if software enters core C6 during an interrupt service - * routine. So we don't enter deep Cx state if there is an EOI pending. - */ -static bool errata_c6_eoi_workaround(void) +bool errata_c6_eoi_workaround(void) { - static int8_t fix_needed = -1; + static int8_t __read_mostly fix_needed = -1; if ( unlikely(fix_needed == -1) ) { - int model = boot_cpu_data.x86_model; - fix_needed = (cpu_has_apic && !directed_eoi_enabled && - (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && - (boot_cpu_data.x86 == 6) && - ((model == 0x1a) || (model == 0x1e) || (model == 0x1f) || - (model == 0x25) || (model == 0x2c) || (model == 0x2f))); +#define INTEL_FAM6_MODEL(m) { X86_VENDOR_INTEL, 6, m, X86_FEATURE_ALWAYS } + /* + * Errata AAJ72: EOI Transaction May Not be Sent if Software Enters + * Core C6 During an Interrupt Service Routine" + * + * There was an errata with some Core i7 processors that an EOI + * transaction may not be sent if software enters core C6 during an + * interrupt service routine. So we don't enter deep Cx state if + * there is an EOI pending. + */ + static const struct x86_cpu_id eoi_errata[] = { + INTEL_FAM6_MODEL(0x1a), + INTEL_FAM6_MODEL(0x1e), + INTEL_FAM6_MODEL(0x1f), + INTEL_FAM6_MODEL(0x25), + INTEL_FAM6_MODEL(0x2c), + INTEL_FAM6_MODEL(0x2f), + { } + }; +#undef INTEL_FAM6_MODEL + + fix_needed = cpu_has_apic && !directed_eoi_enabled && + x86_match_cpu(eoi_errata); } return (fix_needed && cpu_has_pending_apic_eoi()); @@ -664,7 +673,7 @@ static void acpi_processor_idle(void) return; } - if ( (cx->type == ACPI_STATE_C3) && errata_c6_eoi_workaround() ) + if ( (cx->type >= ACPI_STATE_C3) && errata_c6_eoi_workaround() ) cx = power->safe_state; --- a/xen/arch/x86/cpu/mwait-idle.c +++ b/xen/arch/x86/cpu/mwait-idle.c @@ -769,6 +769,9 @@ static void mwait_idle(void) return; } + if ((cx->type >= 3) && errata_c6_eoi_workaround()) + cx = power->safe_state; + eax = cx->address; cstate = ((eax >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1; --- a/xen/include/asm-x86/cpuidle.h +++ b/xen/include/asm-x86/cpuidle.h @@ -26,4 +26,6 @@ void update_idle_stats(struct acpi_proce void update_last_cx_stat(struct acpi_processor_power *, struct acpi_processor_cx *, uint64_t); +bool errata_c6_eoi_workaround(void); + #endif /* __X86_ASM_CPUIDLE_H__ */ ++++++ 5ec7dcaa-x86-dont-enter-C6-with-in-service-intr.patch ++++++ # Commit fc44a7014cafe28b8c53eeaf6ac2a71f5bc8b815 # Date 2020-05-22 16:07:38 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/idle: prevent entering C6 with in service interrupts on Intel Apply a workaround for Intel errata BDX99, CLX30, SKX100, CFW125, BDF104, BDH85, BDM135, KWB131: "A Pending Fixed Interrupt May Be Dispatched Before an Interrupt of The Same Priority Completes". Apply the errata to all server and client models (big cores) from Broadwell to Cascade Lake. The workaround is grouped together with the existing fix for errata AAJ72, and the eoi from the function name is removed. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/acpi/cpu_idle.c +++ b/xen/arch/x86/acpi/cpu_idle.c @@ -537,7 +537,7 @@ void trace_exit_reason(u32 *irq_traced) } } -bool errata_c6_eoi_workaround(void) +bool errata_c6_workaround(void) { static int8_t __read_mostly fix_needed = -1; @@ -562,10 +562,40 @@ bool errata_c6_eoi_workaround(void) INTEL_FAM6_MODEL(0x2f), { } }; + /* + * Errata BDX99, CLX30, SKX100, CFW125, BDF104, BDH85, BDM135, KWB131: + * A Pending Fixed Interrupt May Be Dispatched Before an Interrupt of + * The Same Priority Completes. + * + * Resuming from C6 Sleep-State, with Fixed Interrupts of the same + * priority queued (in the corresponding bits of the IRR and ISR APIC + * registers), the processor may dispatch the second interrupt (from + * the IRR bit) before the first interrupt has completed and written to + * the EOI register, causing the first interrupt to never complete. + */ + static const struct x86_cpu_id isr_errata[] = { + /* Broadwell */ + INTEL_FAM6_MODEL(0x47), + INTEL_FAM6_MODEL(0x3d), + INTEL_FAM6_MODEL(0x4f), + INTEL_FAM6_MODEL(0x56), + /* Skylake (client) */ + INTEL_FAM6_MODEL(0x5e), + INTEL_FAM6_MODEL(0x4e), + /* {Sky/Cascade}lake (server) */ + INTEL_FAM6_MODEL(0x55), + /* {Kaby/Coffee/Whiskey/Amber} Lake */ + INTEL_FAM6_MODEL(0x9e), + INTEL_FAM6_MODEL(0x8e), + /* Cannon Lake */ + INTEL_FAM6_MODEL(0x66), + { } + }; #undef INTEL_FAM6_MODEL - fix_needed = cpu_has_apic && !directed_eoi_enabled && - x86_match_cpu(eoi_errata); + fix_needed = cpu_has_apic && + ((!directed_eoi_enabled && x86_match_cpu(eoi_errata)) || + x86_match_cpu(isr_errata)); } return (fix_needed && cpu_has_pending_apic_eoi()); @@ -673,7 +703,7 @@ static void acpi_processor_idle(void) return; } - if ( (cx->type >= ACPI_STATE_C3) && errata_c6_eoi_workaround() ) + if ( (cx->type >= ACPI_STATE_C3) && errata_c6_workaround() ) cx = power->safe_state; --- a/xen/arch/x86/cpu/mwait-idle.c +++ b/xen/arch/x86/cpu/mwait-idle.c @@ -769,7 +769,7 @@ static void mwait_idle(void) return; } - if ((cx->type >= 3) && errata_c6_eoi_workaround()) + if ((cx->type >= 3) && errata_c6_workaround()) cx = power->safe_state; eax = cx->address; --- a/xen/include/asm-x86/cpuidle.h +++ b/xen/include/asm-x86/cpuidle.h @@ -26,6 +26,6 @@ void update_idle_stats(struct acpi_proce void update_last_cx_stat(struct acpi_processor_power *, struct acpi_processor_cx *, uint64_t); -bool errata_c6_eoi_workaround(void); +bool errata_c6_workaround(void); #endif /* __X86_ASM_CPUIDLE_H__ */ ++++++ 5ec7dcf6-x86-dont-enter-C3-C6-with-errata.patch ++++++ # Commit b2d502466547e6782ccadd501b8ef1482c391f2c # Date 2020-05-22 16:08:54 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/idle: prevent entering C3/C6 on some Intel CPUs due to errata Apply a workaround for errata BA80, AAK120, AAM108, AAO67, BD59, AAY54: Rapid Core C3/C6 Transition May Cause Unpredictable System Behavior. Limit maximum C state to C1 when SMT is enabled on the affected CPUs. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Acked-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/cpu/intel.c +++ b/xen/arch/x86/cpu/intel.c @@ -297,6 +297,41 @@ static void early_init_intel(struct cpui } /* + * Errata BA80, AAK120, AAM108, AAO67, BD59, AAY54: Rapid Core C3/C6 Transition + * May Cause Unpredictable System Behavior + * + * Under a complex set of internal conditions, cores rapidly performing C3/C6 + * transitions in a system with Intel Hyper-Threading Technology enabled may + * cause a machine check error (IA32_MCi_STATUS.MCACOD = 0x0106), system hang + * or unpredictable system behavior. + */ +static void probe_c3_errata(const struct cpuinfo_x86 *c) +{ +#define INTEL_FAM6_MODEL(m) { X86_VENDOR_INTEL, 6, m, X86_FEATURE_ALWAYS } + static const struct x86_cpu_id models[] = { + /* Nehalem */ + INTEL_FAM6_MODEL(0x1a), + INTEL_FAM6_MODEL(0x1e), + INTEL_FAM6_MODEL(0x1f), + INTEL_FAM6_MODEL(0x2e), + /* Westmere (note Westmere-EX is not affected) */ + INTEL_FAM6_MODEL(0x2c), + INTEL_FAM6_MODEL(0x25), + { } + }; +#undef INTEL_FAM6_MODEL + + /* Serialized by the AP bringup code. */ + if ( max_cstate > 1 && (c->apicid & (c->x86_num_siblings - 1)) && + x86_match_cpu(models) ) + { + printk(XENLOG_WARNING + "Disabling C-states C3 and C6 due to CPU errata\n"); + max_cstate = 1; + } +} + +/* * P4 Xeon errata 037 workaround. * Hardware prefetcher may cause stale data to be loaded into the cache. * @@ -323,6 +358,8 @@ static void Intel_errata_workarounds(str if (cpu_has_tsx_force_abort && opt_rtm_abort) wrmsrl(MSR_TSX_FORCE_ABORT, TSX_FORCE_ABORT_RTM); + + probe_c3_errata(c); } ++++++ 5ec82237-x86-extend-ISR-C6-workaround-to-Haswell.patch ++++++ # Commit b72d8870b5f68f06b083e6bfdb28f081bcb6ab3b # Date 2020-05-22 20:04:23 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/idle: Extend ISR/C6 erratum workaround to Haswell This bug was first discovered against Haswell. It is definitely affected. (The XenServer ticket for this bug was opened on 2013-05-30 which is coming up on 7 years old, and predates Broadwell). Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Acked-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/acpi/cpu_idle.c +++ b/xen/arch/x86/acpi/cpu_idle.c @@ -572,8 +572,16 @@ bool errata_c6_workaround(void) * registers), the processor may dispatch the second interrupt (from * the IRR bit) before the first interrupt has completed and written to * the EOI register, causing the first interrupt to never complete. + * + * Note: Haswell hasn't had errata issued, but this issue was first + * discovered on Haswell hardware, and is affected. */ static const struct x86_cpu_id isr_errata[] = { + /* Haswell */ + INTEL_FAM6_MODEL(0x3c), + INTEL_FAM6_MODEL(0x3f), + INTEL_FAM6_MODEL(0x45), + INTEL_FAM6_MODEL(0x46), /* Broadwell */ INTEL_FAM6_MODEL(0x47), INTEL_FAM6_MODEL(0x3d), ++++++ 5ece1b91-x86-clear-RDRAND-CPUID-bit-on-AMD-fam-15-16.patch ++++++ # Commit 93401e28a84b9dc5945f5d0bf5bce68e9d5ee121 # Date 2020-05-27 09:49:37 +0200 # Author Jan Beulich <jbeulich(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86: clear RDRAND CPUID bit on AMD family 15h/16h Inspired by Linux commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24: There have been reports of RDRAND issues after resuming from suspend on some AMD family 15h and family 16h systems. This issue stems from a BIOS not performing the proper steps during resume to ensure RDRAND continues to function properly. Update the CPU initialization to clear the RDRAND CPUID bit for any family 15h and 16h processor that supports RDRAND. If it is known that the family 15h or family 16h system does not have an RDRAND resume issue or that the system will not be placed in suspend, the "cpuid=rdrand" kernel parameter can be used to stop the clearing of the RDRAND CPUID bit. Note, that clearing the RDRAND CPUID bit does not prevent a processor that normally supports the RDRAND instruction from executing it. So any code that determined the support based on family and model won't #UD. Warn if no explicit choice was given on affected hardware. Check RDRAND functions at boot as well as after S3 resume (the retry limit chosen is entirely arbitrary). Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> Acked-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -488,6 +488,10 @@ The Speculation Control hardware feature be ignored, e.g. `no-ibrsb`, at which point Xen won't use them itself, and won't offer them to guests. +`rdrand` can be used to override the default disabling of the feature on certain +AMD systems. Its negative form can of course also be used to suppress use and +exposure of the feature. + ### cpuid_mask_cpu > `= fam_0f_rev_[cdefg] | fam_10_rev_[bc] | fam_11_rev_b` --- a/xen/arch/x86/cpu/amd.c +++ b/xen/arch/x86/cpu/amd.c @@ -3,6 +3,7 @@ #include <xen/mm.h> #include <xen/smp.h> #include <xen/pci.h> +#include <xen/warning.h> #include <asm/io.h> #include <asm/msr.h> #include <asm/processor.h> @@ -645,6 +646,26 @@ static void init_amd(struct cpuinfo_x86 if (acpi_smi_cmd && (acpi_enable_value | acpi_disable_value)) amd_acpi_c1e_quirk = true; break; + + case 0x15: case 0x16: + /* + * There are some Fam15/Fam16 systems where upon resume from S3 + * firmware fails to re-setup properly functioning RDRAND. + * By the time we can spot the problem, it is too late to take + * action, and there is nothing Xen can do to repair the problem. + * Clear the feature unless force-enabled on the command line. + */ + if (c == &boot_cpu_data && + cpu_has(c, X86_FEATURE_RDRAND) && + !is_forced_cpu_cap(X86_FEATURE_RDRAND)) { + static const char __initconst text[] = + "RDRAND may cease to work on this hardware upon resume from S3.\n" + "Please choose an explicit cpuid={no-}rdrand setting.\n"; + + setup_clear_cpu_cap(X86_FEATURE_RDRAND); + warning_add(text); + } + break; } display_cacheinfo(c); --- a/xen/arch/x86/cpu/common.c +++ b/xen/arch/x86/cpu/common.c @@ -10,6 +10,7 @@ #include <asm/io.h> #include <asm/mpspec.h> #include <asm/apic.h> +#include <asm/random.h> #include <asm/setup.h> #include <mach_apic.h> #include <public/sysctl.h> /* for XEN_INVALID_{SOCKET,CORE}_ID */ @@ -97,6 +98,11 @@ void __init setup_force_cpu_cap(unsigned __set_bit(cap, boot_cpu_data.x86_capability); } +bool __init is_forced_cpu_cap(unsigned int cap) +{ + return test_bit(cap, forced_caps); +} + static void default_init(struct cpuinfo_x86 * c) { /* Not much we can do here... */ @@ -496,6 +502,27 @@ void identify_cpu(struct cpuinfo_x86 *c) printk("\n"); #endif + /* + * If RDRAND is available, make an attempt to check that it actually + * (still) works. + */ + if (cpu_has(c, X86_FEATURE_RDRAND)) { + unsigned int prev = 0; + + for (i = 0; i < 5; ++i) + { + unsigned int cur = arch_get_random(); + + if (prev && cur != prev) + break; + prev = cur; + } + + if (i >= 5) + printk(XENLOG_WARNING "CPU%u: RDRAND appears to not work\n", + smp_processor_id()); + } + if (system_state == SYS_STATE_resume) return; --- a/xen/arch/x86/cpuid.c +++ b/xen/arch/x86/cpuid.c @@ -67,6 +67,9 @@ static int __init parse_xen_cpuid(const { if ( !val ) setup_clear_cpu_cap(mid->bit); + else if ( mid->bit == X86_FEATURE_RDRAND && + (cpuid_ecx(1) & cpufeat_mask(X86_FEATURE_RDRAND)) ) + setup_force_cpu_cap(X86_FEATURE_RDRAND); mid = NULL; } --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -166,6 +166,7 @@ extern const struct x86_cpu_id *x86_matc extern void identify_cpu(struct cpuinfo_x86 *); extern void setup_clear_cpu_cap(unsigned int); extern void setup_force_cpu_cap(unsigned int); +extern bool is_forced_cpu_cap(unsigned int); extern void print_cpu_info(unsigned int cpu); extern unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c); ++++++ 5ece8ac4-x86-load_system_tables-NMI-MC-safe.patch ++++++ # Commit 9f3e9139fa6c3d620eb08dff927518fc88200b8d # Date 2020-05-27 16:44:04 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/boot: Fix load_system_tables() to be NMI/#MC-safe During boot, load_system_tables() is used in reinit_bsp_stack() to switch the virtual addresses used from their .data/.bss alias, to their directmap alias. The structure assignment is implemented as a memset() to zero first, then a copy-in of the new data. This causes the NMI/#MC stack pointers to transiently become 0, at a point where we may have an NMI watchdog running. Rewrite the logic using a volatile tss pointer (equivalent to, but more readable than, using ACCESS_ONCE() for all writes). This does drop the zeroing side effect for holes in the structure, but the backing memory for the TSS is fully zeroed anyway, and architecturally, they are all reserved. Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/cpu/common.c +++ b/xen/arch/x86/cpu/common.c @@ -729,11 +729,12 @@ static cpumask_t cpu_initialized; */ void load_system_tables(void) { - unsigned int cpu = smp_processor_id(); + unsigned int i, cpu = smp_processor_id(); unsigned long stack_bottom = get_stack_bottom(), stack_top = stack_bottom & ~(STACK_SIZE - 1); - struct tss64 *tss = &this_cpu(tss_page).tss; + /* The TSS may be live. Disuade any clever optimisations. */ + volatile struct tss64 *tss = &this_cpu(tss_page).tss; seg_desc_t *gdt = this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY; seg_desc_t *compat_gdt = @@ -748,30 +749,26 @@ void load_system_tables(void) .limit = (IDT_ENTRIES * sizeof(idt_entry_t)) - 1, }; - *tss = (struct tss64){ - /* Main stack for interrupts/exceptions. */ - .rsp0 = stack_bottom, - - /* Ring 1 and 2 stacks poisoned. */ - .rsp1 = 0x8600111111111111ul, - .rsp2 = 0x8600111111111111ul, - - /* - * MCE, NMI and Double Fault handlers get their own stacks. - * All others poisoned. - */ - .ist = { - [IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE, - [IST_DF - 1] = stack_top + IST_DF * PAGE_SIZE, - [IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE, - [IST_DB - 1] = stack_top + IST_DB * PAGE_SIZE, - - [IST_MAX ... ARRAY_SIZE(tss->ist) - 1] = - 0x8600111111111111ul, - }, - - .bitmap = IOBMP_INVALID_OFFSET, - }; + /* + * Set up the TSS. Warning - may be live, and the NMI/#MC must remain + * valid on every instruction boundary. (Note: these are all + * semantically ACCESS_ONCE() due to tss's volatile qualifier.) + * + * rsp0 refers to the primary stack. #MC, #DF, NMI and #DB handlers + * each get their own stacks. No IO Bitmap. + */ + tss->rsp0 = stack_bottom; + tss->ist[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE; + tss->ist[IST_DF - 1] = stack_top + IST_DF * PAGE_SIZE; + tss->ist[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE; + tss->ist[IST_DB - 1] = stack_top + IST_DB * PAGE_SIZE; + tss->bitmap = IOBMP_INVALID_OFFSET; + + /* All other stack pointers poisioned. */ + for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i ) + tss->ist[i] = 0x8600111111111111ul; + tss->rsp1 = 0x8600111111111111ul; + tss->rsp2 = 0x8600111111111111ul; BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */ ++++++ 5ed69804-x86-ucode-fix-start-end-update.patch ++++++ # Commit 3659f54e9bd31f0f59268402fd67fb4b4118e184 # Date 2020-06-02 19:18:44 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/ucode: Fix errors with start/end_update() c/s 9267a439c "x86/ucode: Document the behaviour of the microcode_ops hooks" identified several poor behaviours of the start_update()/end_update_percpu() hooks. AMD have subsequently confirmed that OSVW don't, and are not expected to, change across a microcode load, rendering all of this complexity unecessary. Instead of fixing up the logic to not leave the OSVW state reset in a number of corner cases, delete the logic entirely. This in turn allows for the removal of the poorly-named 'start_update' parameter to microcode_update_one(), and for svm_host_osvw_{init,reset}() to become static. Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/acpi/power.c +++ b/xen/arch/x86/acpi/power.c @@ -286,7 +286,7 @@ static int enter_state(u32 state) console_end_sync(); watchdog_enable(); - microcode_update_one(true); + microcode_update_one(); if ( !recheck_cpu_features(0) ) panic("Missing previously available feature(s)\n"); --- a/xen/arch/x86/microcode_amd.c +++ b/xen/arch/x86/microcode_amd.c @@ -24,7 +24,6 @@ #include <asm/msr.h> #include <asm/processor.h> #include <asm/microcode.h> -#include <asm/hvm/svm/svm.h> #define pr_debug(x...) ((void)0) @@ -590,27 +589,10 @@ static struct microcode_patch *cpu_reque return patch; } -#ifdef CONFIG_HVM -static int start_update(void) -{ - /* - * svm_host_osvw_init() will be called on each cpu by calling '.end_update' - * in common code. - */ - svm_host_osvw_reset(); - - return 0; -} -#endif - static const struct microcode_ops microcode_amd_ops = { .cpu_request_microcode = cpu_request_microcode, .collect_cpu_info = collect_cpu_info, .apply_microcode = apply_microcode, -#ifdef CONFIG_HVM - .start_update = start_update, - .end_update_percpu = svm_host_osvw_init, -#endif .free_patch = free_patch, .compare_patch = compare_patch, .match_cpu = match_cpu, --- a/xen/arch/x86/microcode.c +++ b/xen/arch/x86/microcode.c @@ -578,9 +578,6 @@ static int do_microcode_update(void *pat else ret = secondary_thread_fn(); - if ( microcode_ops->end_update_percpu ) - microcode_ops->end_update_percpu(); - return ret; } @@ -652,16 +649,6 @@ static long microcode_update_helper(void } spin_unlock(&microcode_mutex); - if ( microcode_ops->start_update ) - { - ret = microcode_ops->start_update(); - if ( ret ) - { - microcode_free_patch(patch); - goto put; - } - } - cpumask_clear(&cpu_callin_map); atomic_set(&cpu_out, 0); atomic_set(&cpu_updated, 0); @@ -760,28 +747,14 @@ static int __init microcode_init(void) __initcall(microcode_init); /* Load a cached update to current cpu */ -int microcode_update_one(bool start_update) +int microcode_update_one(void) { - int err; - if ( !microcode_ops ) return -EOPNOTSUPP; microcode_ops->collect_cpu_info(&this_cpu(cpu_sig)); - if ( start_update && microcode_ops->start_update ) - { - err = microcode_ops->start_update(); - if ( err ) - return err; - } - - err = microcode_update_cpu(NULL); - - if ( microcode_ops->end_update_percpu ) - microcode_ops->end_update_percpu(); - - return err; + return microcode_update_cpu(NULL); } /* BSP calls this function to parse ucode blob and then apply an update. */ @@ -825,7 +798,7 @@ int __init early_microcode_update_cpu(vo spin_unlock(&microcode_mutex); ASSERT(rc); - return microcode_update_one(true); + return microcode_update_one(); } int __init early_microcode_init(void) --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -1082,7 +1082,7 @@ static void svm_guest_osvw_init(struct d spin_unlock(&osvw_lock); } -void svm_host_osvw_reset() +static void svm_host_osvw_reset(void) { spin_lock(&osvw_lock); @@ -1092,7 +1092,7 @@ void svm_host_osvw_reset() spin_unlock(&osvw_lock); } -void svm_host_osvw_init() +static void svm_host_osvw_init(void) { spin_lock(&osvw_lock); --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -358,7 +358,7 @@ void start_secondary(void *unused) initialize_cpu_data(cpu); - microcode_update_one(false); + microcode_update_one(); /* * If MSR_SPEC_CTRL is available, apply Xen's default setting and discard --- a/xen/include/asm-x86/hvm/svm/svm.h +++ b/xen/include/asm-x86/hvm/svm/svm.h @@ -93,9 +93,6 @@ extern u32 svm_feature_flags; #define DEFAULT_TSC_RATIO 0x0000000100000000ULL #define TSC_RATIO_RSVD_BITS 0xffffff0000000000ULL -extern void svm_host_osvw_reset(void); -extern void svm_host_osvw_init(void); - /* EXITINFO1 fields on NPT faults */ #define _NPT_PFEC_with_gla 32 #define NPT_PFEC_with_gla (1UL<<_NPT_PFEC_with_gla) --- a/xen/include/asm-x86/microcode.h +++ b/xen/include/asm-x86/microcode.h @@ -24,8 +24,6 @@ struct microcode_ops { size_t size); int (*collect_cpu_info)(struct cpu_signature *csig); int (*apply_microcode)(const struct microcode_patch *patch); - int (*start_update)(void); - void (*end_update_percpu)(void); void (*free_patch)(void *mc); bool (*match_cpu)(const struct microcode_patch *patch); enum microcode_match_result (*compare_patch)( --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -586,7 +586,7 @@ void microcode_set_module(unsigned int); int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void), unsigned long len); int early_microcode_update_cpu(void); int early_microcode_init(void); -int microcode_update_one(bool start_update); +int microcode_update_one(void); int microcode_init_intel(void); int microcode_init_amd(void); ++++++ 5eda60cb-SVM-split-recalc-NPT-fault-handling.patch ++++++ # Commit 51ca66c37371b10b378513af126646de22eddb17 # Date 2020-06-05 17:12:11 +0200 # Author Igor Druzhinin <igor.druzhinin(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/svm: do not try to handle recalc NPT faults immediately A recalculation NPT fault doesn't always require additional handling in hvm_hap_nested_page_fault(), moreover in general case if there is no explicit handling done there - the fault is wrongly considered fatal. This covers a specific case of migration with vGPU assigned which uses direct MMIO mappings made by XEN_DOMCTL_memory_mapping hypercall: at a moment log-dirty is enabled globally, recalculation is requested for the whole guest memory including those mapped MMIO regions which causes a page fault being raised at the first access to them; but due to MMIO P2M type not having any explicit handling in hvm_hap_nested_page_fault() a domain is erroneously crashed with unhandled SVM violation. Instead of trying to be opportunistic - use safer approach and handle P2M recalculation in a separate NPT fault by attempting to retry after making the necessary adjustments. This is aligned with Intel behavior where there are separate VMEXITs for recalculation and EPT violations (faults) and only faults are handled in hvm_hap_nested_page_fault(). Do it by also unifying do_recalc return code with Intel implementation where returning 1 means P2M was actually changed. Since there was no case previously where p2m_pt_handle_deferred_changes() could return a positive value - it's safe to replace ">= 0" with just "== 0" in VMEXIT_NPF handler. finish_type_change() is also not affected by the change as being able to deal with >0 return value of p2m->recalc from EPT implementation. Signed-off-by: Igor Druzhinin <igor.druzhinin(a)citrix.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -2947,9 +2947,10 @@ void svm_vmexit_handler(struct cpu_user_ v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf; rc = vmcb->exitinfo1 & PFEC_page_present ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0; - if ( rc >= 0 ) + if ( rc == 0 ) + /* If no recal adjustments were being made - handle this fault */ svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2); - else + else if ( rc < 0 ) { printk(XENLOG_G_ERR "%pv: Error %d handling NPF (gpa=%08lx ec=%04lx)\n", --- a/xen/arch/x86/mm/p2m-pt.c +++ b/xen/arch/x86/mm/p2m-pt.c @@ -341,6 +341,7 @@ static int do_recalc(struct p2m_domain * unsigned int level = 4; l1_pgentry_t *pent; int err = 0; + bool recalc_done = false; table = map_domain_page(pagetable_get_mfn(p2m_get_pagetable(p2m))); while ( --level ) @@ -402,6 +403,8 @@ static int do_recalc(struct p2m_domain * clear_recalc(l1, e); err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1); ASSERT(!err); + + recalc_done = true; } } unmap_domain_page((void *)((unsigned long)pent & PAGE_MASK)); @@ -448,12 +451,14 @@ static int do_recalc(struct p2m_domain * clear_recalc(l1, e); err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1); ASSERT(!err); + + recalc_done = true; } out: unmap_domain_page(table); - return err; + return err ?: recalc_done; } int p2m_pt_handle_deferred_changes(uint64_t gpa) --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -1194,7 +1194,7 @@ static int finish_type_change(struct p2m rc = p2m->recalc(p2m, gfn); /* * ept->recalc could return 0/1/-ENOMEM. pt->recalc could return - * 0/-ENOMEM/-ENOENT, -ENOENT isn't an error as we are looping + * 0/1/-ENOMEM/-ENOENT, -ENOENT isn't an error as we are looping * gfn here. If rc is 1 we need to have it 0 for success. */ if ( rc == -ENOENT || rc > 0 ) ++++++ 5edf6ad8-ioreq-pending-emulation-server-destruction-race.patch ++++++ # Commit f7039ee41b3d3448775a1623f230037fd0455104 # Date 2020-06-09 12:56:24 +0200 # Author Paul Durrant <pdurrant(a)amazon.com> # Committer Jan Beulich <jbeulich(a)suse.com> ioreq: handle pending emulation racing with ioreq server destruction When an emulation request is initiated in hvm_send_ioreq() the guest vcpu is blocked on an event channel until that request is completed. If, however, the emulator is killed whilst that emulation is pending then the ioreq server may be destroyed. Thus when the vcpu is awoken the code in handle_hvm_io_completion() will find no pending request to wait for, but will leave the internal vcpu io_req.state set to IOREQ_READY and the vcpu shutdown deferall flag in place (because hvm_io_assist() will never be called). The emulation request is then completed anyway. This means that any subsequent call to hvmemul_do_io() will find an unexpected value in io_req.state and will return X86EMUL_UNHANDLEABLE, which in some cases will result in continuous re-tries. This patch fixes the issue by moving the setting of io_req.state and clearing of shutdown deferral (as will as MSI-X write completion) out of hvm_io_assist() and directly into handle_hvm_io_completion(). Reported-by: Marek Marczykowski-Górecki <marmarek(a)invisiblethingslab.com> Signed-off-by: Paul Durrant <pdurrant(a)amazon.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/hvm/ioreq.c +++ b/xen/arch/x86/hvm/ioreq.c @@ -107,15 +107,7 @@ static void hvm_io_assist(struct hvm_ior ioreq_t *ioreq = &v->arch.hvm.hvm_io.io_req; if ( hvm_ioreq_needs_completion(ioreq) ) - { - ioreq->state = STATE_IORESP_READY; ioreq->data = data; - } - else - ioreq->state = STATE_IOREQ_NONE; - - msix_write_completion(v); - vcpu_end_shutdown_deferral(v); sv->pending = false; } @@ -207,6 +199,12 @@ bool handle_hvm_io_completion(struct vcp } } + vio->io_req.state = hvm_ioreq_needs_completion(&vio->io_req) ? + STATE_IORESP_READY : STATE_IOREQ_NONE; + + msix_write_completion(v); + vcpu_end_shutdown_deferral(v); + io_completion = vio->io_completion; vio->io_completion = HVMIO_no_completion; ++++++ 5edfbbea-x86-spec-ctrl-CPUID-MSR-defs-for-SRBDS.patch ++++++ # Commit caab85ab58c0cdf74ab070a5de5c4df89f509ff3 # Date 2020-06-09 17:42:18 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling This is part of XSA-320 / CVE-2020-0543 Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> Acked-by: Wei Liu <wl(a)xen.org> --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -483,10 +483,10 @@ accounting for hardware capabilities as Currently accepted: -The Speculation Control hardware features `md-clear`, `ibrsb`, `stibp`, `ibpb`, -`l1d-flush` and `ssbd` are used by default if available and applicable. They can -be ignored, e.g. `no-ibrsb`, at which point Xen won't use them itself, and -won't offer them to guests. +The Speculation Control hardware features `srbds-ctrl`, `md-clear`, `ibrsb`, +`stibp`, `ibpb`, `l1d-flush` and `ssbd` are used by default if available and +applicable. They can be ignored, e.g. `no-ibrsb`, at which point Xen won't +use them itself, and won't offer them to guests. `rdrand` can be used to override the default disabling of the feature on certain AMD systems. Its negative form can of course also be used to suppress use and --- a/tools/libxl/libxl_cpuid.c +++ b/tools/libxl/libxl_cpuid.c @@ -213,6 +213,7 @@ int libxl_cpuid_parse_config(libxl_cpuid {"avx512-4vnniw",0x00000007, 0, CPUID_REG_EDX, 2, 1}, {"avx512-4fmaps",0x00000007, 0, CPUID_REG_EDX, 3, 1}, + {"srbds-ctrl", 0x00000007, 0, CPUID_REG_EDX, 9, 1}, {"md-clear", 0x00000007, 0, CPUID_REG_EDX, 10, 1}, {"cet-ibt", 0x00000007, 0, CPUID_REG_EDX, 20, 1}, {"ibrsb", 0x00000007, 0, CPUID_REG_EDX, 26, 1}, --- a/tools/misc/xen-cpuid.c +++ b/tools/misc/xen-cpuid.c @@ -157,6 +157,7 @@ static const char *const str_7d0[32] = [ 2] = "avx512_4vnniw", [ 3] = "avx512_4fmaps", [ 4] = "fsrm", + /* 8 */ [ 9] = "srbds-ctrl", [10] = "md-clear", /* 12 */ [13] = "tsx-force-abort", --- a/xen/arch/x86/msr.c +++ b/xen/arch/x86/msr.c @@ -134,6 +134,7 @@ int guest_rdmsr(struct vcpu *v, uint32_t /* Write-only */ case MSR_TSX_FORCE_ABORT: case MSR_TSX_CTRL: + case MSR_MCU_OPT_CTRL: case MSR_U_CET: case MSR_S_CET: case MSR_PL0_SSP ... MSR_INTERRUPT_SSP_TABLE: @@ -288,6 +289,7 @@ int guest_wrmsr(struct vcpu *v, uint32_t /* Read-only */ case MSR_TSX_FORCE_ABORT: case MSR_TSX_CTRL: + case MSR_MCU_OPT_CTRL: case MSR_U_CET: case MSR_S_CET: case MSR_PL0_SSP ... MSR_INTERRUPT_SSP_TABLE: --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -312,12 +312,13 @@ static void __init print_details(enum in printk("Speculative mitigation facilities:\n"); /* Hardware features which pertain to speculative mitigations. */ - printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n", + printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n", (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBRS/IBPB" : "", (_7d0 & cpufeat_mask(X86_FEATURE_STIBP)) ? " STIBP" : "", (_7d0 & cpufeat_mask(X86_FEATURE_L1D_FLUSH)) ? " L1D_FLUSH" : "", (_7d0 & cpufeat_mask(X86_FEATURE_SSBD)) ? " SSBD" : "", (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR)) ? " MD_CLEAR" : "", + (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL)) ? " SRBDS_CTRL" : "", (e8b & cpufeat_mask(X86_FEATURE_IBPB)) ? " IBPB" : "", (caps & ARCH_CAPS_IBRS_ALL) ? " IBRS_ALL" : "", (caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "", --- a/xen/include/asm-x86/msr-index.h +++ b/xen/include/asm-x86/msr-index.h @@ -179,6 +179,9 @@ #define MSR_IA32_VMX_TRUE_ENTRY_CTLS 0x490 #define MSR_IA32_VMX_VMFUNC 0x491 +#define MSR_MCU_OPT_CTRL 0x00000123 +#define MCU_OPT_CTRL_RNGDS_MITG_DIS (_AC(1, ULL) << 0) + #define MSR_U_CET 0x000006a0 #define MSR_S_CET 0x000006a2 #define MSR_PL0_SSP 0x000006a4 --- a/xen/include/public/arch-x86/cpufeatureset.h +++ b/xen/include/public/arch-x86/cpufeatureset.h @@ -252,6 +252,7 @@ XEN_CPUFEATURE(IBPB, 8*32+12) / /* Intel-defined CPU features, CPUID level 0x00000007:0.edx, word 9 */ XEN_CPUFEATURE(AVX512_4VNNIW, 9*32+ 2) /*A AVX512 Neural Network Instructions */ XEN_CPUFEATURE(AVX512_4FMAPS, 9*32+ 3) /*A AVX512 Multiply Accumulation Single Precision */ +XEN_CPUFEATURE(SRBDS_CTRL, 9*32+ 9) /* MSR_MCU_OPT_CTRL and RNGDS_MITG_DIS. */ XEN_CPUFEATURE(MD_CLEAR, 9*32+10) /*A VERW clears microarchitectural buffers */ XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */ XEN_CPUFEATURE(CET_IBT, 9*32+20) /* CET - Indirect Branch Tracking */ ++++++ 5edfbbea-x86-spec-ctrl-mitigate-SRBDS.patch ++++++ # Commit 6a49b9a7920c82015381740905582b666160d955 # Date 2020-06-09 17:42:18 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel See patch documentation and comments. This is part of XSA-320 / CVE-2020-0543 Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -1995,7 +1995,7 @@ By default SSBD will be mitigated at run ### spec-ctrl (x86) > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>, > bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu, -> l1d-flush,branch-harden}=<bool> ]` +> l1d-flush,branch-harden,srb-lock}=<bool> ]` Controls for speculative execution sidechannel mitigations. By default, Xen will pick the most appropriate mitigations based on compiled in support, @@ -2072,6 +2072,12 @@ If Xen is compiled with `CONFIG_SPECULAT speculation barriers to protect selected conditional branches. By default, Xen will enable this mitigation. +On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force +or prevent Xen from protect the Special Register Buffer from leaking stale +data. By default, Xen will enable this mitigation, except on parts where MDS +is fixed and TAA is fixed/mitigated (in which case, there is believed to be no +way for an attacker to obtain the stale data). + ### sync_console > `= <boolean>` --- a/xen/arch/x86/acpi/power.c +++ b/xen/arch/x86/acpi/power.c @@ -295,6 +295,9 @@ static int enter_state(u32 state) ci->spec_ctrl_flags |= (default_spec_ctrl_flags & SCF_ist_wrmsr); spec_ctrl_exit_idle(ci); + if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) + wrmsrl(MSR_MCU_OPT_CTRL, default_xen_mcu_opt_ctrl); + done: spin_debug_enable(); local_irq_restore(flags); --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -361,12 +361,14 @@ void start_secondary(void *unused) microcode_update_one(); /* - * If MSR_SPEC_CTRL is available, apply Xen's default setting and discard - * any firmware settings. Note: MSR_SPEC_CTRL may only become available - * after loading microcode. + * If any speculative control MSRs are available, apply Xen's default + * settings. Note: These MSRs may only become available after loading + * microcode. */ if ( boot_cpu_has(X86_FEATURE_IBRSB) ) wrmsrl(MSR_SPEC_CTRL, default_xen_spec_ctrl); + if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) + wrmsrl(MSR_MCU_OPT_CTRL, default_xen_mcu_opt_ctrl); tsx_init(); /* Needs microcode. May change HLE/RTM feature bits. */ --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -65,6 +65,9 @@ static unsigned int __initdata l1d_maxph static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */ +static int8_t __initdata opt_srb_lock = -1; +uint64_t __read_mostly default_xen_mcu_opt_ctrl; + static int __init parse_spec_ctrl(const char *s) { const char *ss; @@ -112,6 +115,7 @@ static int __init parse_spec_ctrl(const opt_ssbd = false; opt_l1d_flush = 0; opt_branch_harden = false; + opt_srb_lock = 0; } else if ( val > 0 ) rc = -EINVAL; @@ -178,6 +182,8 @@ static int __init parse_spec_ctrl(const opt_l1d_flush = val; else if ( (val = parse_boolean("branch-harden", s, ss)) >= 0 ) opt_branch_harden = val; + else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 ) + opt_srb_lock = val; else rc = -EINVAL; @@ -341,7 +347,7 @@ static void __init print_details(enum in "\n"); /* Settings for Xen's protection, irrespective of guests. */ - printk(" Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s, Other:%s%s%s%s\n", + printk(" Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s, Other:%s%s%s%s%s\n", thunk == THUNK_NONE ? "N/A" : thunk == THUNK_RETPOLINE ? "RETPOLINE" : thunk == THUNK_LFENCE ? "LFENCE" : @@ -352,6 +358,8 @@ static void __init print_details(enum in (default_xen_spec_ctrl & SPEC_CTRL_SSBD) ? " SSBD+" : " SSBD-", !(caps & ARCH_CAPS_TSX_CTRL) ? "" : (opt_tsx & 1) ? " TSX+" : " TSX-", + !boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ? "" : + opt_srb_lock ? " SRB_LOCK+" : " SRB_LOCK-", opt_ibpb ? " IBPB" : "", opt_l1d_flush ? " L1D_FLUSH" : "", opt_md_clear_pv || opt_md_clear_hvm ? " VERW" : "", @@ -1149,6 +1157,34 @@ void __init init_speculation_mitigations tsx_init(); } + /* Calculate suitable defaults for MSR_MCU_OPT_CTRL */ + if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) + { + uint64_t val; + + rdmsrl(MSR_MCU_OPT_CTRL, val); + + /* + * On some SRBDS-affected hardware, it may be safe to relax srb-lock + * by default. + * + * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only way + * to access the Fill Buffer. If TSX isn't available (inc. SKU + * reasons on some models), or TSX is explicitly disabled, then there + * is no need for the extra overhead to protect RDRAND/RDSEED. + */ + if ( opt_srb_lock == -1 && + (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO && + (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && opt_tsx == 0)) ) + opt_srb_lock = 0; + + val &= ~MCU_OPT_CTRL_RNGDS_MITG_DIS; + if ( !opt_srb_lock ) + val |= MCU_OPT_CTRL_RNGDS_MITG_DIS; + + default_xen_mcu_opt_ctrl = val; + } + print_details(thunk, caps); /* @@ -1180,6 +1216,9 @@ void __init init_speculation_mitigations wrmsrl(MSR_SPEC_CTRL, bsp_delay_spec_ctrl ? 0 : default_xen_spec_ctrl); } + + if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) + wrmsrl(MSR_MCU_OPT_CTRL, default_xen_mcu_opt_ctrl); } static void __init __maybe_unused build_assertions(void) --- a/xen/include/asm-x86/spec_ctrl.h +++ b/xen/include/asm-x86/spec_ctrl.h @@ -54,6 +54,8 @@ extern int8_t opt_pv_l1tf_hwdom, opt_pv_ */ extern paddr_t l1tf_addr_mask, l1tf_safe_maddr; +extern uint64_t default_xen_mcu_opt_ctrl; + static inline void init_shadow_spec_ctrl_state(void) { struct cpu_info *info = get_cpu_info(); ++++++ 5ee24d0e-x86-spec-ctrl-document-SRBDS-workaround.patch ++++++ # Commit 7028534d8482d25860c4d1aa8e45f0b911abfc5a # Date 2020-06-11 16:26:06 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/spec-ctrl: Update docs with SRBDS workaround RDRAND/RDSEED can be hidden using cpuid= to mitigate SRBDS if microcode isn't available. This is part of XSA-320 / CVE-2020-0543. Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Acked-by: Julien Grall <jgrall(a)amazon.com> --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -481,16 +481,21 @@ choice of `dom0-kernel` is deprecated an This option allows for fine tuning of the facilities Xen will use, after accounting for hardware capabilities as enumerated via CPUID. +Unless otherwise noted, options only have any effect in their negative form, +to hide the named feature(s). Ignoring a feature using this mechanism will +cause Xen not to use the feature, nor offer them as usable to guests. + Currently accepted: The Speculation Control hardware features `srbds-ctrl`, `md-clear`, `ibrsb`, `stibp`, `ibpb`, `l1d-flush` and `ssbd` are used by default if available and -applicable. They can be ignored, e.g. `no-ibrsb`, at which point Xen won't -use them itself, and won't offer them to guests. +applicable. They can all be ignored. -`rdrand` can be used to override the default disabling of the feature on certain -AMD systems. Its negative form can of course also be used to suppress use and -exposure of the feature. +`rdrand` and `rdseed` can be ignored, as a mitigation to XSA-320 / +CVE-2020-0543. The RDRAND feature is disabled by default on certain AMD +systems, due to possible malfunctions after ACPI S3 suspend/resume. `rdrand` +may be used in its positive form to override Xen's default behaviour on these +systems, and make the feature fully usable. ### cpuid_mask_cpu > `= fam_0f_rev_[cdefg] | fam_10_rev_[bc] | fam_11_rev_b` ++++++ 5ef44e0d-x86-PMTMR-use-FADT-flags.patch ++++++ # Commit f325d2477eef8229c47d97031d314629521c70ab # Date 2020-06-25 09:11:09 +0200 # Author Grzegorz Uriasz <gorbak25(a)gmail.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/acpi: use FADT flags to determine the PMTMR width On some computers the bit width of the PM Timer as reported by ACPI is 32 bits when in fact the FADT flags report correctly that the timer is 24 bits wide. On affected machines such as the ASUS FX504GM and never gaming laptops this results in the inability to resume the machine from suspend. Without this patch suspend is broken on affected machines and even if a machine manages to resume correctly then the kernel time and xen timers are trashed. Signed-off-by: Grzegorz Uriasz <gorbak25(a)gmail.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/acpi/boot.c +++ b/xen/arch/x86/acpi/boot.c @@ -473,10 +473,17 @@ static int __init acpi_parse_fadt(struct #ifdef CONFIG_X86_PM_TIMER /* detect the location of the ACPI PM Timer */ - if (fadt->header.revision >= FADT2_REVISION_ID) { + if (fadt->header.revision >= FADT2_REVISION_ID && + fadt->xpm_timer_block.space_id == ACPI_ADR_SPACE_SYSTEM_IO) { /* FADT rev. 2 */ - if (fadt->xpm_timer_block.space_id == - ACPI_ADR_SPACE_SYSTEM_IO) { + if (fadt->xpm_timer_block.access_width != 0 && + ACPI_ACCESS_BIT_WIDTH(fadt->xpm_timer_block.access_width) != 32) + printk(KERN_WARNING PREFIX "PM-Timer has invalid access width(%u)\n", + fadt->xpm_timer_block.access_width); + else if (fadt->xpm_timer_block.bit_offset != 0) + printk(KERN_WARNING PREFIX "PM-Timer has invalid bit offset(%u)\n", + fadt->xpm_timer_block.bit_offset); + else { pmtmr_ioport = fadt->xpm_timer_block.address; pmtmr_width = fadt->xpm_timer_block.bit_width; } @@ -488,8 +495,12 @@ static int __init acpi_parse_fadt(struct */ if (!pmtmr_ioport) { pmtmr_ioport = fadt->pm_timer_block; - pmtmr_width = fadt->pm_timer_length == 4 ? 24 : 0; + pmtmr_width = fadt->pm_timer_length == 4 ? 32 : 0; } + if (pmtmr_width < 32 && (fadt->flags & ACPI_FADT_32BIT_TIMER)) + printk(KERN_WARNING PREFIX "PM-Timer is too short\n"); + if (pmtmr_width > 24 && !(fadt->flags & ACPI_FADT_32BIT_TIMER)) + pmtmr_width = 24; if (pmtmr_ioport) printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x (%u bits)\n", pmtmr_ioport, pmtmr_width); --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -452,16 +452,13 @@ static u64 read_pmtimer_count(void) static s64 __init init_pmtimer(struct platform_timesource *pts) { u64 start; - u32 count, target, mask = 0xffffff; + u32 count, target, mask; - if ( !pmtmr_ioport || !pmtmr_width ) + if ( !pmtmr_ioport || (pmtmr_width != 24 && pmtmr_width != 32) ) return 0; - if ( pmtmr_width == 32 ) - { - pts->counter_bits = 32; - mask = 0xffffffff; - } + pts->counter_bits = pmtmr_width; + mask = 0xffffffff >> (32 - pmtmr_width); count = inl(pmtmr_ioport) & mask; start = rdtsc_ordered(); @@ -481,7 +478,6 @@ static struct platform_timesource __init .name = "ACPI PM Timer", .frequency = ACPI_PM_FREQUENCY, .read_counter = read_pmtimer_count, - .counter_bits = 24, .init = init_pmtimer }; --- a/xen/include/acpi/acmacros.h +++ b/xen/include/acpi/acmacros.h @@ -122,6 +122,14 @@ #endif /* + * Algorithm to obtain access bit or byte width. + * Can be used with access_width of struct acpi_generic_address and access_size of + * struct acpi_resource_generic_register. + */ +#define ACPI_ACCESS_BIT_WIDTH(size) (1 << ((size) + 2)) +#define ACPI_ACCESS_BYTE_WIDTH(size) (1 << ((size) - 1)) + +/* * Macros for moving data around to/from buffers that are possibly unaligned. * If the hardware supports the transfer of unaligned data, just do the store. * Otherwise, we have to move one byte at a time. ++++++ 5ef6156a-x86-disallow-access-to-PT-MSRs.patch ++++++ # Commit bcdfbb70fca579baa04f212c0936b77919bdae11 # Date 2020-06-26 16:34:02 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/msr: Disallow access to Processor Trace MSRs We do not expose the feature to guests, so should disallow access to the respective MSRs. For simplicity, drop the entire block of MSRs, not just the subset which have been specified thus far. Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Wei Liu <wl(a)xen.org> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/msr.c +++ b/xen/arch/x86/msr.c @@ -135,6 +135,7 @@ int guest_rdmsr(struct vcpu *v, uint32_t case MSR_TSX_FORCE_ABORT: case MSR_TSX_CTRL: case MSR_MCU_OPT_CTRL: + case MSR_RTIT_OUTPUT_BASE ... MSR_RTIT_ADDR_B(7): case MSR_U_CET: case MSR_S_CET: case MSR_PL0_SSP ... MSR_INTERRUPT_SSP_TABLE: @@ -290,6 +291,7 @@ int guest_wrmsr(struct vcpu *v, uint32_t case MSR_TSX_FORCE_ABORT: case MSR_TSX_CTRL: case MSR_MCU_OPT_CTRL: + case MSR_RTIT_OUTPUT_BASE ... MSR_RTIT_ADDR_B(7): case MSR_U_CET: case MSR_S_CET: case MSR_PL0_SSP ... MSR_INTERRUPT_SSP_TABLE: --- a/xen/include/asm-x86/msr-index.h +++ b/xen/include/asm-x86/msr-index.h @@ -182,6 +182,14 @@ #define MSR_MCU_OPT_CTRL 0x00000123 #define MCU_OPT_CTRL_RNGDS_MITG_DIS (_AC(1, ULL) << 0) +#define MSR_RTIT_OUTPUT_BASE 0x00000560 +#define MSR_RTIT_OUTPUT_MASK 0x00000561 +#define MSR_RTIT_CTL 0x00000570 +#define MSR_RTIT_STATUS 0x00000571 +#define MSR_RTIT_CR3_MATCH 0x00000572 +#define MSR_RTIT_ADDR_A(n) (0x00000580 + (n) * 2) +#define MSR_RTIT_ADDR_B(n) (0x00000581 + (n) * 2) + #define MSR_U_CET 0x000006a0 #define MSR_S_CET 0x000006a2 #define MSR_PL0_SSP 0x000006a4 ++++++ 5efcb354-x86-protect-CALL-JMP-straight-line-speculation.patch ++++++ # Commit 3b7dab93f2401b08c673244c9ae0f92e08bd03ba # Date 2020-07-01 17:01:24 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/spec-ctrl: Protect against CALL/JMP straight-line speculation Some x86 CPUs speculatively execute beyond indirect CALL/JMP instructions. With CONFIG_INDIRECT_THUNK / Retpolines, indirect CALL/JMP instructions are converted to direct CALL/JMP's to __x86_indirect_thunk_REG(), leaving just a handful of indirect JMPs implementing those stubs. There is no architectrual execution beyond an indirect JMP, so use INT3 as recommended by vendors to halt speculative execution. This is shorter than LFENCE (which would also work fine), but also shows up in logs if we do unexpected execute them. Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/indirect-thunk.S +++ b/xen/arch/x86/indirect-thunk.S @@ -24,10 +24,12 @@ .macro IND_THUNK_LFENCE reg:req lfence jmp *%\reg + int3 /* Halt straight-line speculation */ .endm .macro IND_THUNK_JMP reg:req jmp *%\reg + int3 /* Halt straight-line speculation */ .endm /* ++++++ 5f046c18-evtchn-dont-ignore-error-in-get_free_port.patch ++++++ # Commit 2e9c2bc292231823a3a021d2e0a9f1956bf00b3c # Date 2020-07-07 14:35:36 +0200 # Author Julien Grall <jgrall(a)amazon.com> # Committer Jan Beulich <jbeulich(a)suse.com> xen/common: event_channel: Don't ignore error in get_free_port() Currently, get_free_port() is assuming that the port has been allocated when evtchn_allocate_port() is not return -EBUSY. However, the function may return an error when: - We exhausted all the event channels. This can happen if the limit configured by the administrator for the guest ('max_event_channels' in xl cfg) is higher than the ABI used by the guest. For instance, if the guest is using 2L, the limit should not be higher than 4095. - We cannot allocate memory (e.g Xen has not more memory). Users of get_free_port() (such as EVTCHNOP_alloc_unbound) will validly assuming the port was valid and will next call evtchn_from_port(). This will result to a crash as the memory backing the event channel structure is not present. Fixes: 368ae9a05fe ("xen/pvshim: forward evtchn ops between L0 Xen and L2 DomU") Signed-off-by: Julien Grall <jgrall(a)amazon.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- xen/common/event_channel.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -195,10 +195,10 @@ static int get_free_port(struct domain * { int rc = evtchn_allocate_port(d, port); - if ( rc == -EBUSY ) - continue; - - return port; + if ( rc == 0 ) + return port; + else if ( rc != -EBUSY ) + return rc; } return -ENOSPC; ++++++ 5f046c48-x86-shadow-dirty-VRAM-inverted-conditional.patch ++++++ # Commit 23a216f99d40fbfbc2318ade89d8213eea6ba1f8 # Date 2020-07-07 14:36:24 +0200 # Author Jan Beulich <jbeulich(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/shadow: correct an inverted conditional in dirty VRAM tracking This originally was "mfn_x(mfn) == INVALID_MFN". Make it like this again, taking the opportunity to also drop the unnecessary nearby braces. This is XSA-319. Fixes: 246a5a3377c2 ("xen: Use a typesafe to define INVALID_MFN") Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/mm/shadow/common.c +++ b/xen/arch/x86/mm/shadow/common.c @@ -3249,10 +3249,8 @@ int shadow_track_dirty_vram(struct domai int dirty = 0; paddr_t sl1ma = dirty_vram->sl1ma[i]; - if ( !mfn_eq(mfn, INVALID_MFN) ) - { + if ( mfn_eq(mfn, INVALID_MFN) ) dirty = 1; - } else { page = mfn_to_page(mfn); ++++++ 5f046c64-EPT-set_middle_entry-adjustments.patch ++++++ # Commit 1104288186ee73a7f9bfa41cbaa5bb7611521028 # Date 2020-07-07 14:36:52 +0200 # Author Jan Beulich <jbeulich(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/EPT: ept_set_middle_entry() related adjustments ept_split_super_page() wants to further modify the newly allocated table, so have ept_set_middle_entry() return the mapped pointer rather than tearing it down and then getting re-established right again. Similarly ept_next_level() wants to hand back a mapped pointer of the next level page, so re-use the one established by ept_set_middle_entry() in case that path was taken. Pull the setting of suppress_ve ahead of insertion into the higher level table, and don't have ept_split_super_page() set the field a 2nd time. This is part of XSA-328. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> --- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -187,8 +187,9 @@ static void ept_p2m_type_to_flags(struct #define GUEST_TABLE_SUPER_PAGE 2 #define GUEST_TABLE_POD_PAGE 3 -/* Fill in middle levels of ept table */ -static int ept_set_middle_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry) +/* Fill in middle level of ept table; return pointer to mapped new table. */ +static ept_entry_t *ept_set_middle_entry(struct p2m_domain *p2m, + ept_entry_t *ept_entry) { mfn_t mfn; ept_entry_t *table; @@ -196,7 +197,12 @@ static int ept_set_middle_entry(struct p mfn = p2m_alloc_ptp(p2m, 0); if ( mfn_eq(mfn, INVALID_MFN) ) - return 0; + return NULL; + + table = map_domain_page(mfn); + + for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) + table[i].suppress_ve = 1; ept_entry->epte = 0; ept_entry->mfn = mfn_x(mfn); @@ -208,14 +214,7 @@ static int ept_set_middle_entry(struct p ept_entry->suppress_ve = 1; - table = map_domain_page(mfn); - - for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) - table[i].suppress_ve = 1; - - unmap_domain_page(table); - - return 1; + return table; } /* free ept sub tree behind an entry */ @@ -253,10 +252,10 @@ static bool_t ept_split_super_page(struc ASSERT(is_epte_superpage(ept_entry)); - if ( !ept_set_middle_entry(p2m, &new_ept) ) + table = ept_set_middle_entry(p2m, &new_ept); + if ( !table ) return 0; - table = map_domain_page(_mfn(new_ept.mfn)); trunk = 1UL << ((level - 1) * EPT_TABLE_ORDER); for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) @@ -267,7 +266,6 @@ static bool_t ept_split_super_page(struc epte->sp = (level > 1); epte->mfn += i * trunk; epte->snp = is_iommu_enabled(p2m->domain) && iommu_snoop; - epte->suppress_ve = 1; ept_p2m_type_to_flags(p2m, epte, epte->sa_p2mt, epte->access); @@ -306,8 +304,7 @@ static int ept_next_level(struct p2m_dom ept_entry_t **table, unsigned long *gfn_remainder, int next_level) { - unsigned long mfn; - ept_entry_t *ept_entry, e; + ept_entry_t *ept_entry, *next = NULL, e; u32 shift, index; shift = next_level * EPT_TABLE_ORDER; @@ -332,19 +329,17 @@ static int ept_next_level(struct p2m_dom if ( read_only ) return GUEST_TABLE_MAP_FAILED; - if ( !ept_set_middle_entry(p2m, ept_entry) ) + next = ept_set_middle_entry(p2m, ept_entry); + if ( !next ) return GUEST_TABLE_MAP_FAILED; - else - e = atomic_read_ept_entry(ept_entry); /* Refresh */ + /* e is now stale and hence may not be used anymore below. */ } - /* The only time sp would be set here is if we had hit a superpage */ - if ( is_epte_superpage(&e) ) + else if ( is_epte_superpage(&e) ) return GUEST_TABLE_SUPER_PAGE; - mfn = e.mfn; unmap_domain_page(*table); - *table = map_domain_page(_mfn(mfn)); + *table = next ?: map_domain_page(_mfn(e.mfn)); *gfn_remainder &= (1UL << shift) - 1; return GUEST_TABLE_NORMAL_PAGE; } ++++++ 5f046c78-EPT-atomically-modify-ents-in-ept_next_level.patch ++++++ # Commit bc3d9f95d661372b059a5539ae6cb1e79435bb95 # Date 2020-07-07 14:37:12 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/ept: atomically modify entries in ept_next_level ept_next_level was passing a live PTE pointer to ept_set_middle_entry, which was then modified without taking into account that the PTE could be part of a live EPT table. This wasn't a security issue because the pages returned by p2m_alloc_ptp are zeroed, so adding such an entry before actually initializing it didn't allow a guest to access physical memory addresses it wasn't supposed to access. This is part of XSA-328. Reported-by: Jan Beulich <jbeulich(a)suse.com> Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -307,6 +307,8 @@ static int ept_next_level(struct p2m_dom ept_entry_t *ept_entry, *next = NULL, e; u32 shift, index; + ASSERT(next_level); + shift = next_level * EPT_TABLE_ORDER; index = *gfn_remainder >> shift; @@ -323,16 +325,20 @@ static int ept_next_level(struct p2m_dom if ( !is_epte_present(&e) ) { + int rc; + if ( e.sa_p2mt == p2m_populate_on_demand ) return GUEST_TABLE_POD_PAGE; if ( read_only ) return GUEST_TABLE_MAP_FAILED; - next = ept_set_middle_entry(p2m, ept_entry); + next = ept_set_middle_entry(p2m, &e); if ( !next ) return GUEST_TABLE_MAP_FAILED; - /* e is now stale and hence may not be used anymore below. */ + + rc = atomic_write_ept_entry(p2m, ept_entry, e, next_level); + ASSERT(rc == 0); } /* The only time sp would be set here is if we had hit a superpage */ else if ( is_epte_superpage(&e) ) ++++++ 5f046c9a-VT-d-improve-IOMMU-TLB-flush.patch ++++++ # Commit 5fe515a0fede07543f2a3b049167b1fd8b873caf # Date 2020-07-07 14:37:46 +0200 # Author Jan Beulich <jbeulich(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> vtd: improve IOMMU TLB flush Do not limit PSI flushes to order 0 pages, in order to avoid doing a full TLB flush if the passed in page has an order greater than 0 and is aligned. Should increase the performance of IOMMU TLB flushes when dealing with page orders greater than 0. This is part of XSA-321. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -570,13 +570,14 @@ static int __must_check iommu_flush_iotl if ( iommu_domid == -1 ) continue; - if ( page_count != 1 || dfn_eq(dfn, INVALID_DFN) ) + if ( !page_count || (page_count & (page_count - 1)) || + dfn_eq(dfn, INVALID_DFN) || !IS_ALIGNED(dfn_x(dfn), page_count) ) rc = iommu_flush_iotlb_dsi(iommu, iommu_domid, 0, flush_dev_iotlb); else rc = iommu_flush_iotlb_psi(iommu, iommu_domid, dfn_to_daddr(dfn), - PAGE_ORDER_4K, + get_order_from_pages(page_count), !dma_old_pte_present, flush_dev_iotlb); ++++++ 5f046cb5-VT-d-prune-rename-cache-flush-funcs.patch ++++++ # Commit 62298825b9a44f45761acbd758138b5ba059ebd1 # Date 2020-07-07 14:38:13 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> vtd: prune (and rename) cache flush functions Rename __iommu_flush_cache to iommu_sync_cache and remove iommu_flush_cache_page. Also remove the iommu_flush_cache_entry wrapper and just use iommu_sync_cache instead. Note the _entry suffix was meaningless as the wrapper was already taking a size parameter in bytes. While there also constify the addr parameter. No functional change intended. This is part of XSA-321. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/drivers/passthrough/vtd/extern.h +++ b/xen/drivers/passthrough/vtd/extern.h @@ -43,8 +43,7 @@ void disable_qinval(struct vtd_iommu *io int enable_intremap(struct vtd_iommu *iommu, int eim); void disable_intremap(struct vtd_iommu *iommu); -void iommu_flush_cache_entry(void *addr, unsigned int size); -void iommu_flush_cache_page(void *addr, unsigned long npages); +void iommu_sync_cache(const void *addr, unsigned int size); int iommu_alloc(struct acpi_drhd_unit *drhd); void iommu_free(struct acpi_drhd_unit *drhd); --- a/xen/drivers/passthrough/vtd/intremap.c +++ b/xen/drivers/passthrough/vtd/intremap.c @@ -230,7 +230,7 @@ static void free_remap_entry(struct vtd_ iremap_entries, iremap_entry); update_irte(iommu, iremap_entry, &new_ire, false); - iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry)); + iommu_sync_cache(iremap_entry, sizeof(*iremap_entry)); iommu_flush_iec_index(iommu, 0, index); unmap_vtd_domain_page(iremap_entries); @@ -406,7 +406,7 @@ static int ioapic_rte_to_remap_entry(str } update_irte(iommu, iremap_entry, &new_ire, !init); - iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry)); + iommu_sync_cache(iremap_entry, sizeof(*iremap_entry)); iommu_flush_iec_index(iommu, 0, index); unmap_vtd_domain_page(iremap_entries); @@ -695,7 +695,7 @@ static int msi_msg_to_remap_entry( update_irte(iommu, iremap_entry, &new_ire, msi_desc->irte_initialized); msi_desc->irte_initialized = true; - iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry)); + iommu_sync_cache(iremap_entry, sizeof(*iremap_entry)); iommu_flush_iec_index(iommu, 0, index); unmap_vtd_domain_page(iremap_entries); --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -140,7 +140,8 @@ static int context_get_domain_id(struct } static int iommus_incoherent; -static void __iommu_flush_cache(void *addr, unsigned int size) + +void iommu_sync_cache(const void *addr, unsigned int size) { int i; static unsigned int clflush_size = 0; @@ -155,16 +156,6 @@ static void __iommu_flush_cache(void *ad cacheline_flush((char *)addr + i); } -void iommu_flush_cache_entry(void *addr, unsigned int size) -{ - __iommu_flush_cache(addr, size); -} - -void iommu_flush_cache_page(void *addr, unsigned long npages) -{ - __iommu_flush_cache(addr, PAGE_SIZE * npages); -} - /* Allocate page table, return its machine address */ uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node) { @@ -183,7 +174,7 @@ uint64_t alloc_pgtable_maddr(unsigned lo vaddr = __map_domain_page(cur_pg); memset(vaddr, 0, PAGE_SIZE); - iommu_flush_cache_page(vaddr, 1); + iommu_sync_cache(vaddr, PAGE_SIZE); unmap_domain_page(vaddr); cur_pg++; } @@ -216,7 +207,7 @@ static u64 bus_to_context_maddr(struct v } set_root_value(*root, maddr); set_root_present(*root); - iommu_flush_cache_entry(root, sizeof(struct root_entry)); + iommu_sync_cache(root, sizeof(struct root_entry)); } maddr = (u64) get_context_addr(*root); unmap_vtd_domain_page(root_entries); @@ -263,7 +254,7 @@ static u64 addr_to_dma_page_maddr(struct */ dma_set_pte_readable(*pte); dma_set_pte_writable(*pte); - iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); + iommu_sync_cache(pte, sizeof(struct dma_pte)); } if ( level == 2 ) @@ -640,7 +631,7 @@ static int __must_check dma_pte_clear_on *flush_flags |= IOMMU_FLUSHF_modified; spin_unlock(&hd->arch.mapping_lock); - iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); + iommu_sync_cache(pte, sizeof(struct dma_pte)); unmap_vtd_domain_page(page); @@ -679,7 +670,7 @@ static void iommu_free_page_table(struct iommu_free_pagetable(dma_pte_addr(*pte), next_level); dma_clear_pte(*pte); - iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); + iommu_sync_cache(pte, sizeof(struct dma_pte)); } unmap_vtd_domain_page(pt_vaddr); @@ -1400,7 +1391,7 @@ int domain_context_mapping_one( context_set_address_width(*context, agaw); context_set_fault_enable(*context); context_set_present(*context); - iommu_flush_cache_entry(context, sizeof(struct context_entry)); + iommu_sync_cache(context, sizeof(struct context_entry)); spin_unlock(&iommu->lock); /* Context entry was previously non-present (with domid 0). */ @@ -1564,7 +1555,7 @@ int domain_context_unmap_one( context_clear_present(*context); context_clear_entry(*context); - iommu_flush_cache_entry(context, sizeof(struct context_entry)); + iommu_sync_cache(context, sizeof(struct context_entry)); iommu_domid= domain_iommu_domid(domain, iommu); if ( iommu_domid == -1 ) @@ -1791,7 +1782,7 @@ static int __must_check intel_iommu_map_ *pte = new; - iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); + iommu_sync_cache(pte, sizeof(struct dma_pte)); spin_unlock(&hd->arch.mapping_lock); unmap_vtd_domain_page(page); @@ -1866,7 +1857,7 @@ int iommu_pte_flush(struct domain *d, ui int iommu_domid; int rc = 0; - iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); + iommu_sync_cache(pte, sizeof(struct dma_pte)); for_each_drhd_unit ( drhd ) { @@ -2724,7 +2715,7 @@ static int __init intel_iommu_quarantine dma_set_pte_addr(*pte, maddr); dma_set_pte_readable(*pte); } - iommu_flush_cache_page(parent, 1); + iommu_sync_cache(parent, PAGE_SIZE); unmap_vtd_domain_page(parent); parent = map_vtd_domain_page(maddr); ++++++ 5f046cca-x86-IOMMU-introduce-cache-sync-hook.patch ++++++ # Commit 91526b460e5009fc56edbd6809e66c327281faba # Date 2020-07-07 14:38:34 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/iommu: introduce a cache sync hook The hook is only implemented for VT-d and it uses the already existing iommu_sync_cache function present in VT-d code. The new hook is added so that the cache can be flushed by code outside of VT-d when using shared page tables. Note that alloc_pgtable_maddr must use the now locally defined sync_cache function, because IOMMU ops are not yet setup the first time the function gets called during IOMMU initialization. No functional change intended. This is part of XSA-321. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/drivers/passthrough/vtd/extern.h +++ b/xen/drivers/passthrough/vtd/extern.h @@ -43,7 +43,6 @@ void disable_qinval(struct vtd_iommu *io int enable_intremap(struct vtd_iommu *iommu, int eim); void disable_intremap(struct vtd_iommu *iommu); -void iommu_sync_cache(const void *addr, unsigned int size); int iommu_alloc(struct acpi_drhd_unit *drhd); void iommu_free(struct acpi_drhd_unit *drhd); --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -141,7 +141,7 @@ static int context_get_domain_id(struct static int iommus_incoherent; -void iommu_sync_cache(const void *addr, unsigned int size) +static void sync_cache(const void *addr, unsigned int size) { int i; static unsigned int clflush_size = 0; @@ -174,7 +174,7 @@ uint64_t alloc_pgtable_maddr(unsigned lo vaddr = __map_domain_page(cur_pg); memset(vaddr, 0, PAGE_SIZE); - iommu_sync_cache(vaddr, PAGE_SIZE); + sync_cache(vaddr, PAGE_SIZE); unmap_domain_page(vaddr); cur_pg++; } @@ -2763,6 +2763,7 @@ const struct iommu_ops __initconstrel in .iotlb_flush_all = iommu_flush_iotlb_all, .get_reserved_device_memory = intel_iommu_get_reserved_device_memory, .dump_p2m_table = vtd_dump_p2m_table, + .sync_cache = sync_cache, }; const struct iommu_init_ops __initconstrel intel_iommu_init_ops = { --- a/xen/include/asm-x86/iommu.h +++ b/xen/include/asm-x86/iommu.h @@ -121,6 +121,13 @@ extern bool untrusted_msi; int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq, const uint8_t gvec); +#define iommu_sync_cache(addr, size) ({ \ + const struct iommu_ops *ops = iommu_get_ops(); \ + \ + if ( ops->sync_cache ) \ + iommu_vcall(ops, sync_cache, addr, size); \ +}) + #endif /* !__ARCH_X86_IOMMU_H__ */ /* * Local variables: --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -250,6 +250,7 @@ struct iommu_ops { int (*setup_hpet_msi)(struct msi_desc *); int (*adjust_irq_affinities)(void); + void (*sync_cache)(const void *addr, unsigned int size); #endif /* CONFIG_X86 */ int __must_check (*suspend)(void); ++++++ 5f046ce9-VT-d-sync_cache-misaligned-addresses.patch ++++++ # Commit b6d9398144f21718d25daaf8d72669a75592abc5 # Date 2020-07-07 14:39:05 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> vtd: don't assume addresses are aligned in sync_cache Current code in sync_cache assume that the address passed in is aligned to a cache line size. Fix the code to support passing in arbitrary addresses not necessarily aligned to a cache line size. This is part of XSA-321. Reported-by: Jan Beulich <jbeulich(a)suse.com> Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -143,8 +143,8 @@ static int iommus_incoherent; static void sync_cache(const void *addr, unsigned int size) { - int i; - static unsigned int clflush_size = 0; + static unsigned long clflush_size = 0; + const void *end = addr + size; if ( !iommus_incoherent ) return; @@ -152,8 +152,9 @@ static void sync_cache(const void *addr, if ( clflush_size == 0 ) clflush_size = get_cache_line_size(); - for ( i = 0; i < size; i += clflush_size ) - cacheline_flush((char *)addr + i); + addr -= (unsigned long)addr & (clflush_size - 1); + for ( ; addr < end; addr += clflush_size ) + cacheline_flush((char *)addr); } /* Allocate page table, return its machine address */ ++++++ 5f046cfd-x86-introduce-alternative_2.patch ++++++ # Commit 23570bce00ee6ba2139ece978ab6f03ff166e21d # Date 2020-07-07 14:39:25 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/alternative: introduce alternative_2 It's based on alternative_io_2 without inputs or outputs but with an added memory clobber. This is part of XSA-321. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Acked-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/include/asm-x86/alternative.h +++ b/xen/include/asm-x86/alternative.h @@ -114,6 +114,11 @@ extern void alternative_branches(void); #define alternative(oldinstr, newinstr, feature) \ asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory") +#define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \ + asm volatile (ALTERNATIVE_2(oldinstr, newinstr1, feature1, \ + newinstr2, feature2) \ + : : : "memory") + /* * Alternative inline assembly with input. * ++++++ 5f046d1a-VT-d-optimize-CPU-cache-sync.patch ++++++ # Commit a64ea16522a73a13a0d66cfa4b66a9d3b95dd9d6 # Date 2020-07-07 14:39:54 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> vtd: optimize CPU cache sync Some VT-d IOMMUs are non-coherent, which requires a cache write back in order for the changes made by the CPU to be visible to the IOMMU. This cache write back was unconditionally done using clflush, but there are other more efficient instructions to do so, hence implement support for them using the alternative framework. This is part of XSA-321. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/drivers/passthrough/vtd/extern.h +++ b/xen/drivers/passthrough/vtd/extern.h @@ -68,7 +68,6 @@ int __must_check qinval_device_iotlb_syn u16 did, u16 size, u64 addr); unsigned int get_cache_line_size(void); -void cacheline_flush(char *); void flush_all_cache(void); uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node); --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -31,6 +31,7 @@ #include <xen/pci_regs.h> #include <xen/keyhandler.h> #include <asm/msi.h> +#include <asm/nops.h> #include <asm/irq.h> #include <asm/hvm/vmx/vmx.h> #include <asm/p2m.h> @@ -154,7 +155,42 @@ static void sync_cache(const void *addr, addr -= (unsigned long)addr & (clflush_size - 1); for ( ; addr < end; addr += clflush_size ) - cacheline_flush((char *)addr); +/* + * The arguments to a macro must not include preprocessor directives. Doing so + * results in undefined behavior, so we have to create some defines here in + * order to avoid it. + */ +#if defined(HAVE_AS_CLWB) +# define CLWB_ENCODING "clwb %[p]" +#elif defined(HAVE_AS_XSAVEOPT) +# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */ +#else +# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */ +#endif + +#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr)) +#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT) +# define INPUT BASE_INPUT +#else +# define INPUT(addr) "a" (addr), BASE_INPUT(addr) +#endif + /* + * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush + * + prefix than a clflush + nop, and hence the prefix is added instead + * of letting the alternative framework fill the gap by appending nops. + */ + alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]", + "data16 clflush %[p]", /* clflushopt */ + X86_FEATURE_CLFLUSHOPT, + CLWB_ENCODING, + X86_FEATURE_CLWB, /* no outputs */, + INPUT(addr)); +#undef INPUT +#undef BASE_INPUT +#undef CLWB_ENCODING + + alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT, + "sfence", X86_FEATURE_CLWB); } /* Allocate page table, return its machine address */ --- a/xen/drivers/passthrough/vtd/x86/vtd.c +++ b/xen/drivers/passthrough/vtd/x86/vtd.c @@ -51,11 +51,6 @@ unsigned int get_cache_line_size(void) return ((cpuid_ebx(1) >> 8) & 0xff) * 8; } -void cacheline_flush(char * addr) -{ - clflush(addr); -} - void flush_all_cache() { wbinvd(); ++++++ 5f046d2b-EPT-flush-cache-when-modifying-PTEs.patch ++++++ # Commit c23274fd0412381bd75068ebc9f8f8c90a4be748 # Date 2020-07-07 14:40:11 +0200 # Author Roger Pau Monné <roger.pau(a)citrix.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/ept: flush cache when modifying PTEs and sharing page tables Modifications made to the page tables by EPT code need to be written to memory when the page tables are shared with the IOMMU, as Intel IOMMUs can be non-coherent and thus require changes to be written to memory in order to be visible to the IOMMU. In order to achieve this make sure data is written back to memory after writing an EPT entry when the recalc bit is not set in atomic_write_ept_entry. If such bit is set, the entry will be adjusted and atomic_write_ept_entry will be called a second time without the recalc bit set. Note that when splitting a super page the new tables resulting of the split should also be written back. Failure to do so can allow devices behind the IOMMU access to the stale super page, or cause coherency issues as changes made by the processor to the page tables are not visible to the IOMMU. This allows to remove the VT-d specific iommu_pte_flush helper, since the cache write back is now performed by atomic_write_ept_entry, and hence iommu_iotlb_flush can be used to flush the IOMMU TLB. The newly used method (iommu_iotlb_flush) can result in less flushes, since it might sometimes be called rightly with 0 flags, in which case it becomes a no-op. This is part of XSA-321. Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -58,6 +58,19 @@ static int atomic_write_ept_entry(struct write_atomic(&entryptr->epte, new.epte); + /* + * The recalc field on the EPT is used to signal either that a + * recalculation of the EMT field is required (which doesn't effect the + * IOMMU), or a type change. Type changes can only be between ram_rw, + * logdirty and ioreq_server: changes to/from logdirty won't work well with + * an IOMMU anyway, as IOMMU #PFs are not synchronous and will lead to + * aborts, and changes to/from ioreq_server are already fully flushed + * before returning to guest context (see + * XEN_DMOP_map_mem_type_to_ioreq_server). + */ + if ( !new.recalc && iommu_use_hap_pt(p2m->domain) ) + iommu_sync_cache(entryptr, sizeof(*entryptr)); + return 0; } @@ -278,6 +291,9 @@ static bool_t ept_split_super_page(struc break; } + if ( iommu_use_hap_pt(p2m->domain) ) + iommu_sync_cache(table, EPT_PAGETABLE_ENTRIES * sizeof(ept_entry_t)); + unmap_domain_page(table); /* Even failed we should install the newly allocated ept page. */ @@ -337,6 +353,9 @@ static int ept_next_level(struct p2m_dom if ( !next ) return GUEST_TABLE_MAP_FAILED; + if ( iommu_use_hap_pt(p2m->domain) ) + iommu_sync_cache(next, EPT_PAGETABLE_ENTRIES * sizeof(ept_entry_t)); + rc = atomic_write_ept_entry(p2m, ept_entry, e, next_level); ASSERT(rc == 0); } @@ -821,7 +840,10 @@ out: need_modify_vtd_table ) { if ( iommu_use_hap_pt(d) ) - rc = iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present); + rc = iommu_iotlb_flush(d, _dfn(gfn), (1u << order), + (iommu_flags ? IOMMU_FLUSHF_added : 0) | + (vtd_pte_present ? IOMMU_FLUSHF_modified + : 0)); else if ( need_iommu_pt_sync(d) ) rc = iommu_flags ? iommu_legacy_map(d, _dfn(gfn), mfn, order, iommu_flags) : --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -1884,53 +1884,6 @@ static int intel_iommu_lookup_page(struc return 0; } -int iommu_pte_flush(struct domain *d, uint64_t dfn, uint64_t *pte, - int order, int present) -{ - struct acpi_drhd_unit *drhd; - struct vtd_iommu *iommu = NULL; - struct domain_iommu *hd = dom_iommu(d); - bool_t flush_dev_iotlb; - int iommu_domid; - int rc = 0; - - iommu_sync_cache(pte, sizeof(struct dma_pte)); - - for_each_drhd_unit ( drhd ) - { - iommu = drhd->iommu; - if ( !test_bit(iommu->index, &hd->arch.iommu_bitmap) ) - continue; - - flush_dev_iotlb = !!find_ats_dev_drhd(iommu); - iommu_domid= domain_iommu_domid(d, iommu); - if ( iommu_domid == -1 ) - continue; - - rc = iommu_flush_iotlb_psi(iommu, iommu_domid, - __dfn_to_daddr(dfn), - order, !present, flush_dev_iotlb); - if ( rc > 0 ) - { - iommu_flush_write_buffer(iommu); - rc = 0; - } - } - - if ( unlikely(rc) ) - { - if ( !d->is_shutting_down && printk_ratelimit() ) - printk(XENLOG_ERR VTDPREFIX - " d%d: IOMMU pages flush failed: %d\n", - d->domain_id, rc); - - if ( !is_hardware_domain(d) ) - domain_crash(d); - } - - return rc; -} - static int __init vtd_ept_page_compatible(struct vtd_iommu *iommu) { u64 ept_cap, vtd_cap = iommu->cap; --- a/xen/include/asm-x86/iommu.h +++ b/xen/include/asm-x86/iommu.h @@ -97,10 +97,6 @@ static inline int iommu_adjust_irq_affin : 0; } -/* While VT-d specific, this must get declared in a generic header. */ -int __must_check iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte, - int order, int present); - static inline bool iommu_supports_x2apic(void) { return iommu_init_ops && iommu_init_ops->supports_x2apic ++++++ 5f046d5c-check-VCPUOP_register_vcpu_info-alignment.patch ++++++ # Commit 3fdc211b01b29f252166937238efe02d15cb5780 # Date 2020-07-07 14:41:00 +0200 # Author Julien Grall <jgrall(a)amazon.com> # Committer Jan Beulich <jbeulich(a)suse.com> xen: Check the alignment of the offset pased via VCPUOP_register_vcpu_info Currently a guest is able to register any guest physical address to use for the vcpu_info structure as long as the structure can fits in the rest of the frame. This means a guest can provide an address that is not aligned to the natural alignment of the structure. On Arm 32-bit, unaligned access are completely forbidden by the hypervisor. This will result to a data abort which is fatal. On Arm 64-bit, unaligned access are only forbidden when used for atomic access. As the structure contains fields (such as evtchn_pending_self) that are updated using atomic operations, any unaligned access will be fatal as well. While the misalignment is only fatal on Arm, a generic check is added as an x86 guest shouldn't sensibly pass an unaligned address (this would result to a split lock). This is XSA-327. Reported-by: Julien Grall <jgrall(a)amazon.com> Signed-off-by: Julien Grall <jgrall(a)amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Stefano Stabellini <sstabellini(a)kernel.org> --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -1300,10 +1300,20 @@ int map_vcpu_info(struct vcpu *v, unsign void *mapping; vcpu_info_t *new_info; struct page_info *page; + unsigned int align; if ( offset > (PAGE_SIZE - sizeof(vcpu_info_t)) ) return -EINVAL; +#ifdef CONFIG_COMPAT + if ( has_32bit_shinfo(d) ) + align = alignof(new_info->compat); + else +#endif + align = alignof(*new_info); + if ( offset & (align - 1) ) + return -EINVAL; + if ( !mfn_eq(v->vcpu_info_mfn, INVALID_MFN) ) return -EINVAL; ++++++ 5f1a9916-x86-S3-put-data-sregs-into-known-state.patch ++++++ # Commit 55f8c389d4348cc517946fdcb10794112458e81e # Date 2020-07-24 10:17:26 +0200 # Author Jan Beulich <jbeulich(a)suse.com> # Committer Jan Beulich <jbeulich(a)suse.com> x86/S3: put data segment registers into known state upon resume wakeup_32 sets %ds and %es to BOOT_DS, while leaving %fs at what wakeup_start did set it to, and %gs at whatever BIOS did load into it. All of this may end up confusing the first load_segments() to run on the BSP after resume, in particular allowing a non-nul selector value to be left in %fs. Alongside %ss, also put all other data segment registers into the same state that the boot and CPU bringup paths put them in. Reported-by: M. Vefa Bicakci <m.v.b(a)runbox.com> Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/acpi/wakeup_prot.S +++ b/xen/arch/x86/acpi/wakeup_prot.S @@ -66,6 +66,12 @@ ENTRY(__ret_point) mov REF(saved_ss), %ss LOAD_GREG(sp) + mov $__HYPERVISOR_DS64, %eax + mov %eax, %ds + mov %eax, %es + mov %eax, %fs + mov %eax, %gs + /* Reload code selector */ pushq $__HYPERVISOR_CS leaq 1f(%rip),%rax ++++++ 5f21b9fd-x86-cpuid-APIC-bit-clearing.patch ++++++ # Commit 64219fa179c3e48adad12bfce3f6b3f1596cccbf # Date 2020-07-29 19:03:41 +0100 # Author Fam Zheng <famzheng(a)amazon.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/cpuid: Fix APIC bit clearing The bug is obvious here, other places in this function used "cpufeat_mask" correctly. Fixed: b648feff8ea2 ("xen/x86: Improvements to in-hypervisor cpuid sanity checks") Signed-off-by: Fam Zheng <famzheng(a)amazon.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/cpuid.c +++ b/xen/arch/x86/cpuid.c @@ -961,7 +961,7 @@ void guest_cpuid(const struct vcpu *v, u { /* Fast-forward MSR_APIC_BASE.EN. */ if ( vlapic_hw_disabled(vcpu_vlapic(v)) ) - res->d &= ~cpufeat_bit(X86_FEATURE_APIC); + res->d &= ~cpufeat_mask(X86_FEATURE_APIC); /* * PSE36 is not supported in shadow mode. This bit should be ++++++ 5f479d9e-x86-begin-to-support-MSR_ARCH_CAPS.patch ++++++ # Commit e32605b07ef2e01c9d05da9b2d5d7b8f9a5c7c1b # Date 2020-08-27 12:48:46 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86: Begin to introduce support for MSR_ARCH_CAPS ... including serialisation/deserialisation logic and unit tests. There is no current way to configure this MSR correctly for guests. The toolstack side this logic needs building, which is far easier to do with it in place. Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/tools/tests/cpu-policy/test-cpu-policy.c +++ b/tools/tests/cpu-policy/test-cpu-policy.c @@ -328,6 +328,11 @@ static void test_msr_deserialise_failure .msr = { .idx = 0xce, .val = ~0ull }, .rc = -EOVERFLOW, }, + { + .name = "truncated val", + .msr = { .idx = 0x10a, .val = ~0ull }, + .rc = -EOVERFLOW, + }, }; printf("Testing MSR deserialise failure:\n"); --- a/xen/arch/x86/msr.c +++ b/xen/arch/x86/msr.c @@ -183,8 +183,10 @@ int guest_rdmsr(struct vcpu *v, uint32_t break; case MSR_ARCH_CAPABILITIES: - /* Not implemented yet. */ - goto gp_fault; + if ( !cp->feat.arch_caps ) + goto gp_fault; + *val = mp->arch_caps.raw; + break; case MSR_INTEL_MISC_FEATURES_ENABLES: *val = msrs->misc_features_enables.raw; --- a/xen/include/public/arch-x86/cpufeatureset.h +++ b/xen/include/public/arch-x86/cpufeatureset.h @@ -259,7 +259,7 @@ XEN_CPUFEATURE(CET_IBT, 9*32+20) / XEN_CPUFEATURE(IBRSB, 9*32+26) /*A IBRS and IBPB support (used by Intel) */ XEN_CPUFEATURE(STIBP, 9*32+27) /*A STIBP */ XEN_CPUFEATURE(L1D_FLUSH, 9*32+28) /*S MSR_FLUSH_CMD and L1D flush. */ -XEN_CPUFEATURE(ARCH_CAPS, 9*32+29) /* IA32_ARCH_CAPABILITIES MSR */ +XEN_CPUFEATURE(ARCH_CAPS, 9*32+29) /*! IA32_ARCH_CAPABILITIES MSR */ XEN_CPUFEATURE(SSBD, 9*32+31) /*A MSR_SPEC_CTRL.SSBD available */ /* Intel-defined CPU features, CPUID level 0x00000007:1.eax, word 10 */ --- a/xen/include/xen/lib/x86/msr.h +++ b/xen/include/xen/lib/x86/msr.h @@ -3,7 +3,7 @@ #define XEN_LIB_X86_MSR_H /* Maximum number of MSRs written when serialising msr_policy. */ -#define MSR_MAX_SERIALISED_ENTRIES 1 +#define MSR_MAX_SERIALISED_ENTRIES 2 /* MSR policy object for shared per-domain MSRs */ struct msr_policy @@ -23,6 +23,28 @@ struct msr_policy bool cpuid_faulting:1; }; } platform_info; + + /* + * 0x0000010a - MSR_ARCH_CAPABILITIES + * + * This is an Intel-only MSR, which provides miscellaneous enumeration, + * including those which indicate that microarchitectrual sidechannels are + * fixed in hardware. + */ + union { + uint32_t raw; + struct { + bool rdcl_no:1; + bool ibrs_all:1; + bool rsba:1; + bool skip_l1dfl:1; + bool ssb_no:1; + bool mds_no:1; + bool if_pschange_mc_no:1; + bool tsx_ctrl:1; + bool taa_no:1; + }; + } arch_caps; }; #ifdef __XEN__ --- a/xen/lib/x86/msr.c +++ b/xen/lib/x86/msr.c @@ -39,6 +39,7 @@ int x86_msr_copy_to_buffer(const struct }) COPY_MSR(MSR_INTEL_PLATFORM_INFO, p->platform_info.raw); + COPY_MSR(MSR_ARCH_CAPABILITIES, p->arch_caps.raw); #undef COPY_MSR @@ -99,6 +100,7 @@ int x86_msr_copy_from_buffer(struct msr_ }) case MSR_INTEL_PLATFORM_INFO: ASSIGN(platform_info.raw); break; + case MSR_ARCH_CAPABILITIES: ASSIGN(arch_caps.raw); break; #undef ASSIGN ++++++ 5f4cf06e-x86-Dom0-expose-MSR_ARCH_CAPS.patch ++++++ # Commit e46474278a0e87e2b32ad5dd5fc20e8d2cb0688b # Date 2020-08-31 13:43:26 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/intel: Expose MSR_ARCH_CAPS to dom0 The overhead of (the lack of) MDS_NO alone has been measured at 30% on some workloads. While we're not in a position yet to offer MSR_ARCH_CAPS generally to guests, dom0 doesn't migrate, so we can pass a subset of hardware values straight through. This will cause PVH dom0's not to use KPTI by default, and all dom0's not to use VERW flushing by default, and to use eIBRS in preference to retpoline on recent Intel CPUs. Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/cpuid.c +++ b/xen/arch/x86/cpuid.c @@ -627,6 +627,14 @@ int init_domain_cpuid_policy(struct doma recalculate_cpuid_policy(d); + /* + * Expose the "hardware speculation behaviour" bits of ARCH_CAPS to dom0, + * so dom0 can turn off workarounds as appropriate. Temporary, until the + * domain policy logic gains a better understanding of MSRs. + */ + if ( is_hardware_domain(d) && boot_cpu_has(X86_FEATURE_ARCH_CAPS) ) + p->feat.arch_caps = true; + return 0; } --- a/xen/arch/x86/msr.c +++ b/xen/arch/x86/msr.c @@ -96,6 +96,22 @@ int init_domain_msr_policy(struct domain if ( !opt_dom0_cpuid_faulting && is_control_domain(d) && is_pv_domain(d) ) mp->platform_info.cpuid_faulting = false; + /* + * Expose the "hardware speculation behaviour" bits of ARCH_CAPS to dom0, + * so dom0 can turn off workarounds as appropriate. Temporary, until the + * domain policy logic gains a better understanding of MSRs. + */ + if ( is_hardware_domain(d) && boot_cpu_has(X86_FEATURE_ARCH_CAPS) ) + { + uint64_t val; + + rdmsrl(MSR_ARCH_CAPABILITIES, val); + + mp->arch_caps.raw = val & + (ARCH_CAPS_RDCL_NO | ARCH_CAPS_IBRS_ALL | ARCH_CAPS_RSBA | + ARCH_CAPS_SSB_NO | ARCH_CAPS_MDS_NO | ARCH_CAPS_TAA_NO); + } + d->arch.msr = mp; return 0; ++++++ 5f4cf96a-x86-PV-fix-SEGBASE_GS_USER_SEL.patch ++++++ # Commit afe018e041ec112d90a8b4e6ed607d22aa06f280 # Date 2020-08-31 14:21:46 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/pv: Fix multiple bugs with SEGBASE_GS_USER_SEL The logic takes the segment selector unmodified from guest context. This allowed the guest to load DPL0 descriptors into %gs. Fix up the RPL for non-NUL selectors to be 3. Xen's context switch logic skips saving the inactive %gs base, as it cannot be modified by the guest behind Xen's back. This depends on Xen caching updates to the inactive base, which is was missing from this path. The consequence is that, following SEGBASE_GS_USER_SEL, the next context switch will restore the stale inactive %gs base, and corrupt vcpu state. Rework the hypercall to update the cached idea of gs_base_user, and fix the behaviour in the case of the AMD NUL selector bug to always zero the segment base. Reported-by: Andy Lutomirski <luto(a)kernel.org> Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -1056,17 +1056,54 @@ long do_set_segment_base(unsigned int wh break; case SEGBASE_GS_USER_SEL: - __asm__ __volatile__ ( - " swapgs \n" - "1: movl %k0,%%gs \n" - " "safe_swapgs" \n" - ".section .fixup,\"ax\" \n" - "2: xorl %k0,%k0 \n" - " jmp 1b \n" - ".previous \n" - _ASM_EXTABLE(1b, 2b) - : "+r" (base) ); + { + unsigned int sel = (uint16_t)base; + + /* + * We wish to update the user %gs from the GDT/LDT. Currently, the + * guest kernel's GS_BASE is in context. + */ + asm volatile ( "swapgs" ); + + if ( sel > 3 ) + /* Fix up RPL for non-NUL selectors. */ + sel |= 3; + else if ( boot_cpu_data.x86_vendor & + (X86_VENDOR_AMD | X86_VENDOR_HYGON) ) + /* Work around NUL segment behaviour on AMD hardware. */ + asm volatile ( "mov %[sel], %%gs" + :: [sel] "r" (FLAT_USER_DS32) ); + + /* + * Load the chosen selector, with fault handling. + * + * Errors ought to fail the hypercall, but that was never built in + * originally, and Linux will BUG() if this call fails. + * + * NUL the selector in the case of an error. This too needs to deal + * with the AMD NUL segment behaviour, but it is already a slowpath in + * #GP context so perform the flat load unconditionally to avoid + * complicated logic. + * + * Anyone wanting to check for errors from this hypercall should + * re-read %gs and compare against the input. + */ + asm volatile ( "1: mov %[sel], %%gs\n\t" + ".section .fixup, \"ax\", @progbits\n\t" + "2: mov %k[flat], %%gs\n\t" + " xor %[sel], %[sel]\n\t" + " jmp 1b\n\t" + ".previous\n\t" + _ASM_EXTABLE(1b, 2b) + : [sel] "+r" (sel) + : [flat] "r" (FLAT_USER_DS32) ); + + /* Update the cache of the inactive base, as read from the GDT/LDT. */ + v->arch.pv.gs_base_user = rdgsbase(); + + asm volatile ( safe_swapgs ); break; + } default: ret = -EINVAL; ++++++ 5f560c42-x86-PV-64bit-segbase-consistency.patch ++++++ # Commit a5eaac9245f4f382a3cd0e9710e9d1cba7db20e4 # Date 2020-09-07 11:32:34 +0100 # Author Andrew Cooper <andrew.cooper3(a)citrix.com> # Committer Andrew Cooper <andrew.cooper3(a)citrix.com> x86/pv: Fix consistency of 64bit segment bases The comments in save_segments(), _toggle_guest_pt() and write_cr() are false. The %fs and %gs bases can be updated at any time by the guest. As a consequence, Xen's fs_base/etc tracking state is always stale when the vcpu is in context, and must not be used to complete MSR_{FS,GS}_BASE reads, etc. In particular, a sequence such as: wrmsr(MSR_FS_BASE, 0x1ull << 32); write_fs(__USER_DS); base = rdmsr(MSR_FS_BASE); will return the stale base, not the new base. This may cause guest a guest kernel's context switching of userspace to malfunction. Therefore: * Update save_segments(), _toggle_guest_pt() and read_msr() to always read the segment bases from hardware. * Update write_cr(), write_msr() and do_set_segment_base() to not not waste time caching data which is instantly going to become stale again. * Provide comments explaining when the tracking state is and isn't stale. This bug has been present for 14 years, but several bugfixes since have built on and extended the original flawed logic. Fixes: ba9adb737ba ("Apply stricter checking to RDMSR/WRMSR emulations.") Fixes: c42494acb2f ("x86: fix FS/GS base handling when using the fsgsbase feature") Fixed: eccc170053e ("x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back") Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1546,6 +1546,16 @@ static void load_segments(struct vcpu *n } } +/* + * Record all guest segment state. The guest can load segment selectors + * without trapping, which will also alter the 64bit FS/GS bases. Arbitrary + * changes to bases can also be made with the WR{FS,GS}BASE instructions, when + * enabled. + * + * Guests however cannot use SWAPGS, so there is no mechanism to modify the + * inactive GS base behind Xen's back. Therefore, Xen's copy of the inactive + * GS base is still accurate, and doesn't need reading back from hardware. + */ static void save_segments(struct vcpu *v) { struct cpu_user_regs *regs = &v->arch.user_regs; @@ -1556,14 +1566,15 @@ static void save_segments(struct vcpu *v regs->fs = read_sreg(fs); regs->gs = read_sreg(gs); - /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */ - if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) ) + if ( !is_pv_32bit_vcpu(v) ) { - v->arch.pv.fs_base = __rdfsbase(); + unsigned long gs_base = rdgsbase(); + + v->arch.pv.fs_base = rdfsbase(); if ( v->arch.flags & TF_kernel_mode ) - v->arch.pv.gs_base_kernel = __rdgsbase(); + v->arch.pv.gs_base_kernel = gs_base; else - v->arch.pv.gs_base_user = __rdgsbase(); + v->arch.pv.gs_base_user = gs_base; } if ( regs->ds ) --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -408,16 +408,19 @@ static void _toggle_guest_pt(struct vcpu void toggle_guest_mode(struct vcpu *v) { + unsigned long gs_base; + ASSERT(!is_pv_32bit_vcpu(v)); - /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */ - if ( read_cr4() & X86_CR4_FSGSBASE ) - { - if ( v->arch.flags & TF_kernel_mode ) - v->arch.pv.gs_base_kernel = __rdgsbase(); - else - v->arch.pv.gs_base_user = __rdgsbase(); - } + /* + * Update the cached value of the GS base about to become inactive, as a + * subsequent context switch won't bother re-reading it. + */ + gs_base = rdgsbase(); + if ( v->arch.flags & TF_kernel_mode ) + v->arch.pv.gs_base_kernel = gs_base; + else + v->arch.pv.gs_base_user = gs_base; asm volatile ( "swapgs" ); _toggle_guest_pt(v); --- a/xen/arch/x86/pv/emul-priv-op.c +++ b/xen/arch/x86/pv/emul-priv-op.c @@ -779,17 +779,6 @@ static int write_cr(unsigned int reg, un } case 4: /* Write CR4 */ - /* - * If this write will disable FSGSBASE, refresh Xen's idea of the - * guest bases now that they can no longer change. - */ - if ( (curr->arch.pv.ctrlreg[4] & X86_CR4_FSGSBASE) && - !(val & X86_CR4_FSGSBASE) ) - { - curr->arch.pv.fs_base = __rdfsbase(); - curr->arch.pv.gs_base_kernel = __rdgsbase(); - } - curr->arch.pv.ctrlreg[4] = pv_fixup_guest_cr4(curr, val); write_cr4(pv_make_cr4(curr)); ctxt_switch_levelling(curr); @@ -838,15 +827,13 @@ static int read_msr(unsigned int reg, ui case MSR_FS_BASE: if ( is_pv_32bit_domain(currd) ) break; - *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase() - : curr->arch.pv.fs_base; + *val = rdfsbase(); return X86EMUL_OKAY; case MSR_GS_BASE: if ( is_pv_32bit_domain(currd) ) break; - *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase() - : curr->arch.pv.gs_base_kernel; + *val = rdgsbase(); return X86EMUL_OKAY; case MSR_SHADOW_GS_BASE: @@ -975,14 +962,12 @@ static int write_msr(unsigned int reg, u if ( is_pv_32bit_domain(currd) || !is_canonical_address(val) ) break; wrfsbase(val); - curr->arch.pv.fs_base = val; return X86EMUL_OKAY; case MSR_GS_BASE: if ( is_pv_32bit_domain(currd) || !is_canonical_address(val) ) break; wrgsbase(val); - curr->arch.pv.gs_base_kernel = val; return X86EMUL_OKAY; case MSR_SHADOW_GS_BASE: --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -1027,10 +1027,7 @@ long do_set_segment_base(unsigned int wh { case SEGBASE_FS: if ( is_canonical_address(base) ) - { wrfsbase(base); - v->arch.pv.fs_base = base; - } else ret = -EINVAL; break; @@ -1047,10 +1044,7 @@ long do_set_segment_base(unsigned int wh case SEGBASE_GS_KERNEL: if ( is_canonical_address(base) ) - { wrgsbase(base); - v->arch.pv.gs_base_kernel = base; - } else ret = -EINVAL; break; --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -505,7 +505,24 @@ struct pv_vcpu bool_t syscall32_disables_events; bool_t sysenter_disables_events; - /* Segment base addresses. */ + /* + * 64bit segment bases. + * + * FS and the active GS are always stale when the vCPU is in context, as + * the guest can change them behind Xen's back with MOV SREG, or + * WR{FS,GS}BASE on capable hardware. + * + * The inactive GS base is never stale, as guests can't use SWAPGS to + * access it - all modification is performed by Xen either directly + * (hypercall, #GP emulation), or indirectly (toggle_guest_mode()). + * + * The vCPU context switch path is optimised based on this fact, so any + * path updating or swapping the inactive base must update the cached + * value as well. + * + * Which GS base is active and inactive depends on whether the vCPU is in + * user or kernel context. + */ unsigned long fs_base; unsigned long gs_base_kernel; unsigned long gs_base_user; ++++++ README.SUSE ++++++ ++++ 704 lines (skipped) ++++++ aarch64-maybe-uninitialized.patch ++++++ Index: xen-4.12.0-testing/tools/libxl/libxl_arm_acpi.c =================================================================== --- xen-4.12.0-testing.orig/tools/libxl/libxl_arm_acpi.c +++ xen-4.12.0-testing/tools/libxl/libxl_arm_acpi.c @@ -99,7 +99,7 @@ int libxl__get_acpi_size(libxl__gc *gc, const libxl_domain_build_info *info, uint64_t *out) { - uint64_t size; + uint64_t size = 0; int rc = 0; @@ -124,7 +124,7 @@ static int libxl__allocate_acpi_tables(l struct acpitable acpitables[]) { int rc; - size_t size; + size_t size = 0; acpitables[RSDP].addr = GUEST_ACPI_BASE; acpitables[RSDP].size = sizeof(struct acpi_table_rsdp); ++++++ aarch64-rename-PSR_MODE_ELxx-to-match-linux-headers.patch ++++++ >From 98abe3b337e69371678859c4cfd19df61aebb0d9 Mon Sep 17 00:00:00 2001 From: Olaf Hering <olaf(a)aepfle.de> Date: Sun, 2 Feb 2014 20:42:42 +0100 Subject: aarch64: rename PSR_MODE_ELxx to match linux headers https://bugs.launchpad.net/linaro-aarch64/+bug/1169164 Signed-off-by: Olaf Hering <olaf(a)aepfle.de> --- xen/include/public/arch-arm.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) Index: xen-4.13.0-testing/xen/include/public/arch-arm.h =================================================================== --- xen-4.13.0-testing.orig/xen/include/public/arch-arm.h +++ xen-4.13.0-testing/xen/include/public/arch-arm.h @@ -371,13 +371,13 @@ typedef uint64_t xen_callback_t; /* 64 bit modes */ #define PSR_MODE_BIT 0x10 /* Set iff AArch32 */ -#define PSR_MODE_EL3h 0x0d -#define PSR_MODE_EL3t 0x0c -#define PSR_MODE_EL2h 0x09 -#define PSR_MODE_EL2t 0x08 -#define PSR_MODE_EL1h 0x05 -#define PSR_MODE_EL1t 0x04 -#define PSR_MODE_EL0t 0x00 +#define PSR_MODE_EL3h 0x0000000d +#define PSR_MODE_EL3t 0x0000000c +#define PSR_MODE_EL2h 0x00000009 +#define PSR_MODE_EL2t 0x00000008 +#define PSR_MODE_EL1h 0x00000005 +#define PSR_MODE_EL1t 0x00000004 +#define PSR_MODE_EL0t 0x00000000 #define PSR_GUEST32_INIT (PSR_ABT_MASK|PSR_FIQ_MASK|PSR_IRQ_MASK|PSR_MODE_SVC) #define PSR_GUEST64_INIT (PSR_ABT_MASK|PSR_FIQ_MASK|PSR_IRQ_MASK|PSR_MODE_EL1h) ++++++ baselibs.conf ++++++ xen-libs ++++++ bin-python3-conversion.patch ++++++ Index: xen-4.13.0-testing/tools/misc/xencons =================================================================== --- xen-4.13.0-testing.orig/tools/misc/xencons +++ xen-4.13.0-testing/tools/misc/xencons @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 ############################################## # Console client for Xen guest OSes @@ -27,13 +27,13 @@ def __recv_from_sock(sock): while not stop: try: data = sock.recv(1024) - except socket.error, error: + except socket.error as error: if error[0] != errno.EINTR: raise else: try: os.write(1, data) - except os.error, error: + except os.error as error: if error[0] != errno.EINTR: raise os.wait() @@ -42,7 +42,7 @@ def __send_to_sock(sock): while 1: try: data = os.read(0,1024) - except os.error, error: + except os.error as error: if error[0] != errno.EINTR: raise else: @@ -50,7 +50,7 @@ def __send_to_sock(sock): break try: sock.send(data) - except socket.error, error: + except socket.error as error: if error[0] == errno.EPIPE: sys.exit(0) if error[0] != errno.EINTR: @@ -73,20 +73,20 @@ def connect(host,port): if os.fork(): signal.signal(signal.SIGCHLD, __child_death) - print "************ REMOTE CONSOLE: CTRL-] TO QUIT ********" + print("************ REMOTE CONSOLE: CTRL-] TO QUIT ********") tcsetattr(0, TCSAFLUSH, nattrs) try: __recv_from_sock(sock) finally: tcsetattr(0, TCSAFLUSH, oattrs) - print - print "************ REMOTE CONSOLE EXITED *****************" + print() + print("************ REMOTE CONSOLE EXITED *****************") else: signal.signal(signal.SIGPIPE, signal.SIG_IGN) __send_to_sock(sock) if __name__ == '__main__': if len(sys.argv) != 3: - print sys.argv[0] + " <host> <port>" + print(sys.argv[0] + " <host> <port>") sys.exit(1) connect(str(sys.argv[1]),int(sys.argv[2])) Index: xen-4.13.0-testing/tools/misc/xencov_split =================================================================== --- xen-4.13.0-testing.orig/tools/misc/xencov_split +++ xen-4.13.0-testing/tools/misc/xencov_split @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/python3 import sys, os, os.path as path, struct, errno from optparse import OptionParser @@ -51,7 +51,7 @@ def xencov_split(opts): dir = opts.output_dir + path.dirname(fn) try: os.makedirs(dir) - except OSError, e: + except OSError as e: if e.errno == errno.EEXIST and os.path.isdir(dir): pass else: @@ -89,8 +89,8 @@ def main(): if __name__ == "__main__": try: sys.exit(main()) - except Exception, e: - print >>sys.stderr, "Error:", e + except Exception as e: + print("Error:", e, file=sys.stderr) sys.exit(1) except KeyboardInterrupt: sys.exit(1) Index: xen-4.13.0-testing/tools/misc/xenpvnetboot =================================================================== --- xen-4.13.0-testing.orig/tools/misc/xenpvnetboot +++ xen-4.13.0-testing/tools/misc/xenpvnetboot @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # # Copyright (C) 2010 Oracle. All rights reserved. # @@ -17,9 +17,9 @@ import time import string import random import tempfile -import commands import subprocess -import urlgrabber +import subprocess +import urllib.request as request from optparse import OptionParser @@ -58,7 +58,7 @@ def mount(dev, path, option=''): else: mountcmd = '/bin/mount' cmd = ' '.join([mountcmd, option, dev, path]) - (status, output) = commands.getstatusoutput(cmd) + (status, output) = subprocess.getstatusoutput(cmd) if status != 0: raise RuntimeError('Command: (%s) failed: (%s) %s' % (cmd, status, output)) @@ -79,7 +79,7 @@ class Fetcher: def prepare(self): if not os.path.exists(self.tmpdir): - os.makedirs(self.tmpdir, 0750) + os.makedirs(self.tmpdir, 0o750) def cleanup(self): pass @@ -89,8 +89,8 @@ class Fetcher: suffix = ''.join(random.sample(string.ascii_letters, 6)) local_name = os.path.join(self.tmpdir, 'xenpvboot.%s.%s' % (os.path.basename(filename), suffix)) try: - return urlgrabber.urlgrab(url, local_name, copy_local=1) - except Exception, err: + return request.urlretrieve(url, local_name) + except Exception as err: raise RuntimeError('Cannot get file %s: %s' % (url, err)) @@ -155,7 +155,7 @@ class TFTPFetcher(Fetcher): suffix = ''.join(random.sample(string.ascii_letters, 6)) local_name = os.path.join(self.tmpdir, 'xenpvboot.%s.%s' % (os.path.basename(filename), suffix)) cmd = '/usr/bin/tftp %s -c get %s %s' % (host, os.path.join(basedir, filename), local_name) - (status, output) = commands.getstatusoutput(cmd) + (status, output) = subprocess.getstatusoutput(cmd) if status != 0: raise RuntimeError('Command: (%s) failed: (%s) %s' % (cmd, status, output)) return local_name @@ -202,7 +202,7 @@ Supported locations: if not opts.location and not opts.kernel and not opts.ramdisk: if not opts.quiet: - print >> sys.stderr, 'You should at least specify a location or kernel/ramdisk.' + print('You should at least specify a location or kernel/ramdisk.', file=sys.stderr) parser.print_help(sys.stderr) sys.exit(1) @@ -228,14 +228,14 @@ Supported locations: fetcher = TFTPFetcher(location, opts.output_directory) else: if not opts.quiet: - print >> sys.stderr, 'Unsupported location: %s' % location + print('Unsupported location: %s' % location, file=sys.stderr) sys.exit(1) try: fetcher.prepare() - except Exception, err: + except Exception as err: if not opts.quiet: - print >> sys.stderr, str(err) + print(str(err), file=sys.stderr) fetcher.cleanup() sys.exit(1) @@ -247,15 +247,15 @@ Supported locations: for (kernel_path, _) in XEN_PATHS: try: kernel = fetcher.get_file(kernel_path) - except Exception, err: + except Exception as err: if not opts.quiet: - print >> sys.stderr, str(err) + print(str(err), file=sys.stderr) continue break if not kernel: if not opts.quiet: - print >> sys.stderr, 'Cannot get kernel from loacation: %s' % location + print('Cannot get kernel from loacation: %s' % location, file=sys.stderr) sys.exit(1) ramdisk = None @@ -265,9 +265,9 @@ Supported locations: for (_, ramdisk_path) in XEN_PATHS: try: ramdisk = fetcher.get_file(ramdisk_path) - except Exception, err: + except Exception as err: if not opts.quiet: - print >> sys.stderr, str(err) + print(str(err), file=sys.stderr) continue break finally: @@ -280,7 +280,7 @@ Supported locations: elif opts.output_format == 'simple0': output = format_simple(kernel, ramdisk, opts.args, '\0') else: - print >> sys.stderr, 'Unknown output format: %s' % opts.output_format + print('Unknown output format: %s' % opts.output_format, file=sys.stderr) sys.exit(1) sys.stdout.flush() Index: xen-4.13.0-testing/tools/python/scripts/convert-legacy-stream =================================================================== --- xen-4.13.0-testing.orig/tools/python/scripts/convert-legacy-stream +++ xen-4.13.0-testing/tools/python/scripts/convert-legacy-stream @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ @@ -39,16 +39,16 @@ def info(msg): for line in msg.split("\n"): syslog.syslog(syslog.LOG_INFO, line) else: - print msg + print(msg) def err(msg): """Error message, routed to appropriate destination""" if log_to_syslog: for line in msg.split("\n"): syslog.syslog(syslog.LOG_ERR, line) - print >> sys.stderr, msg + print(msg, file=sys.stderr) -class StreamError(StandardError): +class StreamError(Exception): """Error with the incoming migration stream""" pass @@ -637,7 +637,7 @@ def open_file_or_fd(val, mode): else: return open(val, mode, 0) - except StandardError, e: + except Exception as e: if fd != -1: err("Unable to open fd %d: %s: %s" % (fd, e.__class__.__name__, e)) @@ -723,7 +723,7 @@ def main(): if __name__ == "__main__": try: sys.exit(main()) - except SystemExit, e: + except SystemExit as e: sys.exit(e.code) except KeyboardInterrupt: sys.exit(1) Index: xen-4.13.0-testing/tools/python/scripts/verify-stream-v2 =================================================================== --- xen-4.13.0-testing.orig/tools/python/scripts/verify-stream-v2 +++ xen-4.13.0-testing/tools/python/scripts/verify-stream-v2 @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ Verify a v2 format migration stream """ @@ -25,7 +25,7 @@ def info(msg): for line in msg.split("\n"): syslog.syslog(syslog.LOG_INFO, line) else: - print msg + print(msg) def err(msg): """Error message, routed to appropriate destination""" @@ -33,7 +33,7 @@ def err(msg): if log_to_syslog: for line in msg.split("\n"): syslog.syslog(syslog.LOG_ERR, line) - print >> sys.stderr, msg + print(msg, file=sys.stderr) def stream_read(_ = None): """Read from input""" @@ -86,7 +86,7 @@ def read_stream(fmt): err(traceback.format_exc()) return 1 - except StandardError: + except Exception: err("Script Error:") err(traceback.format_exc()) err("Please fix me") @@ -114,7 +114,7 @@ def open_file_or_fd(val, mode, buffering else: return open(val, mode, buffering) - except StandardError, e: + except Exception as e: if fd != -1: err("Unable to open fd %d: %s: %s" % (fd, e.__class__.__name__, e)) @@ -168,7 +168,7 @@ def main(): if __name__ == "__main__": try: sys.exit(main()) - except SystemExit, e: + except SystemExit as e: sys.exit(e.code) except KeyboardInterrupt: sys.exit(2) Index: xen-4.13.0-testing/tools/xenmon/xenmon.py =================================================================== --- xen-4.13.0-testing.orig/tools/xenmon/xenmon.py +++ xen-4.13.0-testing/tools/xenmon/xenmon.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 ##################################################################### # xenmon is a front-end for xenbaked. Index: xen-4.13.0-testing/tools/xentrace/xentrace_format =================================================================== --- xen-4.13.0-testing.orig/tools/xentrace/xentrace_format +++ xen-4.13.0-testing/tools/xentrace/xentrace_format @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # by Mark Williamson, (C) 2004 Intel Research Cambridge @@ -7,8 +7,7 @@ import re, sys, string, signal, struct, os, getopt def usage(): - print >> sys.stderr, \ - "Usage: " + sys.argv[0] + """ defs-file + print("Usage: " + sys.argv[0] + """ defs-file Parses trace data in binary format, as output by Xentrace and reformats it according to the rules in a file of definitions. The rules in this file should have the format ({ and } show grouping @@ -29,7 +28,7 @@ def usage(): this script may not be able to keep up with the output of xentrace if it is piped directly. In these circumstances you should have xentrace output to a file for processing off-line. - """ + """, file=sys.stderr) sys.exit(1) def read_defs(defs_file): @@ -49,7 +48,7 @@ def read_defs(defs_file): m = reg.match(line) - if not m: print >> sys.stderr, "Bad format file" ; sys.exit(1) + if not m: print("Bad format file", file=sys.stderr) ; sys.exit(1) defs[str(eval(m.group(1)))] = m.group(2) @@ -83,8 +82,8 @@ interrupted = 0 try: defs = read_defs(arg[0]) -except IOError, exn: - print exn +except IOError as exn: + print(exn) sys.exit(1) # structure of trace record (as output by xentrace): @@ -211,7 +210,7 @@ while not interrupted: if cpu >= len(last_tsc): last_tsc += [0] * (cpu - len(last_tsc) + 1) elif tsc < last_tsc[cpu] and tsc_in == 1: - print "TSC stepped backward cpu %d ! %d %d" % (cpu,tsc,last_tsc[cpu]) + print("TSC stepped backward cpu %d ! %d %d" % (cpu,tsc,last_tsc[cpu])) # provide relative TSC if last_tsc[cpu] > 0 and tsc_in == 1: @@ -239,18 +238,20 @@ while not interrupted: try: - if defs.has_key(str(event)): - print defs[str(event)] % args + if str(event) in defs: + print(defs[str(event)] % args) else: - if defs.has_key(str(0)): print defs[str(0)] % args + if str(0) in defs: print(defs[str(0)] % args) except TypeError: - if defs.has_key(str(event)): - print defs[str(event)] - print args + if str(event) in defs: + print(defs[str(event)]) + print(args) else: - if defs.has_key(str(0)): - print defs[str(0)] - print args + if str(0) in defs: + print(defs[str(0)]) + print(args) - except IOError, struct.error: sys.exit() + except IOError as xxx_todo_changeme: + struct.error = xxx_todo_changeme + sys.exit(1) ++++++ block-dmmd ++++++ #! /bin/bash # Usage: block-dmmd [add args | remove args] # # the dmmd device syntax (in xl commands/configs) is something like: # script=block-dmmd,md;/dev/md0;md;/dev/md1;lvm;/dev/vg1/lv1 # or # script=block-dmmd,lvm;/dev/vg1/lv1;lvm;/dev/vg1/lv2;md;/dev/md0 # device pairs (type;dev) are processed in order, with the last device # assigned to the VM # # Note - When using the libxl stack, the "script=block-dmmd" option # is required. See man xl-disk-configuration(5) for more information. # # md devices can optionally: # specify a config file through: # md;/dev/md100(/var/xen/config/mdadm.conf) # use an array name (mdadm -N option): # md;My-MD-name;lvm;/dev/vg1/lv1 # # Completely expressive syntax should be similar to: # "format=raw, vdev=xvdb, access=rw, script=block-dmmd, \ # target=md;/dev/md0(/etc/mdadm.conf);lvm;/dev/vg1/lv1" # ## # History: # 2017-07-10, mlatimer(a)suse.com: # Modification to use syslog for progress messages by ldevulder(a)suse.com # 2017-06-12, mlatimer(a)suse.com: # Merge LVM improvements by loic.devulder(a)mpsa.com # Document libxl "script=block-dmmd" syntax in examples # Remove xm/xend references (e.g. parsed_timeout from xend-config.sxp) # 2016-05-27, mlatimer(a)suse.com: # Merge improvements by loic.devulder(a)mpsa.com. Highlights include: # - Re-write and simplification to speed up the script! # - Add some (useful) logging messages and comments # Minor tweaks and logging improvements # 2016-05-26, mlatimer(a)suse.com: # Verify MD activation if mdadm returns 2 # 2016-05-20, mlatimer(a)suse.com: # Strip leading "dmmd:" if present in xenstore params value # 2013-07-03, loic.devulder(a)mpsa.com: # Partial rewrite of the script for supporting MD activation by name # 2009-06-09, mh(a)novell.com: # Emit debugging messages into a temporary file; if no longer needed, # just comment the exec I/O redirection below # Make variables used in functions local to avoid global overridings # Use vgscan and vgchange where required # Use the C locale to avoid dealing with localized messages # Assign output from assembling an MD device to a variable to aid # debugging # We do not want to deal with localized messages # We use LC_ALL because LC_ALL superse LANG # But we also use LANG because some applications may still use LANG... export LC_ALL=C export LANG=${LC_ALL} # Loading common libraries . $(dirname $0)/block-common.sh # Constants typeset -rx MDADM_BIN=/sbin/mdadm typeset -rx LVCHANGE_BIN=/sbin/lvchange typeset -rx PVSCAN_BIN=/sbin/pvscan typeset -rx VGSCAN_BIN=/sbin/vgscan typeset -rx VGCHANGE_BIN=/sbin/vgchange typeset -rx CLVMD_BIN=/usr/sbin/clvmd typeset -rx DATE_SEC="date +%s" # We check for errors ourselves set +e function reload_clvm() { # If we are in cluster mode if ps -e | grep -q [c]lvmd 2>/dev/null; then # Logging message log info "Synchronizing cLVM..." # Synchronize cLVM ${CLVMD_BIN} -R > /dev/null 2>&1 \ || return 1 fi return 0 } function run_mdadm() { local mdadm_cmd=$1 local msg local rc msg="$(${MDADM_BIN} ${mdadm_cmd} 2>&1)" rc=$? case "${msg}" in *"has been started"* | *"already active"*) return 0 ;; *"is already in use"*) # Hmm, might be used by another device in this domU # Leave it to upper layers to detect a real error return 2 ;; *) return ${rc} ;; esac # Normally we should not get here, but if this happens # we have to return an error return 1 } function activate_md() { # Make it explicitly local local par=$1 local cfg dev dev_path rc t mdadm_opts if [[ ${par} == ${par%%(*} ]]; then # No configuration file specified dev=${par} cfg="" else dev=${par%%(*} t=${par#*(} cfg="-c ${t%%)*}" fi # Looking for device name or aliase if [[ ${dev:0:1} == / ]]; then dev_path=${dev%/*} mdadm_opts="" else dev_path=/dev/md mdadm_opts="-s -N" fi # Logging message log info "Activating MD device ${dev}..." # Is MD device already active? # We need to use full path name, aliase is not possible... if [ -e ${dev_path}/${dev##*/} ]; then ${MDADM_BIN} -Q -D ${dev_path}/${dev##*/} 2>/dev/null \ | grep -iq state.*\:.*inactive || return 0 fi # Activate MD device run_mdadm "-A ${mdadm_opts} ${dev} ${cfg}" rc=$? # A return code of 2 can indicate the array configuration was incorrect if [[ ${rc} == 2 ]]; then # Logging message log info "Verifying MD device ${dev} activation..." # If the array is active, return 0, otherwise return an error ${MDADM_BIN} -Q -D ${dev_path}/${dev##*/} &>/dev/null && return 0 \ || return 1 fi return ${rc} } function deactivate_md() { local par=$1 local dev if [[ ${par} == ${par%%(*} ]]; then # No configuration file specified dev=${par} else dev=${par%%(*} fi # Looking for device name or aliase if [[ ${dev:0:1} == / ]]; then dev_path=${dev%/*} else dev_path=/dev/md fi # Logging message log info "Deactivating MD device ${dev}..." # We need the device name only while deactivating ${MDADM_BIN} -S ${dev_path}/${dev##*/} > /dev/null 2>&1 return $? } function lvm_action() { local action=$1 local dev=$2 local run_timeout=90 local end_time # Logging message log info "${action} LVM device ${dev}..." # Set end_time for the loop (( end_time = $(${DATE_SEC}) + run_timeout )) while true; do # Action depends of what the user asks if [[ ${action} == activate ]]; then # First scan for PVs and VGs # We need this for using MD device as PV ${PVSCAN_BIN} > /dev/null 2>&1 ${LVCHANGE_BIN} -aey ${dev} > /dev/null 2>&1 \ && [[ -e ${dev} ]] \ && return 0 elif [[ ${action} == deactivate ]]; then ${LVCHANGE_BIN} -aen ${dev} > /dev/null 2>&1 \ && return 0 # If the LV is already deactivated we may be in an infinite loop # So we need to test if the LV is still present [[ -e ${dev} ]] || return 0 fi # It seems that we had a problem during lvchange # If we are in a cluster the problem may be due to a cLVM locking bug, # so try to reload it reload_clvm # If it takes too long we need to return an error if (( $(${DATE_SEC}) >= end_time )); then log err "Failed to ${action} $1 within ${run_timeout} seconds" return 1 fi # Briefly sleep before restarting the loop sleep 0.1 done # Normally we should not get here, but if this happens # we have to return an error return 1 } # Variables typeset command=$1 typeset BP=100 typeset SP=${BP} typeset VBD typeset -a stack function push() { local value="$1" [[ -n "${value}" ]] \ && stack[$((--SP))]="${value}" return 0 } function pop() { [[ "${SP}" != "${BP}" ]] \ && VBD=${stack[$((SP++))]} \ || VBD="" return 0 } function activate_dmmd() { case "$1" in "md") activate_md $2 return $? ;; "lvm") lvm_action activate $2 return $? ;; esac # Normally we should not get here, but if this happens # we have to return an error return 1 } function deactivate_dmmd() { case "$1" in "md") deactivate_md $2 return $? ;; "lvm") lvm_action deactivate $2 return $? ;; esac # Normally we should not get here, but if this happens # we have to return an error return 1 } function cleanup_stack() { while true; do pop [[ -z "${VBD}" ]] && break deactivate_dmmd ${VBD} done } function parse_par() { # Make these vars explicitly local local ac par rc s t ac=$1 par="$2" par="${par};" while true; do t=${par%%;*} [[ -z "${t}" ]] && return 0 par=${par#*;} s=${par%%;*} [[ -z "${s}" ]] && return 1 par=${par#*;} if [[ "${ac}" == "activate" ]]; then activate_dmmd ${t} ${s} \ || return 1 fi push "${t} ${s}" done } case "${command}" in "add") p=$(xenstore-read ${XENBUS_PATH}/params) || true claim_lock "dmmd" dmmd=${p#dmmd:} if ! parse_par activate "${dmmd}"; then cleanup_stack release_lock "dmmd" exit 1 fi lastparam=${dmmd##*;} usedevice=${lastparam%(*} xenstore-write ${XENBUS_PATH}/node "${usedevice}" write_dev "${usedevice}" release_lock "dmmd" exit 0 ;; "remove") p=$(xenstore-read ${XENBUS_PATH}/params) || true claim_lock "dmmd" dmmd=${p#dmmd:} parse_par noactivate "${dmmd}" cleanup_stack release_lock "dmmd" exit 0 ;; esac # Normally we should not get here, but if this happens # we have to return an error return 1 ++++++ block-npiv ++++++ #!/bin/bash # Usage: block-npiv [add npiv | remove dev] dir=$(dirname "$0") . "$dir/block-npiv-common.sh" . "$dir/block-common.sh" #set -x #command=$1 case "$command" in add) # Params is one big arg, with fields separated by hyphens: # single path: # VPWWPN-TGTWWPN-LUN# # multipath: # {VPWWPN1.VPWWPN2....VPWWPNx}-{TGTWWPN1.TGTWWPN2....TGTWWPNx}-LUN# # arg 1 - VPORT's WWPN # arg 2 - Target's WWPN # arg 3 - LUN # on Target # no wwn contains a leading 0x - it is a 16 character hex value # You may want to optionally pick a specific adapter ? par=`xenstore-read $XENBUS_PATH/params` || true NPIVARGS=(${par//-/ }) wc=${#NPIVARGS[@]} if [ $wc -eq 5 ]; then # support old syntax # FABRIC-VPWWPN-VPWWNN-TGTWWPN-LUN VPORTWWPNS=${NPIVARGS[1]} VPORTWWNNS=${NPIVARGS[2]} TGTWWPNS=${NPIVARGS[3]} LUN=${NPIVARGS[4]} elif [ $wc -eq 3 ]; then # new syntax VPORTWWPNS=${NPIVARGS[0]} TGTWWPNS=${NPIVARGS[1]} LUN=${NPIVARGS[2]} else # wrong syntax exit 1 fi # Ensure we compare everything using lower-case hex characters TGTWWPNS=`echo $TGTWWPNS | tr A-Z a-z |sed 's/[{.}]/ /g'` VPORTWWPNS=`echo $VPORTWWPNS | tr A-Z a-z |sed 's/[{.}]/ /g'` # Only one VPWWNN is supported VPORTWWNN=`echo $VPORTWWNNS | tr A-Z a-z | sed -e 's/\..*//g' -e 's/{//'` claim_lock "npiv" paths=0 for VPORTWWPN in $VPORTWWPNS; do find_vhost $VPORTWWPN if test -z "$vhost" ; then create_vport $VPORTWWPN $VPORTWWNN if [ $? -ne 0 ] ; then exit 2; fi sleep 8 find_vhost $VPORTWWPN if test -z "$vhost" ; then exit 3; fi fi for TGTWWPN in $TGTWWPNS; do find_sdev $vhost $TGTWWPN $LUN if test -z "$dev"; then echo "- - -" > /sys/class/scsi_host/$vhost/scan sleep 2 find_sdev $vhost $TGTWWPN $LUN fi if test -z "$dev"; then exit 4 fi paths=$(($paths+1)) done done release_lock "npiv" if test $paths -gt 1; then xenstore-write $XENBUS_PATH/multipath 1 /etc/init.d/multipathd start if test $? -ne 0 ; then exit 4; fi dm=`multipath -l /dev/$dev | grep dm | cut -f2 -d' '` else xenstore-write $XENBUS_PATH/multipath 0 dm=$dev fi if test ! -z "$dm"; then xenstore-write $XENBUS_PATH/node /dev/$dm write_dev /dev/$dm exit 0 fi exit 4 ;; remove) node=`xenstore-read $XENBUS_PATH/node` || true multipath=`xenstore-read $XENBUS_PATH/multipath` || true # this is really screwy. the first delete of a lun will # terminate the entire vport (all luns) if test $multipath = 1; then par=`xenstore-read $XENBUS_PATH/params` || true NPIVARGS=(${par//-/ }) wc=${#NPIVARGS[@]} if [ $wc -eq 5 ]; then # old syntax # FABRIC-VPWWPN-VPWWNN-TGTWWPN-LUN VPORTWWPNS=${NPIVARGS[1]} elif [ $wc -eq 3 ]; then # new syntax VPORTWWPNS=${NPIVARGS[0]} fi VPORTWWPNS=`echo $VPORTWWPNS | tr A-Z a-z |sed 's/[{.}]/ /g'` for VPORTWWPN in $VPORTWWPNS; do find_vhost $VPORTWWPN if test -z "$vhost" ; then exit 5; fi flush_nodes_on_vhost $vhost delete_vhost $vhost done else dev=$node; dev=${dev#/dev/} find_vhost_from_dev $dev if test -z "$vhost" ; then exit 5; fi flush_nodes_on_vhost $vhost delete_vhost $vhost fi exit 0 ;; esac ++++++ block-npiv-common.sh ++++++ # Look for the NPIV vport with the WWPN # $1 contains the WWPN (assumes it does not contain a leading "0x") find_vhost() { unset vhost # look in upstream locations for fchost in /sys/class/fc_vports/* ; do if test -e $fchost/port_name ; then wwpn=`cat $fchost/port_name | sed -e s/^0x//` if test $wwpn = $1 ; then # Note: makes the assumption the vport will always have an scsi_host child vhost=`ls -d $fchost/device/host*` vhost=`basename $vhost` return fi fi done # look in vendor-specific locations # Emulex - just looks like another scsi_host - so look at fc_hosts... for fchost in /sys/class/fc_host/* ; do if test -e $fchost/port_name ; then wwpn=`cat $fchost/port_name | sed -e s/^0x//` if test $wwpn = $1 ; then # Note: makes the assumption the vport will always have an scsi_host child vhost=`basename $fchost` return fi fi done } # Create a NPIV vport with WWPN # $1 contains the VPORT WWPN # $2 may contain the VPORT WWNN # (assumes no name contains a leading "0x") create_vport() { wwpn=$1 wwnn=$2 if [ -z "$wwnn" ]; then # auto generate wwnn, follow FluidLabUpdateForEmulex.pdf # Novell specific identifier # byte 6 = 0 indicates WWNN, = 1 indicates WWPN wwnn=${wwpn:0:6}"0"${wwpn:7} fi # find a base adapter with npiv support that is on the right fabric # Look via upstream interfaces for fchost in /sys/class/fc_host/* ; do if test -e $fchost/vport_create ; then # is the link up, w/ NPIV support ? pstate=`cat $fchost/port_state` ptype=`cat $fchost/port_type | cut -c 1-5` if [ $pstate = "Online" -a $ptype = "NPort" ] ; then vmax=`cat $fchost/max_npiv_vports` vinuse=`cat $fchost/npiv_vports_inuse` avail=`expr $vmax - $vinuse` if [ $avail -gt 0 ] ; then # create the vport echo $wwpn":"$wwnn > $fchost/vport_create if [ $? -eq 0 ] ; then return 0 fi # failed - so we'll just look for the next adapter fi fi fi done # Look in vendor-specific locations # Emulex: interfaces mirror upstream, but are under adapter scsi_host for shost in /sys/class/scsi_host/* ; do if [ -e $shost/vport_create ] ; then fchost=`ls -d $shost/device/fc_host*` # is the link up, w/ NPIV support ? if [ -e $fchost/port_state ] ; then pstate=`cat $fchost/port_state` ptype=`cat $fchost/port_type | cut -c 1-5` if [ $pstate = "Online" -a $ptype = "NPort" ] ; then vmax=`cat $shost/max_npiv_vports` vinuse=`cat $shost/npiv_vports_inuse` avail=`expr $vmax - $vinuse` if [ $avail -gt 0 ] ; then # create the vport echo $wwpn":"$wwnn > $shost/vport_create if [ $? -eq 0 ] ; then return 0 fi # failed - so we'll just look for the next adapter fi fi fi fi done # BFA are under adapter scsi_host for shost in /sys/class/scsi_host/* ; do if [ -e $shost/vport_create ] ; then fchost=`ls -d $shost/device/fc_host/*` # is the link up, w/ NPIV support ? if [ -e $fchost/port_state ] ; then pstate=`cat $fchost/port_state` ptype=`cat $fchost/port_type | cut -c 1-5` if [ $pstate = "Online" -a $ptype = "NPort" ] ; then # create the vport echo $wwpn":"$wwnn > $shost/vport_create if [ $? -eq 0 ] ; then return 0 fi # failed - so we'll just look for the next adapter fi fi fi done return 1 } # Look for the LUN on the indicated scsi_host (which is an NPIV vport) # $1 is the scsi_host name (normalized to simply the hostX name) # $2 is the WWPN of the tgt port the lun is on # Note: this implies we don't support a multipath'd lun, or we # are explicitly identifying a "path" # $3 is the LUN number of the scsi device find_sdev() { unset dev hostno=${1/*host/} for sdev in /sys/class/scsi_device/${hostno}:*:$3 ; do if test -e $sdev/device/../fc_trans*/target${hostno}*/port_name ; then tgtwwpn=`cat $sdev/device/../fc_trans*/target${hostno}*/port_name | sed -e s/^0x//` if test $tgtwwpn = $2 ; then if test -e $sdev/device/block* ; then dev=`ls $sdev/device/block*` dev=${dev##*/} return fi fi fi done } # Look for the NPIV vhost based on a scsi "sdX" name # $1 is the "sdX" name find_vhost_from_dev() { unset vhost hostno=`readlink /sys/block/$1/device` hostno=${hostno##*/} hostno=${hostno%%:*} if test -z "$hostno" ; then return; fi vhost="host"$hostno } # We're about to terminate a vhost based on a scsi device # Flush all nodes on that vhost as they are about to go away # $1 is the vhost flush_nodes_on_vhost() { if test ! -x /sbin/blockdev ; then return; fi hostno=${1/*host/} for sdev in /sys/class/scsi_device/${hostno}:* ; do if test -e $sdev/device/block* ; then dev=`ls $sdev/device/block*` dev="/dev/"$dev if test -n "$dev"; then blockdev --flushbufs $dev fi fi done } # Terminate a NPIV vhost # $1 is vhost delete_vhost() { # use upstream interface for vport in /sys/class/fc_vports/* ; do if test -e $vport/device/$1 ; then if test -e $vport/vport_delete ; then echo "1" > $vport/vport_delete if test $? -ne 0 ; then exit 6; fi sleep 4 return fi fi done # use vendor specific interface # Emulex if test -e /sys/class/fc_host/$1/device/../scsi_host*/lpfc_drvr_version ; then shost=`ls -1d /sys/class/fc_host/$1/device/../scsi_host* | sed s/.*scsi_host://` vportwwpn=`cat /sys/class/fc_host/$1/port_name | sed s/^0x//` vportwwnn=`cat /sys/class/fc_host/$1/node_name | sed s/^0x//` echo "$vportwwpn:$vportwwnn" > /sys/class/scsi_host/$shost/vport_delete if test $? -ne 0 ; then exit 6; fi sleep 4 return fi # Qlogic if test -e /sys/class/fc_host/$1/device/../scsi_host*/driver_version ; then shost=`ls -1d /sys/class/fc_host/$1/device/../scsi_host* | sed s/.*scsi_host://` vportwwpn=`cat /sys/class/fc_host/$1/port_name | sed s/^0x//` vportwwnn=`cat /sys/class/fc_host/$1/node_name | sed s/^0x//` echo "$vportwwpn:$vportwwnn" > /sys/class/scsi_host/$shost/vport_delete if test $? -ne 0 ; then exit 6; fi sleep 4 return fi # BFA if test -e /sys/class/fc_host/$1/device/../scsi_host/*/driver_name ; then shost=`ls -1d /sys/class/fc_host/$1/device/../scsi_host/* | sed s#.*scsi_host/##` vportwwpn=`cat /sys/class/fc_host/$1/port_name | sed s/^0x//` vportwwnn=`cat /sys/class/fc_host/$1/node_name | sed s/^0x//` echo "$vportwwpn:$vportwwnn" > /sys/class/scsi_host/$shost/vport_delete if test $? -ne 0 ; then exit 6; fi sleep 4 return fi exit 6 } vport_status() { # Look via upstream interfaces for fchost in /sys/class/fc_host/* ; do if test -e $fchost/vport_create ; then vport_status_display $fchost $fchost fi done # Look in vendor-specific locations # Emulex: interfaces mirror upstream, but are under adapter scsi_host for shost in /sys/class/scsi_host/* ; do if [ -e $shost/vport_create ] ; then fchost=`ls -d $shost/device/fc_host*` vport_status_display $fchost $shost fi done return 0 } vport_status_display() { echo echo "fc_host: " $2 echo "port_state: " `cat $1/port_state` echo "port_type: " `cat $1/port_type` echo "fabric_name: " `cat $1/fabric_name` echo "max_npiv_vports: " `cat $2/max_npiv_vports` echo "npiv_vports_inuse: " `cat $2/npiv_vports_inuse` echo "modeldesc: " `cat $2/modeldesc` echo "speed: " `cat $1/speed` return 0 } ++++++ block-npiv-vport ++++++ #!/bin/bash # Usage: block-npiv-vport [create npivargs | delete vportwwpn | status] dir=$(dirname "$0") . "$dir/block-npiv-common.sh" #set -x command=$1 params=$2 case "$command" in create) # Params is one big arg, with fields separated by hyphens: # FABRIC-VPWWPN-VPWWNN-TGTWWPN-LUN# # arg 2 - Fabric Name # arg 3 - VPORT's WWPN # arg 4 - VPORT's WWNN # arg 5 - Target's WWPN # arg 6 - LUN # on Target # no wwn contains a leading 0x - it is a 16 character hex value # You may want to optionally pick a specific adapter ? NPIVARGS=$params; LUN=${NPIVARGS##*-*-*-*-}; NPIVARGS=${NPIVARGS%-*} if test $LUN = $NPIVARGS ; then exit 1; fi TGTWWPN=${NPIVARGS##*-*-*-}; NPIVARGS=${NPIVARGS%-*} if test $TGTWWPN = $NPIVARGS ; then exit 1; fi VPORTWWNN=${NPIVARGS##*-*-}; NPIVARGS=${NPIVARGS%-*} if test $VPORTWWNN = $NPIVARGS ; then exit 1; fi VPORTWWPN=${NPIVARGS##*-}; NPIVARGS=${NPIVARGS%-*} if test $VPORTWWPN = $NPIVARGS ; then exit 1; fi FABRICNM=$NPIVARGS # Ensure we compare everything using lower-case hex characters TGTWWPN=`echo $TGTWWPN | tr A-Z a-z` VPORTWWPN=`echo $VPORTWWPN | tr A-Z a-z` VPORTWWNN=`echo $VPORTWWNN | tr A-Z a-z` FABRICNM=`echo $FABRICNM | tr A-Z a-z` find_vhost $VPORTWWPN $FABRICNM if test -z "$vhost" ; then create_vport $FABRICNM $VPORTWWPN $VPORTWWNN if [ $? -ne 0 ] ; then exit 2; fi sleep 8 find_vhost $VPORTWWPN $FABRICNM if test -z "$vhost" ; then exit 3; fi fi exit 0 ;; delete) # Params is VPORT's WWPN # no wwn contains a leading 0x - it is a 16 character hex value VPORTWWPN=$params # Ensure we compare everything using lower-case hex characters VPORTWWPN=`echo $VPORTWWPN | tr A-Z a-z` find_vhost $VPORTWWPN $FABRICNM if test -z "$vhost" ; then exit 4; fi delete_vhost $vhost exit 0 ;; status) vport_status exit 0 ;; *) echo "Usage: block-npiv-vport [create npivargs | delete vportwwpn | status]" exit 1 ;; esac ++++++ boot.local.xenU ++++++ #! /bin/sh # # Copyright (c) 2014 SUSE GmbH Nuernberg, Germany. All rights reserved. # # Author: Werner Fink <werner(a)suse.de>, 1996 # Burchard Steinbild <bs(a)suse.de>, 1996 # # /etc/init.d/boot.local # # script with local commands to be executed from init on system startup # # # Here you should add things, that should happen directly after booting # before we're going to the first run level. # date # echo "$MACHINE: running $0 $*" my_REDIRECT="$(echo $REDIRECT | sed 's#^/dev/##')" my_DEVICE="$(echo $my_REDIRECT | sed 's#^tty##')" my_SPEED="$(stty speed)" # echo REDIRECT $REDIRECT $my_REDIRECT # echo my_DEVICE $my_DEVICE # echo my_SPEED $my_SPEED # compose a line like that for inittab # S0:12345:respawn:/sbin/agetty -L 9600 ttyS0 vt102 case $my_REDIRECT in ttyS*) echo adding this line to inittab echo "$my_DEVICE:12345:respawn:/sbin/agetty -L $my_SPEED $my_REDIRECT vt102" echo "$my_DEVICE:12345:respawn:/sbin/agetty -L $my_SPEED $my_REDIRECT vt102" >> /etc/inittab echo $my_REDIRECT >> /etc/securetty ;; hvc*) echo adding this line to inittab echo "$my_DEVICE:12345:respawn:/sbin/agetty -L $my_SPEED $my_REDIRECT vt320" echo "$my_DEVICE:12345:respawn:/sbin/agetty -L $my_SPEED $my_REDIRECT vt320" >> /etc/inittab echo $my_REDIRECT >> /etc/securetty ;; *) echo "no modification in inittab needed for: $my_REDIRECT" ;; esac telinit q # Changes for Xen test -f /lib/modules/`uname -r`/modules.dep || depmod -ae CMDLINE=`cat /proc/cmdline | grep 'ip='` if test ! -z "$CMDLINE"; then OLDIFS=$IFS IFS=":" read ip oth mask gw hostname dev dhcp rest < /proc/cmdline IFS=$OLDIFS hostname $hostname ip=`echo $ip | sed 's/ip= *//'` if test ! -z "$ip"; then if test -z "$mask"; then if [ ${ip%/*} = $ip ]; then ip="$ip/27" fi echo "ip addr add $ip dev $dev" ip addr add $ip dev $dev ip link set $dev up else ifconfig add $ip netmask $mask $dev fi fi if test "${dhcp#dhcp}" != "$dhcp"; then ifup-dhcp $dev fi fi ++++++ boot.xen ++++++ #! /bin/sh # Copyright (c) 2005-2006 SUSE Linux AG, Nuernberg, Germany. # All rights reserved. # # /etc/init.d/boot.xen # # LSB compatible service control script; see http://www.linuxbase.org/spec/ # ### BEGIN INIT INFO # Provides: Xen # Required-Start: boot.localfs # Should-Start: boot.localnet # Required-Stop: boot.localfs # Should-Stop: # Default-Start: B # Default-Stop: # Short-Description: Switch on and off TLS depending on whether Xen is running # Description: Xen gets a major performance hit by the way # recent glibc (& gcc) set up the TLS offset, as it needs to # play segmentation tricks. This can be avoided by moving away # the tls libs. ### END INIT INFO . /etc/rc.status # Reset status of this service rc_reset case "$1" in start) echo -n "Starting Xen setup " if test -d /proc/xen; then export LD_ASSUME_KERNEL=2.4.21 echo -n "Xen running " fi if test -d /proc/xen -a -d /lib/tls; then echo -n "move /lib/tls away " mv /lib/tls /lib/tls.save elif test ! -d /proc/xen -a -d /lib/tls.save; then echo -n "move back /lib/tls " mv /lib/tls.save /lib/tls fi rc_status -v ;; stop) # rc_status -v ;; try-restart|condrestart) $0 restart # Remember status and be quiet rc_status ;; restart) ## Stop the service and regardless of whether it was ## running or not, start it again. $0 start # Remember status and be quiet rc_status ;; force-reload) $0 try-restart rc_status ;; reload) rc_failed 3 rc_status -v ;; status) echo -n "Checking for Xen " # Return value is slightly different for the status command: # 0 - service up and running # 1 - service dead, but /var/run/ pid file exists # 2 - service dead, but /var/lock/ lock file exists # 3 - service not running (unused) # 4 - service status unknown :-( # 5--199 reserved (5--99 LSB, 100--149 distro, 150--199 appl.) if test -d /proc/xen; then if test -d /lib/tls; then echo -n "Xen running, /lib/tls existing " rc_failed 1 else echo -n "Xen running, /lib/tls not existing " fi else if test -d /lib/tls.save; then echo -n "Xen not running, /lib/tls existing " rc_failed 2 else echo -n "Xen not running, /lib/tls not existing " rc_failed 3 fi fi rc_status -v ;; *) echo "Usage: $0 {start|stop|status|try-restart|restart|force-reload|reload}" exit 1 ;; esac rc_exit ++++++ build-python3-conversion.patch ++++++ Index: xen-4.13.0-testing/Config.mk =================================================================== --- xen-4.13.0-testing.orig/Config.mk +++ xen-4.13.0-testing/Config.mk @@ -82,7 +82,7 @@ EXTRA_INCLUDES += $(EXTRA_PREFIX)/includ EXTRA_LIB += $(EXTRA_PREFIX)/lib endif -PYTHON ?= python +PYTHON ?= python3 PYTHON_PREFIX_ARG ?= --prefix="$(prefix)" # The above requires that prefix contains *no spaces*. This variable is here # to permit the user to set PYTHON_PREFIX_ARG to '' to workaround this bug: Index: xen-4.13.0-testing/tools/configure =================================================================== --- xen-4.13.0-testing.orig/tools/configure +++ xen-4.13.0-testing/tools/configure @@ -6926,7 +6926,7 @@ then fi;; esac if test -z "$PYTHON"; then : - for ac_prog in python python3 python2 + for ac_prog in python3 python python2 do # Extract the first word of "$ac_prog", so it can be a program name with args. set dummy $ac_prog; ac_word=$2 @@ -7065,15 +7065,15 @@ if test x"${PYTHONPATH}" = x"no" then as_fn_error $? "Unable to find $PYTHON, please install $PYTHON" "$LINENO" 5 fi -{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for python version >= 2.6 " >&5 -$as_echo_n "checking for python version >= 2.6 ... " >&6; } -`$PYTHON -c 'import sys; sys.exit(eval("sys.version_info < (2, 6)"))'` +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for python3 version >= 3.0 " >&5 +$as_echo_n "checking for python3 version >= 3.0 ... " >&6; } +`$PYTHON -c 'import sys; sys.exit(eval("sys.version_info < (3, 0)"))'` if test "$?" != "0" then python_version=`$PYTHON -V 2>&1` { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5 $as_echo "no" >&6; } - as_fn_error $? "$python_version is too old, minimum required version is 2.6" "$LINENO" 5 + as_fn_error $? "$python_version is too old, minimum required version is 3.0" "$LINENO" 5 else { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5 $as_echo "yes" >&6; } Index: xen-4.13.0-testing/tools/python/test.py =================================================================== --- xen-4.13.0-testing.orig/tools/python/test.py +++ xen-4.13.0-testing/tools/python/test.py @@ -1,4 +1,4 @@ -#! /usr/bin/env python2.3 +#!/usr/bin/python3 ############################################################################## # # Copyright (c) 2001, 2002 Zope Corporation and Contributors. @@ -289,9 +289,9 @@ class ImmediateTestResult(unittest._Text def stopTest(self, test): self._testtimes[test] = time.time() - self._testtimes[test] if gc.garbage: - print "The following test left garbage:" - print test - print gc.garbage + print("The following test left garbage:") + print(test) + print(gc.garbage) # XXX Perhaps eat the garbage here, so that the garbage isn't # printed for every subsequent test. @@ -301,23 +301,23 @@ class ImmediateTestResult(unittest._Text and t not in self._threads)] if new_threads: - print "The following test left new threads behind:" - print test - print "New thread(s):", new_threads + print("The following test left new threads behind:") + print(test) + print("New thread(s):", new_threads) def print_times(self, stream, count=None): - results = self._testtimes.items() + results = list(self._testtimes.items()) results.sort(lambda x, y: cmp(y[1], x[1])) if count: n = min(count, len(results)) if n: - print >>stream, "Top %d longest tests:" % n + print("Top %d longest tests:" % n, file=stream) else: n = len(results) if not n: return for i in range(n): - print >>stream, "%6dms" % int(results[i][1] * 1000), results[i][0] + print("%6dms" % int(results[i][1] * 1000), results[i][0], file=stream) def _print_traceback(self, msg, err, test, errlist): if self.showAll or self.dots or self._progress: @@ -369,7 +369,7 @@ class ImmediateTestResult(unittest._Text if self._progress: self.stream.write("\r") if self._debug: - raise err[0], err[1], err[2] + raise err[0](err[1]).with_traceback(err[2]) self._print_traceback("Error in test %s" % test, err, test, self.errors) @@ -377,7 +377,7 @@ class ImmediateTestResult(unittest._Text if self._progress: self.stream.write("\r") if self._debug: - raise err[0], err[1], err[2] + raise err[0](err[1]).with_traceback(err[2]) self._print_traceback("Failure in test %s" % test, err, test, self.failures) @@ -480,11 +480,11 @@ class PathInit: kind = functional and "FUNCTIONAL" or "UNIT" if libdir: extra = os.path.join(self.org_cwd, libdir) - print "Running %s tests from %s" % (kind, extra) + print("Running %s tests from %s" % (kind, extra)) self.libdir = extra sys.path.insert(0, extra) else: - print "Running %s tests from %s" % (kind, self.cwd) + print("Running %s tests from %s" % (kind, self.cwd)) # Make sure functional tests find ftesting.zcml if functional: config_file = 'ftesting.zcml' @@ -492,7 +492,7 @@ class PathInit: # We chdired into build, so ftesting.zcml is in the # parent directory config_file = os.path.join('..', 'ftesting.zcml') - print "Parsing %s" % config_file + print("Parsing %s" % config_file) from zope.app.tests.functional import FunctionalTestSetup FunctionalTestSetup(config_file) @@ -530,7 +530,7 @@ class TestFileFinder: if not "__init__.py" in files: if not files or files == ["CVS"]: return - print "not a package", dir + print("not a package", dir) return # Put matching files in matches. If matches is non-empty, @@ -549,9 +549,9 @@ class TestFileFinder: __import__(pkg) # We specifically do not want to catch ImportError since that's useful # information to know when running the tests. - except RuntimeError, e: + except RuntimeError as e: if VERBOSE: - print "skipping %s because: %s" % (pkg, e) + print("skipping %s because: %s" % (pkg, e)) return else: self.files.extend(matches) @@ -698,16 +698,16 @@ class TrackRefs: ct = [(type2count[t] - self.type2count.get(t, 0), type2all[t] - self.type2all.get(t, 0), t) - for t in type2count.iterkeys()] + for t in type2count.keys()] ct.sort() ct.reverse() printed = False for delta1, delta2, t in ct: if delta1 or delta2: if not printed: - print "%-55s %8s %8s" % ('', 'insts', 'refs') + print("%-55s %8s %8s" % ('', 'insts', 'refs')) printed = True - print "%-55s %8d %8d" % (t, delta1, delta2) + print("%-55s %8d %8d" % (t, delta1, delta2)) self.type2count = type2count self.type2all = type2all @@ -729,25 +729,25 @@ def runner(files, test_filter, debug): if TIMESFN: r.print_times(open(TIMESFN, "w")) if VERBOSE: - print "Wrote timing data to", TIMESFN + print("Wrote timing data to", TIMESFN) if TIMETESTS: r.print_times(sys.stdout, TIMETESTS) except: if DEBUGGER: - print "%s:" % (sys.exc_info()[0], ) - print sys.exc_info()[1] + print("%s:" % (sys.exc_info()[0], )) + print(sys.exc_info()[1]) pdb.post_mortem(sys.exc_info()[2]) else: raise def remove_stale_bytecode(arg, dirname, names): - names = map(os.path.normcase, names) + names = list(map(os.path.normcase, names)) for name in names: if name.endswith(".pyc") or name.endswith(".pyo"): srcname = name[:-1] if srcname not in names: fullname = os.path.join(dirname, name) - print "Removing stale bytecode file", fullname + print("Removing stale bytecode file", fullname) os.unlink(fullname) def main(module_filter, test_filter, libdir): @@ -773,12 +773,12 @@ def main(module_filter, test_filter, lib runner(files, test_filter, DEBUG) gc.collect() if gc.garbage: - print "GARBAGE:", len(gc.garbage), gc.garbage + print("GARBAGE:", len(gc.garbage), gc.garbage) return if REFCOUNT: prev = rc rc = sys.gettotalrefcount() - print "totalrefcount=%-8d change=%-6d" % (rc, rc - prev) + print("totalrefcount=%-8d change=%-6d" % (rc, rc - prev)) track.update() else: runner(files, test_filter, DEBUG) @@ -801,7 +801,7 @@ def configure_logging(): else: logging.basicConfig() - if os.environ.has_key("LOGGING"): + if "LOGGING" in os.environ: level = int(os.environ["LOGGING"]) logging.getLogger().setLevel(level) @@ -865,8 +865,8 @@ def process_args(argv=None): # import the config file if os.path.isfile(config_filename): - print 'Configuration file found.' - execfile(config_filename, globals()) + print('Configuration file found.') + exec(compile(open(config_filename).read(), config_filename, 'exec'), globals()) try: @@ -884,9 +884,9 @@ def process_args(argv=None): # fixme: add the long names # fixme: add the extra documentation # fixme: test for functional first! - except getopt.error, msg: - print msg - print "Try `python %s -h' for more information." % argv[0] + except getopt.error as msg: + print(msg) + print("Try `python %s -h' for more information." % argv[0]) sys.exit(2) for k, v in opts: @@ -916,13 +916,13 @@ def process_args(argv=None): RUN_UNIT = True RUN_FUNCTIONAL = True elif k in ("-h", "--help"): - print __doc__ + print(__doc__) sys.exit(0) elif k in ("-g", "--gc-threshold"): GC_THRESHOLD = int(v) elif k in ("-G", "--gc-option"): if not v.startswith("DEBUG_"): - print "-G argument must be DEBUG_ flag, not", repr(v) + print("-G argument must be DEBUG_ flag, not", repr(v)) sys.exit(1) GC_FLAGS.append(v) elif k in ('-k', '--keepbytecode'): @@ -968,30 +968,30 @@ def process_args(argv=None): import pychecker.checker if REFCOUNT and not hasattr(sys, "gettotalrefcount"): - print "-r ignored, because it needs a debug build of Python" + print("-r ignored, because it needs a debug build of Python") REFCOUNT = False if sys.version_info < ( 2,3,2 ): - print """\ + print("""\ ERROR: Your python version is not supported by Zope3. - Zope3 needs Python 2.3.2 or greater. You are running:""" + sys.version + Zope3 needs Python 2.3.2 or greater. You are running:""" + sys.version) sys.exit(1) if GC_THRESHOLD is not None: if GC_THRESHOLD == 0: gc.disable() - print "gc disabled" + print("gc disabled") else: gc.set_threshold(GC_THRESHOLD) - print "gc threshold:", gc.get_threshold() + print("gc threshold:", gc.get_threshold()) if GC_FLAGS: val = 0 for flag in GC_FLAGS: v = getattr(gc, flag, None) if v is None: - print "Unknown gc flag", repr(flag) - print gc.set_debug.__doc__ + print("Unknown gc flag", repr(flag)) + print(gc.set_debug.__doc__) sys.exit(1) val |= v gcdebug |= v @@ -1009,10 +1009,10 @@ def process_args(argv=None): if BUILD_INPLACE: cmd += "_ext -i" if VERBOSE: - print cmd + print(cmd) sts = os.system(cmd) if sts: - print "Build failed", hex(sts) + print("Build failed", hex(sts)) sys.exit(1) k = [] @@ -1027,9 +1027,9 @@ def process_args(argv=None): if VERBOSE: kind = functional and "FUNCTIONAL" or "UNIT" if LEVEL == 0: - print "Running %s tests at all levels" % kind + print("Running %s tests at all levels" % kind) else: - print "Running %s tests at level %d" % (kind, LEVEL) + print("Running %s tests at level %d" % (kind, LEVEL)) # This was to avoid functional tests outside of z3, but this doesn't really # work right. @@ -1073,20 +1073,20 @@ def process_args(argv=None): globals=globals(), locals=vars()) r = tracer.results() path = "/tmp/trace.%s" % os.getpid() - import cPickle + import pickle f = open(path, "wb") - cPickle.dump(r, f) + pickle.dump(r, f) f.close() - print path + print(path) r.write_results(show_missing=True, summary=True, coverdir=coverdir) else: bad = main(MODULE_FILTER, TEST_FILTER, LIBDIR) if bad: sys.exit(1) - except ImportError, err: - print err - print sys.path + except ImportError as err: + print(err) + print(sys.path) raise Index: xen-4.13.0-testing/tools/configure.ac =================================================================== --- xen-4.13.0-testing.orig/tools/configure.ac +++ xen-4.13.0-testing/tools/configure.ac @@ -337,14 +337,14 @@ case "$host_os" in freebsd*) ;; *) AX_PATH_PROG_OR_FAIL([BASH], [bash]);; esac -AS_IF([test -z "$PYTHON"], [AC_CHECK_PROGS([PYTHON], [python python3 python2], err)]) +AS_IF([test -z "$PYTHON"], [AC_CHECK_PROGS([PYTHON], [python3 python python2], err)]) AS_IF([test "$PYTHON" = "err"], [AC_MSG_ERROR([No python interpreter found])]) AS_IF([echo "$PYTHON" | grep -q "^/"], [], [AC_PATH_PROG([PYTHON], [$PYTHON])]) PYTHONPATH=$PYTHON PYTHON=`basename $PYTHONPATH` AX_PATH_PROG_OR_FAIL([PYTHONPATH], [$PYTHON]) -AX_CHECK_PYTHON_VERSION([2], [6]) +AX_CHECK_PYTHON_VERSION([3], [0]) AS_IF([test "$cross_compiling" != yes], [ AX_CHECK_PYTHON_DEVEL() Index: xen-4.13.0-testing/tools/libxl/idl.py =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/idl.py +++ xen-4.13.0-testing/tools/libxl/idl.py @@ -271,7 +271,7 @@ class KeyedUnion(Aggregate): if not isinstance(keyvar_type, Enumeration): raise ValueError - kv_kwargs = dict([(x.lstrip('keyvar_'),y) for (x,y) in kwargs.items() if x.startswith('keyvar_')]) + kv_kwargs = dict([(x.lstrip('keyvar_'),y) for (x,y) in list(kwargs.items()) if x.startswith('keyvar_')]) self.keyvar = Field(keyvar_type, keyvar_name, **kv_kwargs) @@ -317,7 +317,7 @@ class Array(Type): kwargs.setdefault('json_parse_type', 'JSON_ARRAY') Type.__init__(self, namespace=elem_type.namespace, typename=elem_type.rawname + " *", **kwargs) - lv_kwargs = dict([(x.lstrip('lenvar_'),y) for (x,y) in kwargs.items() if x.startswith('lenvar_')]) + lv_kwargs = dict([(x.lstrip('lenvar_'),y) for (x,y) in list(kwargs.items()) if x.startswith('lenvar_')]) self.lenvar = Field(integer, lenvar_name, **lv_kwargs) self.elem_type = elem_type @@ -353,7 +353,7 @@ def parse(f): globs = {} locs = OrderedDict() - for n,t in globals().items(): + for n,t in list(globals().items()): if isinstance(t, Type): globs[n] = t elif isinstance(t,type(object)) and issubclass(t, Type): Index: xen-4.13.0-testing/tools/libxl/gentest.py =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/gentest.py +++ xen-4.13.0-testing/tools/libxl/gentest.py @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/python3 from __future__ import print_function Index: xen-4.13.0-testing/tools/libxl/gentypes.py =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/gentypes.py +++ xen-4.13.0-testing/tools/libxl/gentypes.py @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/python3 from __future__ import print_function Index: xen-4.13.0-testing/tools/ocaml/libs/xentoollog/genlevels.py =================================================================== --- xen-4.13.0-testing.orig/tools/ocaml/libs/xentoollog/genlevels.py +++ xen-4.13.0-testing/tools/ocaml/libs/xentoollog/genlevels.py @@ -89,7 +89,7 @@ def gen_c(level): def autogen_header(open_comment, close_comment): s = open_comment + " AUTO-GENERATED FILE DO NOT EDIT " + close_comment + "\n" s += open_comment + " autogenerated by \n" - s += reduce(lambda x,y: x + " ", range(len(open_comment + " ")), "") + s += reduce(lambda x,y: x + " ", list(range(len(open_comment + " "))), "") s += "%s" % " ".join(sys.argv) s += "\n " + close_comment + "\n\n" return s Index: xen-4.13.0-testing/tools/include/xen-foreign/mkheader.py =================================================================== --- xen-4.13.0-testing.orig/tools/include/xen-foreign/mkheader.py +++ xen-4.13.0-testing/tools/include/xen-foreign/mkheader.py @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/python3 import sys, re; from structs import unions, structs, defines; Index: xen-4.13.0-testing/tools/include/xen-foreign/mkchecker.py =================================================================== --- xen-4.13.0-testing.orig/tools/include/xen-foreign/mkchecker.py +++ xen-4.13.0-testing/tools/include/xen-foreign/mkchecker.py @@ -1,4 +1,4 @@ -#!/usr/bin/python +#!/usr/bin/python3 import sys; from structs import structs, compat_arches; Index: xen-4.13.0-testing/xen/tools/gen-cpuid.py =================================================================== --- xen-4.13.0-testing.orig/xen/tools/gen-cpuid.py +++ xen-4.13.0-testing/xen/tools/gen-cpuid.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- import sys, os, re @@ -135,7 +135,7 @@ def crunch_numbers(state): common_1d = (FPU, VME, DE, PSE, TSC, MSR, PAE, MCE, CX8, APIC, MTRR, PGE, MCA, CMOV, PAT, PSE36, MMX, FXSR) - state.known = featureset_to_uint32s(state.names.keys(), nr_entries) + state.known = featureset_to_uint32s(list(state.names.keys()), nr_entries) state.common_1d = featureset_to_uint32s(common_1d, 1)[0] state.special = featureset_to_uint32s(state.raw_special, nr_entries) state.pv = featureset_to_uint32s(state.raw_pv, nr_entries) @@ -317,11 +317,11 @@ def crunch_numbers(state): state.deep_deps[feat] = seen[1:] - state.deep_features = featureset_to_uint32s(deps.keys(), nr_entries) - state.nr_deep_deps = len(state.deep_deps.keys()) + state.deep_features = featureset_to_uint32s(list(deps.keys()), nr_entries) + state.nr_deep_deps = len(list(state.deep_deps.keys())) try: - _tmp = state.deep_deps.iteritems() + _tmp = state.deep_deps.items() except AttributeError: _tmp = state.deep_deps.items() @@ -329,10 +329,10 @@ def crunch_numbers(state): state.deep_deps[k] = featureset_to_uint32s(v, nr_entries) # Calculate the bitfield name declarations - for word in xrange(nr_entries): + for word in range(nr_entries): names = [] - for bit in xrange(32): + for bit in range(32): name = state.names.get(word * 32 + bit, "") Index: xen-4.13.0-testing/xen/tools/compat-build-source.py =================================================================== --- xen-4.13.0-testing.orig/xen/tools/compat-build-source.py +++ xen-4.13.0-testing/xen/tools/compat-build-source.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 import re,sys Index: xen-4.13.0-testing/xen/tools/compat-build-header.py =================================================================== --- xen-4.13.0-testing.orig/xen/tools/compat-build-header.py +++ xen-4.13.0-testing/xen/tools/compat-build-header.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 import re,sys Index: xen-4.13.0-testing/xen/tools/fig-to-oct.py =================================================================== --- xen-4.13.0-testing.orig/xen/tools/fig-to-oct.py +++ xen-4.13.0-testing/xen/tools/fig-to-oct.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 import sys chars_per_line = 18 Index: xen-4.13.0-testing/tools/misc/xensymoops =================================================================== --- xen-4.13.0-testing.orig/tools/misc/xensymoops +++ xen-4.13.0-testing/tools/misc/xensymoops @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # An oops analyser for Xen # Usage: xensymoops path-to-xen.s < oops-message @@ -43,12 +43,12 @@ def read_oops(): return (eip_addr, stack_addresses) def usage(): - print >> sys.stderr, """Usage: %s path-to-asm < oops-msg + print("""Usage: %s path-to-asm < oops-msg The oops message should be fed to the standard input. The command-line argument specifies the path to the Xen assembly dump produced by \"make debug\". The location of EIP and the backtrace will be output to standard output. - """ % sys.argv[0] + """ % sys.argv[0], file=sys.stderr) sys.exit() ##### main @@ -99,7 +99,7 @@ while True: # if this address was seen as a potential code address in the backtrace then # record it in the backtrace list - if stk_addrs.has_key(address): + if address in stk_addrs: backtrace.append((stk_addrs[address], address, func)) # if this was the address that EIP... @@ -107,12 +107,12 @@ while True: eip_func = func -print "EIP %s in function %s" % (eip_addr, eip_func) -print "Backtrace:" +print("EIP %s in function %s" % (eip_addr, eip_func)) +print("Backtrace:") # sorting will order primarily by the first element of each tuple, # i.e. the order in the original oops backtrace.sort() for (i, a, f) in backtrace: - print "%s in function %s" % ( a, f ) + print("%s in function %s" % ( a, f )) ++++++ disable-building-pv-shim.patch ++++++ --- xen-4.13.0-testing/xen/arch/x86/configs/pvshim_defconfig.orig 2019-10-14 09:46:44.567846243 -0600 +++ xen-4.13.0-testing/xen/arch/x86/configs/pvshim_defconfig 2019-10-14 09:47:17.722552005 -0600 @@ -2,8 +2,8 @@ CONFIG_PV=y CONFIG_XEN_GUEST=y CONFIG_PVH_GUEST=y -CONFIG_PV_SHIM=y -CONFIG_PV_SHIM_EXCLUSIVE=y +CONFIG_PV_SHIM=n +CONFIG_PV_SHIM_EXCLUSIVE=n CONFIG_NR_CPUS=32 # Disable features not used by the PV shim # CONFIG_SHADOW_PAGING is not set ++++++ etc_pam.d_xen-api ++++++ #%PAM-1.0 auth required pam_listfile.so onerr=fail item=user \ sense=allow file=/etc/xen/xenapiusers auth include common-auth account include common-account password include common-password session include common-session ++++++ gcc10-fixes.patch ++++++ References: bsc#1158414 For libxlu_pci.c libxlu_pci.c: In function 'xlu_pci_parse_bdf': libxlu_pci.c:32:18: error: 'func' may be used uninitialized in this function [-Werror=maybe-uninitialized] 32 | pcidev->func = func; | ~~~~~~~~~~~~~^~~~~~ libxlu_pci.c:51:29: note: 'func' was declared here 51 | unsigned dom, bus, dev, func, vslot = 0; | ^~~~ libxlu_pci.c:31:17: error: 'dev' may be used uninitialized in this function [-Werror=maybe-uninitialized] 31 | pcidev->dev = dev; | ~~~~~~~~~~~~^~~~~ libxlu_pci.c:51:24: note: 'dev' was declared here 51 | unsigned dom, bus, dev, func, vslot = 0; | ^~~ libxlu_pci.c:30:17: error: 'bus' may be used uninitialized in this function [-Werror=maybe-uninitialized] 30 | pcidev->bus = bus; | ~~~~~~~~~~~~^~~~~ libxlu_pci.c:51:19: note: 'bus' was declared here 51 | unsigned dom, bus, dev, func, vslot = 0; | ^~~ libxlu_pci.c:29:20: error: 'dom' may be used uninitialized in this function [-Werror=maybe-uninitialized] 29 | pcidev->domain = domain; | ~~~~~~~~~~~~~~~^~~~~~~~ libxlu_pci.c:51:14: note: 'dom' was declared here 51 | unsigned dom, bus, dev, func, vslot = 0; | ^~~ For kdd.c kdd.c: In function 'kdd_tx': kdd.c:408:30: error: array subscript 65534 is outside the bounds of an interior zero-length array 'uint8_t[0]' {aka 'unsigned char[0]'} [-Werror=zero-length-bounds] 408 | sum += s->txp.payload[i]; | ~~~~~~~~~~~~~~^~~ In file included from kdd.c:52: kdd.h:326:17: note: while referencing 'payload' 326 | uint8_t payload[0]; | ^~~~~~~ cc1: all warnings being treated as errors For ssl_tls.c ssl_tls.c: In function 'ssl_session_reset': ssl_tls.c:1778:5: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size] 1778 | memset( ssl->ctx_enc, 0, 128 ); | ^~~~~~ ssl_tls.c:1779:5: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size] 1779 | memset( ssl->ctx_dec, 0, 128 ); | ^~~~~~ ssl_tls.c: In function 'ssl_encrypt_buf': ssl_tls.c:633:68: warning: this statement may fall through [-Wimplicit-fallthrough=] 633 | ssl->session->ciphersuite == SSL_RSA_CAMELLIA_256_SHA || ssl_tls.c:643:13: note: here 643 | default: | ^~~~~~~ ssl_tls.c: In function 'ssl_decrypt_buf': ssl_tls.c:738:68: warning: this statement may fall through [-Wimplicit-fallthrough=] 738 | ssl->session->ciphersuite == SSL_RSA_CAMELLIA_256_SHA || ssl_tls.c:748:13: note: here 748 | default: | ^~~~~~~ For xenstored_core.h ld: /home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/xenstored.a(xenstored_watch.o):/home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/xenstored_core.h:207: multiple definition of `xgt_handle'; /home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/xenstored.a(xenstored_core.o):/home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/xenstored_core.h:207: first defined here For utils.h ld: /home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/xenstored.a(xenstored_watch.o):/home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/utils.h:27: multiple definition of `xprintf'; /home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/xenstored.a(xenstored_core.o):/home/abuild/rpmbuild/BUILD/xen-4.13.0-testing/stubdom/xenstore/utils.h:27: first defined here for libxl_utils.h specified bound 108 equals destination size [-Werror=stringop-truncation] xenpmd.c: In function 'get_next_battery_file': xenpmd.c:92:37: error: '%s' directive output may be truncated writing between 4 and 2147483645 bytes into a region of size 271 [-Werror=format-truncation=] 92 | #define BATTERY_STATE_FILE_PATH "/tmp/battery/%s/state" | ^~~~~~~~~~~~~~~~~~~~~~~ xenpmd.c:117:52: note: in expansion of macro 'BATTERY_STATE_FILE_PATH' 117 | snprintf(file_name, sizeof(file_name), BATTERY_STATE_FILE_PATH, | ^~~~~~~~~~~~~~~~~~~~~~~ Index: xen-4.13.0-testing/tools/libxl/libxlu_pci.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxlu_pci.c +++ xen-4.13.0-testing/tools/libxl/libxlu_pci.c @@ -22,6 +22,9 @@ static int hex_convert(const char *str, return 0; } +#if __GNUC__ >= 10 +#pragma GCC diagnostic ignored "-Wmaybe-uninitialized" +#endif static int pcidev_struct_fill(libxl_device_pci *pcidev, unsigned int domain, unsigned int bus, unsigned int dev, unsigned int func, unsigned int vdevfn) Index: xen-4.13.0-testing/tools/debugger/kdd/kdd.c =================================================================== --- xen-4.13.0-testing.orig/tools/debugger/kdd/kdd.c +++ xen-4.13.0-testing/tools/debugger/kdd/kdd.c @@ -396,6 +396,9 @@ static void find_os(kdd_state *s) */ +#if __GNUC__ >= 10 +#pragma GCC diagnostic ignored "-Wzero-length-bounds" +#endif /* Send a serial packet */ static void kdd_tx(kdd_state *s) { Index: xen-4.13.0-testing/stubdom/polarssl.patch =================================================================== --- xen-4.13.0-testing.orig/stubdom/polarssl.patch +++ xen-4.13.0-testing/stubdom/polarssl.patch @@ -62,3 +62,25 @@ diff -Naur polarssl-1.1.4/library/bignum t_udbl r; r = (t_udbl) X.p[i] << biL; +--- polarssl-1.1.4/library/ssl_tls.c.orig 2012-05-30 01:39:36.000000000 -0600 ++++ polarssl-1.1.4/library/ssl_tls.c 2020-03-10 10:17:26.270755351 -0600 +@@ -487,6 +487,9 @@ static void ssl_mac_sha1( unsigned char + sha1_finish( &sha1, buf + len ); + } + ++#if __GNUC__ >= 10 ++#pragma GCC diagnostic ignored "-Wimplicit-fallthrough=" ++#endif + /* + * Encryption/decryption functions + */ +@@ -1739,6 +1742,9 @@ int ssl_init( ssl_context *ssl ) + return( 0 ); + } + ++#if __GNUC__ >= 10 ++#pragma GCC diagnostic ignored "-Wmemset-elt-size" ++#endif + /* + * Reset an initialized and used SSL context for re-use while retaining + * all application-set variables, function pointers and data. Index: xen-4.13.0-testing/tools/xenstore/xenstored_core.h =================================================================== --- xen-4.13.0-testing.orig/tools/xenstore/xenstored_core.h +++ xen-4.13.0-testing/tools/xenstore/xenstored_core.h @@ -204,7 +204,11 @@ void finish_daemonize(void); /* Open a pipe for signal handling */ void init_pipe(int reopen_log_pipe[2]); +#if __GNUC__ >= 10 +extern xengnttab_handle **xgt_handle; +#else xengnttab_handle **xgt_handle; +#endif int remember_string(struct hashtable *hash, const char *str); Index: xen-4.13.0-testing/tools/xenstore/utils.h =================================================================== --- xen-4.13.0-testing.orig/tools/xenstore/utils.h +++ xen-4.13.0-testing/tools/xenstore/utils.h @@ -24,7 +24,11 @@ static inline bool strends(const char *a void barf(const char *fmt, ...) __attribute__((noreturn)); void barf_perror(const char *fmt, ...) __attribute__((noreturn)); +#if __GNUC__ >= 10 +extern void (*xprintf)(const char *fmt, ...); +#else void (*xprintf)(const char *fmt, ...); +#endif #define eprintf(_fmt, _args...) xprintf("[ERR] %s" _fmt, __FUNCTION__, ##_args) Index: xen-4.13.0-testing/tools/libxl/libxl_utils.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_utils.c +++ xen-4.13.0-testing/tools/libxl/libxl_utils.c @@ -1248,6 +1248,9 @@ int libxl__random_bytes(libxl__gc *gc, u return ret; } +#if __GNUC__ >= 10 +#pragma GCC diagnostic ignored "-Wstringop-truncation" +#endif int libxl__prepare_sockaddr_un(libxl__gc *gc, struct sockaddr_un *un, const char *path, const char *what) Index: xen-4.13.0-testing/tools/xenpmd/xenpmd.c =================================================================== --- xen-4.13.0-testing.orig/tools/xenpmd/xenpmd.c +++ xen-4.13.0-testing/tools/xenpmd/xenpmd.c @@ -86,6 +86,9 @@ struct battery_status { static struct xs_handle *xs; +#if __GNUC__ >= 10 +#pragma GCC diagnostic ignored "-Wformat-truncation" +#endif #ifdef RUN_IN_SIMULATE_MODE #define BATTERY_DIR_PATH "/tmp/battery" #define BATTERY_INFO_FILE_PATH "/tmp/battery/%s/info" ++++++ hibernate.patch ++++++ Index: xen-4.8.0-testing/tools/libacpi/ssdt_s3.asl =================================================================== --- xen-4.8.0-testing.orig/tools/libacpi/ssdt_s3.asl +++ xen-4.8.0-testing/tools/libacpi/ssdt_s3.asl @@ -16,13 +16,9 @@ DefinitionBlock ("SSDT_S3.aml", "SSDT", 2, "Xen", "HVM", 0) { - /* Must match piix emulation */ - Name (\_S3, Package (0x04) - { - 0x01, /* PM1a_CNT.SLP_TYP */ - 0x01, /* PM1b_CNT.SLP_TYP */ - 0x0, /* reserved */ - 0x0 /* reserved */ - }) + /* + * Turn off support for s3 sleep state to deal with SVVP tests. + * This is what MSFT does on HyperV. + */ } Index: xen-4.8.0-testing/tools/libacpi/ssdt_s4.asl =================================================================== --- xen-4.8.0-testing.orig/tools/libacpi/ssdt_s4.asl +++ xen-4.8.0-testing/tools/libacpi/ssdt_s4.asl @@ -16,13 +16,9 @@ DefinitionBlock ("SSDT_S4.aml", "SSDT", 2, "Xen", "HVM", 0) { - /* Must match piix emulation */ - Name (\_S4, Package (0x04) - { - 0x00, /* PM1a_CNT.SLP_TYP */ - 0x00, /* PM1b_CNT.SLP_TYP */ - 0x00, /* reserved */ - 0x00 /* reserved */ - }) + /* + * Turn off support for s4 sleep state to deal with SVVP tests. + * This is what MSFT does on HyperV. + */ } ++++++ ignore-ip-command-script-errors.patch ++++++ References: bsc#1172356 The bug is that virt-manager reports a failure when in fact the host and guest have added the network interface. The Xen scripts are failing with an error when in fact that command is succeeding. The 'ip' commands seem to abort the script due to a 'set -e' in xen-script-common.sh with what appears to be an error condition. However, the command actually succeeds when checked from the host console or also by inserting a sleep before each ip command and executing it manually at the command line. This seems to be an artifact of using 'set -e' everywhere. --- xen-4.13.1-testing.orig/tools/hotplug/Linux/xen-network-common.sh +++ xen-4.13.1-testing/tools/hotplug/Linux/xen-network-common.sh @@ -90,7 +90,7 @@ _setup_bridge_port() { local virtual="$2" # take interface down ... - ip link set dev ${dev} down + (ip link set dev ${dev} down || true) if [ $virtual -ne 0 ] ; then # Initialise a dummy MAC address. We choose the numerically @@ -101,7 +101,7 @@ _setup_bridge_port() { fi # ... and configure it - ip address flush dev ${dev} + (ip address flush dev ${dev} || true) } setup_physical_bridge_port() { @@ -138,11 +138,11 @@ add_to_bridge () { return fi if [ "$legacy_tools" ]; then - brctl addif ${bridge} ${dev} + (brctl addif ${bridge} ${dev} || true) else - ip link set "$dev" master "$bridge" + (ip link set "$dev" master "$bridge" || true) fi - ip link set dev ${dev} up + (ip link set dev ${dev} up || true) } # Usage: set_mtu bridge dev ++++++ init.pciback ++++++ #!/bin/bash # # Copyright (c) 2014 SUSE GmbH Nuernberg, Germany. All rights reserved. # # /etc/init.d/pciback # ### BEGIN INIT INFO # Provides: pciback # Required-Start: $syslog $network # Should-Start: $null # Required-Stop: $syslog $network # Should-Stop: $null # Default-Start: 3 5 # Default-Stop: 0 1 2 6 # Description: bind PCI devices to pciback ### END INIT INFO . /etc/rc.status . /etc/sysconfig/pciback rc_reset load_pciback() { if ! lsmod | grep -qi "pciback" then echo "Loading pciback ..." modprobe pciback fi } unload_pciback() { if lsmod | grep -qi "pciback" then echo "Unloading pciback ..." modprobe -r pciback fi } bind_dev_to_pciback() { for DEVICE in ${XEN_PCI_HIDE_LIST} do local DRV=`echo ${DEVICE} | /usr/bin/cut -d "," -f 1` local PCIID=`echo ${DEVICE} | /usr/bin/cut -d "," -f 2` if ! ls /sys/bus/pci/drivers/pciback/${PCIID} > /dev/null 2>&1 then echo "Binding ${PCIID} ..." if ls /sys/bus/pci/drivers/${DRV}/${PCIID} > /dev/null 2>&1 then echo -n ${PCIID} > /sys/bus/pci/drivers/${DRV}/unbind fi echo -n ${PCIID} > /sys/bus/pci/drivers/pciback/new_slot echo -n ${PCIID} > /sys/bus/pci/drivers/pciback/bind fi done } unbind_dev_from_pciback() { for DEVICE in ${XEN_PCI_HIDE_LIST} do local DRV=`echo ${DEVICE} | /usr/bin/cut -d "," -f 1` local PCIID=`echo ${DEVICE} | /usr/bin/cut -d "," -f 2` if ls /sys/bus/pci/drivers/pciback/${PCIID} > /dev/null then echo "Unbinding ${PCIID} ..." echo -n ${PCIID} > /sys/bus/pci/drivers/pciback/unbind fi done } test "uname -r" | grep xen && exit 0 case $1 in start) echo "Starting pciback ..." echo load_pciback bind_dev_to_pciback rc_status -v -r ;; stop) echo "Stopping pciback ..." echo unbind_dev_from_pciback unload_pciback rc_status -v ;; reload|restart) echo "Stopping pciback ..." echo unbind_dev_from_pciback unload_pciback echo "Starting pciback ..." echo load_pciback bind_dev_to_pciback ;; status) if lsmod | grep -qi pciback then echo echo "pciback: loaded" echo echo "Currently bound devices ..." echo "-----------------------------" ls /sys/bus/pci/drivers/pciback | grep ^0000 echo else echo "pciback: not loaded" fi ;; *) echo "Usage: $0 [start|stop|restart|reload|status]" exit 1 ;; esac ++++++ init.xen_loop ++++++ # Increase the number of loopback devices available for vm creation options loop max_loop=64 ++++++ ipxe-enable-nics.patch ++++++ Index: xen-4.2.0-testing/tools/firmware/etherboot/Config =================================================================== --- xen-4.2.0-testing.orig/tools/firmware/etherboot/Config +++ xen-4.2.0-testing/tools/firmware/etherboot/Config @@ -1,3 +1,4 @@ +NICS = rtl8139 8086100e eepro100 e1000 pcnet32 10ec8029 CFLAGS += -UPXE_DHCP_STRICT CFLAGS += -DPXE_DHCP_STRICT ++++++ ipxe-no-error-logical-not-parentheses.patch ++++++ Index: xen-4.8.0-testing/tools/firmware/etherboot/patches/ipxe-no-error-logical-not-parentheses.patch =================================================================== --- /dev/null +++ xen-4.8.0-testing/tools/firmware/etherboot/patches/ipxe-no-error-logical-not-parentheses.patch @@ -0,0 +1,11 @@ +--- ipxe/src/Makefile.housekeeping.orig 2015-03-12 12:15:50.054891858 +0000 ++++ ipxe/src/Makefile.housekeeping 2015-03-12 12:16:05.978071221 +0000 +@@ -415,7 +415,7 @@ + # Inhibit -Werror if NO_WERROR is specified on make command line + # + ifneq ($(NO_WERROR),1) +-CFLAGS += -Werror ++CFLAGS += -Werror -Wno-logical-not-parentheses + ASFLAGS += --fatal-warnings + endif + Index: xen-4.8.0-testing/tools/firmware/etherboot/patches/series =================================================================== --- xen-4.8.0-testing.orig/tools/firmware/etherboot/patches/series +++ xen-4.8.0-testing/tools/firmware/etherboot/patches/series @@ -1 +1,2 @@ boot_prompt_option.patch +ipxe-no-error-logical-not-parentheses.patch ++++++ ipxe-use-rpm-opt-flags.patch ++++++ References: bsc#969377 - xen does not build with GCC 6 Index: xen-4.8.0-testing/tools/firmware/etherboot/patches/ipxe-use-rpm-opt-flags.patch =================================================================== --- /dev/null +++ xen-4.8.0-testing/tools/firmware/etherboot/patches/ipxe-use-rpm-opt-flags.patch @@ -0,0 +1,11 @@ +--- ipxe/src/Makefile.orig 2016-03-04 15:48:15.000000000 -0700 ++++ ipxe/src/Makefile 2016-03-04 15:48:40.000000000 -0700 +@@ -4,7 +4,7 @@ + # + + CLEANUP := +-CFLAGS := ++CFLAGS := $(RPM_OPT_FLAGS) -Wno-error=array-bounds -Wno-nonnull-compare -Wno-unused-const-variable -Wno-misleading-indentation -Wno-shift-negative-value -Wno-implicit-fallthrough -Wno-nonnull + ASFLAGS := + LDFLAGS := + MAKEDEPS := Makefile Index: xen-4.8.0-testing/tools/firmware/etherboot/patches/series =================================================================== --- xen-4.8.0-testing.orig/tools/firmware/etherboot/patches/series +++ xen-4.8.0-testing/tools/firmware/etherboot/patches/series @@ -1,2 +1,3 @@ boot_prompt_option.patch ipxe-no-error-logical-not-parentheses.patch +ipxe-use-rpm-opt-flags.patch ++++++ libxc.migrate_tracking.patch ++++++ Track live migration state unconditionally in logfiles to see how long a domU was suspended. Depends on libxc.sr.superpage.patch --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -69,20 +69,26 @@ int xc_domain_cacheflush(xc_interface *x int xc_domain_pause(xc_interface *xch, uint32_t domid) { + int ret; DECLARE_DOMCTL; domctl.cmd = XEN_DOMCTL_pausedomain; domctl.domain = domid; - return do_domctl(xch, &domctl); + ret = do_domctl(xch, &domctl); + SUSEINFO("domid %u: %s returned %d", domid, __func__, ret); + return ret; } int xc_domain_unpause(xc_interface *xch, uint32_t domid) { + int ret; DECLARE_DOMCTL; domctl.cmd = XEN_DOMCTL_unpausedomain; domctl.domain = domid; - return do_domctl(xch, &domctl); + ret = do_domctl(xch, &domctl); + SUSEINFO("domid %u: %s returned %d", domid, __func__, ret); + return ret; } --- a/tools/libxc/xc_private.h +++ b/tools/libxc/xc_private.h @@ -42,6 +42,11 @@ #include <xen-tools/libs.h> +#define SUSEINFO(_m, _a...) do { int ERROR_errno = errno; \ + xc_report(xch, xch->error_handler, XTL_ERROR, XC_ERROR_NONE, "SUSEINFO: " _m , ## _a ); \ + errno = ERROR_errno; \ + } while (0) + #if defined(HAVE_VALGRIND_MEMCHECK_H) && !defined(NDEBUG) && !defined(__MINIOS__) /* Compile in Valgrind client requests? */ #include <valgrind/memcheck.h> --- a/tools/libxc/xc_resume.c +++ b/tools/libxc/xc_resume.c @@ -284,7 +284,9 @@ out: */ int xc_domain_resume(xc_interface *xch, uint32_t domid, int fast) { - return (fast + int ret = (fast ? xc_domain_resume_cooperative(xch, domid) : xc_domain_resume_any(xch, domid)); + SUSEINFO("domid %u: %s%s returned %d", domid, __func__, fast ? " fast" : "", ret); + return ret; } --- a/tools/libxc/xc_sr_common.c +++ b/tools/libxc/xc_sr_common.c @@ -196,6 +196,65 @@ bool _xc_sr_bitmap_resize(struct xc_sr_b return true; } +/* Write a two-character hex representation of 'byte' to digits[]. + Pre-condition: sizeof(digits) >= 2 */ +static void byte_to_hex(char *digits, const uint8_t byte) +{ + uint8_t nybbel = byte >> 4; + + if ( nybbel > 9 ) + digits[0] = 'a' + nybbel-10; + else + digits[0] = '0' + nybbel; + + nybbel = byte & 0x0f; + if ( nybbel > 9 ) + digits[1] = 'a' + nybbel-10; + else + digits[1] = '0' + nybbel; +} + +/* Convert an array of 16 unsigned bytes to a DCE/OSF formatted UUID + string. + + Pre-condition: sizeof(dest) >= 37 */ +void sr_uuid_to_string(char *dest, const uint8_t *uuid) +{ + int i = 0; + char *p = dest; + + for (; i < 4; i++ ) + { + byte_to_hex(p, uuid[i]); + p += 2; + } + *p++ = '-'; + for (; i < 6; i++ ) + { + byte_to_hex(p, uuid[i]); + p += 2; + } + *p++ = '-'; + for (; i < 8; i++ ) + { + byte_to_hex(p, uuid[i]); + p += 2; + } + *p++ = '-'; + for (; i < 10; i++ ) + { + byte_to_hex(p, uuid[i]); + p += 2; + } + *p++ = '-'; + for (; i < 16; i++ ) + { + byte_to_hex(p, uuid[i]); + p += 2; + } + *p = '\0'; +} + /* * Local variables: * mode: C --- a/tools/libxc/xc_sr_common.h +++ b/tools/libxc/xc_sr_common.h @@ -195,6 +195,7 @@ struct xc_sr_context int fd; xc_dominfo_t dominfo; + char uuid[16*2+4+1]; union /* Common save or restore data. */ { @@ -427,6 +428,8 @@ static inline int pfn_set_populated(stru return 0; } +extern void sr_uuid_to_string(char *dest, const uint8_t *uuid); + struct xc_sr_record { uint32_t type; --- a/tools/libxc/xc_sr_restore.c +++ b/tools/libxc/xc_sr_restore.c @@ -608,6 +608,7 @@ static int restore(struct xc_sr_context struct xc_sr_record rec; int rc, saved_rc = 0, saved_errno = 0; + SUSEINFO("domid %u: %s %s start", ctx->domid, ctx->uuid, __func__); IPRINTF("Restoring domain"); rc = setup(ctx); @@ -684,6 +685,7 @@ static int restore(struct xc_sr_context PERROR("Restore failed"); done: + SUSEINFO("domid %u: %s done", ctx->domid, __func__); cleanup(ctx); if ( saved_rc ) @@ -748,6 +750,7 @@ int xc_domain_restore(xc_interface *xch, } ctx.domid = dom; + sr_uuid_to_string(ctx.uuid, ctx.dominfo.handle); if ( read_headers(&ctx) ) return -1; --- a/tools/libxc/xc_sr_save.c +++ b/tools/libxc/xc_sr_save.c @@ -852,6 +852,7 @@ static int save(struct xc_sr_context *ct xc_interface *xch = ctx->xch; int rc, saved_rc = 0, saved_errno = 0; + SUSEINFO("domid %u: %s %s start, %lu pages allocated", ctx->domid, ctx->uuid, __func__, ctx->dominfo.nr_pages); IPRINTF("Saving domain %d, type %s", ctx->domid, dhdr_type_to_str(guest_type)); @@ -964,6 +965,7 @@ static int save(struct xc_sr_context *ct PERROR("Save failed"); done: + SUSEINFO("domid %u: %s done", ctx->domid, __func__); cleanup(ctx); if ( saved_rc ) @@ -1019,6 +1021,10 @@ static int suse_precopy_policy(struct pr goto out; } /* Keep going */ + if ( stats.dirty_count >= 0 ) + SUSEINFO("domid %u: dirty pages %ld after iteration %u/%u", + suse_flags.ctx->domid, + suse_flags.dirty_count, stats.iteration, suse_flags.max_iters); return XGS_POLICY_CONTINUE_PRECOPY; out: @@ -1032,6 +1038,8 @@ out: return XGS_POLICY_ABORT; } suspend: + SUSEINFO("domid %u: suspending, remaining dirty pages %ld prior final transit", + suse_flags.ctx->domid, suse_flags.dirty_count); return XGS_POLICY_STOP_AND_COPY; } @@ -1095,6 +1103,7 @@ int xc_domain_save_suse(xc_interface *xc } ctx.domid = dom; + sr_uuid_to_string(ctx.uuid, ctx.dominfo.handle); if ( ctx.dominfo.hvm ) { ++++++ libxc.sr.superpage.patch ++++++ ++++ 905 lines (skipped) ++++++ libxl.LIBXL_HOTPLUG_TIMEOUT.patch ++++++ References: bsc#1120095 A domU with a large amount of disks may run into the hardcoded LIBXL_HOTPLUG_TIMEOUT limit, which is 40 seconds. This happens if the preparation for each disk takes an unexpected large amount of time. Then the sum of all configured disks and the individual preparation time will be larger than 40 seconds. The hotplug script which does the preparation takes a lock before doing the actual preparation. Since the hotplug scripts for each disk are spawned at nearly the same time, each one has to wait for the lock. Due to this contention, the total execution time of a script can easily exceed the timeout. In this case libxl will terminate the script because it has to assume an error condition. Example: 10 configured disks, each one takes 3 seconds within the critital section. The total execution time will be 30 seconds, which is still within the limit. With 5 additional configured disks, the total execution time will be 45 seconds, which would trigger the timeout. To handle such setup without a recompile of libxl, a special key/value has to be created in xenstore prior domain creation. This can be done either manually, or at system startup. If this systemd service file is placed in /etc/systemd/system/, and activated, it will create the required entry in xenstore: /etc/systemd/system # cat xen-LIBXL_HOTPLUG_TIMEOUT.service [Unit] Description=set global LIBXL_HOTPLUG_TIMEOUT ConditionPathExists=/proc/xen/capabilities Requires=xenstored.service After=xenstored.service Requires=xen-init-dom0.service After=xen-init-dom0.service Before=xencommons.service [Service] Type=oneshot RemainAfterExit=true ExecStartPre=/bin/grep -q control_d /proc/xen/capabilities ExecStart=/usr/bin/xenstore-write /libxl/suse/per-device-LIBXL_HOTPLUG_TIMEOUT 5 [Install] WantedBy=multi-user.target /etc/systemd/system # systemctl enable xen-LIBXL_HOTPLUG_TIMEOUT.service /etc/systemd/system # systemctl start xen-LIBXL_HOTPLUG_TIMEOUT.service In this example the per-device value will be set to 5 seconds. The change for libxl which handles this xenstore value will enable additional logging if the key is found. That extra logging will show how the execution time of each script. Index: xen-4.13.0-testing/tools/libxl/libxl_aoutils.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_aoutils.c +++ xen-4.13.0-testing/tools/libxl/libxl_aoutils.c @@ -529,6 +529,8 @@ static void async_exec_timeout(libxl__eg { libxl__async_exec_state *aes = CONTAINER_OF(ev, *aes, time); STATE_AO_GC(aes->ao); + char b[64]; + libxl__suse_diff_timespec(&aes->start, b, sizeof(b)); if (!aes->rc) aes->rc = rc; @@ -536,7 +538,7 @@ static void async_exec_timeout(libxl__eg libxl__ev_time_deregister(gc, &aes->time); assert(libxl__ev_child_inuse(&aes->child)); - LOG(ERROR, "killing execution of %s because of timeout", aes->what); + LOG(ERROR, "killing execution of %s because of timeout%s", aes->what, b); if (kill(aes->child.pid, SIGKILL)) { LOGEV(ERROR, errno, "unable to kill %s [%ld]", @@ -552,6 +554,10 @@ static void async_exec_done(libxl__egc * { libxl__async_exec_state *aes = CONTAINER_OF(child, *aes, child); STATE_AO_GC(aes->ao); + char b[64]; + libxl__suse_diff_timespec(&aes->start, b, sizeof(b)); + if (b[0]) + LOG(NOTICE, "finished execution of '%s'%s", aes->what, b); libxl__ev_time_deregister(gc, &aes->time); Index: xen-4.13.0-testing/tools/libxl/libxl_create.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_create.c +++ xen-4.13.0-testing/tools/libxl/libxl_create.c @@ -1116,6 +1116,7 @@ static void initiate_domain_create(libxl * build info around just to know if the domain has a device model or not. */ store_libxl_entry(gc, domid, &d_config->b_info); + libxl__suse_domain_set_hotplug_timeout(gc, domid, d_config->num_disks, d_config->num_nics); for (i = 0; i < d_config->num_disks; i++) { ret = libxl__disk_devtype.set_default(gc, domid, &d_config->disks[i], Index: xen-4.13.0-testing/tools/libxl/libxl_device.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_device.c +++ xen-4.13.0-testing/tools/libxl/libxl_device.c @@ -1212,7 +1212,7 @@ static void device_hotplug(libxl__egc *e } aes->ao = ao; - aes->what = GCSPRINTF("%s %s", args[0], args[1]); + aes->what = GCSPRINTF("%s %s for %s", args[0], args[1], be_path); aes->env = env; aes->args = args; aes->callback = device_hotplug_child_death_cb; @@ -1221,6 +1221,15 @@ static void device_hotplug(libxl__egc *e aes->stdfds[1] = 2; aes->stdfds[2] = -1; + switch (aodev->dev->backend_kind) { + case LIBXL__DEVICE_KIND_VBD: + case LIBXL__DEVICE_KIND_VIF: + if (aodev->num_exec == 0) + libxl__suse_domain_get_hotplug_timeout(gc, aodev->dev->domid, aodev->dev->backend_kind, &aes->start, &aes->timeout_ms, be_path); + default: + break; + } + rc = libxl__async_exec_start(aes); if (rc) goto out; Index: xen-4.13.0-testing/tools/libxl/libxl_event.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_event.c +++ xen-4.13.0-testing/tools/libxl/libxl_event.c @@ -858,27 +858,29 @@ static void devstate_callback(libxl__egc { EGC_GC; libxl__ev_devstate *ds = CONTAINER_OF(xsw, *ds, w); + char b[64]; + libxl__suse_diff_timespec(&ds->w.start, b, sizeof(b)); if (rc) { if (rc == ERROR_TIMEDOUT) - LOG(DEBUG, "backend %s wanted state %d "" timed out", ds->w.path, - ds->wanted); + LOG(DEBUG, "backend %s wanted state %d "" timed out%s", ds->w.path, + ds->wanted, b); goto out; } if (!sstate) { - LOG(DEBUG, "backend %s wanted state %d"" but it was removed", - ds->w.path, ds->wanted); + LOG(DEBUG, "backend %s wanted state %d"" but it was removed%s", + ds->w.path, ds->wanted, b); rc = ERROR_INVAL; goto out; } int got = atoi(sstate); if (got == ds->wanted) { - LOG(DEBUG, "backend %s wanted state %d ok", ds->w.path, ds->wanted); + LOG(DEBUG, "backend %s wanted state %d ok%s", ds->w.path, ds->wanted, b); rc = 0; } else { - LOG(DEBUG, "backend %s wanted state %d"" still waiting state %d", - ds->w.path, ds->wanted, got); + LOG(DEBUG, "backend %s wanted state %d"" still waiting state %d%s", + ds->w.path, ds->wanted, got, b); return; } @@ -904,6 +906,8 @@ int libxl__ev_devstate_wait(libxl__ao *a ds->w.path = state_path; ds->w.timeout_ms = milliseconds; ds->w.callback = devstate_callback; + rc = clock_gettime(CLOCK_MONOTONIC, &ds->w.start); + if (rc) goto out; rc = libxl__xswait_start(gc, &ds->w); if (rc) goto out; Index: xen-4.13.0-testing/tools/libxl/libxl_internal.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_internal.c +++ xen-4.13.0-testing/tools/libxl/libxl_internal.c @@ -17,6 +17,97 @@ #include "libxl_internal.h" +#define LIBXL_SUSE_PATH_TIMEOUT "/libxl/suse/per-device-LIBXL_HOTPLUG_TIMEOUT" +#define LIBXL_SUSE_PATH_DISK_TIMEOUT "suse/disks-LIBXL_HOTPLUG_TIMEOUT" +#define LIBXL_SUSE_PATH_NIC_TIMEOUT "suse/nics-LIBXL_HOTPLUG_TIMEOUT" + +void libxl__suse_domain_set_hotplug_timeout(libxl__gc *gc, uint32_t domid, long d, long n) +{ + char *path; + char *val, *p; + long v; + + val = libxl__xs_read(gc, XBT_NULL, LIBXL_SUSE_PATH_TIMEOUT); + if (!val) + return; + + v = strtol(val, NULL, 0); + if (v <= 0) + return; + + path = libxl__xs_libxl_path(gc, domid); + if (d > 0) { + p = GCSPRINTF("%s/" LIBXL_SUSE_PATH_DISK_TIMEOUT, path); + LOGD(NOTICE, domid, "Setting %s to %ld*%ld=%ld", p, d, v, d*v); + libxl__xs_printf(gc, XBT_NULL, p, "%ld", d*v); + } + if (n > 0) { + p = GCSPRINTF("%s/" LIBXL_SUSE_PATH_NIC_TIMEOUT, path); + LOGD(NOTICE, domid, "Setting %s to %ld*%ld=%ld", p, n, v, n*v); + libxl__xs_printf(gc, XBT_NULL, p, "%ld", n*v); + } +} + +void libxl__suse_domain_get_hotplug_timeout(libxl__gc *gc, uint32_t domid, libxl__device_kind kind, struct timespec *ts, int *timeout_ms, const char *be_path) +{ + char *path; + char *val, *p; + long v = 0; + + path = libxl__xs_libxl_path(gc, domid); + if (!path) + return; + + switch (kind) { + case LIBXL__DEVICE_KIND_VBD: + p = GCSPRINTF("%s/" LIBXL_SUSE_PATH_DISK_TIMEOUT, path); + break; + case LIBXL__DEVICE_KIND_VIF: + p = GCSPRINTF("%s/" LIBXL_SUSE_PATH_NIC_TIMEOUT, path); + break; + default: + return; + } + errno = 0; + val = libxl__xs_read(gc, XBT_NULL, p); + if (val) + v = strtol(val, NULL, 0); + LOGED(DEBUG, domid, "Got from '%s' = %ld from %s for %s", val?:"", v, p, be_path); + if (!val || v <= 0) + return; + + if (v > (INT_MAX/1000)) + v = (INT_MAX/1000); + v *= 1000; + LOGD(NOTICE, domid, "Replacing timeout %d with %ld for %s", *timeout_ms, v, be_path); + *timeout_ms = v; + if (clock_gettime(CLOCK_MONOTONIC, ts) < 0) { + LOGED(ERROR, domid, "clock_gettime failed for %s", be_path); + ts->tv_sec = ts->tv_nsec = 0; + } + +} + +void libxl__suse_diff_timespec(const struct timespec *old, char *b, size_t s) +{ + struct timespec new, diff; + + if (old->tv_sec == 0 && old->tv_nsec == 0) { + *b = '\0'; + return; + } + if (clock_gettime(CLOCK_MONOTONIC, &new)) + new = *old; + if ((new.tv_nsec - old->tv_nsec) < 0) { + diff.tv_sec = new.tv_sec - old->tv_sec - 1; + diff.tv_nsec = new.tv_nsec - old->tv_nsec + (1000*1000*1000); + } else { + diff.tv_sec = new.tv_sec - old->tv_sec; + diff.tv_nsec = new.tv_nsec - old->tv_nsec; + } + snprintf(b, s, " (%ld.%09lds)", (long)diff.tv_sec, diff.tv_nsec); +} + void libxl__alloc_failed(libxl_ctx *ctx, const char *func, size_t nmemb, size_t size) { #define M "libxl: FATAL ERROR: memory allocation failure" Index: xen-4.13.0-testing/tools/libxl/libxl_internal.h =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_internal.h +++ xen-4.13.0-testing/tools/libxl/libxl_internal.h @@ -50,6 +50,7 @@ #include <sys/un.h> #include <sys/file.h> #include <sys/ioctl.h> +#include <time.h> #include <xenevtchn.h> #include <xenstore.h> @@ -1593,6 +1594,7 @@ struct libxl__xswait_state { const char *what; /* for error msgs: noun phrase, what we're waiting for */ const char *path; int timeout_ms; /* as for poll(2) */ + struct timespec start; libxl__xswait_callback *callback; /* remaining fields are private to xswait */ libxl__ev_time time_ev; @@ -2652,6 +2654,7 @@ struct libxl__async_exec_state { char **args; /* execution arguments */ char **env; /* execution environment */ + struct timespec start; /* private */ libxl__ev_time time; libxl__ev_child child; @@ -4783,6 +4786,9 @@ _hidden int libxl__domain_pvcontrol(libx #endif +_hidden void libxl__suse_domain_set_hotplug_timeout(libxl__gc *gc, uint32_t domid, long d, long n); +_hidden void libxl__suse_domain_get_hotplug_timeout(libxl__gc *gc, uint32_t domid, libxl__device_kind kind, struct timespec *ts, int *timeout_ms, const char *be_path); +_hidden void libxl__suse_diff_timespec(const struct timespec *old, char *b, size_t s); /* * Local variables: * mode: C ++++++ libxl.add-option-to-disable-disk-cache-flushes-in-qdisk.patch ++++++ https://bugzilla.novell.com/show_bug.cgi?id=879425 --- tools/libxl/libxl.c | 2 ++ tools/libxl/libxl.h | 12 ++++++++++++ tools/libxl/libxlu_disk.c | 2 ++ tools/libxl/libxlu_disk_i.h | 2 +- tools/libxl/libxlu_disk_l.l | 1 + 5 files changed, 18 insertions(+), 1 deletion(-) Index: xen-4.13.0-testing/docs/man/xl-disk-configuration.5.pod =================================================================== --- xen-4.13.0-testing.orig/docs/man/xl-disk-configuration.5.pod +++ xen-4.13.0-testing/docs/man/xl-disk-configuration.5.pod @@ -344,6 +344,32 @@ can be used to disable "hole punching" f were intentionally created non-sparse to avoid fragmentation of the file. +=item B<suse-diskcache-disable-flush> + +=over 4 + +=item Description + +Request that the qemu block driver does not automatically flush written data to the backend storage. + +=item Supported values + +absent, present + +=item Mandatory + +No + +=item Default value + +absent + +=back + +This enables the '-disk cache=unsafe' mode inside qemu. +In this mode writes to the underlying blockdevice are delayed. +While using this option in production is dangerous, it improves performance during installation of a domU. + =back Index: xen-4.13.0-testing/tools/libxl/libxl.h =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl.h +++ xen-4.13.0-testing/tools/libxl/libxl.h @@ -439,6 +439,21 @@ #define LIBXL_HAVE_CREATEINFO_PASSTHROUGH 1 /* + * The libxl_device_disk has no way to indicate that cache=unsafe is + * supposed to be used. Provide this knob without breaking the ABI. + * This is done by overloading struct libxl_device_disk->readwrite: + * readwrite == 0: disk is readonly, cache= does not matter + * readwrite == 1: disk is readwrite, backend driver may tweak cache= + * readwrite == MAGIC: disk is readwrite, backend driver should ignore + * flush requests from the frontend driver. + * Note: the macro with MAGIC is used by libvirt to decide if this patch is applied + */ +#define LIBXL_HAVE_LIBXL_DEVICE_DISK_DISABLE_FLUSH_MAGIC 0x00006000U +#define LIBXL_HAVE_LIBXL_DEVICE_DISK_DISABLE_FLUSH_MASK 0xffff0fffU +#define LIBXL_SUSE_IS_CACHE_UNSAFE(rw) (((rw) & ~LIBXL_HAVE_LIBXL_DEVICE_DISK_DISABLE_FLUSH_MASK) == LIBXL_HAVE_LIBXL_DEVICE_DISK_DISABLE_FLUSH_MAGIC) +#define LIBXL_SUSE_SET_CACHE_UNSAFE(rw) (((rw) & LIBXL_HAVE_LIBXL_DEVICE_DISK_DISABLE_FLUSH_MASK) | LIBXL_HAVE_LIBXL_DEVICE_DISK_DISABLE_FLUSH_MAGIC) + +/* * libxl ABI compatibility * * The only guarantee which libxl makes regarding ABI compatibility Index: xen-4.13.0-testing/tools/libxl/libxl_disk.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_disk.c +++ xen-4.13.0-testing/tools/libxl/libxl_disk.c @@ -386,6 +386,8 @@ static void device_disk_add(libxl__egc * flexarray_append_pair(back, "discard-enable", libxl_defbool_val(disk->discard_enable) ? "1" : "0"); + if (LIBXL_SUSE_IS_CACHE_UNSAFE(disk->readwrite)) + flexarray_append_pair(back, "suse-diskcache-disable-flush", "1"); flexarray_append(front, "backend-id"); flexarray_append(front, GCSPRINTF("%d", disk->backend_domid)); Index: xen-4.13.0-testing/tools/libxl/libxl_dm.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_dm.c +++ xen-4.13.0-testing/tools/libxl/libxl_dm.c @@ -984,14 +984,27 @@ enum { LIBXL__COLO_SECONDARY, }; +static const char *qemu_cache_mode(const libxl_device_disk *disk) +{ + static const char cache_directsync[] = "directsync"; + static const char cache_writeback[] = "writeback"; + static const char cache_unsafe[] = "unsafe"; + + if (LIBXL_SUSE_IS_CACHE_UNSAFE(disk->readwrite)) + return cache_unsafe; + if (disk->direct_io_safe) + return cache_directsync; + return cache_writeback; +} + static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *target_path, int unit, const char *format, const libxl_device_disk *disk, int colo_mode, const char **id_ptr) { char *drive = NULL; - char *common = GCSPRINTF("if=none,readonly=%s,cache=writeback", - disk->readwrite ? "off" : "on"); + char *common = GCSPRINTF("if=none,readonly=%s,cache=%s", + disk->readwrite ? "off" : "on", qemu_cache_mode(disk)); const char *exportname = disk->colo_export; const char *active_disk = disk->active_disk; const char *hidden_disk = disk->hidden_disk; @@ -1050,8 +1063,8 @@ static char *qemu_disk_ide_drive_string( switch (colo_mode) { case LIBXL__COLO_NONE: drive = GCSPRINTF - ("file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback", - target_path, unit, format); + ("file=%s,if=ide,index=%d,media=disk,format=%s,cache=%s", + target_path, unit, format, qemu_cache_mode(disk)); break; case LIBXL__COLO_PRIMARY: /* @@ -1064,13 +1077,14 @@ static char *qemu_disk_ide_drive_string( * vote-threshold=1 */ drive = GCSPRINTF( - "if=ide,index=%d,media=disk,cache=writeback,driver=quorum," + "if=ide,index=%d,media=disk,cache=%s,driver=quorum," "id=%s," "children.0.file.filename=%s," "children.0.driver=%s," "read-pattern=fifo," "vote-threshold=1", - unit, exportname, target_path, format); + unit, qemu_cache_mode(disk), + exportname, target_path, format); break; case LIBXL__COLO_SECONDARY: /* @@ -1084,7 +1098,7 @@ static char *qemu_disk_ide_drive_string( * file.backing.backing=exportname, */ drive = GCSPRINTF( - "if=ide,index=%d,id=top-colo,media=disk,cache=writeback," + "if=ide,index=%d,id=top-colo,media=disk,cache=%s," "driver=replication," "mode=secondary," "top-id=top-colo," @@ -1093,7 +1107,8 @@ static char *qemu_disk_ide_drive_string( "file.backing.driver=qcow2," "file.backing.file.filename=%s," "file.backing.backing=%s", - unit, active_disk, hidden_disk, exportname); + unit, qemu_cache_mode(disk), + active_disk, hidden_disk, exportname); break; default: abort(); @@ -1881,8 +1896,8 @@ static int libxl__build_device_model_arg return ERROR_INVAL; } flexarray_vappend(dm_args, "-drive", - GCSPRINTF("file=%s,if=none,id=ahcidisk-%d,format=%s,cache=writeback", - target_path, disk, format), + GCSPRINTF("file=%s,if=none,id=ahcidisk-%d,format=%s,cache=%s", + target_path, disk, format, qemu_cache_mode(&disks[i])), "-device", GCSPRINTF("ide-hd,bus=ahci0.%d,unit=0,drive=ahcidisk-%d", disk, disk), NULL); continue; Index: xen-4.13.0-testing/tools/libxl/libxlu_disk.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxlu_disk.c +++ xen-4.13.0-testing/tools/libxl/libxlu_disk.c @@ -79,6 +79,8 @@ int xlu_disk_parse(XLU_Config *cfg, if (!disk->pdev_path || !strcmp(disk->pdev_path, "")) disk->format = LIBXL_DISK_FORMAT_EMPTY; } + if (disk->readwrite && dpc.suse_diskcache_disable_flush) + disk->readwrite = LIBXL_SUSE_SET_CACHE_UNSAFE(disk->readwrite); if (!disk->vdev) { xlu__disk_err(&dpc,0, "no vdev specified"); Index: xen-4.13.0-testing/tools/libxl/libxlu_disk_i.h =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxlu_disk_i.h +++ xen-4.13.0-testing/tools/libxl/libxlu_disk_i.h @@ -10,7 +10,7 @@ typedef struct { void *scanner; YY_BUFFER_STATE buf; libxl_device_disk *disk; - int access_set, had_depr_prefix; + int access_set, suse_diskcache_disable_flush, had_depr_prefix; const char *spec; } DiskParseContext; Index: xen-4.13.0-testing/tools/libxl/libxlu_disk_l.l =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxlu_disk_l.l +++ xen-4.13.0-testing/tools/libxl/libxlu_disk_l.l @@ -196,6 +196,7 @@ colo-port=[^,]*,? { STRIP(','); setcolop colo-export=[^,]*,? { STRIP(','); SAVESTRING("colo-export", colo_export, FROMEQUALS); } active-disk=[^,]*,? { STRIP(','); SAVESTRING("active-disk", active_disk, FROMEQUALS); } hidden-disk=[^,]*,? { STRIP(','); SAVESTRING("hidden-disk", hidden_disk, FROMEQUALS); } +suse-diskcache-disable-flush,? { DPC->suse_diskcache_disable_flush = 1; } /* the target magic parameter, eats the rest of the string */ ++++++ libxl.helper_done-crash.patch ++++++ >From fb0f946726ff8aaa15b76bc3ec3b18878851a447 Mon Sep 17 00:00:00 2001 From: Olaf Hering <olaf(a)aepfle.de> Date: Fri, 27 Sep 2019 18:06:12 +0200 Subject: libxl: fix crash in helper_done due to uninitialized data A crash in helper_done, called from libxl_domain_suspend, was reported, triggered by 'virsh migrate --live xen+ssh://host': #1 helper_done (...) at libxl_save_callout.c:371 helper_failed helper_stop libxl__save_helper_abort #2 check_all_finished (..., rc=-3) at libxl_stream_write.c:671 stream_done stream_complete write_done dc->callback == write_done efd->func == datacopier_writable #3 afterpoll_internal (...) at libxl_event.c:1269 This is triggered by a failed poll, the actual error was: libxl_aoutils.c:328:datacopier_writable: unexpected poll event 0x1c on fd 37 (should be POLLOUT) writing libxc header during copy of save v2 stream In this case revents in datacopier_writable is POLLHUP|POLLERR|POLLOUT, which triggers datacopier_callback. In helper_done, shs->completion_callback is still zero. libxl__xc_domain_save fills dss.sws.shs. But that function is only called after stream_header_done. Any error before that will leave dss partly uninitialized. Fix this crash by checking if ->completion_callback is valid. Signed-off-by: Olaf Hering <olaf(a)aepfle.de> --- tools/libxl/libxl_save_callout.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c index 6452d70036..89a2f6ecf0 100644 --- a/tools/libxl/libxl_save_callout.c +++ b/tools/libxl/libxl_save_callout.c @@ -368,8 +368,9 @@ static void helper_done(libxl__egc *egc, libxl__save_helper_state *shs) assert(!libxl__save_helper_inuse(shs)); shs->egc = egc; - shs->completion_callback(egc, shs->caller_state, - shs->rc, shs->retval, shs->errnoval); + if (shs->completion_callback) + shs->completion_callback(egc, shs->caller_state, + shs->rc, shs->retval, shs->errnoval); shs->egc = 0; } ++++++ libxl.libxl__domain_pvcontrol.patch ++++++ References: bsc#1161480 Fix xl shutdown for HVM without PV drivers A return value of zero means no PV drivers. Restore the hunk which was removed. Fixes commit b183e180bce93037d3ef385a8c2338bbfb7f23d9 Signed-off-by: Olaf Hering <olaf(a)aepfle.de> --- tools/libxl/libxl_domain.c | 3 +++ 1 file changed, 3 insertions(+) Index: xen-4.13.1-testing/tools/libxl/libxl_domain.c =================================================================== --- xen-4.13.1-testing.orig/tools/libxl/libxl_domain.c +++ xen-4.13.1-testing/tools/libxl/libxl_domain.c @@ -795,6 +795,9 @@ int libxl__domain_pvcontrol(libxl__egc * if (rc < 0) return rc; + if (!rc) + return ERROR_NOPARAVIRT; + shutdown_path = libxl__domain_pvcontrol_xspath(gc, domid); if (!shutdown_path) return ERROR_FAIL; ++++++ libxl.max_event_channels.patch ++++++ References: bsc#1167608 unbound limits for max_event_channels 1023 is too low for a three digit value of vcpus it is difficult to make the value depend on the number of vcpus adding devices at runtime also needs event channels --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -224,7 +224,7 @@ int libxl__domain_build_info_setdefault( b_info->iomem[i].gfn = b_info->iomem[i].start; if (!b_info->event_channels) - b_info->event_channels = 1023; + b_info->event_channels = -1U; libxl__arch_domain_build_info_setdefault(gc, b_info); libxl_defbool_setdefault(&b_info->dm_restrict, false); ++++++ libxl.pvscsi.patch ++++++ ++++ 2538 lines (skipped) ++++++ libxl.set-migration-constraints-from-cmdline.patch ++++++ >From 77deb80879859ed279e24a790ec08e9c5d37dd0e Mon Sep 17 00:00:00 2001 From: Olaf Hering <olaf(a)aepfle.de> Date: Wed, 5 Feb 2014 14:37:53 +0100 Subject: libxl: set migration constraints from cmdline Add new options to xl migrate to control the process of migration. The intention is to optionally abort the migration if it takes too long to migrate a busy guest due to the high number of new dirty pages. Currently the guest is suspended to transfer the remaining dirty pages. The suspend/resume cycle will cause a time jump. This transfer can take a long time, which can confuse the guest if the time jump is too far. The new options allow to override the built-in default values, which are not changed by this patch. --max_iters <number> Number of iterations before final suspend (default: 30) --max_factor <factor> Max amount of memory to transfer before final suspend (default: 3*RAM) --min_remaing <pages> Number of dirty pages before stop&copy (default: 50) --abort_if_busy Abort migration instead of doing final suspend. The changes to libxl change the API, handle LIBXL_API_VERSION == 0x040200. v8: - merge --min_remaing changes - tools/libxc: print stats if migration is aborted - use special _suse version of lib calls to preserve ABI v7: - remove short options - update description of --abort_if_busy in xl.1 - extend description of --abort_if_busy in xl help - add comment to libxl_domain_suspend declaration, props is optional v6: - update the LIBXL_API_VERSION handling for libxl_domain_suspend change it to an inline function if LIBXL_API_VERSION is defined to 4.2.0 - rename libxl_save_properties to libxl_domain_suspend_properties - rename ->xlflags to ->flags within that struct v5: - adjust libxl_domain_suspend prototype, move flags, max_iters, max_factor into a new, optional struct libxl_save_properties - rename XCFLAGS_DOMSAVE_NOSUSPEND to XCFLAGS_DOMSAVE_ABORT_IF_BUSY - rename LIBXL_SUSPEND_NO_FINAL_SUSPEND to LIBXL_SUSPEND_ABORT_IF_BUSY - rename variables no_suspend to abort_if_busy - rename option -N/--no_suspend to -A/--abort_if_busy - update xl.1, extend description of -A option v4: - update default for no_suspend from None to 0 in XendCheckpoint.py:save - update logoutput in setMigrateConstraints - change xm migrate defaults from None to 0 - add new options to xl.1 - fix syntax error in XendDomain.py:domain_migrate_constraints_set - fix xm migrate -N option name to match xl migrate v3: - move logic errors in libxl__domain_suspend and fixed help text in cmd_table to separate patches - fix syntax error in XendCheckpoint.py - really pass max_iters and max_factor in libxl__xc_domain_save - make libxl_domain_suspend_0x040200 declaration globally visible - bump libxenlight.so SONAME from 2.0 to 2.1 due to changed libxl_domain_suspend v2: - use LIBXL_API_VERSION and define libxl_domain_suspend_0x040200 - fix logic error in min_reached check in xc_domain_save - add longopts - update --help text - correct description of migrate --help text Signed-off-by: Olaf Hering <olaf(a)aepfle.de> --- docs/man/xl.pod.1 | 20 +++++++++++++++++++ tools/libxc/include/xenguest.h | 7 ++++++ tools/libxc/xc_nomigrate.c | 10 +++++++++ tools/libxc/xc_sr_common.h | 1 tools/libxc/xc_sr_save.c | 22 +++++++++++++++------ tools/libxl/libxl.c | 29 ++++++++++++++++++++++++---- tools/libxl/libxl.h | 15 ++++++++++++++ tools/libxl/libxl_dom_save.c | 1 tools/libxl/libxl_internal.h | 4 +++ tools/libxl/libxl_save_callout.c | 4 ++- tools/libxl/libxl_save_helper.c | 8 ++++--- tools/libxl/xl_cmdimpl.c | 40 +++++++++++++++++++++++++++++++++------ tools/libxl/xl_cmdtable.c | 23 ++++++++++++++-------- 13 files changed, 156 insertions(+), 28 deletions(-) Index: xen-4.13.0-testing/docs/man/xl.1.pod.in =================================================================== --- xen-4.13.0-testing.orig/docs/man/xl.1.pod.in +++ xen-4.13.0-testing/docs/man/xl.1.pod.in @@ -490,6 +490,26 @@ Display huge (!) amount of debug informa Leave the domain on the receive side paused after migration. +=item B<--max_iters> I<number> + +Number of iterations before final suspend (default: 30) + +=item B<--max_factor> I<factor> + +Max amount of memory to transfer before final suspend (default: 3*RAM) + +=item B<--min_remaining> + +Number of remaining dirty pages. If the number of dirty pages drops that +low the guest is suspended and the remaing pages are transfered to <host>. + +=item B<--abort_if_busy> + +Abort migration instead of doing final suspend/transfer/resume if the +guest has still dirty pages after the number of iterations and/or the +amount of RAM transferred. This avoids long periods of time where the +guest is suspended. + =back =item B<remus> [I<OPTIONS>] I<domain-id> I<host> Index: xen-4.13.0-testing/tools/libxc/include/xenguest.h =================================================================== --- xen-4.13.0-testing.orig/tools/libxc/include/xenguest.h +++ xen-4.13.0-testing/tools/libxc/include/xenguest.h @@ -29,6 +29,7 @@ #define XCFLAGS_HVM (1 << 2) #define XCFLAGS_STDVGA (1 << 3) #define XCFLAGS_CHECKPOINT_COMPRESS (1 << 4) +#define XCFLAGS_DOMSAVE_ABORT_IF_BUSY (1 << 5) #define X86_64_B_SIZE 64 #define X86_32_B_SIZE 32 @@ -131,10 +132,20 @@ typedef enum { * doesn't use checkpointing * @return 0 on success, -1 on failure */ +int xc_domain_save_suse(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters, + uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */, + uint32_t min_remaining, + struct save_callbacks* callbacks, int hvm, + xc_migration_stream_t stream_type, int recv_fd); +static inline int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t flags /* XCFLAGS_xxx */, struct save_callbacks* callbacks, int hvm, - xc_migration_stream_t stream_type, int recv_fd); + xc_migration_stream_t stream_type, int recv_fd) +{ + return xc_domain_save_suse(xch,io_fd,dom,0,0,flags,0,callbacks,hvm,stream_type,recv_fd); +} + /* callbacks provided by xc_domain_restore */ struct restore_callbacks { Index: xen-4.13.0-testing/tools/libxc/xc_nomigrate.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxc/xc_nomigrate.c +++ xen-4.13.0-testing/tools/libxc/xc_nomigrate.c @@ -20,9 +20,11 @@ #include <xenctrl.h> #include <xenguest.h> -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t flags, - struct save_callbacks* callbacks, int hvm, - xc_migration_stream_t stream_type, int recv_fd) +int xc_domain_save_suse(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters, + uint32_t max_factor, uint32_t flags, + uint32_t min_remaining, + struct save_callbacks* callbacks, int hvm, + xc_migration_stream_t stream_type, int recv_fd) { errno = ENOSYS; return -1; Index: xen-4.13.0-testing/tools/libxc/xc_sr_save.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxc/xc_sr_save.c +++ xen-4.13.0-testing/tools/libxc/xc_sr_save.c @@ -525,6 +525,11 @@ static int send_memory_live(struct xc_sr policy_decision = precopy_policy(*policy_stats, data); x++; + if ( policy_decision == XGS_POLICY_ABORT ) + { + rc = -1; + break; + } if ( stats.dirty_count > 0 && policy_decision != XGS_POLICY_ABORT ) { rc = update_progress_string(ctx, &progress_str); @@ -545,6 +550,11 @@ static int send_memory_live(struct xc_sr policy_decision = precopy_policy(*policy_stats, data); + if ( policy_decision == XGS_POLICY_ABORT ) + { + rc = -1; + break; + } if ( policy_decision != XGS_POLICY_CONTINUE_PRECOPY ) break; @@ -965,9 +975,71 @@ static int save(struct xc_sr_context *ct return rc; }; -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, - uint32_t flags, struct save_callbacks* callbacks, - int hvm, xc_migration_stream_t stream_type, int recv_fd) +static struct suse_flags { + struct xc_sr_context *ctx; + unsigned long cnt; + uint32_t max_iters; + unsigned long max_factor; + long min_remaining; + long dirty_count; + uint32_t abort_if_busy; +} suse_flags; + +static int suse_precopy_policy(struct precopy_stats stats, void *user) +{ + xc_interface *xch = suse_flags.ctx->xch; + + suse_flags.cnt++; + errno = 0; + DBGPRINTF("%s: domU %u: #%lu iteration %u total_written %u dirty_count %ld", + __func__, suse_flags.ctx->domid, suse_flags.cnt, stats.iteration, stats.total_written, stats.dirty_count); + + if ( stats.dirty_count >= 0 ) + suse_flags.dirty_count = stats.dirty_count; + + /* Stop loop after N iterations */ + if ( stats.iteration > suse_flags.max_iters ) + { + IPRINTF("%s: domU %u, too many iterations (%u/%u)", + __func__, suse_flags.ctx->domid, stats.iteration, suse_flags.max_iters); + goto out; + } + /* Suspend domU in case only few dirty pages remain */ + if ( stats.dirty_count >= 0 && stats.dirty_count < suse_flags.min_remaining ) + { + IPRINTF("%s: domU %u, dirty_count reached (%ld/%ld)", + __func__, suse_flags.ctx->domid, stats.dirty_count, suse_flags.min_remaining); + goto suspend; + } + /* Stop loop if too much memory was transfered (formula incorrect for ballooned domU) */ + if ( stats.total_written > suse_flags.max_factor * suse_flags.ctx->save.p2m_size ) + { + IPRINTF("%s: domU %u, too much memory transfered (%u/%lu)", + __func__, suse_flags.ctx->domid, stats.total_written, suse_flags.max_factor * suse_flags.ctx->save.p2m_size); + goto out; + } + /* Keep going */ + return XGS_POLICY_CONTINUE_PRECOPY; + +out: + if ( suse_flags.abort_if_busy ) + { + errno = EBUSY; + PERROR("%s: domU %u busy, dirty pages %ld/%lu after %u iterations, %u pages transfered", + __func__, suse_flags.ctx->domid, + suse_flags.dirty_count, suse_flags.ctx->save.p2m_size, + stats.iteration, stats.total_written); + return XGS_POLICY_ABORT; + } +suspend: + return XGS_POLICY_STOP_AND_COPY; +} + +int xc_domain_save_suse(xc_interface *xch, int io_fd, uint32_t dom, + uint32_t max_iters, uint32_t max_factor, uint32_t flags, + uint32_t min_remaining, + struct save_callbacks* callbacks, int hvm, + xc_migration_stream_t stream_type, int recv_fd) { struct xc_sr_context ctx = { @@ -982,6 +1054,19 @@ int xc_domain_save(xc_interface *xch, in ctx.save.checkpointed = stream_type; ctx.save.recv_fd = recv_fd; + if ( callbacks->precopy_policy ) + { + errno = EBUSY; + PERROR("%s: precopy_policy already set (%p)", __func__, callbacks->precopy_policy); + return -1; + } + callbacks->precopy_policy = suse_precopy_policy; + suse_flags.ctx = &ctx; + suse_flags.max_iters = max_iters ? : 5; + suse_flags.max_factor = max_factor ? : 3; + suse_flags.min_remaining = min_remaining ? : 50; + suse_flags.abort_if_busy = !!(flags & XCFLAGS_DOMSAVE_ABORT_IF_BUSY); + /* If altering migration_stream update this assert too. */ assert(stream_type == XC_MIG_STREAM_NONE || stream_type == XC_MIG_STREAM_REMUS || Index: xen-4.13.0-testing/tools/libxl/libxl.h =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl.h +++ xen-4.13.0-testing/tools/libxl/libxl.h @@ -1647,8 +1647,23 @@ int libxl_domain_suspend(libxl_ctx *ctx, int flags, /* LIBXL_SUSPEND_* */ const libxl_asyncop_how *ao_how) LIBXL_EXTERNAL_CALLERS_ONLY; + +typedef struct { + int flags; /* LIBXL_SUSPEND_* */ + int max_iters; + int max_factor; + int min_remaining; +} libxl_domain_suspend_suse_properties; + +#define LIBXL_HAVE_DOMAIN_SUSPEND_SUSE +int libxl_domain_suspend_suse(libxl_ctx *ctx, uint32_t domid, int fd, + const libxl_domain_suspend_suse_properties *props, /* optional */ + const libxl_asyncop_how *ao_how) + LIBXL_EXTERNAL_CALLERS_ONLY; + #define LIBXL_SUSPEND_DEBUG 1 #define LIBXL_SUSPEND_LIVE 2 +#define LIBXL_SUSPEND_ABORT_IF_BUSY 4 /* * Only suspend domain, do not save its state to file, do not destroy it. Index: xen-4.13.0-testing/tools/libxl/libxl_dom_save.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_dom_save.c +++ xen-4.13.0-testing/tools/libxl/libxl_dom_save.c @@ -423,6 +423,7 @@ void libxl__domain_save(libxl__egc *egc, dss->xcflags = (live ? XCFLAGS_LIVE : 0) | (debug ? XCFLAGS_DEBUG : 0) + | (dss->xlflags & LIBXL_SUSPEND_ABORT_IF_BUSY ? XCFLAGS_DOMSAVE_ABORT_IF_BUSY : 0) | (dss->hvm ? XCFLAGS_HVM : 0); /* Disallow saving a guest with vNUMA configured because migration Index: xen-4.13.0-testing/tools/libxl/libxl_domain.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_domain.c +++ xen-4.13.0-testing/tools/libxl/libxl_domain.c @@ -503,8 +503,9 @@ static void domain_suspend_cb(libxl__egc } -int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags, - const libxl_asyncop_how *ao_how) +static int do_libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, + const libxl_domain_suspend_suse_properties *props, + const libxl_asyncop_how *ao_how) { AO_CREATE(ctx, domid, ao_how); int rc; @@ -524,9 +525,15 @@ int libxl_domain_suspend(libxl_ctx *ctx, dss->domid = domid; dss->fd = fd; dss->type = type; - dss->live = flags & LIBXL_SUSPEND_LIVE; - dss->debug = flags & LIBXL_SUSPEND_DEBUG; dss->checkpointed_stream = LIBXL_CHECKPOINTED_STREAM_NONE; + if (props) { + dss->live = props->flags & LIBXL_SUSPEND_LIVE; + dss->debug = props->flags & LIBXL_SUSPEND_DEBUG; + dss->max_iters = props->max_iters; + dss->max_factor = props->max_factor; + dss->min_remaining = props->min_remaining; + dss->xlflags = props->flags; + } rc = libxl__fd_flags_modify_save(gc, dss->fd, ~(O_NONBLOCK|O_NDELAY), 0, @@ -574,6 +581,20 @@ int libxl_domain_suspend_only(libxl_ctx return AO_CREATE_FAIL(rc); } +int libxl_domain_suspend_suse(libxl_ctx *ctx, uint32_t domid, int fd, + const libxl_domain_suspend_suse_properties *props, + const libxl_asyncop_how *ao_how) +{ + return do_libxl_domain_suspend(ctx, domid, fd, props, ao_how); +} + +int libxl_domain_suspend(libxl_ctx *ctx, uint32_t domid, int fd, int flags, + const libxl_asyncop_how *ao_how) +{ + libxl_domain_suspend_suse_properties props = { .flags = flags }; + return do_libxl_domain_suspend(ctx, domid, fd, &props, ao_how); +} + int libxl_domain_pause(libxl_ctx *ctx, uint32_t domid, const libxl_asyncop_how *ao_how) { Index: xen-4.13.0-testing/tools/libxl/libxl_internal.h =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_internal.h +++ xen-4.13.0-testing/tools/libxl/libxl_internal.h @@ -3596,6 +3596,10 @@ struct libxl__domain_save_state { /* private */ int rc; int hvm; + int max_iters; + int max_factor; + int min_remaining; + int xlflags; int xcflags; libxl__domain_suspend_state dsps; union { Index: xen-4.13.0-testing/tools/libxl/libxl_save_callout.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_save_callout.c +++ xen-4.13.0-testing/tools/libxl/libxl_save_callout.c @@ -89,7 +89,9 @@ void libxl__xc_domain_save(libxl__egc *e libxl__srm_callout_enumcallbacks_save(&shs->callbacks.save.a); const unsigned long argnums[] = { - dss->domid, dss->xcflags, dss->hvm, cbflags, + dss->domid, + dss->max_iters, dss->max_factor, dss->min_remaining, + dss->xcflags, dss->hvm, cbflags, dss->checkpointed_stream, }; Index: xen-4.13.0-testing/tools/libxl/libxl_save_helper.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_save_helper.c +++ xen-4.13.0-testing/tools/libxl/libxl_save_helper.c @@ -251,6 +251,9 @@ int main(int argc, char **argv) io_fd = atoi(NEXTARG); recv_fd = atoi(NEXTARG); uint32_t dom = strtoul(NEXTARG,0,10); + uint32_t max_iters = strtoul(NEXTARG,0,10); + uint32_t max_factor = strtoul(NEXTARG,0,10); + uint32_t min_remaining = strtoul(NEXTARG,0,10); uint32_t flags = strtoul(NEXTARG,0,10); int hvm = atoi(NEXTARG); unsigned cbflags = strtoul(NEXTARG,0,10); @@ -262,8 +265,10 @@ int main(int argc, char **argv) startup("save"); setup_signals(save_signal_handler); - r = xc_domain_save(xch, io_fd, dom, flags, &helper_save_callbacks, - hvm, stream_type, recv_fd); + r = xc_domain_save_suse(xch, io_fd, dom, max_iters, max_factor, flags, + min_remaining, + &helper_save_callbacks, hvm, stream_type, + recv_fd); complete(r); } else if (!strcmp(mode,"--restore-domain")) { Index: xen-4.13.0-testing/tools/xl/xl_cmdtable.c =================================================================== --- xen-4.13.0-testing.orig/tools/xl/xl_cmdtable.c +++ xen-4.13.0-testing/tools/xl/xl_cmdtable.c @@ -159,15 +159,22 @@ struct cmd_spec cmd_table[] = { &main_migrate, 0, 1, "Migrate a domain to another host", "[options] <Domain> <host>", - "-h Print this help.\n" - "-C <config> Send <config> instead of config file from creation.\n" - "-s <sshcommand> Use <sshcommand> instead of ssh. String will be passed\n" - " to sh. If empty, run <host> instead of ssh <host> xl\n" - " migrate-receive [-d -e]\n" - "-e Do not wait in the background (on <host>) for the death\n" - " of the domain.\n" - "--debug Print huge (!) amount of debug during the migration process.\n" - "-p Do not unpause domain after migrating it." + "-h Print this help.\n" + "-C <config> Send <config> instead of config file from creation.\n" + "-s <sshcommand> Use <sshcommand> instead of ssh. String will be passed\n" + " to sh. If empty, run <host> instead of ssh <host> xl\n" + " migrate-receive [-d -e]\n" + "-e Do not wait in the background (on <host>) for the death\n" + " of the domain.\n" + "--debug Print huge (!) amount of debug during the migration process.\n" + "-p Do not unpause domain after migrating it.\n" + "\n" + "SUSE Linux specific options:\n" + "--max_iters <number> Number of iterations before final suspend (default: 30)\n" + "--max_factor <factor> Max amount of memory to transfer before final suspend (default: 3*RAM).\n" + "--min_remaining <pages> Number of remaining dirty pages before final suspend (default: 50).\n" + "--abort_if_busy Abort migration instead of doing final suspend, if number\n" + " of iterations or amount of transfered memory is exceeded." }, { "restore", &main_restore, 0, 1, Index: xen-4.13.0-testing/tools/xl/xl_migrate.c =================================================================== --- xen-4.13.0-testing.orig/tools/xl/xl_migrate.c +++ xen-4.13.0-testing/tools/xl/xl_migrate.c @@ -177,6 +177,8 @@ static void migrate_do_preamble(int send } static void migrate_domain(uint32_t domid, const char *rune, int debug, + int max_iters, int max_factor, + int min_remaining, int abort_if_busy, const char *override_config_file) { pid_t child = -1; @@ -185,7 +187,13 @@ static void migrate_domain(uint32_t domi char *away_domname; char rc_buf; uint8_t *config_data; - int config_len, flags = LIBXL_SUSPEND_LIVE; + int config_len; + libxl_domain_suspend_suse_properties props = { + .flags = LIBXL_SUSPEND_LIVE, + .max_iters = max_iters, + .max_factor = max_factor, + .min_remaining = min_remaining, + }; save_domain_core_begin(domid, override_config_file, &config_data, &config_len); @@ -204,10 +212,12 @@ static void migrate_domain(uint32_t domi xtl_stdiostream_adjust_flags(logger, XTL_STDIOSTREAM_HIDE_PROGRESS, 0); if (debug) - flags |= LIBXL_SUSPEND_DEBUG; - rc = libxl_domain_suspend(ctx, domid, send_fd, flags, NULL); + props.flags |= LIBXL_SUSPEND_DEBUG; + if (abort_if_busy) + props.flags |= LIBXL_SUSPEND_ABORT_IF_BUSY; + rc = libxl_domain_suspend_suse(ctx, domid, send_fd, &props, NULL); if (rc) { - fprintf(stderr, "migration sender: libxl_domain_suspend failed" + fprintf(stderr, "migration sender: libxl_domain_suspend_suse failed" " (rc=%d)\n", rc); if (rc == ERROR_GUEST_TIMEDOUT) goto failed_suspend; @@ -537,13 +547,18 @@ int main_migrate(int argc, char **argv) char *rune = NULL; char *host; int opt, daemonize = 1, monitor = 1, debug = 0, pause_after_migration = 0; + int max_iters = 0, max_factor = 0, min_remaining = 0, abort_if_busy = 0; static struct option opts[] = { {"debug", 0, 0, 0x100}, + {"max_iters", 1, 0, 0x101}, + {"max_factor", 1, 0, 0x102}, + {"min_remaining", 1, 0, 0x103}, + {"abort_if_busy", 0, 0, 0x104}, {"live", 0, 0, 0x200}, COMMON_LONG_OPTS }; - SWITCH_FOREACH_OPT(opt, "FC:s:ep", opts, "migrate", 2) { + SWITCH_FOREACH_OPT(opt, "FC:s:epM:m:A", opts, "migrate", 2) { case 'C': config_filename = optarg; break; @@ -563,6 +578,18 @@ int main_migrate(int argc, char **argv) case 0x100: /* --debug */ debug = 1; break; + case 0x101: + max_iters = atoi(optarg); + break; + case 0x102: + max_factor = atoi(optarg); + break; + case 0x103: + min_remaining = atoi(optarg); + break; + case 0x104: + abort_if_busy = 1; + break; case 0x200: /* --live */ /* ignored for compatibility with xm */ break; @@ -596,7 +623,8 @@ int main_migrate(int argc, char **argv) pause_after_migration ? " -p" : ""); } - migrate_domain(domid, rune, debug, config_filename); + migrate_domain(domid, rune, debug, max_iters, max_factor, min_remaining, + abort_if_busy, config_filename); return EXIT_SUCCESS; } ++++++ logrotate.conf ++++++ compress missingok notifempty /var/log/xen/xen-hotplug.log { rotate 2 size 100k copytruncate } /var/log/xen/xl-*.log /var/log/xen/qemu-dm-*.log /var/log/xen/console/*.log { rotate 4 dateext dateformat -%Y%m%d-%H%M size 2M copytruncate } ++++++ migration-python3-conversion.patch ++++++ Index: xen-4.10.0-testing/tools/python/xen/migration/legacy.py =================================================================== --- xen-4.10.0-testing.orig/tools/python/xen/migration/legacy.py +++ xen-4.10.0-testing/tools/python/xen/migration/legacy.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ Index: xen-4.10.0-testing/tools/python/xen/migration/libxc.py =================================================================== --- xen-4.10.0-testing.orig/tools/python/xen/migration/libxc.py +++ xen-4.10.0-testing/tools/python/xen/migration/libxc.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ @@ -87,23 +87,23 @@ rec_type_to_str = { # page_data PAGE_DATA_FORMAT = "II" -PAGE_DATA_PFN_MASK = (long(1) << 52) - 1 -PAGE_DATA_PFN_RESZ_MASK = ((long(1) << 60) - 1) & ~((long(1) << 52) - 1) +PAGE_DATA_PFN_MASK = (int(1) << 52) - 1 +PAGE_DATA_PFN_RESZ_MASK = ((int(1) << 60) - 1) & ~((int(1) << 52) - 1) # flags from xen/public/domctl.h: XEN_DOMCTL_PFINFO_* shifted by 32 bits PAGE_DATA_TYPE_SHIFT = 60 -PAGE_DATA_TYPE_LTABTYPE_MASK = (long(0x7) << PAGE_DATA_TYPE_SHIFT) -PAGE_DATA_TYPE_LTAB_MASK = (long(0xf) << PAGE_DATA_TYPE_SHIFT) -PAGE_DATA_TYPE_LPINTAB = (long(0x8) << PAGE_DATA_TYPE_SHIFT) # Pinned pagetable - -PAGE_DATA_TYPE_NOTAB = (long(0x0) << PAGE_DATA_TYPE_SHIFT) # Regular page -PAGE_DATA_TYPE_L1TAB = (long(0x1) << PAGE_DATA_TYPE_SHIFT) # L1 pagetable -PAGE_DATA_TYPE_L2TAB = (long(0x2) << PAGE_DATA_TYPE_SHIFT) # L2 pagetable -PAGE_DATA_TYPE_L3TAB = (long(0x3) << PAGE_DATA_TYPE_SHIFT) # L3 pagetable -PAGE_DATA_TYPE_L4TAB = (long(0x4) << PAGE_DATA_TYPE_SHIFT) # L4 pagetable -PAGE_DATA_TYPE_BROKEN = (long(0xd) << PAGE_DATA_TYPE_SHIFT) # Broken -PAGE_DATA_TYPE_XALLOC = (long(0xe) << PAGE_DATA_TYPE_SHIFT) # Allocate-only -PAGE_DATA_TYPE_XTAB = (long(0xf) << PAGE_DATA_TYPE_SHIFT) # Invalid +PAGE_DATA_TYPE_LTABTYPE_MASK = (int(0x7) << PAGE_DATA_TYPE_SHIFT) +PAGE_DATA_TYPE_LTAB_MASK = (int(0xf) << PAGE_DATA_TYPE_SHIFT) +PAGE_DATA_TYPE_LPINTAB = (int(0x8) << PAGE_DATA_TYPE_SHIFT) # Pinned pagetable + +PAGE_DATA_TYPE_NOTAB = (int(0x0) << PAGE_DATA_TYPE_SHIFT) # Regular page +PAGE_DATA_TYPE_L1TAB = (int(0x1) << PAGE_DATA_TYPE_SHIFT) # L1 pagetable +PAGE_DATA_TYPE_L2TAB = (int(0x2) << PAGE_DATA_TYPE_SHIFT) # L2 pagetable +PAGE_DATA_TYPE_L3TAB = (int(0x3) << PAGE_DATA_TYPE_SHIFT) # L3 pagetable +PAGE_DATA_TYPE_L4TAB = (int(0x4) << PAGE_DATA_TYPE_SHIFT) # L4 pagetable +PAGE_DATA_TYPE_BROKEN = (int(0xd) << PAGE_DATA_TYPE_SHIFT) # Broken +PAGE_DATA_TYPE_XALLOC = (int(0xe) << PAGE_DATA_TYPE_SHIFT) # Allocate-only +PAGE_DATA_TYPE_XTAB = (int(0xf) << PAGE_DATA_TYPE_SHIFT) # Invalid # x86_pv_info X86_PV_INFO_FORMAT = "BBHI" Index: xen-4.10.0-testing/tools/python/xen/migration/libxl.py =================================================================== --- xen-4.10.0-testing.orig/tools/python/xen/migration/libxl.py +++ xen-4.10.0-testing/tools/python/xen/migration/libxl.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ Index: xen-4.10.0-testing/tools/python/xen/migration/public.py =================================================================== --- xen-4.10.0-testing.orig/tools/python/xen/migration/public.py +++ xen-4.10.0-testing/tools/python/xen/migration/public.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ Index: xen-4.10.0-testing/tools/python/xen/migration/tests.py =================================================================== --- xen-4.10.0-testing.orig/tools/python/xen/migration/tests.py +++ xen-4.10.0-testing/tools/python/xen/migration/tests.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ Index: xen-4.10.0-testing/tools/python/xen/migration/verify.py =================================================================== --- xen-4.10.0-testing.orig/tools/python/xen/migration/verify.py +++ xen-4.10.0-testing/tools/python/xen/migration/verify.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ @@ -7,11 +7,11 @@ Common verification infrastructure for v from struct import calcsize, unpack -class StreamError(StandardError): +class StreamError(Exception): """Error with the stream""" pass -class RecordError(StandardError): +class RecordError(Exception): """Error with a record in the stream""" pass Index: xen-4.10.0-testing/tools/python/xen/migration/xl.py =================================================================== --- xen-4.10.0-testing.orig/tools/python/xen/migration/xl.py +++ xen-4.10.0-testing/tools/python/xen/migration/xl.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/python3 # -*- coding: utf-8 -*- """ ++++++ pygrub-boot-legacy-sles.patch ++++++ Index: xen-4.13.0-testing/tools/pygrub/src/pygrub =================================================================== --- xen-4.13.0-testing.orig/tools/pygrub/src/pygrub +++ xen-4.13.0-testing/tools/pygrub/src/pygrub @@ -453,7 +453,7 @@ class Grub: self.cf.filename = f break if self.__dict__.get('cf', None) is None: - raise RuntimeError("couldn't find bootloader config file in the image provided.") + return f = fs.open_file(self.cf.filename) # limit read size to avoid pathological cases buf = f.read(FS_READ_MAX) @@ -628,6 +628,20 @@ def run_grub(file, entry, fs, cfg_args): g = Grub(file, fs) + # If missing config or grub has no menu entries to select, look for + # vmlinuz-xen and initrd-xen in /boot + if g.__dict__.get('cf', None) is None or len(g.cf.images) == 0 or re.search(r"xen(-pae)?\.gz",g.cf.images[0].kernel[1]): + if not list_entries: + chosencfg = { "kernel": None, "ramdisk": None, "args": "" } + chosencfg = sniff_xen_kernel(fs, incfg) + if chosencfg["kernel"] and chosencfg["ramdisk"]: + chosencfg["args"] = cfg_args + return chosencfg + if g.__dict__.get('cf', None) is None: + raise RuntimeError("couldn't find bootloader config file in the image provided.") + else: + return + if list_entries: for i in range(len(g.cf.images)): img = g.cf.images[i] @@ -723,6 +737,19 @@ def sniff_netware(fs, cfg): return cfg +def sniff_xen_kernel(fs, cfg): + if not cfg["kernel"]: + if fs.file_exists('/boot/vmlinuz-xen'): + cfg["kernel"] = '/boot/vmlinuz-xen' + elif fs.file_exists('/boot/vmlinuz-xenpae'): + cfg["kernel"] = '/boot/vmlinuz-xenpae' + if cfg["kernel"] and not cfg["ramdisk"]: + if fs.file_exists('/boot/initrd-xen'): + cfg["ramdisk"] = '/boot/initrd-xen' + elif fs.file_exists('/boot/initrd-xenpae'): + cfg["ramdisk"] = '/boot/initrd-xenpae' + return cfg + def format_sxp(kernel, ramdisk, args): s = "linux (kernel %s)" % repr(kernel) if ramdisk: @@ -806,7 +833,7 @@ if __name__ == "__main__": debug = False not_really = False output_format = "sxp" - output_directory = "/var/run/xen/pygrub" + output_directory = "/var/run/xen" # what was passed in incfg = { "kernel": None, "ramdisk": None, "args": "" } ++++++ pygrub-handle-one-line-menu-entries.patch ++++++ References: bsc#978413 The parsing code can't handle a single line menu entry. For example: menuentry 'halt' { halt } Force it to fall through where it will handle the closing brace. Also change warning to debug to cut down on verbose output. Index: xen-4.13.0-testing/tools/pygrub/src/GrubConf.py =================================================================== --- xen-4.13.0-testing.orig/tools/pygrub/src/GrubConf.py +++ xen-4.13.0-testing/tools/pygrub/src/GrubConf.py @@ -150,7 +150,7 @@ class GrubImage(_GrubImage): else: logging.info("Ignored image directive %s" %(com,)) else: - logging.warning("Unknown image directive %s" %(com,)) + logging.debug("Unknown image directive %s" %(com,)) # now put the line in the list of lines if replace is None: @@ -309,7 +309,7 @@ class GrubConfigFile(_GrubConfigFile): else: logging.info("Ignored directive %s" %(com,)) else: - logging.warning("Unknown directive %s" %(com,)) + logging.debug("Unknown directive %s" %(com,)) if img: self.add_image(GrubImage(title, img)) @@ -343,7 +343,7 @@ class Grub2Image(_GrubImage): elif com.startswith('set:'): pass else: - logging.warning("Unknown image directive %s" %(com,)) + logging.debug("Unknown image directive %s" %(com,)) # now put the line in the list of lines if replace is None: @@ -408,7 +408,10 @@ class Grub2ConfigFile(_GrubConfigFile): raise RuntimeError("syntax error: cannot nest menuentry (%d %s)" % (len(img),img)) img = [] title = title_match.group(1) - continue + if not l.endswith('}'): + continue + # One line menuentry, Ex. menuentry 'halt' { halt } + l = '}' if l.startswith("submenu"): menu_level += 1 @@ -447,7 +450,7 @@ class Grub2ConfigFile(_GrubConfigFile): elif com.startswith('set:'): pass else: - logging.warning("Unknown directive %s" %(com,)) + logging.debug("Unknown directive %s" %(com,)) if img is not None: raise RuntimeError("syntax error: end of file with open menuentry(%d %s)" % (len(img),img)) ++++++ pygrub-netware-xnloader.patch ++++++ Index: xen-4.13.0-testing/tools/pygrub/src/pygrub =================================================================== --- xen-4.13.0-testing.orig/tools/pygrub/src/pygrub +++ xen-4.13.0-testing/tools/pygrub/src/pygrub @@ -27,6 +27,7 @@ import xenfsimage import grub.GrubConf import grub.LiloConf import grub.ExtLinuxConf +import xnloader PYGRUB_VER = 0.6 FS_READ_MAX = 1024 * 1024 @@ -768,6 +769,8 @@ if __name__ == "__main__": if len(data) == 0: os.close(tfd) del datafile + if file_to_read == "/nwserver/xnloader.sys": + xnloader.patch_netware_loader(ret) return ret try: os.write(tfd, data) ++++++ replace-obsolete-network-configuration-commands-in-s.patch ++++++ >From 5e1e18fde92bae1ae87f78d470e80b1ffc9350d1 Mon Sep 17 00:00:00 2001 From: Michal Kubecek <mkubecek(a)suse.cz> Date: Wed, 26 Jul 2017 10:28:54 +0200 Subject: [PATCH] replace obsolete network configuration commands in scripts Some scripts still use obsolete network configuration commands ifconfig and brctl. Replace them by commands from iproute2 package. --- README | 3 +-- tools/hotplug/Linux/colo-proxy-setup | 14 ++++++-------- tools/hotplug/Linux/remus-netbuf-setup | 3 ++- tools/hotplug/Linux/vif-bridge | 7 ++++--- tools/hotplug/Linux/vif-nat | 2 +- tools/hotplug/Linux/vif-route | 6 ++++-- tools/hotplug/Linux/vif2 | 6 +++--- tools/hotplug/Linux/xen-network-common.sh | 6 ++---- .../i386-dm/qemu-ifup-Linux | 5 +++-- 9 files changed, 26 insertions(+), 26 deletions(-) Index: xen-4.13.0-testing/README =================================================================== --- xen-4.13.0-testing.orig/README +++ xen-4.13.0-testing/README @@ -57,8 +57,7 @@ provided by your OS distributor: * Development install of GLib v2.0 (e.g. libglib2.0-dev) * Development install of Pixman (e.g. libpixman-1-dev) * pkg-config - * bridge-utils package (/sbin/brctl) - * iproute package (/sbin/ip) + * iproute package (/sbin/ip, /sbin/bridge) * GNU bison and GNU flex * GNU gettext * ACPI ASL compiler (iasl) Index: xen-4.13.0-testing/tools/hotplug/Linux/colo-proxy-setup =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/colo-proxy-setup +++ xen-4.13.0-testing/tools/hotplug/Linux/colo-proxy-setup @@ -76,10 +76,16 @@ function teardown_primary() function setup_secondary() { - do_without_error brctl delif $bridge $vifname - do_without_error brctl addbr $forwardbr - do_without_error brctl addif $forwardbr $vifname - do_without_error brctl addif $forwardbr $forwarddev + if [ "$legacy_tools" ]; then + do_without_error brctl delif $bridge $vifname + do_without_error brctl addbr $forwardbr + do_without_error brctl addif $forwardbr $vifname + do_without_error brctl addif $forwardbr $forwarddev + else + do_without_error ip link add "$forwardbr" type bridge + do_without_error ip link set "$vifname" master "$forwardbr" + do_without_error ip link set "$forwarddev" master "$forwardbr" + fi do_without_error ip link set dev $forwardbr up do_without_error modprobe xt_SECCOLO @@ -91,10 +97,16 @@ function setup_secondary() function teardown_secondary() { - do_without_error brctl delif $forwardbr $forwarddev - do_without_error brctl delif $forwardbr $vifname - do_without_error brctl delbr $forwardbr - do_without_error brctl addif $bridge $vifname + if [ "$legacy_tools" ]; then + do_without_error brctl delif $forwardbr $forwarddev + do_without_error brctl delif $forwardbr $vifname + do_without_error brctl delbr $forwardbr + do_without_error brctl addif $bridge $vifname + else + do_without_error ip link set "$forwarddev" nomaster + do_without_error ip link set "$vifname" master "$bridge" + do_without_error ip link del "$forwardbr" + fi do_without_error iptables -t mangle -D PREROUTING -m physdev --physdev-in \ $vifname -j SECCOLO --index $index Index: xen-4.13.0-testing/tools/hotplug/Linux/remus-netbuf-setup =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/remus-netbuf-setup +++ xen-4.13.0-testing/tools/hotplug/Linux/remus-netbuf-setup @@ -76,6 +76,7 @@ #specific setup code such as renaming. dir=$(dirname "$0") . "$dir/xen-hotplug-common.sh" +. "$dir/xen-network-common.sh" findCommand "$@" @@ -139,8 +140,16 @@ check_ifb() { setup_ifb() { - for ifb in `ifconfig -a -s|egrep ^ifb|cut -d ' ' -f1` + if [ "$legacy_tools" ]; then + ifbs=`ifconfig -a -s|egrep ^ifb|cut -d ' ' -f1` + else + ifbs=$(ip --oneline link show type ifb | cut -d ' ' -f2) + fi + for ifb in $ifbs do + if [ ! "$legacy_tools" ]; then + ifb="${ifb%:}" + fi check_ifb "$ifb" || continue REMUS_IFB="$ifb" break Index: xen-4.13.0-testing/tools/hotplug/Linux/vif-bridge =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/vif-bridge +++ xen-4.13.0-testing/tools/hotplug/Linux/vif-bridge @@ -40,7 +40,12 @@ bridge=$(xenstore_read_default "$XENBUS_ if [ -z "$bridge" ] then - bridge=$(brctl show | awk 'NR==2{print$1}') + if [ "$legacy_tools" ]; then + bridge=$(brctl show | awk 'NR==2{print$1}') + else + bridge=$(ip --oneline link show type bridge | awk '(NR == 1) { print $2; }') + bridge="${bridge%:}" + fi if [ -z "$bridge" ] then @@ -89,8 +94,13 @@ case "$command" in ;; offline) - do_without_error brctl delif "$bridge" "$dev" - do_without_error ifconfig "$dev" down + if [ "$legacy_tools" ]; then + do_without_error brctl delif "$bridge" "$dev" + do_without_error ifconfig "$dev" down + else + do_without_error ip link set "$dev" nomaster + do_without_error ip link set "$dev" down + fi ;; add) Index: xen-4.13.0-testing/tools/hotplug/Linux/vif-nat =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/vif-nat +++ xen-4.13.0-testing/tools/hotplug/Linux/vif-nat @@ -174,7 +174,11 @@ case "$command" in ;; offline) [ "$dhcp" != 'no' ] && dhcp_down - do_without_error ifconfig "${dev}" down + if [ "$legacy_tools" ]; then + do_without_error ifconfig "${dev}" down + else + do_without_error ip link set "${dev}" down + fi ;; esac Index: xen-4.13.0-testing/tools/hotplug/Linux/vif-route =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/vif-route +++ xen-4.13.0-testing/tools/hotplug/Linux/vif-route @@ -25,7 +25,12 @@ case "${command}" in add) ;& online) - ifconfig ${dev} ${main_ip} netmask 255.255.255.255 up + if [ "$legacy_tools" ]; then + ifconfig ${dev} ${main_ip} netmask 255.255.255.255 up + else + ip addr add "${main_ip}/32" dev "$dev" + fi + ip link set "dev" up echo 1 >/proc/sys/net/ipv4/conf/${dev}/proxy_arp ipcmd='add' cmdprefix='' @@ -33,7 +38,12 @@ case "${command}" in remove) ;& offline) - do_without_error ifdown ${dev} + if [ "$legacy_tools" ]; then + do_without_error ifdown ${dev} + else + do_without_error ip addr flush dev "$dev" + do_without_error ip link set "$dev" down + fi ipcmd='del' cmdprefix='do_without_error' ;; Index: xen-4.13.0-testing/tools/hotplug/Linux/vif2 =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/vif2 +++ xen-4.13.0-testing/tools/hotplug/Linux/vif2 @@ -7,13 +7,22 @@ dir=$(dirname "$0") bridge=$(xenstore_read_default "$XENBUS_PATH/bridge" "$bridge") if [ -z "$bridge" ] then - nr_bridges=$(($(brctl show | cut -f 1 | grep -v "^$" | wc -l) - 1)) + if [ "$legacy_tools" ]; then + nr_bridges=$(($(brctl show | cut -f 1 | grep -v "^$" | wc -l) - 1)) + else + nr_bridges=$(ip --oneline link show type bridge | wc -l) + fi if [ "$nr_bridges" != 1 ] then fatal "no bridge specified, and don't know which one to use ($nr_bridges found)" fi - bridge=$(brctl show | cut -d " + if [ "$legacy_tools" ]; then + bridge=$(brctl show | cut -d " " -f 2 | cut -f 1) + else + bridge=$(ip --oneline link show type bridge | head -1 | cut -d ' ' -f2) + bridge="${bridge%:}" + fi fi command="$1" Index: xen-4.13.0-testing/tools/hotplug/Linux/xen-network-common.sh =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/xen-network-common.sh +++ xen-4.13.0-testing/tools/hotplug/Linux/xen-network-common.sh @@ -15,6 +15,12 @@ # +# Use brctl and ifconfig on older systems +legacy_tools= +if [ -f /sbin/brctl -a -f /sbin/ifconfig ]; then + legacy_tools="true" +fi + # Gentoo doesn't have ifup/ifdown, so we define appropriate alternatives. # Other platforms just use ifup / ifdown directly. @@ -111,9 +117,13 @@ create_bridge () { # Don't create the bridge if it already exists. if [ ! -e "/sys/class/net/${bridge}/bridge" ]; then - brctl addbr ${bridge} - brctl stp ${bridge} off - brctl setfd ${bridge} 0 + if [ "$legacy_tools" ]; then + brctl addbr ${bridge} + brctl stp ${bridge} off + brctl setfd ${bridge} 0 + else + ip link add "$bridge" type bridge stp_state 0 forward_delay 0 + fi fi } @@ -127,7 +137,11 @@ add_to_bridge () { ip link set dev ${dev} up || true return fi - brctl addif ${bridge} ${dev} + if [ "$legacy_tools" ]; then + brctl addif ${bridge} ${dev} + else + ip link set "$dev" master "$bridge" + fi ip link set dev ${dev} up } ++++++ reproducible.patch ++++++ commit e4c8f21e198e739e279b274c17e9246ea9a6d8e5 Author: Bernhard M. Wiedemann <bwiedemann(a)suse.de> Date: Wed Oct 24 09:50:26 2018 +0200 x86/efi: Do not insert timestamps in efi files in order to make builds reproducible. See https://reproducible-builds.org/ for why this is good. We only add the option, if ld understands it. Signed-off-by: Bernhard M. Wiedemann <bwiedemann(a)suse.de> Index: xen-4.13.0-testing/Config.mk =================================================================== --- xen-4.13.0-testing.orig/Config.mk +++ xen-4.13.0-testing/Config.mk @@ -151,6 +151,14 @@ export XEN_HAS_BUILD_ID=y build_id_linker := --build-id=sha1 endif +ld-ver-timestamp = $(shell $(1) -mi386pep --no-insert-timestamp 2>&1 | \ + grep -q no-insert-timestamp && echo n || echo y) +ifeq ($(call ld-ver-timestamp,$(LD)),n) +ld_no_insert_timestamp := +else +ld_no_insert_timestamp := --no-insert-timestamp +endif + ifndef XEN_HAS_CHECKPOLICY CHECKPOLICY ?= checkpolicy XEN_HAS_CHECKPOLICY := $(shell $(CHECKPOLICY) -h 2>&1 | grep -q xen && echo y || echo n) Index: xen-4.13.0-testing/xen/arch/x86/Makefile =================================================================== --- xen-4.13.0-testing.orig/xen/arch/x86/Makefile +++ xen-4.13.0-testing/xen/arch/x86/Makefile @@ -164,6 +164,7 @@ note.o: $(TARGET)-syms EFI_LDFLAGS = $(patsubst -m%,-mi386pep,$(LDFLAGS)) --subsystem=10 EFI_LDFLAGS += --image-base=$(1) --stack=0,0 --heap=0,0 --strip-debug +EFI_LDFLAGS += $(ld_no_insert_timestamp) EFI_LDFLAGS += --section-alignment=0x200000 --file-alignment=0x20 EFI_LDFLAGS += --major-image-version=$(XEN_VERSION) EFI_LDFLAGS += --minor-image-version=$(XEN_SUBVERSION) ++++++ stdvga-cache.patch ++++++ Index: xen-4.9.0-testing/xen/arch/x86/hvm/stdvga.c =================================================================== --- xen-4.9.0-testing.orig/xen/arch/x86/hvm/stdvga.c +++ xen-4.9.0-testing/xen/arch/x86/hvm/stdvga.c @@ -166,7 +166,10 @@ static int stdvga_outb(uint64_t addr, ui /* When in standard vga mode, emulate here all writes to the vram buffer * so we can immediately satisfy reads without waiting for qemu. */ - s->stdvga = (s->sr[7] == 0x00); + s->stdvga = + (s->sr[7] == 0x00) && /* standard vga mode */ + (s->gr[6] == 0x05); /* misc graphics register w/ MemoryMapSelect=1 + * 0xa0000-0xaffff (64k region), AlphaDis=1 */ if ( !prev_stdvga && s->stdvga ) { ++++++ stubdom-have-iovec.patch ++++++ Because of commit 76eb7cef6b84ca804f4db340e23ad9c501767c32 xc_private.h now contains a definition of iovec. This conflicts when building qemu traditional xen_platform.c which includes hw.h which includes qemu-common.h which already has a definition of iovec Index: xen-4.12.0-testing/tools/libxc/xc_private.h =================================================================== --- xen-4.12.0-testing.orig/tools/libxc/xc_private.h +++ xen-4.12.0-testing/tools/libxc/xc_private.h @@ -50,6 +50,8 @@ #endif #if defined(__MINIOS__) +#ifndef HAVE_IOVEC +#define HAVE_IOVEC /* * MiniOS's libc doesn't know about sys/uio.h or writev(). * Declare enough of sys/uio.h to compile. @@ -58,6 +60,7 @@ struct iovec { void *iov_base; size_t iov_len; }; +#endif #else #include <sys/uio.h> #endif ++++++ suse-xendomains-service.patch ++++++ xendomains: remove libvirtd conflict Conflicting with libvirtd is fine for upstream, where xl/libxl is king. But down the SUSE stream, we promote libvirt and all the libvirt-based tools. If a user installs libvirt on their SUSE Xen host, then libvirt should be king and override xendomains. bsc#1015348 Index: xen-4.8.0-testing/tools/hotplug/Linux/systemd/xendomains.service.in =================================================================== --- xen-4.8.0-testing.orig/tools/hotplug/Linux/systemd/xendomains.service.in +++ xen-4.8.0-testing/tools/hotplug/Linux/systemd/xendomains.service.in @@ -5,7 +5,6 @@ After=proc-xen.mount xenstored.service x After=network-online.target After=remote-fs.target ConditionPathExists=/proc/xen/capabilities -Conflicts=libvirtd.service [Service] Type=oneshot ++++++ suspend_evtchn_lock.patch ++++++ Fix problems that suspend eventchannel lock file might be obselete for some reason like segment fault or other abnormal exit, and once obselete lock file exists, it might affact latter save process. Have discussed with upstream, for some reason not accepted. http://xen.1045712.n5.nabble.com/Re-PATCH-improve-suspend-evtchn-lock-proce… Signed-off-by: Chunyan Liu <cyliu(a)suse.com> Index: xen-4.10.0-testing/tools/libxc/xc_suspend.c =================================================================== --- xen-4.10.0-testing.orig/tools/libxc/xc_suspend.c +++ xen-4.10.0-testing/tools/libxc/xc_suspend.c @@ -20,6 +20,10 @@ #include "xc_private.h" #include "xenguest.h" +#include <signal.h> +#ifdef __MINIOS__ +extern int kill (__pid_t __pid, int __sig); +#endif #define SUSPEND_LOCK_FILE XEN_RUN_DIR "/suspend-evtchn-%d.lock" @@ -35,6 +39,37 @@ #define SUSPEND_FILE_BUFLEN (sizeof(SUSPEND_LOCK_FILE) + 10) +/* cleanup obsolete suspend lock file which is unlinked for any reason, +so that current process can get lock */ +static void clean_obsolete_lock(int domid) +{ + int fd, pid, n; + char buf[128]; + char suspend_file[256]; + + snprintf(suspend_file, sizeof(suspend_file), "%s_%d_lock.d", + SUSPEND_LOCK_FILE, domid); + fd = open(suspend_file, O_RDWR); + + if (fd < 0) + return; + + n = read(fd, buf, 127); + + close(fd); + + if (n > 0) + { + sscanf(buf, "%d", &pid); + /* pid does not exist, this lock file is obsolete, just delete it */ + if ( kill(pid,0) ) + { + unlink(suspend_file); + return; + } + } +} + static void get_suspend_file(char buf[], uint32_t domid) { snprintf(buf, SUSPEND_FILE_BUFLEN, SUSPEND_LOCK_FILE, domid); @@ -48,6 +83,7 @@ static int lock_suspend_event(xc_interfa struct flock fl; get_suspend_file(suspend_file, domid); + clean_obsolete_lock(domid); *lockfd = -1; @@ -97,6 +133,8 @@ static int lock_suspend_event(xc_interfa if (fd >= 0) close(fd); + unlink(suspend_file); + return -1; } ++++++ sysconfig.pciback ++++++ ## Path: System/Virtualization ## Type: string ## Default: "" # # Space delimited list of PCI devices to late bind to pciback # Format: <driver>,<PCI ID> # #XEN_PCI_HIDE_LIST="e1000,0000:0b:00.0 e1000,0000:0b:00.1" XEN_PCI_HIDE_LIST="" ++++++ tmp_build.patch ++++++ Note: During the make process we can't have both xenstore and domu-xenstore linking the sub command files from /usr/bin. For example, xen-tools: /usr/bin/xenstore-ls -> xenstore xen-tools-domU: /usr/bin/xenstore-ls -> domu-xenstore The last thing to create this link overwrites the previous link and breaks the packaging. For this reason this patch puts domu-xenstore with its links in /bin so as to not interfere with the regular xenstore links. --- tools/xenstore/Makefile | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: xen-4.11.1-testing/tools/xenstore/Makefile =================================================================== --- xen-4.11.1-testing.orig/tools/xenstore/Makefile +++ xen-4.11.1-testing/tools/xenstore/Makefile @@ -93,6 +93,7 @@ $(CLIENTS_DOMU): xenstore xenstore: xenstore_client.o $(LIBXENSTORE) $(CC) $< $(LDFLAGS) $(LDLIBS_libxenstore) $(LDLIBS_libxentoolcore) $(SOCKET_LIBS) -o $@ $(APPEND_LDFLAGS) + $(CC) $< $(CFLAGS) $(LDFLAGS) -Wl,--build-id=sha1 -L. -lxenstore $(LDLIBS_libxentoolcore) $(SOCKET_LIBS) -o domu-$@ xenstore-control: xenstore_control.o $(LIBXENSTORE) $(CC) $< $(LDFLAGS) $(LDLIBS_libxenstore) $(LDLIBS_libxentoolcore) $(SOCKET_LIBS) -o $@ $(APPEND_LDFLAGS) @@ -172,10 +173,11 @@ endif $(INSTALL_PROG) xenstore-control $(DESTDIR)$(bindir) $(INSTALL_PROG) xenstore $(DESTDIR)$(bindir) set -e ; for c in $(CLIENTS) ; do \ - ln -f $(DESTDIR)$(bindir)/xenstore $(DESTDIR)$(bindir)/$${c} ; \ + ln -fs xenstore $(DESTDIR)$(bindir)/$${c} ; \ done + $(INSTALL_PROG) domu-xenstore $(DESTDIR)/bin for client in $(CLIENTS_DOMU); do \ - $(INSTALL_PROG) $$client $(DESTDIR)$(bindir)/$${client/domu-}; \ + ln -fs domu-xenstore $(DESTDIR)/bin/$${client/domu-} ; \ done $(INSTALL_DIR) $(DESTDIR)$(libdir) $(INSTALL_SHLIB) libxenstore.so.$(MAJOR).$(MINOR) $(DESTDIR)$(libdir) ++++++ vif-bridge-no-iptables.patch ++++++ Index: xen-4.5.0-testing/tools/hotplug/Linux/vif-bridge =================================================================== --- xen-4.5.0-testing.orig/tools/hotplug/Linux/vif-bridge +++ xen-4.5.0-testing/tools/hotplug/Linux/vif-bridge @@ -93,7 +93,7 @@ case "$command" in ;; esac -handle_iptable +#handle_iptable call_hooks vif post ++++++ vif-bridge-tap-fix.patch ++++++ # HG changeset patch # User Jim Fehlig <jfehlig(a)suse.com> # Date 1319581952 21600 # Node ID 74da2a3a1db1476d627f42e4a99e9e720cc6774d # Parent 6c583d35d76dda2236c81d9437ff9d57ab02c006 Prevent vif-bridge from adding user-created tap interfaces to a bridge Exit vif-bridge script if there is no device info in xenstore, preventing it from adding user-created taps to bridges. Signed-off-by: Jim Fehlig <jfehlig(a)suse.com> Index: xen-4.5.0-testing/tools/hotplug/Linux/vif-bridge =================================================================== --- xen-4.5.0-testing.orig/tools/hotplug/Linux/vif-bridge +++ xen-4.5.0-testing/tools/hotplug/Linux/vif-bridge @@ -28,6 +28,13 @@ dir=$(dirname "$0") . "$dir/vif-common.sh" +mac=$(xenstore_read_default "$XENBUS_PATH/mac" "") +if [ -z "$mac" ] +then + log debug "No device details in $XENBUS_PATH, exiting." + exit 0 +fi + bridge=${bridge:-} bridge=$(xenstore_read_default "$XENBUS_PATH/bridge" "$bridge") ++++++ vif-route.patch ++++++ References: bsc#985503 Index: xen-4.13.0-testing/tools/hotplug/Linux/vif-route =================================================================== --- xen-4.13.0-testing.orig/tools/hotplug/Linux/vif-route +++ xen-4.13.0-testing/tools/hotplug/Linux/vif-route @@ -61,11 +61,13 @@ case "${type_if}" in ;; esac -# If we've been given a list of IP addresses, then add routes from dom0 to -# the guest using those addresses. -for addr in ${ip} ; do - ${cmdprefix} ip route ${ipcmd} ${addr} dev ${dev} src ${main_ip} metric ${metric} -done +if [ "${ip}" ] && [ "${ipcmd}" ] ; then + # If we've been given a list of IP addresses, then add routes from dom0 to + # the guest using those addresses. + for addr in ${ip} ; do + ${cmdprefix} ip route ${ipcmd} ${addr} dev ${dev} src ${main_ip} metric ${metric} + done +fi handle_iptable ++++++ x86-cpufreq-report.patch ++++++ Index: xen-4.12.0-testing/xen/arch/x86/platform_hypercall.c =================================================================== --- xen-4.12.0-testing.orig/xen/arch/x86/platform_hypercall.c +++ xen-4.12.0-testing/xen/arch/x86/platform_hypercall.c @@ -25,7 +25,7 @@ #include <xen/symbols.h> #include <asm/current.h> #include <public/platform.h> -#include <acpi/cpufreq/processor_perf.h> +#include <acpi/cpufreq/cpufreq.h> #include <asm/edd.h> #include <asm/mtrr.h> #include <asm/io_apic.h> @@ -807,6 +807,41 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PA ret = -EFAULT; } break; + + case XENPF_get_cpu_freq: + case XENPF_get_cpu_freq_min: + case XENPF_get_cpu_freq_max: + { + struct vcpu *v; + const struct cpufreq_policy *policy; + + if ( op->u.get_cpu_freq.vcpu >= current->domain->max_vcpus || + !(v = current->domain->vcpu[op->u.get_cpu_freq.vcpu]) ) + { + ret = -EINVAL; + break; + } + + policy = per_cpu(cpufreq_cpu_policy, v->processor); + switch ( op->cmd & -!!policy ) + { + case XENPF_get_cpu_freq: + op->u.get_cpu_freq.freq = policy->cur; + break; + case XENPF_get_cpu_freq_min: + op->u.get_cpu_freq.freq = policy->min; + break; + case XENPF_get_cpu_freq_max: + op->u.get_cpu_freq.freq = policy->max; + break; + default: + op->u.get_cpu_freq.freq = 0; + break; + } + if ( __copy_field_to_guest(u_xenpf_op, op, u.get_cpu_freq.freq) ) + ret = -EFAULT; + } + break; default: ret = -ENOSYS; Index: xen-4.12.0-testing/xen/include/public/platform.h =================================================================== --- xen-4.12.0-testing.orig/xen/include/public/platform.h +++ xen-4.12.0-testing/xen/include/public/platform.h @@ -553,6 +553,16 @@ struct xenpf_core_parking { typedef struct xenpf_core_parking xenpf_core_parking_t; DEFINE_XEN_GUEST_HANDLE(xenpf_core_parking_t); +#define XENPF_get_cpu_freq ('N' << 24) +#define XENPF_get_cpu_freq_min (XENPF_get_cpu_freq + 1) +#define XENPF_get_cpu_freq_max (XENPF_get_cpu_freq_min + 1) +struct xenpf_get_cpu_freq { + /* IN variables */ + uint32_t vcpu; + /* OUT variables */ + uint32_t freq; /* in kHz */ +}; + /* * Access generic platform resources(e.g., accessing MSR, port I/O, etc) * in unified way. Batch resource operations in one call are supported and @@ -644,6 +654,7 @@ struct xen_platform_op { struct xenpf_core_parking core_parking; struct xenpf_resource_op resource_op; struct xenpf_symdata symdata; + struct xenpf_get_cpu_freq get_cpu_freq; uint8_t pad[128]; } u; }; ++++++ x86-ioapic-ack-default.patch ++++++ Change default IO-APIC ack mode for single IO-APIC systems to old-style. Index: xen-4.13.0-testing/xen/arch/x86/io_apic.c =================================================================== --- xen-4.13.0-testing.orig/xen/arch/x86/io_apic.c +++ xen-4.13.0-testing/xen/arch/x86/io_apic.c @@ -2029,7 +2029,10 @@ void __init setup_IO_APIC(void) io_apic_irqs = ~PIC_IRQS; printk("ENABLING IO-APIC IRQs\n"); - printk(" -> Using %s ACK method\n", ioapic_ack_new ? "new" : "old"); + if (!directed_eoi_enabled && !ioapic_ack_forced) { + ioapic_ack_new = (nr_ioapics > 1); + printk(" -> Using %s ACK method\n", ioapic_ack_new ? "new" : "old"); + } if (ioapic_ack_new) { ioapic_level_type.ack = irq_complete_move; ++++++ xen-arch-kconfig-nr_cpus.patch ++++++ Index: xen-4.12.0-testing/xen/arch/Kconfig =================================================================== --- xen-4.12.0-testing.orig/xen/arch/Kconfig +++ xen-4.12.0-testing/xen/arch/Kconfig @@ -2,7 +2,7 @@ config NR_CPUS int "Maximum number of physical CPUs" range 1 4095 - default "256" if X86 + default "1024" if X86 default "8" if ARM && RCAR3 default "4" if ARM && QEMU default "4" if ARM && MPSOC ++++++ xen-destdir.patch ++++++ Index: xen-4.11.0-testing/tools/xenstore/Makefile =================================================================== --- xen-4.11.0-testing.orig/tools/xenstore/Makefile +++ xen-4.11.0-testing/tools/xenstore/Makefile @@ -20,6 +20,7 @@ LDFLAGS += $(LDFLAGS-y) CLIENTS := xenstore-exists xenstore-list xenstore-read xenstore-rm xenstore-chmod CLIENTS += xenstore-write xenstore-ls xenstore-watch +CLIENTS_DOMU := $(patsubst xenstore-%,domu-xenstore-%,$(CLIENTS)) XENSTORED_OBJS = xenstored_core.o xenstored_watch.o xenstored_domain.o XENSTORED_OBJS += xenstored_transaction.o xenstored_control.o @@ -57,7 +58,7 @@ endif all: $(ALL_TARGETS) .PHONY: clients -clients: xenstore $(CLIENTS) xenstore-control +clients: xenstore $(CLIENTS) $(CLIENTS_DOMU) xenstore-control ifeq ($(CONFIG_SunOS),y) xenstored_probes.h: xenstored_probes.d @@ -87,6 +88,9 @@ xenstored.a: $(XENSTORED_OBJS) $(CLIENTS): xenstore ln -f xenstore $@ +$(CLIENTS_DOMU): xenstore + ln -f xenstore $@ + xenstore: xenstore_client.o $(LIBXENSTORE) $(CC) $< $(LDFLAGS) $(LDLIBS_libxenstore) $(LDLIBS_libxentoolcore) $(SOCKET_LIBS) -o $@ $(APPEND_LDFLAGS) @@ -139,7 +143,7 @@ clean: rm -f *.a *.o *.opic *.so* xenstored_probes.h rm -f xenstored xs_random xs_stress xs_crashme rm -f xs_tdb_dump xenstore-control init-xenstore-domain - rm -f xenstore $(CLIENTS) + rm -f xenstore $(CLIENTS) $(CLIENTS_DOMU) rm -f xenstore.pc $(RM) $(DEPS_RM) @@ -163,12 +167,16 @@ ifeq ($(XENSTORE_XENSTORED),y) $(INSTALL_DIR) $(DESTDIR)$(sbindir) $(INSTALL_DIR) $(DESTDIR)$(XEN_LIB_STORED) $(INSTALL_PROG) xenstored $(DESTDIR)$(sbindir) + $(INSTALL_DIR) $(DESTDIR)/bin endif $(INSTALL_PROG) xenstore-control $(DESTDIR)$(bindir) $(INSTALL_PROG) xenstore $(DESTDIR)$(bindir) set -e ; for c in $(CLIENTS) ; do \ ln -f $(DESTDIR)$(bindir)/xenstore $(DESTDIR)$(bindir)/$${c} ; \ done + for client in $(CLIENTS_DOMU); do \ + $(INSTALL_PROG) $$client $(DESTDIR)$(bindir)/$${client/domu-}; \ + done $(INSTALL_DIR) $(DESTDIR)$(libdir) $(INSTALL_SHLIB) libxenstore.so.$(MAJOR).$(MINOR) $(DESTDIR)$(libdir) ln -sf libxenstore.so.$(MAJOR).$(MINOR) $(DESTDIR)$(libdir)/libxenstore.so.$(MAJOR) ++++++ xen-dom0-modules.service ++++++ [Unit] Description=Load dom0 backend drivers ConditionPathExists=/proc/xen Before=xenstored.service xen-watchdog.service [Install] WantedBy=multi-user.target [Service] Type=oneshot RemainAfterExit=true Environment=PATH=/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin # dummy to have always one valid line ExecStart=-/usr/bin/env uname -a # modules listed in /usr/lib/modules.d/xen.conf # load them manually to avoid usage of system-modules-load.service ++++++ xen-supportconfig ++++++ #!/bin/bash ############################################################# # Name: Supportconfig Plugin for Xen # Description: Gathers important troubleshooting information # about Xen and its tools # Author: Jim Fehlig <jfehlig(a)suse.com> ############################################################# # TODO: # - Anything needed for UEFI? # RCFILE="/usr/lib/supportconfig/resources/scplugin.rc" GRUB2_CONF_FILES="/etc/default/grub" XEN_CONF_FILES="/etc/xen/xl.conf /etc/sysconfig/xencommons /etc/sysconfig/xendomains" XEN_SERVICES="xencommons xendomains xen-watchdog" VM_CONF_FILES="" XEN_LOG_FILES="" if [ -s $RCFILE ]; then if ! source $RCFILE; then echo "ERROR: Initializing resource file: $RCFILE" >&2 exit 1 fi fi rpm_verify() { thisrpm="$1" local ret=0 echo echo "#==[ Validating RPM ]=================================#" if rpm -q "$thisrpm" >/dev/null 2>&1; then echo "# rpm -V $thisrpm" if rpm -V "$thisrpm"; then echo "Status: Passed" else echo "Status: WARNING" fi else echo "package $thisrpm is not installed" ret=1 fi echo return $ret } # if no xen package we are done if ! rpm_verify xen; then echo "Skipped" exit 0 fi # if not a xen host (dom0) we are done echo "#==[ Checking if booted Xen ]=================================#" if [ ! -d /proc/xen ] || [ ! -e /proc/xen/capabilities ] || [ `cat /proc/xen/capabilities` != "control_d" ]; then echo "No" echo "Skipped" exit 0 else echo "Yes" echo fi # basic system information: plugin_command "uname -r" for service in $XEN_SERVICES; do plugin_command "systemctl status $service" plugin_command "systemctl is-enabled $service" done plugin_command "lscpu" plugin_command "xl info --numa" plugin_command "xl list" plugin_command "xl pci-assignable-list" plugin_command "xenstore-ls" plugin_command "ps -ef | grep xen" # dump grub2-related conf files pconf_files "$GRUB2_CONF_FILES" # dump Xen-related conf files pconf_files "$XEN_CONF_FILES" # detailed system info: plugin_command "xl list --long" plugin_command "xl dmesg" # network-related info often useful for debugging if [ systemctl is-enabled NetworkManager.service 2>&1 > /dev/null ]; then echo "NOTE: NetworkManager should not be enabled on a Xen host" fi plugin_command "route -n" plugin_command "arp -v" plugin_command "ip link show type bridge" plugin_command "bridge link show" # list contents of common config and image directories plugin_command "ls -alR /etc/xen/vm/" plugin_command "ls -alR /etc/xen/auto/" plugin_command "ls -alR /var/lib/xen/images/" # dump VM-related conf files test -d /etc/xen/vm && VM_CONF_FILES=$(find -L /etc/xen/vm/ -type f | sort) pconf_files "$VM_CONF_FILES" # dump log files test -d /var/log/xen && XEN_LOG_FILES="$(find -L /var/log/xen/ -type f | grep 'log$' | sort)" plog_files 0 "$XEN_LOG_FILES" echo "Done" ++++++ xen.bug1026236.suse_vtsc_tolerance.patch ++++++ suse_vtsc_tolerance=<val> Reference: bsc#1026236 To avoid emulation of vTSC after live migration or save/restore allow different clock frequency up to the specified value. If the frequency is within the allowed range TSC access by the domU will be performed at native speed. Otherwise TSC access will be emulated. It is up to the hostadmin to decide how much tolerance all running domUs can actually handle. The default is zero tolerance. --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -43,6 +43,9 @@ static char __initdata opt_clocksource[10]; string_param("clocksource", opt_clocksource); +static unsigned int __read_mostly opt_suse_vtsc_tolerance; +integer_param("suse_vtsc_tolerance", opt_suse_vtsc_tolerance); + unsigned long __read_mostly cpu_khz; /* CPU clock frequency in kHz. */ DEFINE_SPINLOCK(rtc_lock); unsigned long pit0_ticks; @@ -2226,6 +2229,7 @@ int tsc_set_info(struct domain *d, switch ( tsc_mode ) { + bool disable_vtsc; case TSC_MODE_DEFAULT: case TSC_MODE_ALWAYS_EMULATE: d->arch.vtsc_offset = get_s_time() - elapsed_nsec; @@ -2239,8 +2243,26 @@ int tsc_set_info(struct domain *d, * When a guest is created, gtsc_khz is passed in as zero, making * d->arch.tsc_khz == cpu_khz. Thus no need to check incarnation. */ + disable_vtsc = d->arch.tsc_khz == cpu_khz; + + if ( tsc_mode == TSC_MODE_DEFAULT && gtsc_khz && + is_hvm_domain(d) && opt_suse_vtsc_tolerance ) + { + long khz_diff; + + khz_diff = ABS(((long)cpu_khz - gtsc_khz)); + disable_vtsc = khz_diff <= opt_suse_vtsc_tolerance; + + printk(XENLOG_G_INFO "d%d: host has %lu kHz," + " domU expects %u kHz," + " difference of %ld is %s tolerance of %u\n", + d->domain_id, cpu_khz, gtsc_khz, khz_diff, + disable_vtsc ? "within" : "outside", + opt_suse_vtsc_tolerance); + } + if ( tsc_mode == TSC_MODE_DEFAULT && host_tsc_is_safe() && - (d->arch.tsc_khz == cpu_khz || + (disable_vtsc || (is_hvm_domain(d) && hvm_get_tsc_scaling_ratio(d->arch.tsc_khz))) ) { ++++++ xen.build-compare.doc_html.patch ++++++ The result of $(wildcard *) is random. Sort input files to reduce build-compare noise. --- docs/Makefile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: xen-4.13.1-testing/docs/Makefile =================================================================== --- xen-4.13.1-testing.orig/docs/Makefile +++ xen-4.13.1-testing/docs/Makefile @@ -191,7 +191,7 @@ uninstall: uninstall-man-pages uninstall # Individual file build targets html/index.html: $(DOC_HTML) $(CURDIR)/gen-html-index INDEX - $(PERL) -w -- $(CURDIR)/gen-html-index -i INDEX html $(DOC_HTML) + $(PERL) -w -- $(CURDIR)/gen-html-index -i INDEX html $(sort $(DOC_HTML)) html/%.txt: %.txt @$(INSTALL_DIR) $(@D) @@ -206,8 +206,8 @@ html/hypercall/%/index.html: $(CURDIR)/x $(INSTALL_DIR) $(@D) $(PERL) -w $(CURDIR)/xen-headers -O $(@D) \ -T 'arch-$* - Xen public headers' \ - $(patsubst %,-X arch-%,$(filter-out $*,$(DOC_ARCHES))) \ - $(patsubst %,-X xen-%,$(filter-out $*,$(DOC_ARCHES))) \ + $(sort $(patsubst %,-X arch-%,$(filter-out $*,$(DOC_ARCHES)))) \ + $(sort $(patsubst %,-X xen-%,$(filter-out $*,$(DOC_ARCHES)))) \ $(EXTRA_EXCLUDE) \ $(XEN_ROOT)/xen include/public include/xen/errno.h ++++++ xen.libxl.dmmd.patch ++++++ References: bsc#954872 --- tools/libxl/libxl.c | 4 ++++ tools/libxl/libxl_device.c | 3 ++- tools/libxl/libxl_dm.c | 34 +++++++++++++++++++++++++++++----- tools/libxl/libxlu_disk_l.l | 2 ++ 4 files changed, 37 insertions(+), 6 deletions(-) Index: xen-4.13.0-testing/tools/libxl/libxl_disk.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_disk.c +++ xen-4.13.0-testing/tools/libxl/libxl_disk.c @@ -178,7 +178,7 @@ static int libxl__device_disk_setdefault return rc; } -static int libxl__device_from_disk(libxl__gc *gc, uint32_t domid, +int libxl__device_from_disk(libxl__gc *gc, uint32_t domid, const libxl_device_disk *disk, libxl__device *device) { @@ -336,6 +336,10 @@ static void device_disk_add(libxl__egc * rc = ERROR_FAIL; goto out; case LIBXL_DISK_BACKEND_QDISK: + if (disk->script) { + script = libxl__abs_path(gc, disk->script, libxl__xen_script_dir_path()); + flexarray_append_pair(back, "script", script); + } flexarray_append(back, "params"); flexarray_append(back, GCSPRINTF("%s:%s", libxl__device_disk_string_of_format(disk->format), Index: xen-4.13.0-testing/tools/libxl/libxl_device.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_device.c +++ xen-4.13.0-testing/tools/libxl/libxl_device.c @@ -326,7 +326,8 @@ static int disk_try_backend(disk_try_bac return 0; case LIBXL_DISK_BACKEND_QDISK: - if (a->disk->script) goto bad_script; + LOG(DEBUG, "Disk vdev=%s, uses script=%s on %s backend", + a->disk->vdev, a->disk->script, libxl_disk_backend_to_string(backend)); return backend; default: @@ -343,11 +344,6 @@ static int disk_try_backend(disk_try_bac libxl_disk_format_to_string(a->disk->format)); return 0; - bad_script: - LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...", - a->disk->vdev, libxl_disk_backend_to_string(backend)); - return 0; - bad_colo: LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo", a->disk->vdev, libxl_disk_backend_to_string(backend)); Index: xen-4.13.0-testing/tools/libxl/libxl_dm.c =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_dm.c +++ xen-4.13.0-testing/tools/libxl/libxl_dm.c @@ -1162,6 +1162,30 @@ out: return rc; } +static void libxl__suse_node_to_path(libxl__gc *gc, int domid, const libxl_device_disk *dp, const char **pdev_path) +{ + libxl_ctx *ctx = libxl__gc_owner(gc); + char *be_path, *node; + libxl__device device; + libxl_device_disk disk; + int rc; + + disk = *dp; + rc = libxl__device_from_disk(gc, domid, &disk, &device); + if (rc) { + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, "libxl__device_from_disk failed %d", rc); + return; + } + be_path = libxl__device_backend_path(gc, &device); + + node = libxl__xs_read(gc, XBT_NULL, libxl__sprintf(gc, "%s/node", be_path)); + if (!node) + return; + + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, "replacing '%s' with '%s' from %s/node, just for qemu-xen", *pdev_path, node, be_path); + *pdev_path = node; +} + static int libxl__build_device_model_args_new(libxl__gc *gc, const char *dm, int guest_domid, const libxl_domain_config *guest_config, @@ -1795,9 +1819,11 @@ static int libxl__build_device_model_arg libxl__device_disk_dev_number(disks[i].vdev, &disk, &part); const char *format; char *drive; - const char *target_path = NULL; + const char *target_path = disks[i].pdev_path; int colo_mode; + libxl__suse_node_to_path(gc, guest_domid, disks + i, &target_path); + if (dev_number == -1) { LOGD(WARN, guest_domid, "unable to determine"" disk number for %s", disks[i].vdev); Index: xen-4.13.0-testing/tools/libxl/libxlu_disk_l.l =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxlu_disk_l.l +++ xen-4.13.0-testing/tools/libxl/libxlu_disk_l.l @@ -230,6 +230,8 @@ target=.* { STRIP(','); SAVESTRING("targ free(newscript); } +dmmd:/.* { DPC->had_depr_prefix=1; DEPRECATE(0); } +npiv:/.* { DPC->had_depr_prefix=1; DEPRECATE(0); } tapdisk:/.* { DPC->had_depr_prefix=1; DEPRECATE(0); } tap2?:/.* { DPC->had_depr_prefix=1; DEPRECATE(0); } aio:/.* { DPC->had_depr_prefix=1; DEPRECATE(0); } Index: xen-4.13.0-testing/tools/libxl/libxl_internal.h =================================================================== --- xen-4.13.0-testing.orig/tools/libxl/libxl_internal.h +++ xen-4.13.0-testing/tools/libxl/libxl_internal.h @@ -2042,6 +2042,10 @@ struct libxl__cpuid_policy { char *policy[4]; }; +_hidden int libxl__device_from_disk(libxl__gc *gc, uint32_t domid, + const libxl_device_disk *disk, + libxl__device *device); + /* Calls poll() again - useful to check whether a signaled condition * is still true. Cannot fail. Returns currently-true revents. */ _hidden short libxl__fd_poll_recheck(libxl__egc *egc, int fd, short events); ++++++ xen.stubdom.newlib.patch ++++++ # HG changeset patch # Parent 02ec826cab1e4acb25b364a180a1597ace1149f9 stubdom: fix errors in newlib rpm post-build-checks found a few code bugs in newlib, and marks them as errors. Add another newlib patch and apply it during stubdom build. I: A function uses a 'return;' statement, but has actually a value to return, like an integer ('return 42;') or similar. W: xen voidreturn ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:117, 125, 146, 157, 330 I: Program is using implicit definitions of special functions. these functions need to use their correct prototypes to allow the lightweight buffer overflow checking to work. - Implicit memory/string functions need #include <string.h>. - Implicit *printf functions need #include <stdio.h>. - Implicit *printf functions need #include <stdio.h>. - Implicit *read* functions need #include <unistd.h>. - Implicit *recv* functions need #include <sys/socket.h>. E: xen implicit-fortify-decl ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:119 I: Program returns random data in a function E: xen no-return-in-nonvoid-function ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:362 Signed-off-by: Olaf Hering <olaf(a)aepfle.de> Index: xen-4.12.0-testing/stubdom/Makefile =================================================================== --- xen-4.12.0-testing.orig/stubdom/Makefile +++ xen-4.12.0-testing/stubdom/Makefile @@ -88,6 +88,8 @@ newlib-$(NEWLIB_VERSION): newlib-$(NEWLI patch -d $@ -p0 < newlib-chk.patch patch -d $@ -p1 < newlib-stdint-size_max-fix-from-1.17.0.patch patch -d $@ -p1 < newlib-disable-texinfo.patch + patch -d $@ -p1 < newlib-cygmon-gmon.patch + patch -d $@ -p1 < newlib-makedoc.patch find $@ -type f | xargs perl -i.bak \ -pe 's/\b_(tzname|daylight|timezone)\b/$$1/g' touch $@ Index: xen-4.12.0-testing/stubdom/newlib-cygmon-gmon.patch =================================================================== --- /dev/null +++ xen-4.12.0-testing/stubdom/newlib-cygmon-gmon.patch @@ -0,0 +1,60 @@ + +I: A function uses a 'return;' statement, but has actually a value + to return, like an integer ('return 42;') or similar. +W: xen voidreturn ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:117, 125, 146, 157, 330 + +I: Program is using implicit definitions of special functions. + these functions need to use their correct prototypes to allow + the lightweight buffer overflow checking to work. + - Implicit memory/string functions need #include <string.h>. + - Implicit *printf functions need #include <stdio.h>. + - Implicit *printf functions need #include <stdio.h>. + - Implicit *read* functions need #include <unistd.h>. + - Implicit *recv* functions need #include <sys/socket.h>. +E: xen implicit-fortify-decl ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:119 + +I: Program returns random data in a function +E: xen no-return-in-nonvoid-function ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:362 + +--- + libgloss/i386/cygmon-gmon.c | 6 +++++- + 1 file changed, 5 insertions(+), 1 deletion(-) + +Index: newlib-1.16.0/libgloss/i386/cygmon-gmon.c +=================================================================== +--- newlib-1.16.0.orig/libgloss/i386/cygmon-gmon.c ++++ newlib-1.16.0/libgloss/i386/cygmon-gmon.c +@@ -61,6 +61,8 @@ + static char sccsid[] = "@(#)gmon.c 5.3 (Berkeley) 5/22/91"; + #endif /* not lint */ + ++#include <string.h> ++#include <unistd.h> + #define DEBUG + #ifdef DEBUG + #include <stdio.h> +@@ -89,7 +91,7 @@ static int s_scale; + + extern int errno; + +-int ++void + monstartup(lowpc, highpc) + char *lowpc; + char *highpc; +@@ -199,6 +201,7 @@ _mcleanup() + + static char already_setup = 0; + ++void + _mcount() + { + register char *selfpc; +@@ -341,6 +344,7 @@ overflow: + * profiling is what mcount checks to see if + * all the data structures are ready. + */ ++void + moncontrol(mode) + int mode; + { Index: xen-4.12.0-testing/stubdom/newlib-makedoc.patch =================================================================== --- /dev/null +++ xen-4.12.0-testing/stubdom/newlib-makedoc.patch @@ -0,0 +1,10 @@ +--- newlib-1.16.0/newlib/doc/makedoc.c.orig 2015-04-08 11:56:39.283090914 +0200 ++++ newlib-1.16.0/newlib/doc/makedoc.c 2015-04-08 11:56:51.245227742 +0200 +@@ -39,6 +39,7 @@ + #include <stdio.h> + #include <stdlib.h> + #include <ctype.h> ++#include <string.h> + + #define DEF_SIZE 5000 + #define STACK 50 ++++++ xen2libvirt.py ++++++ #!/usr/bin/python3 # # Copyright (C) 2014 SUSE LINUX Products GmbH, Nuernberg, Germany. # # This library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public # License as published by the Free Software Foundation; either # version 2.1 of the License, or (at your option) any later version. # # This library is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public # License along with this library. If not, see # <http://www.gnu.org/licenses/>. # # Authors: # Jim Fehlig <jfehlig(a)suse.com> # # Read native Xen configuration format, convert to libvirt domXML, and # import (virsh define <xml>) into libvirt. import sys import os import argparse import re from xml.etree import ElementTree try: import libvirt except ImportError: print('Unable to import the libvirt module. Is libvirt-python installed?') sys.exit(1) parser = argparse.ArgumentParser(description='Import Xen domain configuration into libvirt') parser.add_argument('-c', '--convert-only', help='Convert Xen domain configuration into libvirt domXML, but do not import into libvirt', action='store_true', dest='convert_only') parser.add_argument('-r', '--recursive', help='Operate recursivelly on all Xen domain configuration rooted at path', action='store_true') parser.add_argument('-f', '--format', help='Format of Xen domain configuration. Supported formats are xm and sexpr', choices=['xm', 'sexpr'], default=None) parser.add_argument('-v', '--verbose', help='Print information about the import process', action='store_true') parser.add_argument('path', help='Path to Xen domain configuration') def print_verbose(msg): if args.verbose: print(msg) def check_config(path, config): isbinary = os.system('file -b ' + path + ' | grep text > /dev/null') if isbinary: print('Skipping %s (not a valid Xen configuration file)' % path) return 'unknown' for line in config.splitlines(): if len(line) == 0 or line.startswith('#'): continue if line.startswith('<domain'): # XML is not a supported conversion format break if line.startswith('(domain'): print('Found sexpr formatted file %s' % path) return 'sexpr' if '=' in line: print('Found xm formatted file %s' % path) return 'xm' break print('Skipping %s (not a valid Xen configuration file)' % path) return 'unknown' def import_domain(conn, path, format=None, convert_only=False): f = open(path, 'r') config = f.read() print_verbose('Xen domain configuration read from %s:\n %s' % (path, config)) if format is None: format = check_config(path, config) if format == 'sexpr': print_verbose('scrubbing domid from configuration') config = re.sub("$domid [0-9]*$", "", config) print_verbose('scrubbed sexpr:\n %s' % config) xml = conn.domainXMLFromNative('xen-sxpr', config, 0) elif format == 'xm': xml = conn.domainXMLFromNative('xen-xm', config, 0) else: # Return to continue on to next file (if recursive) return f.close() # domUloader is no longer available in SLES12, replace with pygrub tree = ElementTree.fromstring(xml) bl = tree.find('.//bootloader') if bl is not None and bl.text is not None and 'domUloader' in bl.text: bl.text = 'pygrub' xml = ElementTree.tostring(tree) print_verbose('Successfully converted Xen domain configuration to ' 'libvirt domXML:\n %s' % xml) if convert_only: print(xml) else: print_verbose('Importing converted libvirt domXML into libvirt...') dom = conn.defineXML(xml.decode("utf-8")) if dom is None: print('Failed to define domain from converted domXML') sys.exit(1) print_verbose('domXML successfully imported into libvirt') args = parser.parse_args() path = args.path # Connect to libvirt conn = libvirt.open(None) if conn is None: print('Failed to open connection to the hypervisor') sys.exit(1) if args.recursive: try: for root, dirs, files in os.walk(path): for name in files: abs_name = os.path.join(root, name) print_verbose('Processing file %s' % abs_name) import_domain(conn, abs_name, args.format, args.convert_only) except IOError: print('Failed to open/read path %s' % path) sys.exit(1) else: import_domain(conn, args.path, args.format, args.convert_only) ++++++ xen_maskcalc.py ++++++ #!/usr/bin/python3 # Xen Mask Calculator - Calculate CPU masking information based on cpuid(1) # Copyright (C) 2017 Armando Vega # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import argparse import sys import os EAX1_MATCH = '0x00000001 0x00:' EAX7_MATCH = '0x00000007 0x00:' EXP_LINELN = 76 libxl_names_ecx1 = [] libxl_names_edx1 = [] libvirt_names_ecx1 = [] libvirt_names_edx1 = [] libxl_names_ebx7 = [] libxl_names_ecx7 = [] libvirt_names_ebx7 = [] libvirt_names_ecx7 = [] def fill_ecx1(bit, libxl, libvirt): if libxl_names_ecx1[bit]: print("ecx bit %s already set: libxl %s libvirt %s. Ignoring %s/%s\n" % (bit, libxl_names_ecx1[bit], libvirt_names_ecx1[bit], libxl, libvirt)) return libxl_names_ecx1[bit] = libxl libvirt_names_ecx1[bit] = libvirt def fill_edx1(bit, libxl, libvirt): if libxl_names_edx1[bit]: print("edx bit %s already set: libxl %s libvirt %s. Ignoring %s/%s\n" % (bit, libxl_names_edx1[bit], libvirt_names_edx1[bit], libxl, libvirt)) return libxl_names_edx1[bit] = libxl libvirt_names_edx1[bit] = libvirt def fill_ebx7(bit, libxl, libvirt): if libxl_names_ebx7[bit]: print("edx bit %s already set: libxl %s libvirt %s. Ignoring %s/%s\n" % (bit, libxl_names_ebx7[bit], libvirt_names_ebx7[bit], libxl, libvirt)) return libxl_names_ebx7[bit] = libxl libvirt_names_ebx7[bit] = libvirt def fill_ecx7(bit, libxl, libvirt): if libxl_names_ecx7[bit]: print("ecx bit %s already set: libxl %s libvirt %s. Ignoring %s/%s\n" % (bit, libxl_names_ecx7[bit], libvirt_names_ecx7[bit], libxl, libvirt)) return libxl_names_ecx7[bit] = libxl libvirt_names_ecx7[bit] = libvirt def fill_bit_names(): for i in range(0,32): libxl_names_ecx1.append(None) libxl_names_edx1.append(None) libxl_names_ebx7.append(None) libxl_names_ecx7.append(None) libvirt_names_ecx1.append(None) libvirt_names_edx1.append(None) libvirt_names_ebx7.append(None) libvirt_names_ecx7.append(None) fill_ecx1(0, "sse3", "pni") fill_ecx1(1, "pclmulqdq", "pclmuldq") fill_ecx1(2, "dtes64", "dtes64") fill_ecx1(3, "monitor", "monitor") fill_ecx1(4, "dscpl", "ds_cpl") fill_ecx1(5, "vmx", "vmx") fill_ecx1(6, "smx", "smx") fill_ecx1(7, "est", "est") fill_ecx1(8, "tm2", "tm2") fill_ecx1(9, "ssse3", "ssse3") fill_ecx1(10, "cntxid", "cid") fill_ecx1(12, "fma", "fma") fill_ecx1(13, "cmpxchg16", "cx16") fill_ecx1(14, "xtpr", "xtpr") fill_ecx1(15, "pdcm", "pdcm") fill_ecx1(17, "pcid", "pcid") fill_ecx1(18, "dca", "dca") fill_ecx1(19, "sse4_1", "sse4.1") fill_ecx1(20, "sse4_2", "sse4.2") fill_ecx1(21, "x2apic", "x2apic") fill_ecx1(22, "movbe", "movbe") fill_ecx1(23, "popcnt", "popcnt") fill_ecx1(24, "tsc-deadline", "tsc-deadline") fill_ecx1(25, "aes", "aes") fill_ecx1(26, "xsave", "xsave") fill_ecx1(27, "osxsave", "osxsave") fill_ecx1(28, "avx", "avx") fill_ecx1(29, "f16c", "f16c") fill_ecx1(30, "rdrand", "rdrand") fill_ecx1(31, "hypervisor", "hypervisor") fill_edx1(0, "fpu", "fpu") fill_edx1(1, "vme", "vme") fill_edx1(2, "de", "de") fill_edx1(3, "pse", "pse") fill_edx1(4, "tsc", "tsc") fill_edx1(5, "msr", "msr") fill_edx1(6, "pae", "pae") fill_edx1(7, "mce", "mce") fill_edx1(8, "cmpxchg8", "cx8") fill_edx1(9, "apic", "apic") fill_edx1(11, "sysenter", "sep") fill_edx1(12, "mtrr", "mtrr") fill_edx1(13, "pge", "pge") fill_edx1(14, "mca", "mca") fill_edx1(15, "cmov", "cmov") fill_edx1(16, "pat", "pat") fill_edx1(17, "pse36", "pse36") fill_edx1(18, "psn", "pn") fill_edx1(19, "clfsh", "clflush") fill_edx1(21, "ds", "ds") fill_edx1(22, "acpi", "acpi") fill_edx1(23, "mmx", "mmx") fill_edx1(24, "fxsr", "fxsr") fill_edx1(25, "sse", "sse") fill_edx1(26, "sse2", "sse2") fill_edx1(27, "ss", "ss") fill_edx1(28, "htt", "ht") fill_edx1(29, "tm", "tm") fill_edx1(30, "ia64", "ia64") fill_edx1(31, "pbe", "pbe") fill_ebx7(0, "fsgsbase", "fsgsbase") fill_ebx7(1, "tsc_adjust", "tsc_adjust") fill_ebx7(3, "bmi1", "bmi1") fill_ebx7(4, "hle", "hle") fill_ebx7(5, "avx2", "avx2") fill_ebx7(7, "smep", "smep") fill_ebx7(8, "bmi2", "bmi2") fill_ebx7(9, "erms", "erms") fill_ebx7(10, "invpcid", "invpcid") fill_ebx7(11, "rtm", "rtm") fill_ebx7(12, "cmt", "cmt") fill_ebx7(14, "mpx", "mpx") fill_ebx7(16, "avx512f", "avx512f") fill_ebx7(17, "avx512dq", "avx512dq") fill_ebx7(18, "rdseed", "rdseed") fill_ebx7(19, "adx", "adx") fill_ebx7(20, "smap", "smap") fill_ebx7(21, "avx512-ifma", "avx512-ifma") fill_ebx7(23, "clflushopt", "clflushopt") fill_ebx7(24, "clwb", "clwb") fill_ebx7(26, "avx512pf", "avx512pf") fill_ebx7(27, "avx512er", "avx512er") fill_ebx7(28, "avx512cd", "avx512cd") fill_ebx7(29, "sha", "sha") fill_ebx7(30, "avx512bw", "avx512bw") fill_ebx7(31, "avx512vl", "avx512vl") fill_ecx7(0, "prefetchwt1", "prefetchwt1") fill_ecx7(1, "avx512-vbmi", "avx512-vbmi") fill_ecx7(2, "umip", "umip") fill_ecx7(3, "pku", "pku") fill_ecx7(4, "ospke", "ospke") fill_ecx7(6, "avx512-vbmi2", "avx512-vbmi2") fill_ecx7(8, "gfni", "gfni") fill_ecx7(9, "vaes", "vaes") fill_ecx7(10, "vpclmulqdq", "vpclmulqdq") fill_ecx7(11, "avx512-vnni", "avx512-vnni") fill_ecx7(12, "avx512-bitalg", "avx512-bitalg") fill_ecx7(14, "avx512-vpopcntdq", "avx512-vpopcntdq") fill_ecx7(22, "rdpid", "rdpid") fill_ecx7(25, "cldemote", "cldemote") def get_register_mask(regs): """ Take a list of register values and return the calculated mask """ reg_n = len(regs) mask = '' for idx in range(32): counter = 0 for reg in regs: counter += 1 if (reg & (1 << idx) > 0) else 0 # if we have all 1s or all 0s we don't mask the bit if counter == reg_n or counter == 0: mask = mask + 'x' else: mask = mask + '0' # we calculated the mask in reverse, so we reverse it again return mask[::-1] def print_xl_masking_config(nodes): """ Take a dictionary of nodes containing their registers and print out CPUID masking configuration for xl """ nomasking = 'x' * 32 libxl = [] libvirt = [] eax1_ecx_regs = [] eax1_edx_regs = [] eax7_ebx_regs = [] eax7_ecx_regs = [] for node in nodes: eax1_ecx_regs.append(nodes[node]['eax1_ecx']) eax1_edx_regs.append(nodes[node]['eax1_edx']) eax7_ebx_regs.append(nodes[node]['eax7_ebx']) eax7_ecx_regs.append(nodes[node]['eax7_ecx']) # Get masks for the EAX1 and EAX7 registers eax1_ecx_mask = get_register_mask(eax1_ecx_regs) eax1_edx_mask = get_register_mask(eax1_edx_regs) eax7_ebx_mask = get_register_mask(eax7_ebx_regs) eax7_ecx_mask = get_register_mask(eax7_ecx_regs) # Build the xl CPUID config cpuid_config = 'cpuid = [\n "0x00000001:ecx=' + eax1_ecx_mask if eax1_edx_mask != nomasking: cpuid_config += ',edx=' + eax1_edx_mask cpuid_config += '",\n' cpuid_config += ' "0x00000007,0x00:ebx=' + eax7_ebx_mask if eax7_ecx_mask != nomasking: cpuid_config += ',ecx=' + eax7_ecx_mask cpuid_config += '"\n' cpuid_config += ']' print(cpuid_config) bitnum = len(eax1_ecx_mask) while bitnum > 0: bitnum -= 1 bitval = eax1_ecx_mask[len(eax1_ecx_mask) - 1 - bitnum] if bitval == "0" and libxl_names_ecx1[bitnum]: libxl.append(libxl_names_ecx1[bitnum] + "=0") libvirt.append(libvirt_names_ecx1[bitnum]) bitnum = len(eax1_edx_mask) while bitnum > 0: bitnum -= 1 bitval = eax1_edx_mask[len(eax1_edx_mask) - 1 - bitnum] if bitval == "0" and libxl_names_edx1[bitnum]: libxl.append(libxl_names_edx1[bitnum] + "=0") libvirt.append(libvirt_names_edx1[bitnum]) bitnum = len(eax7_ebx_mask) while bitnum > 0: bitnum -= 1 bitval = eax7_ebx_mask[len(eax7_ebx_mask) - 1 - bitnum] if bitval == "0" and libxl_names_ebx7[bitnum]: libxl.append(libxl_names_ebx7[bitnum] + "=0") libvirt.append(libvirt_names_ebx7[bitnum]) bitnum = len(eax7_ecx_mask) while bitnum > 0: bitnum -= 1 bitval = eax7_ecx_mask[len(eax7_ecx_mask) - 1 - bitnum] if bitval == "0" and libxl_names_ecx7[bitnum]: libxl.append(libxl_names_ecx7[bitnum] + "=0") libvirt.append(libvirt_names_ecx7[bitnum]) if len(libxl) > 0: output = "cpuid = [ host" for i in libxl: output += "," + i output += " ]" print(output) print("<domain>") print(" <cpu>") for i in libvirt: print(" <feature policy='optional' name='%s' />" % i) print(" </cpu>") print("</domain>") def print_verbose_masking_info(nodes): """ Take a dictionary of nodes containing their registers and print out verbose mask derivation information """ eax1_ecx_regs = [] eax1_edx_regs = [] eax7_ebx_regs = [] eax7_ecx_regs = [] for node in nodes: eax1_ecx_regs.append(nodes[node]['eax1_ecx']) eax1_edx_regs.append(nodes[node]['eax1_edx']) eax7_ebx_regs.append(nodes[node]['eax7_ebx']) eax7_ecx_regs.append(nodes[node]['eax7_ecx']) print("") print('== Detailed mask derivation info ==') print("") print('EAX1 ECX registers:') for reg in eax1_ecx_regs: print('{0:032b}'.format(reg)) print('================================') print(get_register_mask(eax1_ecx_regs)) print("") print('EAX1 EDX registers:') for reg in eax1_edx_regs: print('{0:032b}'.format(reg)) print('================================') print(get_register_mask(eax1_edx_regs)) print("") print('EAX7,0 EBX registers:') for reg in eax7_ebx_regs: print('{0:032b}'.format(reg)) print('================================') print(get_register_mask(eax7_ebx_regs)) print("") print('EAX7,0 ECX registers:') for reg in eax7_ecx_regs: print('{0:032b}'.format(reg)) print('================================') print(get_register_mask(eax7_ecx_regs)) if __name__ == '__main__': epilog = """The individual 'node_files' are generated with 'cpuid -1r': server1~$ cpuid -1r > node1 server2~$ cpuid -1r > node2 server3~$ cpuid -1r > node3 ~$ {0} node1 node2 node3 Use 'zypper install cpuid' to install the cpuid.rpm. Note: Run 'cpuid' with NATIVE boot instead of dom0 to get the complete cpid value. Xen hides some bits from dom0! """.format(sys.argv[0]) parser = argparse.ArgumentParser( formatter_class=argparse.RawDescriptionHelpFormatter, description='A utility that calculates a XEN CPUID difference mask', epilog=epilog ) parser.add_argument('node_files', nargs='*', help='Filenames of XEN node CPUID outputs') parser.add_argument('-v', '--verbose', action='store_true', help='Get detailed mask derivation information') args = parser.parse_args() if len(args.node_files) < 2: print('Need at least 2 files to do the comparison!') parser.print_help() sys.exit(1) fill_bit_names() nodes = dict() for node in args.node_files: if os.path.isfile(node): try: f = open(node) except IOError as e: print("I/O error({0}): {1}".format(e.errno, e.strerror)) sys.exit(1) else: lines = [line.strip() for line in f] eax1 = '' eax7 = '' # try to match the lines containing interesting registers # EAX1 - Processor Info and Feature Bits # EAX7 - Extended features for line in lines: if line.startswith(EAX1_MATCH): eax1 = line elif line.startswith(EAX7_MATCH): eax7 = line # if we get garbled data we should probably just give up if len(eax1) < EXP_LINELN or len(eax7) < EXP_LINELN: print('ERROR: invalid data format in file : ' + node) sys.exit(1) # check if we can actually parse the strings into integers try: eax1_ecx = int(eax1.split()[4].split('=')[1], 0) eax1_edx = int(eax1.split()[5].split('=')[1], 0) eax7_ebx = int(eax7.split()[3].split('=')[1], 0) eax7_ecx = int(eax7.split()[4].split('=')[1], 0) except ValueError: print('ERROR: invalid data format in file: ' + node) sys.exit(1) nodes[node] = dict() nodes[node]['eax1_ecx'] = eax1_ecx nodes[node]['eax1_edx'] = eax1_edx nodes[node]['eax7_ebx'] = eax7_ebx nodes[node]['eax7_ecx'] = eax7_ecx f.close() else: print('File not found: ' + node) sys.exit(1) print_xl_masking_config(nodes) if args.verbose: print_verbose_masking_info(nodes) ++++++ xenapiusers ++++++ root ++++++ xencommons.service ++++++ [Unit] Description=xencommons ConditionPathExists=/proc/xen/capabilities # Avoid errors from systemd-modules-load.service Requires=xen-dom0-modules.service After=xen-dom0-modules.service # Pull in all upstream service files Requires=proc-xen.mount After=proc-xen.mount Requires=xenstored.service After=xenstored.service Requires=xenconsoled.service After=xenconsoled.service Requires=xen-init-dom0.service After=xen-init-dom0.service Requires=xen-qemu-dom0-disk-backend.service After=xen-qemu-dom0-disk-backend.service # Make sure network (for bridge) and remote mounts (for xendomains) are available ... After=network-online.target After=remote-fs.target # ... for libvirt and xendomains Before=xendomains.service libvirtd.service [Service] Type=oneshot RemainAfterExit=true ExecStartPre=/bin/grep -q control_d /proc/xen/capabilities ExecStart=/usr/bin/xenstore-ls -f ExecStartPost=/bin/sh -c 'mv -vf /var/log/xen/xen-boot.log /var/log/xen/xen-boot.prev.log ; /usr/sbin/xl dmesg > /var/log/xen/xen-boot.log' [Install] WantedBy=multi-user.target ++++++ xenconsole-no-multiple-connections.patch ++++++ Index: xen-4.8.0-testing/tools/console/client/main.c =================================================================== --- xen-4.8.0-testing.orig/tools/console/client/main.c +++ xen-4.8.0-testing/tools/console/client/main.c @@ -101,6 +101,7 @@ static int get_pty_fd(struct xs_handle * * Assumes there is already a watch set in the store for this path. */ { struct timeval tv; + struct flock lock; fd_set watch_fdset; int xs_fd = xs_fileno(xs), pty_fd = -1; int start, now; @@ -124,6 +125,14 @@ static int get_pty_fd(struct xs_handle * pty_fd = open(pty_path, O_RDWR | O_NOCTTY); if (pty_fd == -1) warn("Could not open tty `%s'", pty_path); + else { + memset(&lock, 0, sizeof(lock)); + lock.l_type = F_WRLCK; + lock.l_whence = SEEK_SET; + if (fcntl(pty_fd, F_SETLK, &lock) != 0) + err(errno, "Could not lock tty '%s'", + pty_path); + } } free(pty_path); } ++++++ xendomains-wait-disks.LICENSE ++++++ ++++ 674 lines (skipped) ++++++ xendomains-wait-disks.README.md ++++++ # xen-tools-xendomains-wait-disk [xendomains.service](https://github.com/xen-project/xen/blob/RELEASE-4.13.0/… has problems with disks that appear only later in boot process (or even after booting is complete). This project creates a service that loops over all disks that domU will use and wait for them to appear. xendomains-wait-disk.service launches a script that reads both /etc/xen/auto/ configurations and /var/lib/xen/save/ dumps. >From those files, it extracts which disks are needed for all domU that will be started (respecting /etc/sysconfig/xendomains settings). After that, it simply loops waiting for those disks to appear. There is a timeout (5 min) configured in xendomains-wait-disk.service that prevents it to block booting process forever. There are two known cases where this project is useful: ## degraded mdadm RAID mdadm RAID are assembled by [udev rules](https://github.com/neilbrown/mdadm/blob/master/udev-md-raid-assembly…. However, it is only assembled when it is healthy. When a member is still missing, it starts a [timer](https://github.com/neilbrown/mdadm/blob/master/systemd/mdadm-last-re… that will try to assemble the RAID anyway after 30s, even if degraded. This timer does not block xendomains to be started. So, if a domU is depending on a MD RAID that is degraded (i.e. RAID 1 missing one disk), xendomains.service will be started before those 30s passed and that domU will fail. An alternative solution would be to add extra hard dependencies to xendomains.service for each required disk (Require=xxx.device). However, this solution introduces another bigger problem. Before, if a single RAID is degraded, only the domU that depends on it will fail. With Require=xxx.device, xendomains will never start if a RAID could not be assembled even after 30s (i.e. RAID5 with two missing disks). With xendomains-wait-disk.service, xendomains.service will be blocked up to 5 min waiting for those MD RAID used by domUs. If it fails, xendomains.service continues anyway. ## iSCSI disks domU that uses iSCSI disk (mapped by host OS) also fails to start during boot. open-iscsi.service returns before it connect to the remote target and rescan iscsi disks. As in mdadm RAID case, xendomains.service is started and domU that depends on iSCSI disks will fail. ++++++ xendomains-wait-disks.sh ++++++ #!/bin/bash # # Generates xendomains unit # read_conf_from_file() { ${sbindir}/xl create --quiet --dryrun --defconfig "$1" } big2littleendian_32bit(){ echo ${1:6:2}${1:4:2}${1:2:2}${1:0:2} } read_hex() { local out_var=$1; shift local input=$1; shift local pos_var=$1; shift local length=$1; shift local hex=$(dd bs=1 skip=${!pos_var} count=$length status=none <$input | xxd -p -c$length -l$length) read -r $pos_var <<<"$((${!pos_var} + $length))" read -r $out_var <<<"$hex" } hex2dec() { local hex=$1; shift local little_endian=$1; shift if $little_endian; then hex=$(big2littleendian_32bit $hex) fi echo $((0x$hex)) } read_conf_from_image(){ local pos=0 length=0 local magic_header byte_order mandatory_flags optional_flags optional_data_len config_len config_json read_hex magic_header $1 pos 32 # "Xen saved domain, xl format\n \0 \r" if [ "$magic_header" != "58656e20736176656420646f6d61696e2c20786c20666f726d61740a2000200d" ]; then log $err "Unknown file format in $1. Wrong magic header: '0x$magic_header'" return 1 fi read_hex byte_order $1 pos 4 case "$byte_order" in 04030201) little_endian=true;; 01020304) little_endian=false;; *) log $err "Unknown byte order 0x$byte_order in $1"; return 1;; esac #define XL_MANDATORY_FLAG_JSON (1U << 0) /* config data is in JSON format */ #define XL_MANDATORY_FLAG_STREAMv2 (1U << 1) /* stream is v2 */ read_hex mandatory_flags $1 pos 4 if [ "$(($(hex2dec $mandatory_flags $little_endian) & 0x3))" -ne 3 ]; then log $err "Unknown config format or stream version. Mandatory flags are 0x$mandatory_flag" return 1 fi read_hex optional_flags $1 pos 4 read_hex optional_data_len $1 pos 4 optional_data_len=$(hex2dec $optional_data_len $little_endian) # I'll not use but saved memory dump will begin at $((pos+optional_data_len)) read_hex config_len $1 pos 4 config_len=$(hex2dec $config_len $little_endian) # null terminated string read_hex config_json $1 pos $config_len xxd -p -r <<<"$config_json" } log() { local msg_loglevel=$1; shift if [ "$msg_loglevel" -gt "$LOGLEVEL" ]; then return 0 fi echo "$@" >&2 } emerg=0; alert=1; crit=2; err=3 warning=4; notice=5; info=6; debug=7 LOGLEVEL=${LOGLEVEL:-4} if [ "$SYSTEMD_LOG_LEVEL" ]; then LOGLEVEL=${!SYSTEMD_LOG_LEVEL} fi log $debug "Using loglevel $LOGLEVEL" trap "log $err Error on \$LINENO: \$(caller)" ERR log $debug "loading /etc/xen/scripts/hotplugpath.sh..." . /etc/xen/scripts/hotplugpath.sh #log $debug "testing for ${sbindir}/xl..." #CMD=${sbindir}/xl #if ! $CMD list &> /dev/null; then # log $err "${sbindir}/xl list failed!" # log $err "$($CMD list &>&1)" # exit $? #fi #log $debug "${sbindir}/xl list OK!" log $debug "loading /etc/sysconfig/xendomains..." XENDOM_CONFIG=/etc/sysconfig/xendomains if ! test -r $XENDOM_CONFIG; then echo "$XENDOM_CONFIG not existing" >&2; exit 6 fi . $XENDOM_CONFIG doms_conf=() doms_restore=() doms_source=() log $debug "Reading saved domains..." if [ "$XENDOMAINS_RESTORE" = "true" ] && [ -d "$XENDOMAINS_SAVE" ]; then for dom in $XENDOMAINS_SAVE/*; do log $debug "Trying $dom..." if ! [ -r $dom ] ; then log $debug "Not readable $dom..." continue fi log $debug "Reading conf from $dom..." if ! dom_conf=$(read_conf_from_image $dom); then log $error "Cannot read conf from $dom" continue fi log $debug "Adding $dom to the list" doms_conf+=("$dom_conf") doms_restore+=(true) doms_source+=("$dom") done fi log $debug "Reading auto domains..." if [ -d "$XENDOMAINS_AUTO" ]; then for dom in $XENDOMAINS_AUTO/*; do log $debug "Trying $dom..." if ! [ -r $dom ] ; then log $debug "Not readable $dom..." continue fi log $debug "Reading conf from $dom..." if ! dom_conf=$(read_conf_from_file $dom); then echo 123 log $error "Cannot read conf from $dom" continue fi log $debug "Adding $dom to the list" doms_conf+=("$dom_conf") doms_restore+=(false) doms_source+=("$dom") done fi log $debug "We have ${#doms_conf[*]} to check" for i in ${!doms_conf[*]}; do log $debug "Doing dom $i..." dom_conf="${doms_conf[i]}" dom_restore="${doms_restore[i]}" dom_source="${doms_source[i]}" dom_name=$(sed -n 's/^.*(name $.*$)$/\1/p;s/^.*"name": "$.*$",$/\1/p' <<<"$dom_conf") readarray -t required_disks <<<"$(sed -n -e '/^ "disks": \[/,/ \],/{ /"pdev_path":/ { s/.*"pdev_path": "//;s/".*//p } }' <<<"$dom_conf")" log $debug "dom $i is named $dom_name..." for disk in "${required_disks[@]}"; do disk_control_var=control_$(tr -d -c '[a-zA-Z0-9_]' <<<"$disk") if [ "${!disk_control_var:-0}" -eq 1 ]; then log $debug "$disk for $dom_name is already being checked" continue fi declare $disk_control_var=1 log $debug "waiting for $disk for $dom_name" ( j=0 found_loglevel=$debug while true; do if [ -e "$disk" ]; then log $found_loglevel "disk $disk found (after $j seconds)" exit 0 fi if [ "$(( j++ % 5))" -eq 0 ]; then log $warning "still waiting for $disk for $dom_name..." found_loglevel=$warning fi sleep 1 done ) & done done wait log $debug "Exiting normally" ++++++ xenstore-launch.patch ++++++ References: bsc#1131811 When the xenstored service is started it exits successfully but systemd seems to lose track of the service and reports an error causing other xen services to fail. This patch is a workaround giving systemd time to acknowledge a succesful start of xenstored. The real fix is believed to be needed in systemd. diff --git a/tools/hotplug/Linux/launch-xenstore.in b/tools/hotplug/Linux/launch-xenstore.in index 991dec8d25..eb3d7c964c 100644 --- a/tools/hotplug/Linux/launch-xenstore.in +++ b/tools/hotplug/Linux/launch-xenstore.in @@ -79,6 +79,7 @@ test -f @CONFIG_DIR@/@CONFIG_LEAF_DIR@/xencommons && . @CONFIG_DIR@/@CONFIG_LEAF echo -n Starting $XENSTORE_DOMAIN_KERNEL... ${LIBEXEC_BIN}/init-xenstore-domain $XENSTORE_DOMAIN_ARGS || exit 1 systemd-notify --ready 2>/dev/null + systemd-notify --booted 2>/dev/null && sleep 60 exit 0 } ++++++ xenstore-run-in-studomain.patch ++++++ References: fate#323663 - Run Xenstore in stubdomain Index: xen-4.10.0-testing/tools/hotplug/Linux/init.d/sysconfig.xencommons.in =================================================================== --- xen-4.10.0-testing.orig/tools/hotplug/Linux/init.d/sysconfig.xencommons.in +++ xen-4.10.0-testing/tools/hotplug/Linux/init.d/sysconfig.xencommons.in @@ -16,7 +16,7 @@ # # Changing this requires a reboot to take effect. # -#XENSTORETYPE=daemon +#XENSTORETYPE=domain ## Type: string ## Default: xenstored @@ -67,7 +67,7 @@ XENSTORED_ARGS= # # xenstore domain memory size in MiB. # Only evaluated if XENSTORETYPE is "domain". -#XENSTORE_DOMAIN_SIZE=8 +#XENSTORE_DOMAIN_SIZE=32 ## Type: string ## Default: not set, no autoballooning of xenstore domain @@ -78,7 +78,7 @@ XENSTORED_ARGS= # - combination of both in form of <val>:<frac> (e.g. 8:1/100), resulting # value will be the higher of both specifications # Only evaluated if XENSTORETYPE is "domain". -#XENSTORE_MAX_DOMAIN_SIZE= +#XENSTORE_MAX_DOMAIN_SIZE=1/100 ## Type: string ## Default: "" Index: xen-4.10.0-testing/tools/hotplug/Linux/launch-xenstore.in =================================================================== --- xen-4.10.0-testing.orig/tools/hotplug/Linux/launch-xenstore.in +++ xen-4.10.0-testing/tools/hotplug/Linux/launch-xenstore.in @@ -48,7 +48,7 @@ test_xenstore && exit 0 test -f @CONFIG_DIR@/@CONFIG_LEAF_DIR@/xencommons && . @CONFIG_DIR@/@CONFIG_LEAF_DIR@/xencommons -[ "$XENSTORETYPE" = "" ] && XENSTORETYPE=daemon +[ "$XENSTORETYPE" = "" ] && XENSTORETYPE=domain /bin/mkdir -p @XEN_RUN_DIR@ @@ -72,9 +72,10 @@ test -f @CONFIG_DIR@/@CONFIG_LEAF_DIR@/x [ "$XENSTORETYPE" = "domain" ] && { [ -z "$XENSTORE_DOMAIN_KERNEL" ] && XENSTORE_DOMAIN_KERNEL=@LIBEXEC@/boot/xenstore-stubdom.gz XENSTORE_DOMAIN_ARGS="$XENSTORE_DOMAIN_ARGS --kernel $XENSTORE_DOMAIN_KERNEL" - [ -z "$XENSTORE_DOMAIN_SIZE" ] && XENSTORE_DOMAIN_SIZE=8 + [ -z "$XENSTORE_DOMAIN_SIZE" ] && XENSTORE_DOMAIN_SIZE=32 XENSTORE_DOMAIN_ARGS="$XENSTORE_DOMAIN_ARGS --memory $XENSTORE_DOMAIN_SIZE" - [ -z "$XENSTORE_MAX_DOMAIN_SIZE" ] || XENSTORE_DOMAIN_ARGS="$XENSTORE_DOMAIN_ARGS --maxmem $XENSTORE_MAX_DOMAIN_SIZE" + [ -z "$XENSTORE_MAX_DOMAIN_SIZE" ] && XENSTORE_MAX_DOMAIN_SIZE="1/100" + XENSTORE_DOMAIN_ARGS="$XENSTORE_DOMAIN_ARGS --maxmem $XENSTORE_MAX_DOMAIN_SIZE" echo -n Starting $XENSTORE_DOMAIN_KERNEL... ${LIBEXEC_BIN}/init-xenstore-domain $XENSTORE_DOMAIN_ARGS || exit 1 ++++++ xl-conf-default-bridge.patch ++++++ Index: xen-4.4.0-testing/tools/examples/xl.conf =================================================================== --- xen-4.4.0-testing.orig/tools/examples/xl.conf +++ xen-4.4.0-testing/tools/examples/xl.conf @@ -30,7 +30,7 @@ #vif.default.script="vif-bridge" # default bridge device to use with vif-bridge hotplug scripts -#vif.default.bridge="xenbr0" +vif.default.bridge="br0" # Reserve a claim of memory when launching a guest. This guarantees immediate # feedback whether the guest can be launched due to memory exhaustion ++++++ xl-conf-disable-autoballoon.patch ++++++ --- xen-4.12.0-testing/tools/examples/xl.conf.orig 2019-03-11 06:17:17.586380817 -0600 +++ xen-4.12.0-testing/tools/examples/xl.conf 2019-03-11 06:17:31.314553910 -0600 @@ -3,7 +3,7 @@ # Control whether dom0 is ballooned down when xen doesn't have enough # free memory to create a domain. "auto" means only balloon if dom0 # starts with all the host's memory. -#autoballoon="auto" +autoballoon="off" # full path of the lockfile used by xl during domain creation #lockfile="/var/lock/xl" ++++++ xnloader.py ++++++ # NetWare-specific operations # # Copyright (c) 2013 Suse Linux Products. # Author: Charles Arnold <carnold(a)suse.com> # # This software may be freely redistributed under the terms of the GNU # general public license. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # 51 Franklin St, Boston, MA 02110 # Binary patching of xnloader.sys # For launching NetWare on Xen 4.2 and newer import os, sys, base64 CODE_OFFSET=0x49F5 NUMBER_OF_CODE_BYTES=17 ORIGINAL_CODE="BA00080000C786FC1F0000FFFFFFFF31C9" PATCHED_CODE="BAF8070000834C961CFFB9080000009090" XNLOADER_SYS_MD5SUM="eb76cce2a2d45928ea2bf26e01430af2" def patch_netware_loader(loader): """Open the given xnloader.sys file and patch the relevant code hunk.""" # domUloader calls this with all kernels so perhaps this is not the NetWare loader md5sum_cmd = 'md5sum ' + loader p = os.popen(md5sum_cmd) sum = p.read().split()[0] p.close() if sum != XNLOADER_SYS_MD5SUM: return try: fd = os.open(loader, os.O_RDWR) except Exception as e: print(e, file=sys.stderr) raise # Validate minimum size for I/O stat = os.fstat(fd) if stat.st_size < CODE_OFFSET+NUMBER_OF_CODE_BYTES: os.close(fd) return # Seek to location of code hunk os.lseek(fd, CODE_OFFSET, os.SEEK_SET) # Read code bytes at offset buf = os.read(fd, NUMBER_OF_CODE_BYTES) code_as_hex = base64.b16encode(buf) code_as_hex = code_as_hex.decode('utf-8') if code_as_hex == ORIGINAL_CODE: # Seek back to start location of the code hunk os.lseek(fd, CODE_OFFSET, os.SEEK_SET) # Convert the PATCHED_CODE string to raw binary code_as_bin = base64.b16decode(PATCHED_CODE) # Write the patched code os.write(fd, code_as_bin) os.close(fd) ++++++ xsa286-1.patch ++++++ x86/mm: split L4 and L3 parts of the walk out of do_page_walk() The L3 one at least is going to be re-used by a subsequent patch, and splitting the L4 one then as well seems only natural. This is part of XSA-286. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: George Dunlap <george.dunlap(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -44,26 +44,47 @@ unsigned int __read_mostly m2p_compat_vs l2_pgentry_t *compat_idle_pg_table_l2; -void *do_page_walk(struct vcpu *v, unsigned long addr) +static l4_pgentry_t page_walk_get_l4e(pagetable_t root, unsigned long addr) { - unsigned long mfn = pagetable_get_pfn(v->arch.guest_table); - l4_pgentry_t l4e, *l4t; - l3_pgentry_t l3e, *l3t; - l2_pgentry_t l2e, *l2t; - l1_pgentry_t l1e, *l1t; + unsigned long mfn = pagetable_get_pfn(root); + l4_pgentry_t *l4t, l4e; - if ( !is_pv_vcpu(v) || !is_canonical_address(addr) ) - return NULL; + if ( !is_canonical_address(addr) ) + return l4e_empty(); l4t = map_domain_page(_mfn(mfn)); l4e = l4t[l4_table_offset(addr)]; unmap_domain_page(l4t); + + return l4e; +} + +static l3_pgentry_t page_walk_get_l3e(pagetable_t root, unsigned long addr) +{ + l4_pgentry_t l4e = page_walk_get_l4e(root, addr); + l3_pgentry_t *l3t, l3e; + if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) ) - return NULL; + return l3e_empty(); l3t = map_l3t_from_l4e(l4e); l3e = l3t[l3_table_offset(addr)]; unmap_domain_page(l3t); + + return l3e; +} + +void *do_page_walk(struct vcpu *v, unsigned long addr) +{ + l3_pgentry_t l3e; + l2_pgentry_t l2e, *l2t; + l1_pgentry_t l1e, *l1t; + unsigned long mfn; + + if ( !is_pv_vcpu(v) ) + return NULL; + + l3e = page_walk_get_l3e(v->arch.guest_table, addr); mfn = l3e_get_pfn(l3e); if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || !mfn_valid(_mfn(mfn)) ) return NULL; ++++++ xsa286-2.patch ++++++ x86/mm: check page types in do_page_walk() For page table entries read to be guaranteed valid, transiently locking the pages and validating their types is necessary. Note that guest use of linear page tables is intentionally not taken into account here, as ordinary data (guest stacks) can't possibly live inside page tables. This is part of XSA-286. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: George Dunlap <george.dunlap(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -46,15 +46,29 @@ l2_pgentry_t *compat_idle_pg_table_l2; static l4_pgentry_t page_walk_get_l4e(pagetable_t root, unsigned long addr) { - unsigned long mfn = pagetable_get_pfn(root); - l4_pgentry_t *l4t, l4e; + mfn_t mfn = pagetable_get_mfn(root); + /* current's root page table can't disappear under our feet. */ + bool need_lock = !mfn_eq(mfn, pagetable_get_mfn(current->arch.guest_table)); + struct page_info *pg; + l4_pgentry_t l4e = l4e_empty(); if ( !is_canonical_address(addr) ) return l4e_empty(); - l4t = map_domain_page(_mfn(mfn)); - l4e = l4t[l4_table_offset(addr)]; - unmap_domain_page(l4t); + pg = mfn_to_page(mfn); + if ( need_lock && !page_lock(pg) ) + return l4e_empty(); + + if ( (pg->u.inuse.type_info & PGT_type_mask) == PGT_l4_page_table ) + { + l4_pgentry_t *l4t = map_domain_page(mfn); + + l4e = l4t[l4_table_offset(addr)]; + unmap_domain_page(l4t); + } + + if ( need_lock ) + page_unlock(pg); return l4e; } @@ -62,14 +76,26 @@ static l4_pgentry_t page_walk_get_l4e(pa static l3_pgentry_t page_walk_get_l3e(pagetable_t root, unsigned long addr) { l4_pgentry_t l4e = page_walk_get_l4e(root, addr); - l3_pgentry_t *l3t, l3e; + mfn_t mfn = l4e_get_mfn(l4e); + struct page_info *pg; + l3_pgentry_t l3e = l3e_empty(); if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) ) return l3e_empty(); - l3t = map_l3t_from_l4e(l4e); - l3e = l3t[l3_table_offset(addr)]; - unmap_domain_page(l3t); + pg = mfn_to_page(mfn); + if ( !page_lock(pg) ) + return l3e_empty(); + + if ( (pg->u.inuse.type_info & PGT_type_mask) == PGT_l3_page_table ) + { + l3_pgentry_t *l3t = map_domain_page(mfn); + + l3e = l3t[l3_table_offset(addr)]; + unmap_domain_page(l3t); + } + + page_unlock(pg); return l3e; } @@ -77,44 +103,67 @@ static l3_pgentry_t page_walk_get_l3e(pa void *do_page_walk(struct vcpu *v, unsigned long addr) { l3_pgentry_t l3e; - l2_pgentry_t l2e, *l2t; - l1_pgentry_t l1e, *l1t; - unsigned long mfn; + l2_pgentry_t l2e = l2e_empty(); + l1_pgentry_t l1e = l1e_empty(); + mfn_t mfn; + struct page_info *pg; if ( !is_pv_vcpu(v) ) return NULL; l3e = page_walk_get_l3e(v->arch.guest_table, addr); - mfn = l3e_get_pfn(l3e); - if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || !mfn_valid(_mfn(mfn)) ) + mfn = l3e_get_mfn(l3e); + if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || !mfn_valid(mfn) ) return NULL; if ( (l3e_get_flags(l3e) & _PAGE_PSE) ) { - mfn += PFN_DOWN(addr & ((1UL << L3_PAGETABLE_SHIFT) - 1)); + mfn = mfn_add(mfn, PFN_DOWN(addr & ((1UL << L3_PAGETABLE_SHIFT) - 1))); goto ret; } - l2t = map_domain_page(_mfn(mfn)); - l2e = l2t[l2_table_offset(addr)]; - unmap_domain_page(l2t); - mfn = l2e_get_pfn(l2e); - if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) || !mfn_valid(_mfn(mfn)) ) + pg = mfn_to_page(mfn); + if ( !page_lock(pg) ) + return NULL; + + if ( (pg->u.inuse.type_info & PGT_type_mask) == PGT_l2_page_table ) + { + const l2_pgentry_t *l2t = map_domain_page(mfn); + + l2e = l2t[l2_table_offset(addr)]; + unmap_domain_page(l2t); + } + + page_unlock(pg); + + mfn = l2e_get_mfn(l2e); + if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) || !mfn_valid(mfn) ) return NULL; if ( (l2e_get_flags(l2e) & _PAGE_PSE) ) { - mfn += PFN_DOWN(addr & ((1UL << L2_PAGETABLE_SHIFT) - 1)); + mfn = mfn_add(mfn, PFN_DOWN(addr & ((1UL << L2_PAGETABLE_SHIFT) - 1))); goto ret; } - l1t = map_domain_page(_mfn(mfn)); - l1e = l1t[l1_table_offset(addr)]; - unmap_domain_page(l1t); - mfn = l1e_get_pfn(l1e); - if ( !(l1e_get_flags(l1e) & _PAGE_PRESENT) || !mfn_valid(_mfn(mfn)) ) + pg = mfn_to_page(mfn); + if ( !page_lock(pg) ) + return NULL; + + if ( (pg->u.inuse.type_info & PGT_type_mask) == PGT_l1_page_table ) + { + const l1_pgentry_t *l1t = map_domain_page(mfn); + + l1e = l1t[l1_table_offset(addr)]; + unmap_domain_page(l1t); + } + + page_unlock(pg); + + mfn = l1e_get_mfn(l1e); + if ( !(l1e_get_flags(l1e) & _PAGE_PRESENT) || !mfn_valid(mfn) ) return NULL; ret: - return map_domain_page(_mfn(mfn)) + (addr & ~PAGE_MASK); + return map_domain_page(mfn) + (addr & ~PAGE_MASK); } /* ++++++ xsa286-3.patch ++++++ x86/mm: avoid using linear page tables in map_guest_l1e() Replace the linear L2 table access by an actual page walk. This is part of XSA-286. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: George Dunlap <george.dunlap(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/pv/mm.c +++ b/xen/arch/x86/pv/mm.c @@ -40,11 +40,14 @@ l1_pgentry_t *map_guest_l1e(unsigned lon if ( unlikely(!__addr_ok(linear)) ) return NULL; - /* Find this l1e and its enclosing l1mfn in the linear map. */ - if ( __copy_from_user(&l2e, - &__linear_l2_table[l2_linear_offset(linear)], - sizeof(l2_pgentry_t)) ) + if ( unlikely(!(current->arch.flags & TF_kernel_mode)) ) + { + ASSERT_UNREACHABLE(); return NULL; + } + + /* Find this l1e and its enclosing l1mfn. */ + l2e = page_walk_get_l2e(current->arch.guest_table, linear); /* Check flags that it will be safe to read the l1e. */ if ( (l2e_get_flags(l2e) & (_PAGE_PRESENT | _PAGE_PSE)) != _PAGE_PRESENT ) --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -100,6 +100,34 @@ static l3_pgentry_t page_walk_get_l3e(pa return l3e; } +l2_pgentry_t page_walk_get_l2e(pagetable_t root, unsigned long addr) +{ + l3_pgentry_t l3e = page_walk_get_l3e(root, addr); + mfn_t mfn = l3e_get_mfn(l3e); + struct page_info *pg; + l2_pgentry_t l2e = l2e_empty(); + + if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || + (l3e_get_flags(l3e) & _PAGE_PSE) ) + return l2e_empty(); + + pg = mfn_to_page(mfn); + if ( !page_lock(pg) ) + return l2e_empty(); + + if ( (pg->u.inuse.type_info & PGT_type_mask) == PGT_l2_page_table ) + { + l2_pgentry_t *l2t = map_domain_page(mfn); + + l2e = l2t[l2_table_offset(addr)]; + unmap_domain_page(l2t); + } + + page_unlock(pg); + + return l2e; +} + void *do_page_walk(struct vcpu *v, unsigned long addr) { l3_pgentry_t l3e; --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -577,7 +577,9 @@ void audit_domains(void); void make_cr3(struct vcpu *v, mfn_t mfn); void update_cr3(struct vcpu *v); int vcpu_destroy_pagetables(struct vcpu *); + void *do_page_walk(struct vcpu *v, unsigned long addr); +l2_pgentry_t page_walk_get_l2e(pagetable_t root, unsigned long addr); int __sync_local_execstate(void); ++++++ xsa286-4.patch ++++++ x86/mm: avoid using linear page tables in guest_get_eff_kern_l1e() First of all drop guest_get_eff_l1e() entirely - there's no actual user of it: pv_ro_page_fault() has a guest_kernel_mode() conditional around its only call site. Then replace the linear L1 table access by an actual page walk. This is part of XSA-286. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: George Dunlap <george.dunlap(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/pv/mm.c +++ b/xen/arch/x86/pv/mm.c @@ -59,27 +59,6 @@ l1_pgentry_t *map_guest_l1e(unsigned lon } /* - * Read the guest's l1e that maps this address, from the kernel-mode - * page tables. - */ -static l1_pgentry_t guest_get_eff_kern_l1e(unsigned long linear) -{ - struct vcpu *curr = current; - const bool user_mode = !(curr->arch.flags & TF_kernel_mode); - l1_pgentry_t l1e; - - if ( user_mode ) - toggle_guest_pt(curr); - - l1e = guest_get_eff_l1e(linear); - - if ( user_mode ) - toggle_guest_pt(curr); - - return l1e; -} - -/* * Map a guest's LDT page (covering the byte at @offset from start of the LDT) * into Xen's virtual range. Returns true if the mapping changed, false * otherwise. --- a/xen/arch/x86/pv/mm.h +++ b/xen/arch/x86/pv/mm.h @@ -5,19 +5,19 @@ l1_pgentry_t *map_guest_l1e(unsigned lon int new_guest_cr3(mfn_t mfn); -/* Read a PV guest's l1e that maps this linear address. */ -static inline l1_pgentry_t guest_get_eff_l1e(unsigned long linear) +/* + * Read the guest's l1e that maps this address, from the kernel-mode + * page tables. + */ +static inline l1_pgentry_t guest_get_eff_kern_l1e(unsigned long linear) { - l1_pgentry_t l1e; + l1_pgentry_t l1e = l1e_empty(); ASSERT(!paging_mode_translate(current->domain)); ASSERT(!paging_mode_external(current->domain)); - if ( unlikely(!__addr_ok(linear)) || - __copy_from_user(&l1e, - &__linear_l1_table[l1_linear_offset(linear)], - sizeof(l1_pgentry_t)) ) - l1e = l1e_empty(); + if ( likely(__addr_ok(linear)) ) + l1e = page_walk_get_l1e(current->arch.guest_table, linear); return l1e; } --- a/xen/arch/x86/pv/ro-page-fault.c +++ b/xen/arch/x86/pv/ro-page-fault.c @@ -357,7 +357,7 @@ int pv_ro_page_fault(unsigned long addr, bool mmio_ro; /* Attempt to read the PTE that maps the VA being accessed. */ - pte = guest_get_eff_l1e(addr); + pte = guest_get_eff_kern_l1e(addr); /* We are only looking for read-only mappings */ if ( ((l1e_get_flags(pte) & (_PAGE_PRESENT | _PAGE_RW)) != _PAGE_PRESENT) ) --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -128,6 +128,62 @@ l2_pgentry_t page_walk_get_l2e(pagetable return l2e; } +/* + * For now no "set_accessed" parameter, as all callers want it set to true. + * For now also no "set_dirty" parameter, as all callers deal with r/o + * mappings, and we don't want to set the dirty bit there (conflicts with + * CET-SS). However, as there are CPUs which may set the dirty bit on r/o + * PTEs, the logic below tolerates the bit becoming set "behind our backs". + */ +l1_pgentry_t page_walk_get_l1e(pagetable_t root, unsigned long addr) +{ + l2_pgentry_t l2e = page_walk_get_l2e(root, addr); + mfn_t mfn = l2e_get_mfn(l2e); + struct page_info *pg; + l1_pgentry_t l1e = l1e_empty(); + + if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) || + (l2e_get_flags(l2e) & _PAGE_PSE) ) + return l1e_empty(); + + pg = mfn_to_page(mfn); + if ( !page_lock(pg) ) + return l1e_empty(); + + if ( (pg->u.inuse.type_info & PGT_type_mask) == PGT_l1_page_table ) + { + l1_pgentry_t *l1t = map_domain_page(mfn); + + l1e = l1t[l1_table_offset(addr)]; + + if ( (l1e_get_flags(l1e) & (_PAGE_ACCESSED | _PAGE_PRESENT)) == + _PAGE_PRESENT ) + { + l1_pgentry_t ol1e = l1e; + + l1e_add_flags(l1e, _PAGE_ACCESSED); + /* + * Best effort only; with the lock held the page shouldn't + * change anyway, except for the dirty bit to perhaps become set. + */ + while ( cmpxchg(&l1e_get_intpte(l1t[l1_table_offset(addr)]), + l1e_get_intpte(ol1e), l1e_get_intpte(l1e)) != + l1e_get_intpte(ol1e) && + !(l1e_get_flags(l1e) & _PAGE_DIRTY) ) + { + l1e_add_flags(ol1e, _PAGE_DIRTY); + l1e_add_flags(l1e, _PAGE_DIRTY); + } + } + + unmap_domain_page(l1t); + } + + page_unlock(pg); + + return l1e; +} + void *do_page_walk(struct vcpu *v, unsigned long addr) { l3_pgentry_t l3e; --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -580,6 +580,7 @@ int vcpu_destroy_pagetables(struct vcpu void *do_page_walk(struct vcpu *v, unsigned long addr); l2_pgentry_t page_walk_get_l2e(pagetable_t root, unsigned long addr); +l1_pgentry_t page_walk_get_l1e(pagetable_t root, unsigned long addr); int __sync_local_execstate(void); ++++++ xsa286-5.patch ++++++ x86/mm: avoid using top level linear page tables in {,un}map_domain_page() Move the page table recursion two levels down. This entails avoiding to free the recursive mapping prematurely in free_perdomain_mappings(). This is part of XSA-286. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: George Dunlap <george.dunlap(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -65,7 +65,8 @@ void __init mapcache_override_current(st #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER) #define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1) #define MAPCACHE_L1ENT(idx) \ - __linear_l1_table[l1_linear_offset(MAPCACHE_VIRT_START + pfn_to_paddr(idx))] + ((l1_pgentry_t *)(MAPCACHE_VIRT_START | \ + ((L2_PAGETABLE_ENTRIES - 1) << L2_PAGETABLE_SHIFT)))[idx] void *map_domain_page(mfn_t mfn) { @@ -235,6 +236,7 @@ int mapcache_domain_init(struct domain * { struct mapcache_domain *dcache = &d->arch.pv.mapcache; unsigned int bitmap_pages; + int rc; ASSERT(is_pv_domain(d)); @@ -243,8 +245,10 @@ int mapcache_domain_init(struct domain * return 0; #endif + BUILD_BUG_ON(MAPCACHE_VIRT_START & ((1 << L3_PAGETABLE_SHIFT) - 1)); BUILD_BUG_ON(MAPCACHE_VIRT_END + PAGE_SIZE * (3 + - 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long))) > + 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long))) + + (1U << L2_PAGETABLE_SHIFT) > MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20)); bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)); dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE; @@ -253,9 +257,25 @@ int mapcache_domain_init(struct domain * spin_lock_init(&dcache->lock); - return create_perdomain_mapping(d, (unsigned long)dcache->inuse, - 2 * bitmap_pages + 1, - NIL(l1_pgentry_t *), NULL); + rc = create_perdomain_mapping(d, (unsigned long)dcache->inuse, + 2 * bitmap_pages + 1, + NIL(l1_pgentry_t *), NULL); + if ( !rc ) + { + /* + * Install mapping of our L2 table into its own last slot, for easy + * access to the L1 entries via MAPCACHE_L1ENT(). + */ + l3_pgentry_t *l3t = __map_domain_page(d->arch.perdomain_l3_pg); + l3_pgentry_t l3e = l3t[l3_table_offset(MAPCACHE_VIRT_END)]; + l2_pgentry_t *l2t = map_l2t_from_l3e(l3e); + + l2e_get_intpte(l2t[L2_PAGETABLE_ENTRIES - 1]) = l3e_get_intpte(l3e); + unmap_domain_page(l2t); + unmap_domain_page(l3t); + } + + return rc; } int mapcache_vcpu_init(struct vcpu *v) @@ -346,7 +366,7 @@ mfn_t domain_page_map_to_mfn(const void else { ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END); - pl1e = &__linear_l1_table[l1_linear_offset(va)]; + pl1e = &MAPCACHE_L1ENT(PFN_DOWN(va - MAPCACHE_VIRT_START)); } return l1e_get_mfn(*pl1e); --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -6024,6 +6024,10 @@ void free_perdomain_mappings(struct doma { struct page_info *l1pg = l2e_get_page(l2tab[j]); + /* mapcache_domain_init() installs a recursive entry. */ + if ( l1pg == l2pg ) + continue; + if ( l2e_get_flags(l2tab[j]) & _PAGE_AVAIL0 ) { l1_pgentry_t *l1tab = __map_domain_page(l1pg); ++++++ xsa286-6.patch ++++++ x86/mm: restrict use of linear page tables to shadow mode code Other code does not require them to be set up anymore, so restrict when to populate the respective L4 slot and reduce visibility of the accessors. While with the removal of all uses the vulnerability is actually fixed, removing the creation of the linear mapping adds an extra layer of protection. Similarly reducing visibility of the accessors mostly eliminates the risk of undue re-introduction of uses of the linear mappings. This is (not strictly) part of XSA-286. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: George Dunlap <george.dunlap(a)citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -1750,9 +1750,10 @@ void init_xen_l4_slots(l4_pgentry_t *l4t l4t[l4_table_offset(PCI_MCFG_VIRT_START)] = idle_pg_table[l4_table_offset(PCI_MCFG_VIRT_START)]; - /* Slot 258: Self linear mappings. */ + /* Slot 258: Self linear mappings (shadow pt only). */ ASSERT(!mfn_eq(l4mfn, INVALID_MFN)); l4t[l4_table_offset(LINEAR_PT_VIRT_START)] = + !shadow_mode_external(d) ? l4e_empty() : l4e_from_mfn(l4mfn, __PAGE_HYPERVISOR_RW); /* Slot 259: Shadow linear mappings (if applicable) .*/ --- a/xen/arch/x86/mm/shadow/private.h +++ b/xen/arch/x86/mm/shadow/private.h @@ -135,6 +135,15 @@ enum { # define GUEST_PTE_SIZE 4 #endif +/* Where to find each level of the linear mapping */ +#define __linear_l1_table ((l1_pgentry_t *)(LINEAR_PT_VIRT_START)) +#define __linear_l2_table \ + ((l2_pgentry_t *)(__linear_l1_table + l1_linear_offset(LINEAR_PT_VIRT_START))) +#define __linear_l3_table \ + ((l3_pgentry_t *)(__linear_l2_table + l2_linear_offset(LINEAR_PT_VIRT_START))) +#define __linear_l4_table \ + ((l4_pgentry_t *)(__linear_l3_table + l3_linear_offset(LINEAR_PT_VIRT_START))) + /****************************************************************************** * Auditing routines */ --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -833,9 +833,6 @@ void __init paging_init(void) machine_to_phys_mapping_valid = 1; - /* Set up linear page table mapping. */ - l4e_write(&idle_pg_table[l4_table_offset(LINEAR_PT_VIRT_START)], - l4e_from_paddr(__pa(idle_pg_table), __PAGE_HYPERVISOR_RW)); return; nomem: --- a/xen/include/asm-x86/config.h +++ b/xen/include/asm-x86/config.h @@ -193,7 +193,7 @@ extern unsigned char boot_edid_info[128] */ #define PCI_MCFG_VIRT_START (PML4_ADDR(257)) #define PCI_MCFG_VIRT_END (PCI_MCFG_VIRT_START + PML4_ENTRY_BYTES) -/* Slot 258: linear page table (guest table). */ +/* Slot 258: linear page table (monitor table, HVM only). */ #define LINEAR_PT_VIRT_START (PML4_ADDR(258)) #define LINEAR_PT_VIRT_END (LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES) /* Slot 259: linear page table (shadow table). */ --- a/xen/include/asm-x86/page.h +++ b/xen/include/asm-x86/page.h @@ -274,19 +274,6 @@ void copy_page_sse2(void *, const void * #define vmap_to_mfn(va) _mfn(l1e_get_pfn(*virt_to_xen_l1e((unsigned long)(va)))) #define vmap_to_page(va) mfn_to_page(vmap_to_mfn(va)) -#endif /* !defined(__ASSEMBLY__) */ - -/* Where to find each level of the linear mapping */ -#define __linear_l1_table ((l1_pgentry_t *)(LINEAR_PT_VIRT_START)) -#define __linear_l2_table \ - ((l2_pgentry_t *)(__linear_l1_table + l1_linear_offset(LINEAR_PT_VIRT_START))) -#define __linear_l3_table \ - ((l3_pgentry_t *)(__linear_l2_table + l2_linear_offset(LINEAR_PT_VIRT_START))) -#define __linear_l4_table \ - ((l4_pgentry_t *)(__linear_l3_table + l3_linear_offset(LINEAR_PT_VIRT_START))) - - -#ifndef __ASSEMBLY__ extern root_pgentry_t idle_pg_table[ROOT_PAGETABLE_ENTRIES]; extern l2_pgentry_t *compat_idle_pg_table_l2; extern unsigned int m2p_compat_vstart; ++++++ xsa333.patch ++++++ From: Andrew Cooper <andrew.cooper3(a)citrix.com> Subject: x86/pv: Handle the Intel-specific MSR_MISC_ENABLE correctly This MSR doesn't exist on AMD hardware, and switching away from the safe functions in the common MSR path was an erroneous change. Partially revert the change. This is XSA-333. Fixes: 4fdc932b3cc ("x86/Intel: drop another 32-bit leftover") Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> --- a/xen/arch/x86/pv/emul-priv-op.c +++ b/xen/arch/x86/pv/emul-priv-op.c @@ -891,7 +891,8 @@ static int read_msr(unsigned int reg, ui return X86EMUL_OKAY; case MSR_IA32_MISC_ENABLE: - rdmsrl(reg, *val); + if ( rdmsr_safe(reg, *val) ) + break; *val = guest_misc_enable(*val); return X86EMUL_OKAY; @@ -1031,7 +1032,8 @@ static int write_msr(unsigned int reg, u break; case MSR_IA32_MISC_ENABLE: - rdmsrl(reg, temp); + if ( rdmsr_safe(reg, temp) ) + break; if ( val != guest_misc_enable(temp) ) goto invalid; return X86EMUL_OKAY; ++++++ xsa334.patch ++++++ xen/memory: Don't skip the RCU unlock path in acquire_resource() In the case that an HVM Stubdomain makes an XENMEM_acquire_resource hypercall, the FIXME path will bypass rcu_unlock_domain() on the way out of the function. Move the check to the start of the function. This does change the behaviour of the get-size path for HVM Stubdomains, but that functionality is currently broken and unused anyway, as well as being quite useless to entities which can't actually map the resource anyway. This is XSA-334. Fixes: 83fa6552ce ("common: add a new mappable resource type: XENMEM_resource_grant_table") Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -1057,6 +1057,14 @@ static int acquire_resource( xen_pfn_t mfn_list[32]; int rc; + /* + * FIXME: Until foreign pages inserted into the P2M are properly + * reference counted, it is unsafe to allow mapping of + * resource pages unless the caller is the hardware domain. + */ + if ( paging_mode_translate(currd) && !is_hardware_domain(currd) ) + return -EACCES; + if ( copy_from_guest(&xmar, arg, 1) ) return -EFAULT; @@ -1113,14 +1121,6 @@ static int acquire_resource( xen_pfn_t gfn_list[ARRAY_SIZE(mfn_list)]; unsigned int i; - /* - * FIXME: Until foreign pages inserted into the P2M are properly - * reference counted, it is unsafe to allow mapping of - * resource pages unless the caller is the hardware domain. - */ - if ( !is_hardware_domain(currd) ) - return -EACCES; - if ( copy_from_guest(gfn_list, xmar.frame_list, xmar.nr_frames) ) rc = -EFAULT; ++++++ xsa336.patch ++++++ x86/vpt: fix race when migrating timers between vCPUs The current vPT code will migrate the emulated timers between vCPUs (change the pt->vcpu field) while just holding the destination lock, either from create_periodic_time or pt_adjust_global_vcpu_target if the global target is adjusted. Changing the periodic_timer vCPU field in this way creates a race where a third party could grab the lock in the unlocked region of pt_adjust_global_vcpu_target (or before create_periodic_time performs the vcpu change) and then release the lock from a different vCPU, creating a locking imbalance. Introduce a per-domain rwlock in order to protect periodic_time migration between vCPU lists. Taking the lock in read mode prevents any timer from being migrated to a different vCPU, while taking it in write mode allows performing migration of timers across vCPUs. The per-vcpu locks are still used to protect all the other fields from the periodic_timer struct. Note that such migration shouldn't happen frequently, and hence there's no performance drop as a result of such locking. This is XSA-336. Reported-by: Igor Druzhinin <igor.druzhinin(a)citrix.com> Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Tested-by: Igor Druzhinin <igor.druzhinin(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -646,6 +646,8 @@ int hvm_domain_initialise(struct domain /* need link to containing domain */ d->arch.hvm.pl_time->domain = d; + rwlock_init(&d->arch.hvm.pl_time->pt_migrate); + /* Set the default IO Bitmap. */ if ( is_hardware_domain(d) ) { --- a/xen/arch/x86/hvm/vpt.c +++ b/xen/arch/x86/hvm/vpt.c @@ -152,23 +152,32 @@ static int pt_irq_masked(struct periodic return 1; } -static void pt_lock(struct periodic_time *pt) +static void pt_vcpu_lock(struct vcpu *v) { - struct vcpu *v; + read_lock(&v->domain->arch.hvm.pl_time->pt_migrate); + spin_lock(&v->arch.hvm.tm_lock); +} - for ( ; ; ) - { - v = pt->vcpu; - spin_lock(&v->arch.hvm.tm_lock); - if ( likely(pt->vcpu == v) ) - break; - spin_unlock(&v->arch.hvm.tm_lock); - } +static void pt_vcpu_unlock(struct vcpu *v) +{ + spin_unlock(&v->arch.hvm.tm_lock); + read_unlock(&v->domain->arch.hvm.pl_time->pt_migrate); +} + +static void pt_lock(struct periodic_time *pt) +{ + /* + * We cannot use pt_vcpu_lock here, because we need to acquire the + * per-domain lock first and then (re-)fetch the value of pt->vcpu, or + * else we might be using a stale value of pt->vcpu. + */ + read_lock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate); + spin_lock(&pt->vcpu->arch.hvm.tm_lock); } static void pt_unlock(struct periodic_time *pt) { - spin_unlock(&pt->vcpu->arch.hvm.tm_lock); + pt_vcpu_unlock(pt->vcpu); } static void pt_process_missed_ticks(struct periodic_time *pt) @@ -218,7 +227,7 @@ void pt_save_timer(struct vcpu *v) if ( v->pause_flags & VPF_blocked ) return; - spin_lock(&v->arch.hvm.tm_lock); + pt_vcpu_lock(v); list_for_each_entry ( pt, head, list ) if ( !pt->do_not_freeze ) @@ -226,7 +235,7 @@ void pt_save_timer(struct vcpu *v) pt_freeze_time(v); - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); } void pt_restore_timer(struct vcpu *v) @@ -234,7 +243,7 @@ void pt_restore_timer(struct vcpu *v) struct list_head *head = &v->arch.hvm.tm_list; struct periodic_time *pt; - spin_lock(&v->arch.hvm.tm_lock); + pt_vcpu_lock(v); list_for_each_entry ( pt, head, list ) { @@ -247,7 +256,7 @@ void pt_restore_timer(struct vcpu *v) pt_thaw_time(v); - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); } static void pt_timer_fn(void *data) @@ -308,7 +317,7 @@ int pt_update_irq(struct vcpu *v) int irq, pt_vector = -1; bool level; - spin_lock(&v->arch.hvm.tm_lock); + pt_vcpu_lock(v); earliest_pt = NULL; max_lag = -1ULL; @@ -338,7 +347,7 @@ int pt_update_irq(struct vcpu *v) if ( earliest_pt == NULL ) { - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); return -1; } @@ -346,7 +355,7 @@ int pt_update_irq(struct vcpu *v) irq = earliest_pt->irq; level = earliest_pt->level; - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); switch ( earliest_pt->source ) { @@ -393,7 +402,7 @@ int pt_update_irq(struct vcpu *v) time_cb *cb = NULL; void *cb_priv; - spin_lock(&v->arch.hvm.tm_lock); + pt_vcpu_lock(v); /* Make sure the timer is still on the list. */ list_for_each_entry ( pt, &v->arch.hvm.tm_list, list ) if ( pt == earliest_pt ) @@ -403,7 +412,7 @@ int pt_update_irq(struct vcpu *v) cb_priv = pt->priv; break; } - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); if ( cb != NULL ) cb(v, cb_priv); @@ -440,12 +449,12 @@ void pt_intr_post(struct vcpu *v, struct if ( intack.source == hvm_intsrc_vector ) return; - spin_lock(&v->arch.hvm.tm_lock); + pt_vcpu_lock(v); pt = is_pt_irq(v, intack); if ( pt == NULL ) { - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); return; } @@ -454,7 +463,7 @@ void pt_intr_post(struct vcpu *v, struct cb = pt->cb; cb_priv = pt->priv; - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); if ( cb != NULL ) cb(v, cb_priv); @@ -465,12 +474,12 @@ void pt_migrate(struct vcpu *v) struct list_head *head = &v->arch.hvm.tm_list; struct periodic_time *pt; - spin_lock(&v->arch.hvm.tm_lock); + pt_vcpu_lock(v); list_for_each_entry ( pt, head, list ) migrate_timer(&pt->timer, v->processor); - spin_unlock(&v->arch.hvm.tm_lock); + pt_vcpu_unlock(v); } void create_periodic_time( @@ -489,7 +498,7 @@ void create_periodic_time( destroy_periodic_time(pt); - spin_lock(&v->arch.hvm.tm_lock); + write_lock(&v->domain->arch.hvm.pl_time->pt_migrate); pt->pending_intr_nr = 0; pt->do_not_freeze = 0; @@ -539,7 +548,7 @@ void create_periodic_time( init_timer(&pt->timer, pt_timer_fn, pt, v->processor); set_timer(&pt->timer, pt->scheduled); - spin_unlock(&v->arch.hvm.tm_lock); + write_unlock(&v->domain->arch.hvm.pl_time->pt_migrate); } void destroy_periodic_time(struct periodic_time *pt) @@ -564,30 +573,20 @@ void destroy_periodic_time(struct period static void pt_adjust_vcpu(struct periodic_time *pt, struct vcpu *v) { - int on_list; - ASSERT(pt->source == PTSRC_isa || pt->source == PTSRC_ioapic); if ( pt->vcpu == NULL ) return; - pt_lock(pt); - on_list = pt->on_list; - if ( pt->on_list ) - list_del(&pt->list); - pt->on_list = 0; - pt_unlock(pt); - - spin_lock(&v->arch.hvm.tm_lock); + write_lock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate); pt->vcpu = v; - if ( on_list ) + if ( pt->on_list ) { - pt->on_list = 1; + list_del(&pt->list); list_add(&pt->list, &v->arch.hvm.tm_list); - migrate_timer(&pt->timer, v->processor); } - spin_unlock(&v->arch.hvm.tm_lock); + write_unlock(&pt->vcpu->domain->arch.hvm.pl_time->pt_migrate); } void pt_adjust_global_vcpu_target(struct vcpu *v) --- a/xen/include/asm-x86/hvm/vpt.h +++ b/xen/include/asm-x86/hvm/vpt.h @@ -134,6 +134,13 @@ struct pl_time { /* platform time */ struct RTCState vrtc; struct HPETState vhpet; struct PMTState vpmt; + /* + * rwlock to prevent periodic_time vCPU migration. Take the lock in read + * mode in order to prevent the vcpu field of periodic_time from changing. + * Lock must be taken in write mode when changes to the vcpu field are + * performed, as it allows exclusive access to all the timers of a domain. + */ + rwlock_t pt_migrate; /* guest_time = Xen sys time + stime_offset */ int64_t stime_offset; /* Ensures monotonicity in appropriate timer modes. */ ++++++ xsa337-1.patch ++++++ x86/msi: get rid of read_msi_msg It's safer and faster to just use the cached last written (untranslated) MSI message stored in msi_desc for the single user that calls read_msi_msg. This also prevents relying on the data read from the device MSI registers in order to figure out the index into the IOMMU interrupt remapping table, which is not safe. This is XSA-337. Requested-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Signed-off-by: Roger Pau Monné <roger.pau(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -183,54 +183,6 @@ void msi_compose_msg(unsigned vector, co MSI_DATA_VECTOR(vector); } -static bool read_msi_msg(struct msi_desc *entry, struct msi_msg *msg) -{ - switch ( entry->msi_attrib.type ) - { - case PCI_CAP_ID_MSI: - { - struct pci_dev *dev = entry->dev; - int pos = entry->msi_attrib.pos; - uint16_t data; - - msg->address_lo = pci_conf_read32(dev->sbdf, - msi_lower_address_reg(pos)); - if ( entry->msi_attrib.is_64 ) - { - msg->address_hi = pci_conf_read32(dev->sbdf, - msi_upper_address_reg(pos)); - data = pci_conf_read16(dev->sbdf, msi_data_reg(pos, 1)); - } - else - { - msg->address_hi = 0; - data = pci_conf_read16(dev->sbdf, msi_data_reg(pos, 0)); - } - msg->data = data; - break; - } - case PCI_CAP_ID_MSIX: - { - void __iomem *base = entry->mask_base; - - if ( unlikely(!msix_memory_decoded(entry->dev, - entry->msi_attrib.pos)) ) - return false; - msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET); - msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET); - msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET); - break; - } - default: - BUG(); - } - - if ( iommu_intremap ) - iommu_read_msi_from_ire(entry, msg); - - return true; -} - static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) { entry->msg = *msg; @@ -302,10 +254,7 @@ void set_msi_affinity(struct irq_desc *d ASSERT(spin_is_locked(&desc->lock)); - memset(&msg, 0, sizeof(msg)); - if ( !read_msi_msg(msi_desc, &msg) ) - return; - + msg = msi_desc->msg; msg.data &= ~MSI_DATA_VECTOR_MASK; msg.data |= MSI_DATA_VECTOR(desc->arch.vector); msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; ++++++ xsa337-2.patch ++++++ x86/MSI-X: restrict reading of table/PBA bases from BARs When assigned to less trusted or un-trusted guests, devices may change state behind our backs (they may e.g. get reset by means we may not know about). Therefore we should avoid reading BARs from hardware once a device is no longer owned by Dom0. Furthermore when we can't read a BAR, or when we read zero, we shouldn't instead use the caller provided address unless that caller can be trusted. Re-arrange the logic in msix_capability_init() such that only Dom0 (and only if the device isn't DomU-owned yet) or calls through PHYSDEVOP_prepare_msix will actually result in the reading of the respective BAR register(s). Additionally do so only as long as in-use table entries are known (note that invocation of PHYSDEVOP_prepare_msix counts as a "pseudo" entry). In all other uses the value already recorded will get used instead. Clear the recorded values in _pci_cleanup_msix() as well as on the one affected error path. (Adjust this error path to also avoid blindly disabling MSI-X when it was enabled on entry to the function.) While moving around variable declarations (in many cases to reduce their scopes), also adjust some of their types. This is part of XSA-337. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Roger Pau Monné <roger.pau(a)citrix.com> --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -769,16 +769,14 @@ static int msix_capability_init(struct p { struct arch_msix *msix = dev->msix; struct msi_desc *entry = NULL; - int vf; u16 control; u64 table_paddr; u32 table_offset; - u8 bir, pbus, pslot, pfunc; u16 seg = dev->seg; u8 bus = dev->bus; u8 slot = PCI_SLOT(dev->devfn); u8 func = PCI_FUNC(dev->devfn); - bool maskall = msix->host_maskall; + bool maskall = msix->host_maskall, zap_on_error = false; unsigned int pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX); @@ -820,43 +818,45 @@ static int msix_capability_init(struct p /* Locate MSI-X table region */ table_offset = pci_conf_read32(dev->sbdf, msix_table_offset_reg(pos)); - bir = (u8)(table_offset & PCI_MSIX_BIRMASK); - table_offset &= ~PCI_MSIX_BIRMASK; + if ( !msix->used_entries && + (!msi || + (is_hardware_domain(current->domain) && + (dev->domain == current->domain || dev->domain == dom_io))) ) + { + unsigned int bir = table_offset & PCI_MSIX_BIRMASK, pbus, pslot, pfunc; + int vf; + paddr_t pba_paddr; + unsigned int pba_offset; - if ( !dev->info.is_virtfn ) - { - pbus = bus; - pslot = slot; - pfunc = func; - vf = -1; - } - else - { - pbus = dev->info.physfn.bus; - pslot = PCI_SLOT(dev->info.physfn.devfn); - pfunc = PCI_FUNC(dev->info.physfn.devfn); - vf = PCI_BDF2(dev->bus, dev->devfn); - } - - table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf); - WARN_ON(msi && msi->table_base != table_paddr); - if ( !table_paddr ) - { - if ( !msi || !msi->table_base ) + if ( !dev->info.is_virtfn ) { - pci_conf_write16(dev->sbdf, msix_control_reg(pos), - control & ~PCI_MSIX_FLAGS_ENABLE); - xfree(entry); - return -ENXIO; + pbus = bus; + pslot = slot; + pfunc = func; + vf = -1; + } + else + { + pbus = dev->info.physfn.bus; + pslot = PCI_SLOT(dev->info.physfn.devfn); + pfunc = PCI_FUNC(dev->info.physfn.devfn); + vf = PCI_BDF2(dev->bus, dev->devfn); } - table_paddr = msi->table_base; - } - table_paddr += table_offset; - if ( !msix->used_entries ) - { - u64 pba_paddr; - u32 pba_offset; + table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf); + WARN_ON(msi && msi->table_base != table_paddr); + if ( !table_paddr ) + { + if ( !msi || !msi->table_base ) + { + pci_conf_write16(dev->sbdf, msix_control_reg(pos), + control & ~PCI_MSIX_FLAGS_ENABLE); + xfree(entry); + return -ENXIO; + } + table_paddr = msi->table_base; + } + table_paddr += table_offset & ~PCI_MSIX_BIRMASK; msix->table.first = PFN_DOWN(table_paddr); msix->table.last = PFN_DOWN(table_paddr + @@ -875,7 +875,18 @@ static int msix_capability_init(struct p BITS_TO_LONGS(msix->nr_entries) - 1); WARN_ON(rangeset_overlaps_range(mmio_ro_ranges, msix->pba.first, msix->pba.last)); + + zap_on_error = true; + } + else if ( !msix->table.first ) + { + pci_conf_write16(dev->sbdf, msix_control_reg(pos), control); + xfree(entry); + return -ENODATA; } + else + table_paddr = (msix->table.first << PAGE_SHIFT) + + (table_offset & ~PCI_MSIX_BIRMASK & ~PAGE_MASK); if ( entry ) { @@ -886,8 +897,15 @@ static int msix_capability_init(struct p if ( idx < 0 ) { - pci_conf_write16(dev->sbdf, msix_control_reg(pos), - control & ~PCI_MSIX_FLAGS_ENABLE); + if ( zap_on_error ) + { + msix->table.first = 0; + msix->pba.first = 0; + + control &= ~PCI_MSIX_FLAGS_ENABLE; + } + + pci_conf_write16(dev->sbdf, msix_control_reg(pos), control); xfree(entry); return idx; } @@ -1076,9 +1094,14 @@ static void _pci_cleanup_msix(struct arc if ( rangeset_remove_range(mmio_ro_ranges, msix->table.first, msix->table.last) ) WARN(); + msix->table.first = 0; + msix->table.last = 0; + if ( rangeset_remove_range(mmio_ro_ranges, msix->pba.first, msix->pba.last) ) WARN(); + msix->pba.first = 0; + msix->pba.last = 0; } } ++++++ xsa338.patch ++++++ evtchn: relax port_is_valid() To avoid ports potentially becoming invalid behind the back of certain other functions (due to ->max_evtchn shrinking) because of - a guest invoking evtchn_reset() and from a 2nd vCPU opening new channels in parallel (see also XSA-343), - alloc_unbound_xen_event_channel() produced channels living above the 2-level range (see also XSA-342), drop the max_evtchns check from port_is_valid(). For a port for which the function once returned "true", the returned value may not turn into "false" later on. The function's result may only depend on bounds which can only ever grow (which is the case for d->valid_evtchns). This also eliminates a false sense of safety, utilized by some of the users (see again XSA-343): Without a suitable lock held, d->max_evtchns may change at any time, and hence deducing that certain other operations are safe when port_is_valid() returned true is not legitimate. The opportunities to abuse this may get widened by the change here (depending on guest and host configuration), but will be taken care of by the other XSA. This is XSA-338. Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Stefano Stabellini <sstabellini(a)kernel.org> Reviewed-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/include/xen/event.h +++ b/xen/include/xen/event.h @@ -107,8 +107,6 @@ void notify_via_xen_event_channel(struct static inline bool_t port_is_valid(struct domain *d, unsigned int p) { - if ( p >= d->max_evtchns ) - return 0; return p < read_atomic(&d->valid_evtchns); } ++++++ xsa339.patch ++++++ x86/pv: Avoid double exception injection There is at least one path (SYSENTER with NT set, Xen converts to #GP) which ends up injecting the #GP fault twice, first in compat_sysenter(), and then a second time in compat_test_all_events(), due to the stale TBF_EXCEPTION left in TRAPBOUNCE_flags. The guest kernel sees the second fault first, which is a kernel level #GP pointing at the head of the #GP handler, and is therefore a userspace trigger-able DoS. This particular bug has bitten us several times before, so rearrange {compat_,}create_bounce_frame() to clobber TRAPBOUNCE on success, rather than leaving this task to one area of code which isn't used uniformly. Other scenarios which might result in a double injection (e.g. two calls directly to compat_create_bounce_frame) will now crash the guest, which is far more obvious than letting the kernel run with corrupt state. This is XSA-339. Fixes: fdac9515607b ("x86: clear EFLAGS.NT in SYSENTER entry path") Signed-off-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/x86_64/compat/entry.S +++ b/xen/arch/x86/x86_64/compat/entry.S @@ -78,7 +78,6 @@ compat_process_softirqs: sti .Lcompat_bounce_exception: call compat_create_bounce_frame - movb $0, TRAPBOUNCE_flags(%rdx) jmp compat_test_all_events ALIGN @@ -349,7 +348,13 @@ __UNLIKELY_END(compat_bounce_null_select movl %eax,UREGS_cs+8(%rsp) movl TRAPBOUNCE_eip(%rdx),%eax movl %eax,UREGS_rip+8(%rsp) + + /* Trapbounce complete. Clobber state to avoid an erroneous second injection. */ + xor %eax, %eax + mov %ax, TRAPBOUNCE_cs(%rdx) + mov %al, TRAPBOUNCE_flags(%rdx) ret + .section .fixup,"ax" .Lfx13: xorl %edi,%edi --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -90,7 +90,6 @@ process_softirqs: sti .Lbounce_exception: call create_bounce_frame - movb $0, TRAPBOUNCE_flags(%rdx) jmp test_all_events ALIGN @@ -495,6 +494,11 @@ UNLIKELY_START(z, create_bounce_frame_ba jmp asm_domain_crash_synchronous /* Does not return */ __UNLIKELY_END(create_bounce_frame_bad_bounce_ip) movq %rax,UREGS_rip+8(%rsp) + + /* Trapbounce complete. Clobber state to avoid an erroneous second injection. */ + xor %eax, %eax + mov %rax, TRAPBOUNCE_eip(%rdx) + mov %al, TRAPBOUNCE_flags(%rdx) ret .pushsection .fixup, "ax", @progbits ++++++ xsa340.patch ++++++ xen/evtchn: Add missing barriers when accessing/allocating an event channel While the allocation of a bucket is always performed with the per-domain lock, the bucket may be accessed without the lock taken (for instance, see evtchn_send()). Instead such sites relies on port_is_valid() to return a non-zero value when the port has a struct evtchn associated to it. The function will mostly check whether the port is less than d->valid_evtchns as all the buckets/event channels should be allocated up to that point. Unfortunately a compiler is free to re-order the assignment in evtchn_allocate_port() so it would be possible to have d->valid_evtchns updated before the new bucket has finish to allocate. Additionally on Arm, even if this was compiled "correctly", the processor can still re-order the memory access. Add a write memory barrier in the allocation side and a read memory barrier when the port is valid to prevent any re-ordering issue. This is XSA-340. Signed-off-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -178,6 +178,13 @@ int evtchn_allocate_port(struct domain * return -ENOMEM; bucket_from_port(d, port) = chn; + /* + * d->valid_evtchns is used to check whether the bucket can be + * accessed without the per-domain lock. Therefore, + * d->valid_evtchns should be seen *after* the new bucket has + * been setup. + */ + smp_wmb(); write_atomic(&d->valid_evtchns, d->valid_evtchns + EVTCHNS_PER_BUCKET); } --- a/xen/include/xen/event.h +++ b/xen/include/xen/event.h @@ -107,7 +107,17 @@ void notify_via_xen_event_channel(struct static inline bool_t port_is_valid(struct domain *d, unsigned int p) { - return p < read_atomic(&d->valid_evtchns); + if ( p >= read_atomic(&d->valid_evtchns) ) + return false; + + /* + * The caller will usually access the event channel afterwards and + * may be done without taking the per-domain lock. The barrier is + * going in pair the smp_wmb() barrier in evtchn_allocate_port(). + */ + smp_rmb(); + + return true; } static inline struct evtchn *evtchn_from_port(struct domain *d, unsigned int p) ++++++ xsa342.patch ++++++ evtchn/x86: enforce correct upper limit for 32-bit guests The recording of d->max_evtchns in evtchn_2l_init(), in particular with the limited set of callers of the function, is insufficient. Neither for PV nor for HVM guests the bitness is known at domain_create() time, yet the upper bound in 2-level mode depends upon guest bitness. Recording too high a limit "allows" x86 32-bit domains to open not properly usable event channels, management of which (inside Xen) would then result in corruption of the shared info and vCPU info structures. Keep the upper limit dynamic for the 2-level case, introducing a helper function to retrieve the effective limit. This helper is now supposed to be private to the event channel code. The used in do_poll() and domain_dump_evtchn_info() weren't consistent with port uses elsewhere and hence get switched to port_is_valid(). Furthermore FIFO mode's setup_ports() gets adjusted to loop only up to the prior ABI limit, rather than all the way up to the new one. Finally a word on the change to do_poll(): Accessing ->max_evtchns without holding a suitable lock was never safe, as it as well as ->evtchn_port_ops may change behind do_poll()'s back. Using port_is_valid() instead widens some the window for potential abuse, until we've dealt with the race altogether (see XSA-343). This is XSA-342. Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") Reported-by: Julien Grall <jgrall(a)amazon.com> Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Stefano Stabellini <sstabellini(a)kernel.org> Reviewed-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/common/event_2l.c +++ b/xen/common/event_2l.c @@ -103,7 +103,6 @@ static const struct evtchn_port_ops evtc void evtchn_2l_init(struct domain *d) { d->evtchn_port_ops = &evtchn_port_ops_2l; - d->max_evtchns = BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); } /* --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -151,7 +151,7 @@ static void free_evtchn_bucket(struct do int evtchn_allocate_port(struct domain *d, evtchn_port_t port) { - if ( port > d->max_evtchn_port || port >= d->max_evtchns ) + if ( port > d->max_evtchn_port || port >= max_evtchns(d) ) return -ENOSPC; if ( port_is_valid(d, port) ) @@ -1396,13 +1396,11 @@ static void domain_dump_evtchn_info(stru spin_lock(&d->event_lock); - for ( port = 1; port < d->max_evtchns; ++port ) + for ( port = 1; port_is_valid(d, port); ++port ) { const struct evtchn *chn; char *ssid; - if ( !port_is_valid(d, port) ) - continue; chn = evtchn_from_port(d, port); if ( chn->state == ECS_FREE ) continue; --- a/xen/common/event_fifo.c +++ b/xen/common/event_fifo.c @@ -478,7 +478,7 @@ static void cleanup_event_array(struct d d->evtchn_fifo = NULL; } -static void setup_ports(struct domain *d) +static void setup_ports(struct domain *d, unsigned int prev_evtchns) { unsigned int port; @@ -488,7 +488,7 @@ static void setup_ports(struct domain *d * - save its pending state. * - set default priority. */ - for ( port = 1; port < d->max_evtchns; port++ ) + for ( port = 1; port < prev_evtchns; port++ ) { struct evtchn *evtchn; @@ -546,6 +546,8 @@ int evtchn_fifo_init_control(struct evtc if ( !d->evtchn_fifo ) { struct vcpu *vcb; + /* Latch the value before it changes during setup_event_array(). */ + unsigned int prev_evtchns = max_evtchns(d); for_each_vcpu ( d, vcb ) { rc = setup_control_block(vcb); @@ -562,8 +564,7 @@ int evtchn_fifo_init_control(struct evtc goto error; d->evtchn_port_ops = &evtchn_port_ops_fifo; - d->max_evtchns = EVTCHN_FIFO_NR_CHANNELS; - setup_ports(d); + setup_ports(d, prev_evtchns); } else rc = map_control_block(v, gfn, offset); --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -1434,7 +1434,7 @@ static long do_poll(struct sched_poll *s goto out; rc = -EINVAL; - if ( port >= d->max_evtchns ) + if ( !port_is_valid(d, port) ) goto out; rc = 0; --- a/xen/include/xen/event.h +++ b/xen/include/xen/event.h @@ -105,6 +105,12 @@ void notify_via_xen_event_channel(struct #define bucket_from_port(d, p) \ ((group_from_port(d, p))[((p) % EVTCHNS_PER_GROUP) / EVTCHNS_PER_BUCKET]) +static inline unsigned int max_evtchns(const struct domain *d) +{ + return d->evtchn_fifo ? EVTCHN_FIFO_NR_CHANNELS + : BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); +} + static inline bool_t port_is_valid(struct domain *d, unsigned int p) { if ( p >= read_atomic(&d->valid_evtchns) ) --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -382,7 +382,6 @@ struct domain /* Event channel information. */ struct evtchn *evtchn; /* first bucket only */ struct evtchn **evtchn_group[NR_EVTCHN_GROUPS]; /* all other buckets */ - unsigned int max_evtchns; /* number supported by ABI */ unsigned int max_evtchn_port; /* max permitted port number */ unsigned int valid_evtchns; /* number of allocated event channels */ spinlock_t event_lock; ++++++ xsa343-1.patch ++++++ evtchn: evtchn_reset() may not succeed with still-open ports While the function closes all ports, it does so without holding any lock, and hence racing requests may be issued causing new ports to get opened. This would have been problematic in particular if such a newly opened port had a port number above the new implementation limit (i.e. when switching from FIFO to 2-level) after the reset, as prior to "evtchn: relax port_is_valid()" this could have led to e.g. evtchn_close()'s "BUG_ON(!port_is_valid(d2, port2))" to trigger. Introduce a counter of active ports and check that it's (still) no larger then the number of Xen internally used ones after obtaining the necessary lock in evtchn_reset(). As to the access model of the new {active,xen}_evtchns fields - while all writes get done using write_atomic(), reads ought to use read_atomic() only when outside of a suitably locked region. Note that as of now evtchn_bind_virq() and evtchn_bind_ipi() don't have a need to call check_free_port(). This is part of XSA-343. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Stefano Stabellini <sstabellini(a)kernel.org> Reviewed-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -188,6 +188,8 @@ int evtchn_allocate_port(struct domain * write_atomic(&d->valid_evtchns, d->valid_evtchns + EVTCHNS_PER_BUCKET); } + write_atomic(&d->active_evtchns, d->active_evtchns + 1); + return 0; } @@ -211,11 +213,26 @@ static int get_free_port(struct domain * return -ENOSPC; } +/* + * Check whether a port is still marked free, and if so update the domain + * counter accordingly. To be used on function exit paths. + */ +static void check_free_port(struct domain *d, evtchn_port_t port) +{ + if ( port_is_valid(d, port) && + evtchn_from_port(d, port)->state == ECS_FREE ) + write_atomic(&d->active_evtchns, d->active_evtchns - 1); +} + void evtchn_free(struct domain *d, struct evtchn *chn) { /* Clear pending event to avoid unexpected behavior on re-bind. */ evtchn_port_clear_pending(d, chn); + if ( consumer_is_xen(chn) ) + write_atomic(&d->xen_evtchns, d->xen_evtchns - 1); + write_atomic(&d->active_evtchns, d->active_evtchns - 1); + /* Reset binding to vcpu0 when the channel is freed. */ chn->state = ECS_FREE; chn->notify_vcpu_id = 0; @@ -258,6 +275,7 @@ static long evtchn_alloc_unbound(evtchn_ alloc->port = port; out: + check_free_port(d, port); spin_unlock(&d->event_lock); rcu_unlock_domain(d); @@ -351,6 +369,7 @@ static long evtchn_bind_interdomain(evtc bind->local_port = lport; out: + check_free_port(ld, lport); spin_unlock(&ld->event_lock); if ( ld != rd ) spin_unlock(&rd->event_lock); @@ -488,7 +507,7 @@ static long evtchn_bind_pirq(evtchn_bind struct domain *d = current->domain; struct vcpu *v = d->vcpu[0]; struct pirq *info; - int port, pirq = bind->pirq; + int port = 0, pirq = bind->pirq; long rc; if ( (pirq < 0) || (pirq >= d->nr_pirqs) ) @@ -536,6 +555,7 @@ static long evtchn_bind_pirq(evtchn_bind arch_evtchn_bind_pirq(d, pirq); out: + check_free_port(d, port); spin_unlock(&d->event_lock); return rc; @@ -1011,10 +1031,10 @@ int evtchn_unmask(unsigned int port) return 0; } - int evtchn_reset(struct domain *d) { unsigned int i; + int rc = 0; if ( d != current->domain && !d->controller_pause_count ) return -EINVAL; @@ -1024,7 +1044,9 @@ int evtchn_reset(struct domain *d) spin_lock(&d->event_lock); - if ( d->evtchn_fifo ) + if ( d->active_evtchns > d->xen_evtchns ) + rc = -EAGAIN; + else if ( d->evtchn_fifo ) { /* Switching back to 2-level ABI. */ evtchn_fifo_destroy(d); @@ -1033,7 +1055,7 @@ int evtchn_reset(struct domain *d) spin_unlock(&d->event_lock); - return 0; + return rc; } static long evtchn_set_priority(const struct evtchn_set_priority *set_priority) @@ -1219,10 +1241,9 @@ int alloc_unbound_xen_event_channel( spin_lock(&ld->event_lock); - rc = get_free_port(ld); + port = rc = get_free_port(ld); if ( rc < 0 ) goto out; - port = rc; chn = evtchn_from_port(ld, port); rc = xsm_evtchn_unbound(XSM_TARGET, ld, chn, remote_domid); @@ -1238,7 +1259,10 @@ int alloc_unbound_xen_event_channel( spin_unlock(&chn->lock); + write_atomic(&ld->xen_evtchns, ld->xen_evtchns + 1); + out: + check_free_port(ld, port); spin_unlock(&ld->event_lock); return rc < 0 ? rc : port; @@ -1314,6 +1338,7 @@ int evtchn_init(struct domain *d, unsign return -EINVAL; } evtchn_from_port(d, 0)->state = ECS_RESERVED; + write_atomic(&d->active_evtchns, 0); #if MAX_VIRT_CPUS > BITS_PER_LONG d->poll_mask = xzalloc_array(unsigned long, BITS_TO_LONGS(d->max_vcpus)); @@ -1340,6 +1365,8 @@ void evtchn_destroy(struct domain *d) for ( i = 0; port_is_valid(d, i); i++ ) evtchn_close(d, i, 0); + ASSERT(!d->active_evtchns); + clear_global_virq_handlers(d); evtchn_fifo_destroy(d); --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -384,6 +384,16 @@ struct domain struct evtchn **evtchn_group[NR_EVTCHN_GROUPS]; /* all other buckets */ unsigned int max_evtchn_port; /* max permitted port number */ unsigned int valid_evtchns; /* number of allocated event channels */ + /* + * Number of in-use event channels. Writers should use write_atomic(). + * Readers need to use read_atomic() only when not holding event_lock. + */ + unsigned int active_evtchns; + /* + * Number of event channels used internally by Xen (not subject to + * EVTCHNOP_reset). Read/write access like for active_evtchns. + */ + unsigned int xen_evtchns; spinlock_t event_lock; const struct evtchn_port_ops *evtchn_port_ops; struct evtchn_fifo_domain *evtchn_fifo; ++++++ xsa343-2.patch ++++++ evtchn: convert per-channel lock to be IRQ-safe ... in order for send_guest_{global,vcpu}_virq() to be able to make use of it. This is part of XSA-343. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Acked-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -248,6 +248,7 @@ static long evtchn_alloc_unbound(evtchn_ int port; domid_t dom = alloc->dom; long rc; + unsigned long flags; d = rcu_lock_domain_by_any_id(dom); if ( d == NULL ) @@ -263,14 +264,14 @@ static long evtchn_alloc_unbound(evtchn_ if ( rc ) goto out; - spin_lock(&chn->lock); + spin_lock_irqsave(&chn->lock, flags); chn->state = ECS_UNBOUND; if ( (chn->u.unbound.remote_domid = alloc->remote_dom) == DOMID_SELF ) chn->u.unbound.remote_domid = current->domain->domain_id; evtchn_port_init(d, chn); - spin_unlock(&chn->lock); + spin_unlock_irqrestore(&chn->lock, flags); alloc->port = port; @@ -283,26 +284,32 @@ static long evtchn_alloc_unbound(evtchn_ } -static void double_evtchn_lock(struct evtchn *lchn, struct evtchn *rchn) +static unsigned long double_evtchn_lock(struct evtchn *lchn, + struct evtchn *rchn) { - if ( lchn < rchn ) + unsigned long flags; + + if ( lchn <= rchn ) { - spin_lock(&lchn->lock); - spin_lock(&rchn->lock); + spin_lock_irqsave(&lchn->lock, flags); + if ( lchn != rchn ) + spin_lock(&rchn->lock); } else { - if ( lchn != rchn ) - spin_lock(&rchn->lock); + spin_lock_irqsave(&rchn->lock, flags); spin_lock(&lchn->lock); } + + return flags; } -static void double_evtchn_unlock(struct evtchn *lchn, struct evtchn *rchn) +static void double_evtchn_unlock(struct evtchn *lchn, struct evtchn *rchn, + unsigned long flags) { - spin_unlock(&lchn->lock); if ( lchn != rchn ) - spin_unlock(&rchn->lock); + spin_unlock(&lchn->lock); + spin_unlock_irqrestore(&rchn->lock, flags); } static long evtchn_bind_interdomain(evtchn_bind_interdomain_t *bind) @@ -312,6 +319,7 @@ static long evtchn_bind_interdomain(evtc int lport, rport = bind->remote_port; domid_t rdom = bind->remote_dom; long rc; + unsigned long flags; if ( rdom == DOMID_SELF ) rdom = current->domain->domain_id; @@ -347,7 +355,7 @@ static long evtchn_bind_interdomain(evtc if ( rc ) goto out; - double_evtchn_lock(lchn, rchn); + flags = double_evtchn_lock(lchn, rchn); lchn->u.interdomain.remote_dom = rd; lchn->u.interdomain.remote_port = rport; @@ -364,7 +372,7 @@ static long evtchn_bind_interdomain(evtc */ evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn); - double_evtchn_unlock(lchn, rchn); + double_evtchn_unlock(lchn, rchn, flags); bind->local_port = lport; @@ -387,6 +395,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t struct domain *d = current->domain; int virq = bind->virq, vcpu = bind->vcpu; int rc = 0; + unsigned long flags; if ( (virq < 0) || (virq >= ARRAY_SIZE(v->virq_to_evtchn)) ) return -EINVAL; @@ -424,14 +433,14 @@ int evtchn_bind_virq(evtchn_bind_virq_t chn = evtchn_from_port(d, port); - spin_lock(&chn->lock); + spin_lock_irqsave(&chn->lock, flags); chn->state = ECS_VIRQ; chn->notify_vcpu_id = vcpu; chn->u.virq = virq; evtchn_port_init(d, chn); - spin_unlock(&chn->lock); + spin_unlock_irqrestore(&chn->lock, flags); v->virq_to_evtchn[virq] = bind->port = port; @@ -448,6 +457,7 @@ static long evtchn_bind_ipi(evtchn_bind_ struct domain *d = current->domain; int port, vcpu = bind->vcpu; long rc = 0; + unsigned long flags; if ( domain_vcpu(d, vcpu) == NULL ) return -ENOENT; @@ -459,13 +469,13 @@ static long evtchn_bind_ipi(evtchn_bind_ chn = evtchn_from_port(d, port); - spin_lock(&chn->lock); + spin_lock_irqsave(&chn->lock, flags); chn->state = ECS_IPI; chn->notify_vcpu_id = vcpu; evtchn_port_init(d, chn); - spin_unlock(&chn->lock); + spin_unlock_irqrestore(&chn->lock, flags); bind->port = port; @@ -509,6 +519,7 @@ static long evtchn_bind_pirq(evtchn_bind struct pirq *info; int port = 0, pirq = bind->pirq; long rc; + unsigned long flags; if ( (pirq < 0) || (pirq >= d->nr_pirqs) ) return -EINVAL; @@ -541,14 +552,14 @@ static long evtchn_bind_pirq(evtchn_bind goto out; } - spin_lock(&chn->lock); + spin_lock_irqsave(&chn->lock, flags); chn->state = ECS_PIRQ; chn->u.pirq.irq = pirq; link_pirq_port(port, chn, v); evtchn_port_init(d, chn); - spin_unlock(&chn->lock); + spin_unlock_irqrestore(&chn->lock, flags); bind->port = port; @@ -569,6 +580,7 @@ int evtchn_close(struct domain *d1, int struct evtchn *chn1, *chn2; int port2; long rc = 0; + unsigned long flags; again: spin_lock(&d1->event_lock); @@ -668,14 +680,14 @@ int evtchn_close(struct domain *d1, int BUG_ON(chn2->state != ECS_INTERDOMAIN); BUG_ON(chn2->u.interdomain.remote_dom != d1); - double_evtchn_lock(chn1, chn2); + flags = double_evtchn_lock(chn1, chn2); evtchn_free(d1, chn1); chn2->state = ECS_UNBOUND; chn2->u.unbound.remote_domid = d1->domain_id; - double_evtchn_unlock(chn1, chn2); + double_evtchn_unlock(chn1, chn2, flags); goto out; @@ -683,9 +695,9 @@ int evtchn_close(struct domain *d1, int BUG(); } - spin_lock(&chn1->lock); + spin_lock_irqsave(&chn1->lock, flags); evtchn_free(d1, chn1); - spin_unlock(&chn1->lock); + spin_unlock_irqrestore(&chn1->lock, flags); out: if ( d2 != NULL ) @@ -705,13 +717,14 @@ int evtchn_send(struct domain *ld, unsig struct evtchn *lchn, *rchn; struct domain *rd; int rport, ret = 0; + unsigned long flags; if ( !port_is_valid(ld, lport) ) return -EINVAL; lchn = evtchn_from_port(ld, lport); - spin_lock(&lchn->lock); + spin_lock_irqsave(&lchn->lock, flags); /* Guest cannot send via a Xen-attached event channel. */ if ( unlikely(consumer_is_xen(lchn)) ) @@ -746,7 +759,7 @@ int evtchn_send(struct domain *ld, unsig } out: - spin_unlock(&lchn->lock); + spin_unlock_irqrestore(&lchn->lock, flags); return ret; } @@ -1238,6 +1251,7 @@ int alloc_unbound_xen_event_channel( { struct evtchn *chn; int port, rc; + unsigned long flags; spin_lock(&ld->event_lock); @@ -1250,14 +1264,14 @@ int alloc_unbound_xen_event_channel( if ( rc ) goto out; - spin_lock(&chn->lock); + spin_lock_irqsave(&chn->lock, flags); chn->state = ECS_UNBOUND; chn->xen_consumer = get_xen_consumer(notification_fn); chn->notify_vcpu_id = lvcpu; chn->u.unbound.remote_domid = remote_domid; - spin_unlock(&chn->lock); + spin_unlock_irqrestore(&chn->lock, flags); write_atomic(&ld->xen_evtchns, ld->xen_evtchns + 1); @@ -1280,11 +1294,12 @@ void notify_via_xen_event_channel(struct { struct evtchn *lchn, *rchn; struct domain *rd; + unsigned long flags; ASSERT(port_is_valid(ld, lport)); lchn = evtchn_from_port(ld, lport); - spin_lock(&lchn->lock); + spin_lock_irqsave(&lchn->lock, flags); if ( likely(lchn->state == ECS_INTERDOMAIN) ) { @@ -1294,7 +1309,7 @@ void notify_via_xen_event_channel(struct evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); } - spin_unlock(&lchn->lock); + spin_unlock_irqrestore(&lchn->lock, flags); } void evtchn_check_pollers(struct domain *d, unsigned int port) ++++++ xsa343-3.patch ++++++ evtchn: address races with evtchn_reset() Neither d->evtchn_port_ops nor max_evtchns(d) may be used in an entirely lock-less manner, as both may change by a racing evtchn_reset(). In the common case, at least one of the domain's event lock or the per-channel lock needs to be held. In the specific case of the inter-domain sending by evtchn_send() and notify_via_xen_event_channel() holding the other side's per-channel lock is sufficient, as the channel can't change state without both per-channel locks held. Without such a channel changing state, evtchn_reset() can't complete successfully. Lock-free accesses continue to be permitted for the shim (calling some otherwise internal event channel functions), as this happens while the domain is in effectively single-threaded mode. Special care also needs taking for the shim's marking of in-use ports as ECS_RESERVED (allowing use of such ports in the shim case is okay because switching into and hence also out of FIFO mode is impossible there). As a side effect, certain operations on Xen bound event channels which were mistakenly permitted so far (e.g. unmask or poll) will be refused now. This is part of XSA-343. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Acked-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/arch/x86/irq.c +++ b/xen/arch/x86/irq.c @@ -2474,14 +2474,24 @@ static void dump_irqs(unsigned char key) for ( i = 0; i < action->nr_guests; ) { + struct evtchn *evtchn; + unsigned int pending = 2, masked = 2; + d = action->guest[i++]; pirq = domain_irq_to_pirq(d, irq); info = pirq_info(d, pirq); + evtchn = evtchn_from_port(d, info->evtchn); + local_irq_disable(); + if ( spin_trylock(&evtchn->lock) ) + { + pending = evtchn_is_pending(d, evtchn); + masked = evtchn_is_masked(d, evtchn); + spin_unlock(&evtchn->lock); + } + local_irq_enable(); printk("d%d:%3d(%c%c%c)%c", - d->domain_id, pirq, - evtchn_port_is_pending(d, info->evtchn) ? 'P' : '-', - evtchn_port_is_masked(d, info->evtchn) ? 'M' : '-', - info->masked ? 'M' : '-', + d->domain_id, pirq, "-P?"[pending], + "-M?"[masked], info->masked ? 'M' : '-', i < action->nr_guests ? ',' : '\n'); } } --- a/xen/arch/x86/pv/shim.c +++ b/xen/arch/x86/pv/shim.c @@ -660,8 +660,11 @@ void pv_shim_inject_evtchn(unsigned int if ( port_is_valid(guest, port) ) { struct evtchn *chn = evtchn_from_port(guest, port); + unsigned long flags; + spin_lock_irqsave(&chn->lock, flags); evtchn_port_set_pending(guest, chn->notify_vcpu_id, chn); + spin_unlock_irqrestore(&chn->lock, flags); } } --- a/xen/common/event_2l.c +++ b/xen/common/event_2l.c @@ -63,8 +63,10 @@ static void evtchn_2l_unmask(struct doma } } -static bool evtchn_2l_is_pending(const struct domain *d, evtchn_port_t port) +static bool evtchn_2l_is_pending(const struct domain *d, + const struct evtchn *evtchn) { + evtchn_port_t port = evtchn->port; unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); ASSERT(port < max_ports); @@ -72,8 +74,10 @@ static bool evtchn_2l_is_pending(const s guest_test_bit(d, port, &shared_info(d, evtchn_pending))); } -static bool evtchn_2l_is_masked(const struct domain *d, evtchn_port_t port) +static bool evtchn_2l_is_masked(const struct domain *d, + const struct evtchn *evtchn) { + evtchn_port_t port = evtchn->port; unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); ASSERT(port < max_ports); --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -156,8 +156,9 @@ int evtchn_allocate_port(struct domain * if ( port_is_valid(d, port) ) { - if ( evtchn_from_port(d, port)->state != ECS_FREE || - evtchn_port_is_busy(d, port) ) + const struct evtchn *chn = evtchn_from_port(d, port); + + if ( chn->state != ECS_FREE || evtchn_is_busy(d, chn) ) return -EBUSY; } else @@ -774,6 +775,7 @@ void send_guest_vcpu_virq(struct vcpu *v unsigned long flags; int port; struct domain *d; + struct evtchn *chn; ASSERT(!virq_is_global(virq)); @@ -784,7 +786,10 @@ void send_guest_vcpu_virq(struct vcpu *v goto out; d = v->domain; - evtchn_port_set_pending(d, v->vcpu_id, evtchn_from_port(d, port)); + chn = evtchn_from_port(d, port); + spin_lock(&chn->lock); + evtchn_port_set_pending(d, v->vcpu_id, chn); + spin_unlock(&chn->lock); out: spin_unlock_irqrestore(&v->virq_lock, flags); @@ -813,7 +818,9 @@ void send_guest_global_virq(struct domai goto out; chn = evtchn_from_port(d, port); + spin_lock(&chn->lock); evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); + spin_unlock(&chn->lock); out: spin_unlock_irqrestore(&v->virq_lock, flags); @@ -823,6 +830,7 @@ void send_guest_pirq(struct domain *d, c { int port; struct evtchn *chn; + unsigned long flags; /* * PV guests: It should not be possible to race with __evtchn_close(). The @@ -837,7 +845,9 @@ void send_guest_pirq(struct domain *d, c } chn = evtchn_from_port(d, port); + spin_lock_irqsave(&chn->lock, flags); evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); + spin_unlock_irqrestore(&chn->lock, flags); } static struct domain *global_virq_handlers[NR_VIRQS] __read_mostly; @@ -1034,12 +1044,15 @@ int evtchn_unmask(unsigned int port) { struct domain *d = current->domain; struct evtchn *evtchn; + unsigned long flags; if ( unlikely(!port_is_valid(d, port)) ) return -EINVAL; evtchn = evtchn_from_port(d, port); + spin_lock_irqsave(&evtchn->lock, flags); evtchn_port_unmask(d, evtchn); + spin_unlock_irqrestore(&evtchn->lock, flags); return 0; } @@ -1449,8 +1462,8 @@ static void domain_dump_evtchn_info(stru printk(" %4u [%d/%d/", port, - evtchn_port_is_pending(d, port), - evtchn_port_is_masked(d, port)); + evtchn_is_pending(d, chn), + evtchn_is_masked(d, chn)); evtchn_port_print_state(d, chn); printk("]: s=%d n=%d x=%d", chn->state, chn->notify_vcpu_id, chn->xen_consumer); --- a/xen/common/event_fifo.c +++ b/xen/common/event_fifo.c @@ -296,23 +296,26 @@ static void evtchn_fifo_unmask(struct do evtchn_fifo_set_pending(v, evtchn); } -static bool evtchn_fifo_is_pending(const struct domain *d, evtchn_port_t port) +static bool evtchn_fifo_is_pending(const struct domain *d, + const struct evtchn *evtchn) { - const event_word_t *word = evtchn_fifo_word_from_port(d, port); + const event_word_t *word = evtchn_fifo_word_from_port(d, evtchn->port); return word && guest_test_bit(d, EVTCHN_FIFO_PENDING, word); } -static bool_t evtchn_fifo_is_masked(const struct domain *d, evtchn_port_t port) +static bool_t evtchn_fifo_is_masked(const struct domain *d, + const struct evtchn *evtchn) { - const event_word_t *word = evtchn_fifo_word_from_port(d, port); + const event_word_t *word = evtchn_fifo_word_from_port(d, evtchn->port); return !word || guest_test_bit(d, EVTCHN_FIFO_MASKED, word); } -static bool_t evtchn_fifo_is_busy(const struct domain *d, evtchn_port_t port) +static bool_t evtchn_fifo_is_busy(const struct domain *d, + const struct evtchn *evtchn) { - const event_word_t *word = evtchn_fifo_word_from_port(d, port); + const event_word_t *word = evtchn_fifo_word_from_port(d, evtchn->port); return word && guest_test_bit(d, EVTCHN_FIFO_LINKED, word); } --- a/xen/include/asm-x86/event.h +++ b/xen/include/asm-x86/event.h @@ -47,4 +47,10 @@ static inline bool arch_virq_is_global(u return true; } +#ifdef CONFIG_PV_SHIM +# include <asm/pv/shim.h> +# define arch_evtchn_is_special(chn) \ + (pv_shim && (chn)->port && (chn)->state == ECS_RESERVED) +#endif + #endif --- a/xen/include/xen/event.h +++ b/xen/include/xen/event.h @@ -133,6 +133,24 @@ static inline struct evtchn *evtchn_from return bucket_from_port(d, p) + (p % EVTCHNS_PER_BUCKET); } +/* + * "usable" as in "by a guest", i.e. Xen consumed channels are assumed to be + * taken care of separately where used for Xen's internal purposes. + */ +static bool evtchn_usable(const struct evtchn *evtchn) +{ + if ( evtchn->xen_consumer ) + return false; + +#ifdef arch_evtchn_is_special + if ( arch_evtchn_is_special(evtchn) ) + return true; +#endif + + BUILD_BUG_ON(ECS_FREE > ECS_RESERVED); + return evtchn->state > ECS_RESERVED; +} + /* Wait on a Xen-attached event channel. */ #define wait_on_xen_event_channel(port, condition) \ do { \ @@ -165,19 +183,24 @@ int evtchn_reset(struct domain *d); /* * Low-level event channel port ops. + * + * All hooks have to be called with a lock held which prevents the channel + * from changing state. This may be the domain event lock, the per-channel + * lock, or in the case of sending interdomain events also the other side's + * per-channel lock. Exceptions apply in certain cases for the PV shim. */ struct evtchn_port_ops { void (*init)(struct domain *d, struct evtchn *evtchn); void (*set_pending)(struct vcpu *v, struct evtchn *evtchn); void (*clear_pending)(struct domain *d, struct evtchn *evtchn); void (*unmask)(struct domain *d, struct evtchn *evtchn); - bool (*is_pending)(const struct domain *d, evtchn_port_t port); - bool (*is_masked)(const struct domain *d, evtchn_port_t port); + bool (*is_pending)(const struct domain *d, const struct evtchn *evtchn); + bool (*is_masked)(const struct domain *d, const struct evtchn *evtchn); /* * Is the port unavailable because it's still being cleaned up * after being closed? */ - bool (*is_busy)(const struct domain *d, evtchn_port_t port); + bool (*is_busy)(const struct domain *d, const struct evtchn *evtchn); int (*set_priority)(struct domain *d, struct evtchn *evtchn, unsigned int priority); void (*print_state)(struct domain *d, const struct evtchn *evtchn); @@ -193,38 +216,67 @@ static inline void evtchn_port_set_pendi unsigned int vcpu_id, struct evtchn *evtchn) { - d->evtchn_port_ops->set_pending(d->vcpu[vcpu_id], evtchn); + if ( evtchn_usable(evtchn) ) + d->evtchn_port_ops->set_pending(d->vcpu[vcpu_id], evtchn); } static inline void evtchn_port_clear_pending(struct domain *d, struct evtchn *evtchn) { - d->evtchn_port_ops->clear_pending(d, evtchn); + if ( evtchn_usable(evtchn) ) + d->evtchn_port_ops->clear_pending(d, evtchn); } static inline void evtchn_port_unmask(struct domain *d, struct evtchn *evtchn) { - d->evtchn_port_ops->unmask(d, evtchn); + if ( evtchn_usable(evtchn) ) + d->evtchn_port_ops->unmask(d, evtchn); } -static inline bool evtchn_port_is_pending(const struct domain *d, - evtchn_port_t port) +static inline bool evtchn_is_pending(const struct domain *d, + const struct evtchn *evtchn) { - return d->evtchn_port_ops->is_pending(d, port); + return evtchn_usable(evtchn) && d->evtchn_port_ops->is_pending(d, evtchn); } -static inline bool evtchn_port_is_masked(const struct domain *d, - evtchn_port_t port) +static inline bool evtchn_port_is_pending(struct domain *d, evtchn_port_t port) { - return d->evtchn_port_ops->is_masked(d, port); + struct evtchn *evtchn = evtchn_from_port(d, port); + bool rc; + unsigned long flags; + + spin_lock_irqsave(&evtchn->lock, flags); + rc = evtchn_is_pending(d, evtchn); + spin_unlock_irqrestore(&evtchn->lock, flags); + + return rc; +} + +static inline bool evtchn_is_masked(const struct domain *d, + const struct evtchn *evtchn) +{ + return !evtchn_usable(evtchn) || d->evtchn_port_ops->is_masked(d, evtchn); +} + +static inline bool evtchn_port_is_masked(struct domain *d, evtchn_port_t port) +{ + struct evtchn *evtchn = evtchn_from_port(d, port); + bool rc; + unsigned long flags; + + spin_lock_irqsave(&evtchn->lock, flags); + rc = evtchn_is_masked(d, evtchn); + spin_unlock_irqrestore(&evtchn->lock, flags); + + return rc; } -static inline bool evtchn_port_is_busy(const struct domain *d, - evtchn_port_t port) +static inline bool evtchn_is_busy(const struct domain *d, + const struct evtchn *evtchn) { return d->evtchn_port_ops->is_busy && - d->evtchn_port_ops->is_busy(d, port); + d->evtchn_port_ops->is_busy(d, evtchn); } static inline int evtchn_port_set_priority(struct domain *d, @@ -233,6 +285,8 @@ static inline int evtchn_port_set_priori { if ( !d->evtchn_port_ops->set_priority ) return -ENOSYS; + if ( !evtchn_usable(evtchn) ) + return -EACCES; return d->evtchn_port_ops->set_priority(d, evtchn, priority); } ++++++ xsa344-1.patch ++++++ evtchn: arrange for preemption in evtchn_destroy() Especially closing of fully established interdomain channels can take quite some time, due to the locking involved. Therefore we shouldn't assume we can clean up still active ports all in one go. Besides adding the necessary preemption check, also avoid pointlessly starting from (or now really ending at) 0; 1 is the lowest numbered port which may need closing. Since we're now reducing ->valid_evtchns, free_xen_event_channel(), and (at least to be on the safe side) notify_via_xen_event_channel() need to cope with attempts to close / unbind from / send through already closed (and no longer valid, as per port_is_valid()) ports. This is part of XSA-344. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Acked-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -770,12 +770,14 @@ int domain_kill(struct domain *d) return domain_kill(d); d->is_dying = DOMDYING_dying; argo_destroy(d); - evtchn_destroy(d); gnttab_release_mappings(d); vnuma_destroy(d->vnuma); domain_set_outstanding_pages(d, 0); /* fallthrough */ case DOMDYING_dying: + rc = evtchn_destroy(d); + if ( rc ) + break; rc = domain_relinquish_resources(d); if ( rc != 0 ) break; --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -1297,7 +1297,16 @@ int alloc_unbound_xen_event_channel( void free_xen_event_channel(struct domain *d, int port) { - BUG_ON(!port_is_valid(d, port)); + if ( !port_is_valid(d, port) ) + { + /* + * Make sure ->is_dying is read /after/ ->valid_evtchns, pairing + * with the spin_barrier() and BUG_ON() in evtchn_destroy(). + */ + smp_rmb(); + BUG_ON(!d->is_dying); + return; + } evtchn_close(d, port, 0); } @@ -1309,7 +1318,17 @@ void notify_via_xen_event_channel(struct struct domain *rd; unsigned long flags; - ASSERT(port_is_valid(ld, lport)); + if ( !port_is_valid(ld, lport) ) + { + /* + * Make sure ->is_dying is read /after/ ->valid_evtchns, pairing + * with the spin_barrier() and BUG_ON() in evtchn_destroy(). + */ + smp_rmb(); + ASSERT(ld->is_dying); + return; + } + lchn = evtchn_from_port(ld, lport); spin_lock_irqsave(&lchn->lock, flags); @@ -1380,8 +1399,7 @@ int evtchn_init(struct domain *d, unsign return 0; } - -void evtchn_destroy(struct domain *d) +int evtchn_destroy(struct domain *d) { unsigned int i; @@ -1390,14 +1408,29 @@ void evtchn_destroy(struct domain *d) spin_barrier(&d->event_lock); /* Close all existing event channels. */ - for ( i = 0; port_is_valid(d, i); i++ ) + for ( i = d->valid_evtchns; --i; ) + { evtchn_close(d, i, 0); + /* + * Avoid preempting when called from domain_create()'s error path, + * and don't check too often (choice of frequency is arbitrary). + */ + if ( i && !(i & 0x3f) && d->is_dying != DOMDYING_dead && + hypercall_preempt_check() ) + { + write_atomic(&d->valid_evtchns, i); + return -ERESTART; + } + } + ASSERT(!d->active_evtchns); clear_global_virq_handlers(d); evtchn_fifo_destroy(d); + + return 0; } --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -136,7 +136,7 @@ struct evtchn } __attribute__((aligned(64))); int evtchn_init(struct domain *d, unsigned int max_port); -void evtchn_destroy(struct domain *d); /* from domain_kill */ +int evtchn_destroy(struct domain *d); /* from domain_kill */ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */ struct waitqueue_vcpu; ++++++ xsa344-2.patch ++++++ evtchn: arrange for preemption in evtchn_reset() Like for evtchn_destroy() looping over all possible event channels to close them can take a significant amount of time. Unlike done there, we can't alter domain properties (i.e. d->valid_evtchns) here. Borrow, in a lightweight form, the paging domctl continuation concept, redirecting the continuations to different sub-ops. Just like there this is to be able to allow for predictable overall results of the involved sub-ops: Racing requests should either complete or be refused. Note that a domain can't interfere with an already started (by a remote domain) reset, due to being paused. It can prevent a remote reset from happening by leaving a reset unfinished, but that's only going to affect itself. This is part of XSA-344. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Acked-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -1214,7 +1214,7 @@ void domain_unpause_except_self(struct d domain_unpause(d); } -int domain_soft_reset(struct domain *d) +int domain_soft_reset(struct domain *d, bool resuming) { struct vcpu *v; int rc; @@ -1228,7 +1228,7 @@ int domain_soft_reset(struct domain *d) } spin_unlock(&d->shutdown_lock); - rc = evtchn_reset(d); + rc = evtchn_reset(d, resuming); if ( rc ) return rc; --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -572,12 +572,22 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe } case XEN_DOMCTL_soft_reset: + case XEN_DOMCTL_soft_reset_cont: if ( d == current->domain ) /* no domain_pause() */ { ret = -EINVAL; break; } - ret = domain_soft_reset(d); + ret = domain_soft_reset(d, op->cmd == XEN_DOMCTL_soft_reset_cont); + if ( ret == -ERESTART ) + { + op->cmd = XEN_DOMCTL_soft_reset_cont; + if ( !__copy_field_to_guest(u_domctl, op, cmd) ) + ret = hypercall_create_continuation(__HYPERVISOR_domctl, + "h", u_domctl); + else + ret = -EFAULT; + } break; case XEN_DOMCTL_destroydomain: --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -1057,7 +1057,7 @@ int evtchn_unmask(unsigned int port) return 0; } -int evtchn_reset(struct domain *d) +int evtchn_reset(struct domain *d, bool resuming) { unsigned int i; int rc = 0; @@ -1065,11 +1065,40 @@ int evtchn_reset(struct domain *d) if ( d != current->domain && !d->controller_pause_count ) return -EINVAL; - for ( i = 0; port_is_valid(d, i); i++ ) + spin_lock(&d->event_lock); + + /* + * If we are resuming, then start where we stopped. Otherwise, check + * that a reset operation is not already in progress, and if none is, + * record that this is now the case. + */ + i = resuming ? d->next_evtchn : !d->next_evtchn; + if ( i > d->next_evtchn ) + d->next_evtchn = i; + + spin_unlock(&d->event_lock); + + if ( !i ) + return -EBUSY; + + for ( ; port_is_valid(d, i); i++ ) + { evtchn_close(d, i, 1); + /* NB: Choice of frequency is arbitrary. */ + if ( !(i & 0x3f) && hypercall_preempt_check() ) + { + spin_lock(&d->event_lock); + d->next_evtchn = i; + spin_unlock(&d->event_lock); + return -ERESTART; + } + } + spin_lock(&d->event_lock); + d->next_evtchn = 0; + if ( d->active_evtchns > d->xen_evtchns ) rc = -EAGAIN; else if ( d->evtchn_fifo ) @@ -1204,7 +1233,8 @@ long do_event_channel_op(int cmd, XEN_GU break; } - case EVTCHNOP_reset: { + case EVTCHNOP_reset: + case EVTCHNOP_reset_cont: { struct evtchn_reset reset; struct domain *d; @@ -1217,9 +1247,13 @@ long do_event_channel_op(int cmd, XEN_GU rc = xsm_evtchn_reset(XSM_TARGET, current->domain, d); if ( !rc ) - rc = evtchn_reset(d); + rc = evtchn_reset(d, cmd == EVTCHNOP_reset_cont); rcu_unlock_domain(d); + + if ( rc == -ERESTART ) + rc = hypercall_create_continuation(__HYPERVISOR_event_channel_op, + "ih", EVTCHNOP_reset_cont, arg); break; } --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -1152,7 +1152,10 @@ struct xen_domctl { #define XEN_DOMCTL_iomem_permission 20 #define XEN_DOMCTL_ioport_permission 21 #define XEN_DOMCTL_hypercall_init 22 -#define XEN_DOMCTL_arch_setup 23 /* Obsolete IA64 only */ +#ifdef __XEN__ +/* #define XEN_DOMCTL_arch_setup 23 Obsolete IA64 only */ +#define XEN_DOMCTL_soft_reset_cont 23 +#endif #define XEN_DOMCTL_settimeoffset 24 #define XEN_DOMCTL_getvcpuaffinity 25 #define XEN_DOMCTL_real_mode_area 26 /* Obsolete PPC only */ --- a/xen/include/public/event_channel.h +++ b/xen/include/public/event_channel.h @@ -74,6 +74,9 @@ #define EVTCHNOP_init_control 11 #define EVTCHNOP_expand_array 12 #define EVTCHNOP_set_priority 13 +#ifdef __XEN__ +#define EVTCHNOP_reset_cont 14 +#endif /* ` } */ typedef uint32_t evtchn_port_t; --- a/xen/include/xen/event.h +++ b/xen/include/xen/event.h @@ -171,7 +171,7 @@ void evtchn_check_pollers(struct domain void evtchn_2l_init(struct domain *d); /* Close all event channels and reset to 2-level ABI. */ -int evtchn_reset(struct domain *d); +int evtchn_reset(struct domain *d, bool resuming); /* * Low-level event channel port ops. --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -394,6 +394,8 @@ struct domain * EVTCHNOP_reset). Read/write access like for active_evtchns. */ unsigned int xen_evtchns; + /* Port to resume from in evtchn_reset(), when in a continuation. */ + unsigned int next_evtchn; spinlock_t event_lock; const struct evtchn_port_ops *evtchn_port_ops; struct evtchn_fifo_domain *evtchn_fifo; @@ -663,7 +665,7 @@ int domain_shutdown(struct domain *d, u8 void domain_resume(struct domain *d); void domain_pause_for_debugger(void); -int domain_soft_reset(struct domain *d); +int domain_soft_reset(struct domain *d, bool resuming); int vcpu_start_shutdown_deferral(struct vcpu *v); void vcpu_end_shutdown_deferral(struct vcpu *v); ++++++ xsa345-1.patch ++++++ x86/mm: Refactor map_pages_to_xen to have only a single exit path We will soon need to perform clean-ups before returning. No functional change. This is part of XSA-345. Signed-off-by: Wei Liu <wei.liu2(a)citrix.com> Signed-off-by: Hongyan Xia <hongyxia(a)amazon.com> Signed-off-by: George Dunlap <george.dunlap(a)citrix.com> Acked-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -5181,6 +5181,7 @@ int map_pages_to_xen( l2_pgentry_t *pl2e, ol2e; l1_pgentry_t *pl1e, ol1e; unsigned int i; + int rc = -ENOMEM; #define flush_flags(oldf) do { \ unsigned int o_ = (oldf); \ @@ -5201,7 +5202,8 @@ int map_pages_to_xen( l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt); if ( !pl3e ) - return -ENOMEM; + goto out; + ol3e = *pl3e; if ( cpu_has_page1gb && @@ -5289,7 +5291,7 @@ int map_pages_to_xen( pl2e = alloc_xen_pagetable(); if ( pl2e == NULL ) - return -ENOMEM; + goto out; for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) l2e_write(pl2e + i, @@ -5318,7 +5320,7 @@ int map_pages_to_xen( pl2e = virt_to_xen_l2e(virt); if ( !pl2e ) - return -ENOMEM; + goto out; if ( ((((virt >> PAGE_SHIFT) | mfn_x(mfn)) & ((1u << PAGETABLE_ORDER) - 1)) == 0) && @@ -5361,7 +5363,7 @@ int map_pages_to_xen( { pl1e = virt_to_xen_l1e(virt); if ( pl1e == NULL ) - return -ENOMEM; + goto out; } else if ( l2e_get_flags(*pl2e) & _PAGE_PSE ) { @@ -5388,7 +5390,7 @@ int map_pages_to_xen( pl1e = alloc_xen_pagetable(); if ( pl1e == NULL ) - return -ENOMEM; + goto out; for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ ) l1e_write(&pl1e[i], @@ -5532,7 +5534,10 @@ int map_pages_to_xen( #undef flush_flags - return 0; + rc = 0; + + out: + return rc; } int populate_pt_range(unsigned long virt, unsigned long nr_mfns) ++++++ xsa345-2.patch ++++++ x86/mm: Refactor modify_xen_mappings to have one exit path We will soon need to perform clean-ups before returning. No functional change. This is part of XSA-345. Signed-off-by: Wei Liu <wei.liu2(a)citrix.com> Signed-off-by: Hongyan Xia <hongyxia(a)amazon.com> Signed-off-by: George Dunlap <george.dunlap(a)citrix.com> Acked-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -5564,6 +5564,7 @@ int modify_xen_mappings(unsigned long s, l1_pgentry_t *pl1e; unsigned int i; unsigned long v = s; + int rc = -ENOMEM; /* Set of valid PTE bits which may be altered. */ #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT) @@ -5605,7 +5606,8 @@ int modify_xen_mappings(unsigned long s, /* PAGE1GB: shatter the superpage and fall through. */ pl2e = alloc_xen_pagetable(); if ( !pl2e ) - return -ENOMEM; + goto out; + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) l2e_write(pl2e + i, l2e_from_pfn(l3e_get_pfn(*pl3e) + @@ -5660,7 +5662,8 @@ int modify_xen_mappings(unsigned long s, /* PSE: shatter the superpage and try again. */ pl1e = alloc_xen_pagetable(); if ( !pl1e ) - return -ENOMEM; + goto out; + for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ ) l1e_write(&pl1e[i], l1e_from_pfn(l2e_get_pfn(*pl2e) + i, @@ -5789,7 +5792,10 @@ int modify_xen_mappings(unsigned long s, flush_area(NULL, FLUSH_TLB_GLOBAL); #undef FLAGS_MASK - return 0; + rc = 0; + + out: + return rc; } #undef flush_area ++++++ xsa345-3.patch ++++++ x86/mm: Prevent some races in hypervisor mapping updates map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB superpages if possible, to maximize TLB efficiency. This means both replacing superpage entries with smaller entries, and replacing smaller entries with superpages. Unfortunately, while some potential races are handled correctly, others are not. These include: 1. When one processor modifies a sub-superpage mapping while another processor replaces the entire range with a superpage. Take the following example: Suppose L3[N] points to L2. And suppose we have two processors, A and B. * A walks the pagetables, get a pointer to L2. * B replaces L3[N] with a 1GiB mapping. * B Frees L2 * A writes L2[M] # This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't handle higher-level superpages properly: If you call virt_xen_to_l2e on a virtual address within an L3 superpage, you'll either hit a BUG() (most likely), or get a pointer into the middle of a data page; same with virt_xen_to_l1 on a virtual address within either an L3 or L2 superpage. So take the following example: * A reads pl3e and discovers it to point to an L2. * B replaces L3[N] with a 1GiB mapping * A calls virt_to_xen_l2e() and hits the BUG_ON() # 2. When two processors simultaneously try to replace a sub-superpage mapping with a superpage mapping. Take the following example: Suppose L3[N] points to L2. And suppose we have two processors, A and B, both trying to replace L3[N] with a superpage. * A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2. * B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2. * A writes the new value into L3[N] * B writes the new value into L3[N] * A recursively frees all the L1's under L2, then frees L2 * B recursively double-frees all the L1's under L2, then double-frees L2 # Fix this by grabbing a lock for the entirety of the mapping update operation. Rather than grabbing map_pgdir_lock for the entire operation, however, repurpose the PGT_locked bit from L3's page->type_info as a lock. This means that rather than locking the entire address space, we "only" lock a single 512GiB chunk of hypervisor address space at a time. There was a proposal for a lock-and-reverify approach, where we walk the pagetables to the point where we decide what to do; then grab the map_pgdir_lock, re-verify the information we collected without the lock, and finally make the change (starting over again if anything had changed). Without being able to guarantee that the L2 table wasn't freed, however, that means every read would need to be considered potentially unsafe. Thinking carefully about that is probably something that wants to be done on public, not under time pressure. This is part of XSA-345. Signed-off-by: Hongyan Xia <hongyxia(a)amazon.com> Signed-off-by: George Dunlap <george.dunlap(a)citrix.com> Reviewed-by: Jan Beulich <jbeulich(a)suse.com> --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -2161,6 +2161,50 @@ void page_unlock(struct page_info *page) current_locked_page_set(NULL); } +/* + * L3 table locks: + * + * Used for serialization in map_pages_to_xen() and modify_xen_mappings(). + * + * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to + * reuse the PGT_locked flag. This lock is taken only when we move down to L3 + * tables and below, since L4 (and above, for 5-level paging) is still globally + * protected by map_pgdir_lock. + * + * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock(). + * This has two implications: + * - We cannot reuse reuse current_locked_page_* for debugging + * - To avoid the chance of deadlock, even for different pages, we + * must never grab page_lock() after grabbing l3t_lock(). This + * includes any page_lock()-based locks, such as + * mem_sharing_page_lock(). + * + * Also note that we grab the map_pgdir_lock while holding the + * l3t_lock(), so to avoid deadlock we must avoid grabbing them in + * reverse order. + */ +static void l3t_lock(struct page_info *page) +{ + unsigned long x, nx; + + do { + while ( (x = page->u.inuse.type_info) & PGT_locked ) + cpu_relax(); + nx = x | PGT_locked; + } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x ); +} + +static void l3t_unlock(struct page_info *page) +{ + unsigned long x, nx, y = page->u.inuse.type_info; + + do { + x = y; + BUG_ON(!(x & PGT_locked)); + nx = x & ~PGT_locked; + } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x ); +} + #ifdef CONFIG_PV /* * PTE flags that a guest may change without re-validating the PTE. @@ -5171,6 +5215,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned l flush_area_local((const void *)v, f) : \ flush_area_all((const void *)v, f)) +#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR + +#define L3T_LOCK(page) \ + do { \ + if ( locking ) \ + l3t_lock(page); \ + } while ( false ) + +#define L3T_UNLOCK(page) \ + do { \ + if ( locking && (page) != ZERO_BLOCK_PTR ) \ + { \ + l3t_unlock(page); \ + (page) = ZERO_BLOCK_PTR; \ + } \ + } while ( false ) + int map_pages_to_xen( unsigned long virt, mfn_t mfn, @@ -5182,6 +5243,7 @@ int map_pages_to_xen( l1_pgentry_t *pl1e, ol1e; unsigned int i; int rc = -ENOMEM; + struct page_info *current_l3page; #define flush_flags(oldf) do { \ unsigned int o_ = (oldf); \ @@ -5197,13 +5259,20 @@ int map_pages_to_xen( } \ } while (0) + L3T_INIT(current_l3page); + while ( nr_mfns != 0 ) { - l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt); + l3_pgentry_t *pl3e, ol3e; + L3T_UNLOCK(current_l3page); + + pl3e = virt_to_xen_l3e(virt); if ( !pl3e ) goto out; + current_l3page = virt_to_page(pl3e); + L3T_LOCK(current_l3page); ol3e = *pl3e; if ( cpu_has_page1gb && @@ -5537,6 +5606,7 @@ int map_pages_to_xen( rc = 0; out: + L3T_UNLOCK(current_l3page); return rc; } @@ -5565,6 +5635,7 @@ int modify_xen_mappings(unsigned long s, unsigned int i; unsigned long v = s; int rc = -ENOMEM; + struct page_info *current_l3page; /* Set of valid PTE bits which may be altered. */ #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT) @@ -5573,11 +5644,22 @@ int modify_xen_mappings(unsigned long s, ASSERT(IS_ALIGNED(s, PAGE_SIZE)); ASSERT(IS_ALIGNED(e, PAGE_SIZE)); + L3T_INIT(current_l3page); + while ( v < e ) { - l3_pgentry_t *pl3e = virt_to_xen_l3e(v); + l3_pgentry_t *pl3e; - if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ) + L3T_UNLOCK(current_l3page); + + pl3e = virt_to_xen_l3e(v); + if ( !pl3e ) + goto out; + + current_l3page = virt_to_page(pl3e); + L3T_LOCK(current_l3page); + + if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ) { /* Confirm the caller isn't trying to create new mappings. */ ASSERT(!(nf & _PAGE_PRESENT)); @@ -5795,9 +5877,13 @@ int modify_xen_mappings(unsigned long s, rc = 0; out: + L3T_UNLOCK(current_l3page); return rc; } +#undef L3T_LOCK +#undef L3T_UNLOCK + #undef flush_area int destroy_xen_mappings(unsigned long s, unsigned long e) ++++++ xsa346-1.patch ++++++ IOMMU: suppress "iommu_dont_flush_iotlb" when about to free a page Deferring flushes to a single, wide range one - as is done when handling XENMAPSPACE_gmfn_range - is okay only as long as pages don't get freed ahead of the eventual flush. While the only function setting the flag (xenmem_add_to_physmap()) suggests by its name that it's only mapping new entries, in reality the way xenmem_add_to_physmap_one() works means an unmap would happen not only for the page being moved (but not freed) but, if the destination GFN is populated, also for the page being displaced from that GFN. Collapsing the two flushes for this GFN into just one (end even more so deferring it to a batched invocation) is not correct. This is part of XSA-346. Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Paul Durrant <paul(a)xen.org> Acked-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -292,6 +292,7 @@ int guest_remove_page(struct domain *d, p2m_type_t p2mt; #endif mfn_t mfn; + bool *dont_flush_p, dont_flush; int rc; #ifdef CONFIG_X86 @@ -378,8 +379,18 @@ int guest_remove_page(struct domain *d, return -ENXIO; } + /* + * Since we're likely to free the page below, we need to suspend + * xenmem_add_to_physmap()'s suppressing of IOMMU TLB flushes. + */ + dont_flush_p = &this_cpu(iommu_dont_flush_iotlb); + dont_flush = *dont_flush_p; + *dont_flush_p = false; + rc = guest_physmap_remove_page(d, _gfn(gmfn), mfn, 0); + *dont_flush_p = dont_flush; + /* * With the lack of an IOMMU on some platforms, domains with DMA-capable * device must retrieve the same pfn when the hypercall populate_physmap ++++++ xsa346-2.patch ++++++ IOMMU: hold page ref until after deferred TLB flush When moving around a page via XENMAPSPACE_gmfn_range, deferring the TLB flush for the "from" GFN range requires that the page remains allocated to the guest until the TLB flush has actually occurred. Otherwise a parallel hypercall to remove the page would only flush the TLB for the GFN it has been moved to, but not the one is was mapped at originally. This is part of XSA-346. Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") Reported-by: Julien Grall <jgrall(a)amazon.com> Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Acked-by: Julien Grall <jgrall(a)amazon.com> --- a/xen/arch/arm/mm.c +++ b/xen/arch/arm/mm.c @@ -1407,7 +1407,7 @@ void share_xen_page_with_guest(struct pa int xenmem_add_to_physmap_one( struct domain *d, unsigned int space, - union xen_add_to_physmap_batch_extra extra, + union add_to_physmap_extra extra, unsigned long idx, gfn_t gfn) { @@ -1480,10 +1480,6 @@ int xenmem_add_to_physmap_one( break; } case XENMAPSPACE_dev_mmio: - /* extra should be 0. Reserved for future use. */ - if ( extra.res0 ) - return -EOPNOTSUPP; - rc = map_dev_mmio_region(d, gfn, 1, _mfn(idx)); return rc; --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4662,7 +4662,7 @@ static int handle_iomem_range(unsigned l int xenmem_add_to_physmap_one( struct domain *d, unsigned int space, - union xen_add_to_physmap_batch_extra extra, + union add_to_physmap_extra extra, unsigned long idx, gfn_t gpfn) { @@ -4746,9 +4746,20 @@ int xenmem_add_to_physmap_one( rc = guest_physmap_add_page(d, gpfn, mfn, PAGE_ORDER_4K); put_both: - /* In the XENMAPSPACE_gmfn case, we took a ref of the gfn at the top. */ + /* + * In the XENMAPSPACE_gmfn case, we took a ref of the gfn at the top. + * We also may need to transfer ownership of the page reference to our + * caller. + */ if ( space == XENMAPSPACE_gmfn ) + { put_gfn(d, gfn); + if ( !rc && extra.ppage ) + { + *extra.ppage = page; + page = NULL; + } + } if ( page ) put_page(page); --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -814,13 +814,12 @@ int xenmem_add_to_physmap(struct domain { unsigned int done = 0; long rc = 0; - union xen_add_to_physmap_batch_extra extra; + union add_to_physmap_extra extra = {}; + struct page_info *pages[16]; ASSERT(paging_mode_translate(d)); - if ( xatp->space != XENMAPSPACE_gmfn_foreign ) - extra.res0 = 0; - else + if ( xatp->space == XENMAPSPACE_gmfn_foreign ) extra.foreign_domid = DOMID_INVALID; if ( xatp->space != XENMAPSPACE_gmfn_range ) @@ -835,7 +834,10 @@ int xenmem_add_to_physmap(struct domain xatp->size -= start; if ( is_iommu_enabled(d) ) + { this_cpu(iommu_dont_flush_iotlb) = 1; + extra.ppage = &pages[0]; + } while ( xatp->size > done ) { @@ -847,8 +849,12 @@ int xenmem_add_to_physmap(struct domain xatp->idx++; xatp->gpfn++; + if ( extra.ppage ) + ++extra.ppage; + /* Check for continuation if it's not the last iteration. */ - if ( xatp->size > ++done && hypercall_preempt_check() ) + if ( (++done > ARRAY_SIZE(pages) && extra.ppage) || + (xatp->size > done && hypercall_preempt_check()) ) { rc = start + done; break; @@ -858,6 +864,7 @@ int xenmem_add_to_physmap(struct domain if ( is_iommu_enabled(d) ) { int ret; + unsigned int i; this_cpu(iommu_dont_flush_iotlb) = 0; @@ -866,6 +873,15 @@ int xenmem_add_to_physmap(struct domain if ( unlikely(ret) && rc >= 0 ) rc = ret; + /* + * Now that the IOMMU TLB flush was done for the original GFN, drop + * the page references. The 2nd flush below is fine to make later, as + * whoever removes the page again from its new GFN will have to do + * another flush anyway. + */ + for ( i = 0; i < done; ++i ) + put_page(pages[i]); + ret = iommu_iotlb_flush(d, _dfn(xatp->gpfn - done), done, IOMMU_FLUSHF_added | IOMMU_FLUSHF_modified); if ( unlikely(ret) && rc >= 0 ) @@ -879,6 +895,8 @@ static int xenmem_add_to_physmap_batch(s struct xen_add_to_physmap_batch *xatpb, unsigned int extent) { + union add_to_physmap_extra extra = {}; + if ( unlikely(xatpb->size < extent) ) return -EILSEQ; @@ -890,6 +908,19 @@ static int xenmem_add_to_physmap_batch(s !guest_handle_subrange_okay(xatpb->errs, extent, xatpb->size - 1) ) return -EFAULT; + switch ( xatpb->space ) + { + case XENMAPSPACE_dev_mmio: + /* res0 is reserved for future use. */ + if ( xatpb->u.res0 ) + return -EOPNOTSUPP; + break; + + case XENMAPSPACE_gmfn_foreign: + extra.foreign_domid = xatpb->u.foreign_domid; + break; + } + while ( xatpb->size > extent ) { xen_ulong_t idx; @@ -902,8 +933,7 @@ static int xenmem_add_to_physmap_batch(s extent, 1)) ) return -EFAULT; - rc = xenmem_add_to_physmap_one(d, xatpb->space, - xatpb->u, + rc = xenmem_add_to_physmap_one(d, xatpb->space, extra, idx, _gfn(gpfn)); if ( unlikely(__copy_to_guest_offset(xatpb->errs, extent, &rc, 1)) ) --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -588,8 +588,22 @@ void scrub_one_page(struct page_info *); &(d)->xenpage_list : &(d)->page_list) #endif +union add_to_physmap_extra { + /* + * XENMAPSPACE_gmfn: When deferring TLB flushes, a page reference needs + * to be kept until after the flush, so the page can't get removed from + * the domain (and re-used for another purpose) beforehand. By passing + * non-NULL, the caller of xenmem_add_to_physmap_one() indicates it wants + * to have ownership of such a reference transferred in the success case. + */ + struct page_info **ppage; + + /* XENMAPSPACE_gmfn_foreign */ + domid_t foreign_domid; +}; + int xenmem_add_to_physmap_one(struct domain *d, unsigned int space, - union xen_add_to_physmap_batch_extra extra, + union add_to_physmap_extra extra, unsigned long idx, gfn_t gfn); int xenmem_add_to_physmap(struct domain *d, struct xen_add_to_physmap *xatp, ++++++ xsa347-1.patch ++++++ AMD/IOMMU: convert amd_iommu_pte from struct to union This is to add a "raw" counterpart to the bitfield equivalent. Take the opportunity and - convert fields to bool / unsigned int, - drop the naming of the reserved field, - shorten the names of the ignored ones. This is part of XSA-347. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3(a)citrix.com> Reviewed-by: Paul Durrant <paul(a)xen.org> --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -38,7 +38,7 @@ static unsigned int pfn_to_pde_idx(unsig static unsigned int clear_iommu_pte_present(unsigned long l1_mfn, unsigned long dfn) { - struct amd_iommu_pte *table, *pte; + union amd_iommu_pte *table, *pte; unsigned int flush_flags; table = map_domain_page(_mfn(l1_mfn)); @@ -52,7 +52,7 @@ static unsigned int clear_iommu_pte_pres return flush_flags; } -static unsigned int set_iommu_pde_present(struct amd_iommu_pte *pte, +static unsigned int set_iommu_pde_present(union amd_iommu_pte *pte, unsigned long next_mfn, unsigned int next_level, bool iw, bool ir) @@ -87,7 +87,7 @@ static unsigned int set_iommu_pte_presen int pde_level, bool iw, bool ir) { - struct amd_iommu_pte *table, *pde; + union amd_iommu_pte *table, *pde; unsigned int flush_flags; table = map_domain_page(_mfn(pt_mfn)); @@ -178,7 +178,7 @@ void iommu_dte_set_guest_cr3(struct amd_ static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn, unsigned long pt_mfn[], bool map) { - struct amd_iommu_pte *pde, *next_table_vaddr; + union amd_iommu_pte *pde, *next_table_vaddr; unsigned long next_table_mfn; unsigned int level; struct page_info *table; @@ -458,7 +458,7 @@ int __init amd_iommu_quarantine_init(str unsigned long end_gfn = 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT); unsigned int level = amd_iommu_get_paging_mode(end_gfn); - struct amd_iommu_pte *table; + union amd_iommu_pte *table; if ( hd->arch.root_table ) { @@ -489,7 +489,7 @@ int __init amd_iommu_quarantine_init(str for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ ) { - struct amd_iommu_pte *pde = &table[i]; + union amd_iommu_pte *pde = &table[i]; /* * PDEs are essentially a subset of PTEs, so this function --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -390,7 +390,7 @@ static void deallocate_next_page_table(s static void deallocate_page_table(struct page_info *pg) { - struct amd_iommu_pte *table_vaddr; + union amd_iommu_pte *table_vaddr; unsigned int index, level = PFN_ORDER(pg); PFN_ORDER(pg) = 0; @@ -405,7 +405,7 @@ static void deallocate_page_table(struct for ( index = 0; index < PTE_PER_TABLE_SIZE; index++ ) { - struct amd_iommu_pte *pde = &table_vaddr[index]; + union amd_iommu_pte *pde = &table_vaddr[index]; if ( pde->mfn && pde->next_level && pde->pr ) { @@ -557,7 +557,7 @@ static void amd_dump_p2m_table_level(str paddr_t gpa, int indent) { paddr_t address; - struct amd_iommu_pte *table_vaddr; + const union amd_iommu_pte *table_vaddr; int index; if ( level < 1 ) @@ -573,7 +573,7 @@ static void amd_dump_p2m_table_level(str for ( index = 0; index < PTE_PER_TABLE_SIZE; index++ ) { - struct amd_iommu_pte *pde = &table_vaddr[index]; + const union amd_iommu_pte *pde = &table_vaddr[index]; if ( !(index % 2) ) process_pending_softirqs(); --- a/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h @@ -465,20 +465,23 @@ union amd_iommu_x2apic_control { #define IOMMU_PAGE_TABLE_U32_PER_ENTRY (IOMMU_PAGE_TABLE_ENTRY_SIZE / 4) #define IOMMU_PAGE_TABLE_ALIGNMENT 4096 -struct amd_iommu_pte { - uint64_t pr:1; - uint64_t ignored0:4; - uint64_t a:1; - uint64_t d:1; - uint64_t ignored1:2; - uint64_t next_level:3; - uint64_t mfn:40; - uint64_t reserved:7; - uint64_t u:1; - uint64_t fc:1; - uint64_t ir:1; - uint64_t iw:1; - uint64_t ignored2:1; +union amd_iommu_pte { + uint64_t raw; + struct { + bool pr:1; + unsigned int ign0:4; + bool a:1; + bool d:1; + unsigned int ign1:2; + unsigned int next_level:3; + uint64_t mfn:40; + unsigned int :7; + bool u:1; + bool fc:1; + bool ir:1; + bool iw:1; + unsigned int ign2:1; + }; }; /* Paging modes */ ++++++ xsa347-2.patch ++++++ AMD/IOMMU: update live PTEs atomically Updating a live PTE bitfield by bitfield risks the compiler re-ordering the individual updates as well as splitting individual updates into multiple memory writes. Construct the new entry fully in a local variable, do the check to determine the flushing needs on the thus established new entry, and then write the new entry by a single insn. Similarly using memset() to clear a PTE is unsafe, as the order of writes the function does is, at least in principle, undefined. This is part of XSA-347. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Paul Durrant <paul(a)xen.org> --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -45,7 +45,7 @@ static unsigned int clear_iommu_pte_pres pte = &table[pfn_to_pde_idx(dfn, 1)]; flush_flags = pte->pr ? IOMMU_FLUSHF_modified : 0; - memset(pte, 0, sizeof(*pte)); + write_atomic(&pte->raw, 0); unmap_domain_page(table); @@ -57,26 +57,30 @@ static unsigned int set_iommu_pde_presen unsigned int next_level, bool iw, bool ir) { + union amd_iommu_pte new = {}, old; unsigned int flush_flags = IOMMU_FLUSHF_added; - if ( pte->pr && - (pte->mfn != next_mfn || - pte->iw != iw || - pte->ir != ir || - pte->next_level != next_level) ) - flush_flags |= IOMMU_FLUSHF_modified; - /* * FC bit should be enabled in PTE, this helps to solve potential * issues with ATS devices */ - pte->fc = !next_level; + new.fc = !next_level; + + new.mfn = next_mfn; + new.iw = iw; + new.ir = ir; + new.next_level = next_level; + new.pr = true; + + old.raw = read_atomic(&pte->raw); + old.ign0 = 0; + old.ign1 = 0; + old.ign2 = 0; + + if ( old.pr && old.raw != new.raw ) + flush_flags |= IOMMU_FLUSHF_modified; - pte->mfn = next_mfn; - pte->iw = iw; - pte->ir = ir; - pte->next_level = next_level; - pte->pr = 1; + write_atomic(&pte->raw, new.raw); return flush_flags; } ++++++ xsa347-3.patch ++++++ AMD/IOMMU: ensure suitable ordering of DTE modifications DMA and interrupt translation should be enabled only after other applicable DTE fields have been written. Similarly when disabling translation or when moving a device between domains, translation should first be disabled, before other entry fields get modified. Note however that the "moving" aspect doesn't apply to the interrupt remapping side, as domain specifics are maintained in the IRTEs here, not the DTE. We also never disable interrupt remapping once it got enabled for a device (the respective argument passed is always the immutable iommu_intremap). This is part of XSA-347. Signed-off-by: Jan Beulich <jbeulich(a)suse.com> Reviewed-by: Paul Durrant <paul(a)xen.org> --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -107,11 +107,18 @@ void amd_iommu_set_root_page_table(struc uint64_t root_ptr, uint16_t domain_id, uint8_t paging_mode, bool valid) { + if ( valid || dte->v ) + { + dte->tv = false; + dte->v = true; + smp_wmb(); + } dte->domain_id = domain_id; dte->pt_root = paddr_to_pfn(root_ptr); dte->iw = true; dte->ir = true; dte->paging_mode = paging_mode; + smp_wmb(); dte->tv = true; dte->v = valid; } @@ -134,6 +141,7 @@ void amd_iommu_set_intremap_table( } dte->ig = false; /* unmapped interrupts result in i/o page faults */ + smp_wmb(); dte->iv = valid; } --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -120,7 +120,10 @@ static void amd_iommu_setup_domain_devic /* Undo what amd_iommu_disable_domain_device() may have done. */ ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id]; if ( dte->it_root ) + { dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED; + smp_wmb(); + } dte->iv = iommu_intremap; dte->ex = ivrs_dev->dte_allow_exclusion; dte->sys_mgt = MASK_EXTR(ivrs_dev->device_flags, ACPI_IVHD_SYSTEM_MGMT);

1 0

commit xen for openSUSE:Leap:15.2:Update
by root 30 Oct '20

30 Oct '20

Hello community, here is the log from the commit of package xen for openSUSE:Leap:15.2:Update checked in at 2020-10-31 00:23:25 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2:Update/xen (Old) and /work/SRC/openSUSE:Leap:15.2:Update/.xen.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "xen" Sat Oct 31 00:23:25 2020 rev:4 rq:844482 version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ _link ++++++ --- /var/tmp/diff_new_pack.llkN8G/_old 2020-10-31 00:23:34.807629301 +0100 +++ /var/tmp/diff_new_pack.llkN8G/_new 2020-10-31 00:23:34.807629301 +0100 @@ -1 +1 @@ -<link package='xen.14321' cicount='copy' /> +<link package='xen.14764' cicount='copy' />

1 0

commit corosync for openSUSE:Leap:15.1:Update
by root 30 Oct '20

30 Oct '20

Hello community, here is the log from the commit of package corosync for openSUSE:Leap:15.1:Update checked in at 2020-10-31 00:23:10 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.1:Update/corosync (Old) and /work/SRC/openSUSE:Leap:15.1:Update/.corosync.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "corosync" Sat Oct 31 00:23:10 2020 rev:3 rq:844451 version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ _link ++++++ --- /var/tmp/diff_new_pack.pPYqJk/_old 2020-10-31 00:23:11.715615378 +0100 +++ /var/tmp/diff_new_pack.pPYqJk/_new 2020-10-31 00:23:11.715615378 +0100 @@ -1 +1 @@ -<link package='corosync.12411' cicount='copy' /> +<link package='corosync.14693' cicount='copy' />

1 0

commit 00Meta for openSUSE:Leap:15.2:Images
by root 30 Oct '20

30 Oct '20

Hello community, here is the log from the commit of package 00Meta for openSUSE:Leap:15.2:Images checked in at 2020-10-30 22:30:42 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2:Images/00Meta (Old) and /work/SRC/openSUSE:Leap:15.2:Images/.00Meta.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "00Meta" Fri Oct 30 22:30:42 2020 rev:575 rq: version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ version_totest ++++++ --- /var/tmp/diff_new_pack.PsPVvk/_old 2020-10-30 22:30:44.067369622 +0100 +++ /var/tmp/diff_new_pack.PsPVvk/_new 2020-10-30 22:30:44.067369622 +0100 @@ -1 +1 @@ -31.215 \ No newline at end of file +31.216 \ No newline at end of file

1 0

commit MozillaThunderbird for openSUSE:Leap:15.2:Update
by root 30 Oct '20

30 Oct '20

Hello community, here is the log from the commit of package MozillaThunderbird for openSUSE:Leap:15.2:Update checked in at 2020-10-30 21:35:40 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Leap:15.2:Update/MozillaThunderbird (Old) and /work/SRC/openSUSE:Leap:15.2:Update/.MozillaThunderbird.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "MozillaThunderbird" Fri Oct 30 21:35:40 2020 rev:4 rq:844992 version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ _link ++++++ --- /var/tmp/diff_new_pack.SLhGy5/_old 2020-10-30 21:35:43.280855447 +0100 +++ /var/tmp/diff_new_pack.SLhGy5/_new 2020-10-30 21:35:43.284855449 +0100 @@ -1 +1 @@ -<link package='MozillaThunderbird.13939' cicount='copy' /> +<link package='MozillaThunderbird.14813' cicount='copy' />

1 0

commit 000update-repos for openSUSE:Factory
by root 30 Oct '20

30 Oct '20

Hello community, here is the log from the commit of package 000update-repos for openSUSE:Factory checked in at 2020-10-30 21:07:17 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/000update-repos (Old) and /work/SRC/openSUSE:Factory/.000update-repos.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "000update-repos" Fri Oct 30 21:07:17 2020 rev:1368 rq: version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! New: ---- 15.1:update_1604079059.packages.xz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------

1 0

commit 000update-repos for openSUSE:Factory
by root 30 Oct '20

30 Oct '20

Hello community, here is the log from the commit of package 000update-repos for openSUSE:Factory checked in at 2020-10-30 21:07:06 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/000update-repos (Old) and /work/SRC/openSUSE:Factory/.000update-repos.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Package is "000update-repos" Fri Oct 30 21:07:06 2020 rev:1367 rq: version:unknown Changes: -------- New Changes file: NO CHANGES FILE!!! New: ---- factory:non-oss_2444.1.packages.xz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------

1 0