[opensuse] leap15.1 - multi-path route not working
(same old story - I am migrating a mail cluster to 15.1 ... this is still the first node, "tattoo16") For alternating output over two routers (we call them "frontends"), we use a multipath route defined like this: ip route add default table fe1fe2 nexthop via 10.0.1.145 nexthop via 10.0.2.145 This has worked fine for more than ten years, new connections alternate between 10.0.1.145 and 10.0.2.145. Now on Leap 15.1 I get no alternation - I always get 10.0.2.145. I have tried it swapping it: ip route add default table fe1fe2 nexthop via 10.0.2.145 nexthop via 10.0.1.145 and then I always get 10.0.1.145. I am still researching it, I wonder if some default setting changed since this was set up in 2006 or 2007. Just in case, this is what the route table looks like: # ip route show table fe1fe2 default nexthop via 10.0.2.145 dev ipip1 weight 1 nexthop via 10.0.1.145 dev ipip0 weight 1 10.0.1.144/30 dev ipip0 scope link src 10.0.1.146 10.0.2.144/30 dev ipip1 scope link src 10.0.2.146 127.0.0.0/8 dev lo scope link I switched back to kernel 3.16.7 and here it works as expected. (our test system cluster is currently on 3.16.7). -- Per Jessen, Zürich (17.8°C) http://www.hostsuisse.com/ - virtual servers, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Per Jessen wrote:
(same old story - I am migrating a mail cluster to 15.1 ... this is still the first node, "tattoo16") [snip] I switched back to kernel 3.16.7 and here it works as expected. (our test system cluster is currently on 3.16.7).
Having tried out a few different kernels, I have determined it worked in 4.1.39, and stopped working in kernel 4.4.179. -- Per Jessen, Zürich (19.6°C) http://www.dns24.ch/ - free dynamic DNS, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Thu, Jul 11, 2019 at 11:12 AM Per Jessen <per@computer.org> wrote:
(same old story - I am migrating a mail cluster to 15.1 ... this is still the first node, "tattoo16")
For alternating output over two routers (we call them "frontends"), we use a multipath route defined like this:
ip route add default table fe1fe2 nexthop via 10.0.1.145 nexthop via 10.0.2.145
How do you ensure kernel will use table fe1fe2 when computing next hop?
This has worked fine for more than ten years, new connections alternate between 10.0.1.145 and 10.0.2.145.
Now on Leap 15.1 I get no alternation - I always get 10.0.2.145. I have tried it swapping it:
ip route add default table fe1fe2 nexthop via 10.0.2.145 nexthop via 10.0.1.145
and then I always get 10.0.1.145.
I do not have Leap at the moment, quick testing with SLES15 GA (and default routing table) - "ip route get" returns alternate gateways for different addresses (I believe it is not round robin anymore, rather it computes deterministic hash over IP address). -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Andrei Borzenkov wrote:
On Thu, Jul 11, 2019 at 11:12 AM Per Jessen <per@computer.org> wrote:
(same old story - I am migrating a mail cluster to 15.1 ... this is still the first node, "tattoo16")
For alternating output over two routers (we call them "frontends"), we use a multipath route defined like this:
ip route add default table fe1fe2 nexthop via 10.0.1.145 nexthop via 10.0.2.145
How do you ensure kernel will use table fe1fe2 when computing next hop?
I set a firewall mark (given the right conditions), then I use a rule to select the right table.
I do not have Leap at the moment, quick testing with SLES15 GA (and default routing table) - "ip route get" returns alternate gateways for different addresses (I believe it is not round robin anymore, rather it computes deterministic hash over IP address).
Ah, that might well be the issue - the same IP address would then always be directed via the same route. /Per -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Thu, Jul 11, 2019 at 1:51 PM Per Jessen <per@computer.org> wrote:
Ah, that might well be the issue - the same IP address would then always be directed via the same route.
https://codecave.cc/multipath-routing-in-linux-part-2.html may be of some interest. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Andrei Borzenkov wrote:
On Thu, Jul 11, 2019 at 1:51 PM Per Jessen <per@computer.org> wrote:
Ah, that might well be the issue - the same IP address would then always be directed via the same route.
https://codecave.cc/multipath-routing-in-linux-part-2.html may be of some interest.
Yeah, I've just read it :-) Hmm. I'm going to try that net.ipv4.fib_multipath_hash_policy setting - it looks like it will use the source port too, that might be sufficient to get alternating routes. Thanks for the hint about deterministic hashes, very helpful! /Per -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Per Jessen wrote:
Andrei Borzenkov wrote:
On Thu, Jul 11, 2019 at 1:51 PM Per Jessen <per@computer.org> wrote:
Ah, that might well be the issue - the same IP address would then always be directed via the same route.
https://codecave.cc/multipath-routing-in-linux-part-2.html may be of some interest.
Yeah, I've just read it :-)
Hmm. I'm going to try that net.ipv4.fib_multipath_hash_policy setting - it looks like it will use the source port too, that might be sufficient to get alternating routes.
Sofar I can't make it work. I have changed the hashing policy to '1', which should enable L4 hashing, which should also take the source port into account. I've even double checked the code :-) Andrei, you wouldn't happen to know how to look at the hash value used? Afaict, it doesn't change - if I do e.g. 10 or 20 new sessions one after one, they all pick the same route. -- Per Jessen, Zürich (21.6°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
11.07.2019 15:53, Per Jessen пишет:
Andrei, you wouldn't happen to know how to look at the hash value used? Afaict, it doesn't change - if I do e.g. 10 or 20 new sessions one after one, they all pick the same route.
Displaying hash itself which is return value of kernel function is relatively easy - function return is well defined trace point. Displaying its arguments or internal variables seems to be more complicated and requires real programming in whatever trace language is used. But it does look like computed hash is the same every time. tw:/usr/share/bcc/tools # cat /proc/sys/net/ipv4/fib_multipath_hash_policy 0 tw:/usr/share/bcc/tools # ./trace 'r::fib_multipath_hash "ret = %d", retval' PID TID COMM FUNC - 5971 5971 telnet fib_multipath_hash ret = 1229346044 5971 5971 telnet fib_multipath_hash ret = 107210842 5971 5971 telnet fib_multipath_hash ret = 107210842 5972 5972 telnet fib_multipath_hash ret = 1229346044 5972 5972 telnet fib_multipath_hash ret = 107210842 5972 5972 telnet fib_multipath_hash ret = 107210842 ^C tw:/usr/share/bcc/tools # tw:/usr/share/bcc/tools # echo 1 > /proc/sys/net/ipv4/fib_multipath_hash_policy tw:/usr/share/bcc/tools # cat /proc/sys/net/ipv4/fib_multipath_hash_policy 1 tw:/usr/share/bcc/tools # ./trace 'r::fib_multipath_hash "ret = %d", retval' PID TID COMM FUNC - 5998 5998 telnet fib_multipath_hash ret = 1123988857 5998 5998 telnet fib_multipath_hash ret = 840030644 5998 5998 telnet fib_multipath_hash ret = 178211998 1014 1014 ntpd fib_multipath_hash ret = 159405026 5999 5999 telnet fib_multipath_hash ret = 1123988857 5999 5999 telnet fib_multipath_hash ret = 840030644 5999 5999 telnet fib_multipath_hash ret = 1223581618 ^C tw:/usr/share/bcc/tools # Here "telnet" is called twice in different terminals with the same destination address. I am not sure where two other calls to hash come from, may be some internal library resolver or similar. Anyway, resulting hashes seem to be always the same. I am afraid I'm not as deep into kernel routing code to make any useful comment here. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Andrei Borzenkov wrote:
11.07.2019 15:53, Per Jessen пишет:
Andrei, you wouldn't happen to know how to look at the hash value used? Afaict, it doesn't change - if I do e.g. 10 or 20 new sessions one after one, they all pick the same route.
Displaying hash itself which is return value of kernel function is relatively easy - function return is well defined trace point. Displaying its arguments or internal variables seems to be more complicated and requires real programming in whatever trace language is used.
But it does look like computed hash is the same every time.
tw:/usr/share/bcc/tools # cat /proc/sys/net/ipv4/fib_multipath_hash_policy 0 tw:/usr/share/bcc/tools # ./trace 'r::fib_multipath_hash "ret = %d", retval' PID TID COMM FUNC - 5971 5971 telnet fib_multipath_hash ret = 1229346044 5971 5971 telnet fib_multipath_hash ret = 107210842 5971 5971 telnet fib_multipath_hash ret = 107210842 5972 5972 telnet fib_multipath_hash ret = 1229346044 5972 5972 telnet fib_multipath_hash ret = 107210842 5972 5972 telnet fib_multipath_hash ret = 107210842 ^C tw:/usr/share/bcc/tools # tw:/usr/share/bcc/tools # echo 1 > /proc/sys/net/ipv4/fib_multipath_hash_policy tw:/usr/share/bcc/tools # cat /proc/sys/net/ipv4/fib_multipath_hash_policy 1 tw:/usr/share/bcc/tools # ./trace 'r::fib_multipath_hash "ret = %d", retval' PID TID COMM FUNC - 5998 5998 telnet fib_multipath_hash ret = 1123988857 5998 5998 telnet fib_multipath_hash ret = 840030644 5998 5998 telnet fib_multipath_hash ret = 178211998 1014 1014 ntpd fib_multipath_hash ret = 159405026 5999 5999 telnet fib_multipath_hash ret = 1123988857 5999 5999 telnet fib_multipath_hash ret = 840030644 5999 5999 telnet fib_multipath_hash ret = 1223581618 ^C tw:/usr/share/bcc/tools #
Here "telnet" is called twice in different terminals with the same destination address. I am not sure where two other calls to hash come from, may be some internal library resolver or similar.
Anyway, resulting hashes seem to be always the same. I am afraid I'm not as deep into kernel routing code to make any useful comment here.
Thanks anyway - I really appreciate it. I'll have to learn how to use that trace functionality. I hate to leave this topic for now, but tomorrow I'm off on vacation, back again beginning of August. This multipath issue is a real show-stopper, I hope there will be a setting I can tweak. (or that I'm just doing something wrong). -- Per Jessen, Zürich (18.1°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
12.07.2019 6:56, Andrei Borzenkov пишет:
11.07.2019 15:53, Per Jessen пишет:
Andrei, you wouldn't happen to know how to look at the hash value used? Afaict, it doesn't change - if I do e.g. 10 or 20 new sessions one after one, they all pick the same route.
Displaying hash itself which is return value of kernel function is relatively easy - function return is well defined trace point. Displaying its arguments or internal variables seems to be more complicated and requires real programming in whatever trace language is used.
...
Here "telnet" is called twice in different terminals with the same destination address. I am not sure where two other calls to hash come from, may be some internal library resolver or similar.
I did become curious and tried to capture input to hash function. bor@tw:~> sudo ./fibhashsnoop PID COMM SADDR DADDR HASH 3882 telnet 00000000:0 4a7d8363:23 20767294 3882 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3882 telnet 0a00020f:56546 4a7d8363:23 396252f9 3887 telnet 00000000:0 4a7d8363:23 20767294 3887 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3887 telnet 0a00020f:56548 4a7d8363:23 34d6e578 3892 telnet 00000000:0 4a7d8363:23 20767294 3892 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3892 telnet 0a00020f:56550 4a7d8363:23 7d96dbcf 3901 telnet 00000000:0 4a7d8363:23 20767294 3901 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3901 telnet 0a00020f:56552 4a7d8363:23 783155e1 3906 telnet 00000000:0 4a7d8363:23 20767294 3906 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3906 telnet 0a00020f:56554 4a7d8363:23 33cb2d99 ^Cbor@tw:~> All numbers are hex (I am lazy). So it appears kernel calls routing code several times, assuming the last one (with actual port numbers) determines the final route - two consecutive invocations select the same gateway. So in the worst case you may have bad luck. Max hash value ix 0x7ffffff, with equal weights each gateway gets half of all values, i.e. hash below 0x3ffffff selects one gateway, above - another gateway. So in the worst case all your traffic is indeed hashed to one gateway only.
Anyway, resulting hashes seem to be always the same. I am afraid I'm not as deep into kernel routing code to make any useful comment here.
Still true :) In particular I have no idea why routing decision is apparently performed three times. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
12.07.2019 21:10, Andrei Borzenkov пишет:
12.07.2019 6:56, Andrei Borzenkov пишет:
11.07.2019 15:53, Per Jessen пишет:
Andrei, you wouldn't happen to know how to look at the hash value used? Afaict, it doesn't change - if I do e.g. 10 or 20 new sessions one after one, they all pick the same route.
Displaying hash itself which is return value of kernel function is relatively easy - function return is well defined trace point. Displaying its arguments or internal variables seems to be more complicated and requires real programming in whatever trace language is used.
...
Here "telnet" is called twice in different terminals with the same destination address. I am not sure where two other calls to hash come from, may be some internal library resolver or similar.
This happens in kernel, kernel calls routing code several times when application did not bind socket to local address (it needs to compute outgoing device to select local address to use - chicken and egg).
I did become curious and tried to capture input to hash function.
bor@tw:~> sudo ./fibhashsnoop PID COMM SADDR DADDR HASH 3882 telnet 00000000:0 4a7d8363:23 20767294 3882 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3882 telnet 0a00020f:56546 4a7d8363:23 396252f9 3887 telnet 00000000:0 4a7d8363:23 20767294 3887 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3887 telnet 0a00020f:56548 4a7d8363:23 34d6e578 3892 telnet 00000000:0 4a7d8363:23 20767294 3892 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3892 telnet 0a00020f:56550 4a7d8363:23 7d96dbcf 3901 telnet 00000000:0 4a7d8363:23 20767294 3901 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3901 telnet 0a00020f:56552 4a7d8363:23 783155e1 3906 telnet 00000000:0 4a7d8363:23 20767294 3906 telnet 0a00020f:0 4a7d8363:23 736f1ef2 3906 telnet 0a00020f:56554 4a7d8363:23 33cb2d99 ^Cbor@tw:~>
All numbers are hex (I am lazy). So it appears kernel calls routing code several times, assuming the last one (with actual port numbers) determines the final route - two consecutive invocations select the same gateway. So in the worst case you may have bad luck.
Max hash value ix 0x7ffffff, with equal weights each gateway gets half of all values, i.e. hash below 0x3ffffff selects one gateway, above - another gateway. So in the worst case all your traffic is indeed hashed to one gateway only.
It seems to work for me - kernel actually attempts to send traffic via different gateways and it correlates with hash value. tw:/home/bor # ./fibhashsnoop PID COMM SADDR DADDR HASH 4812 telnet 0a00020f:60414 4a7d8363:23 38aa371d 4817 telnet 0a00020f:60416 4a7d8363:23 465d2b13 4822 telnet 0a00020f:60418 4a7d8363:23 4f222ac6 4827 telnet 0a00020f:60420 4a7d8363:23 ff99190 4832 telnet 0a00020f:60422 4a7d8363:23 145c021e ^Ctw:/home/bor # tw:/usr/share/bcc/tools # tshark -i enp0s4 -o column.format:'"No.","%m","Time","%t","Source","%s","Destination","%d","Protocol","%p","Info","%i","src","%uhs","dst","%uhd"' Running as user "root" and group "root". This could be dangerous. ** (process:4792): WARNING **: 10:01:55.457: Preference "column.format" has been converted to "(null).format" Save your preferences to make this change permanent. Capturing on 'enp0s4' 1 0.000000000 10.0.2.15 → 74.125.131.99 TCP 60414 → 23 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=1605324308 TSecr=0 WS=128 52:54:00:12:34:56 → 52:55:0a:00:02:02 2 1.502272623 10.0.2.15 → 74.125.131.99 TCP 60416 → 23 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=1605325810 TSecr=0 WS=128 52:54:00:12:34:56 → 52:55:0a:00:02:03 3 2.960769688 10.0.2.15 → 74.125.131.99 TCP 60418 → 23 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=1605327269 TSecr=0 WS=128 52:54:00:12:34:56 → 52:55:0a:00:02:03 4 4.473304376 10.0.2.15 → 74.125.131.99 TCP 60420 → 23 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=1605328781 TSecr=0 WS=128 52:54:00:12:34:56 → 52:55:0a:00:02:02 7 5.851979498 10.0.2.15 → 74.125.131.99 TCP 60422 → 23 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=1605330160 TSecr=0 WS=128 52:54:00:12:34:56 → 52:55:0a:00:02:02 ^C10 packets captured tw:/usr/share/bcc/tools # As you see connections go either to gateway with MAC 52:55:0a:00:02:02 or 52:55:0a:00:02:03 according to hash value. Disclaimer - this is TW, so may be Leap kernel issue.
Anyway, resulting hashes seem to be always the same. I am afraid I'm not as deep into kernel routing code to make any useful comment here.
Still true :) In particular I have no idea why routing decision is apparently performed three times.
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (2)
-
Andrei Borzenkov
-
Per Jessen