summaryrefslogtreecommitdiff
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* tcp: timestamp SYN+DATA messagesEric Dumazet2014-03-102-0/+7
| | | | | | | | | | | | | | | | | | | | | | All skb in socket write queue should be properly timestamped. In case of FastOpen, we special case the SYN+DATA 'message' as we queue in socket wrote queue the two fallback skbs: 1) SYN message by itself. 2) DATA segment by itself. We should make sure these skbs have proper timestamps. Add a WARN_ON_ONCE() to eventually catch future violations. Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: do not leak non zero tstamp in output packetsEric Dumazet2014-03-071-0/+2
| | | | | | | | | | | | | | | | Usage of skb->tstamp should remain private to TCP stack (only set on packets on write queue, not on cloned ones) Otherwise, packets given to loopback interface with a non null tstamp can confuse netif_rx() / net_timestamp_check() Other possibility would be to clear tstamp in loopback_xmit(), as done in skb_scrub_packet() Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Use NET_ADD_STATS instead of NET_ADD_STATS_BH in tcp_event_new_data_sent()David S. Miller2014-03-061-2/+2
| | | | | | | | | | Can be invoked from non-BH context. Based upon a patch by Eric Dumazet. Fixes: f19c29e3e391 ("tcp: snmp stats for Fast Open, SYN rtx, and data pkts") Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-03-059-69/+81
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/wireless/ath/ath9k/recv.c drivers/net/wireless/mwifiex/pcie.c net/ipv6/sit.c The SIT driver conflict consists of a bug fix being done by hand in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper was created (netdev_alloc_pcpu_stats()) which takes care of this. The two wireless conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointerXin Long2014-03-031-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | when ip_tunnel process multicast packets, it may check if the packet is looped back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(), but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(), so which leads to a panic. fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681 Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix bogus RTT on special retransmissionYuchung Cheng2014-03-032-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RTT may be bogus with tall loss probe (TLP) when a packet is retransmitted and latter (s)acked without TCPCB_SACKED_RETRANS flag. For example, TLP calls __tcp_retransmit_skb() instead of tcp_retransmit_skb(). The skb timestamps are updated but the sacked flag is not marked with TCPCB_SACKED_RETRANS. As a result we'll get bogus RTT in tcp_clean_rtx_queue() or in tcp_sacktag_one() on spurious retransmission. The fix is to apply the sticky flag TCP_EVER_RETRANS to enforce Karn's check on RTT sampling. However this will disable F-RTO if timeout occurs after TLP, by resetting undo_marker in tcp_enter_loss(). We relax this check to only if any pending retransmists are still in-flight. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: tcp: use NET_INC_STATS()Eric Dumazet2014-02-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES can only be incremented in tcp_transmit_skb() from softirq (incoming message or timer activation), it is better to use NET_INC_STATS() instead of NET_INC_STATS_BH() as tcp_transmit_skb() can be called from process context. This will avoid copy/paste confusion when/if we want to add other SNMP counters in tcp_transmit_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: ipv6: better estimate tunnel header cut for correct ufo handlingHannes Frederic Sowa2014-02-251-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the UFO fragmentation process does not correctly handle inner UDP frames. (The following tcpdumps are captured on the parent interface with ufo disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280, both sit device): IPv6: 16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300) 192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000 16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844) 192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471 We can see that fragmentation header offset is not correctly updated. (fragmentation id handling is corrected by 916e4cf46d0204 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")). IPv4: 16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296) 192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276) 192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000 16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792) 192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772) 192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109 In this case fragmentation id is incremented and offset is not updated. First, I aligned inet_gso_segment and ipv6_gso_segment: * align naming of flags * ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we always ensure that the state of this flag is left untouched when returning from upper gso segmenation function * ipv6_gso_segment: move skb_reset_inner_headers below updating the fragmentation header data, we don't care for updating fragmentation header data * remove currently unneeded comment indicating skb->encapsulation might get changed by upper gso_segment callback (gre and udp-tunnel reset encapsulation after segmentation on each fragment) If we encounter an IPIP or SIT gso skb we now check for the protocol == IPPROTO_UDP and that we at least have already traversed another ip(6) protocol header. The reason why we have to special case GSO_IPIP and GSO_SIT is that we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner protocol of GSO_UDP_TUNNEL or GSO_GRE packets. Reported-by: Wolfgang Walter <linux@stwm.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: reduce the bloat caused by tcp_is_cwnd_limited()Eric Dumazet2014-02-241-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_is_cwnd_limited() allows GSO/TSO enabled flows to increase their cwnd to allow a full size (64KB) TSO packet to be sent. Non GSO flows only allow an extra room of 3 MSS. For most flows with a BDP below 10 MSS, this results in a bloat of cwnd reaching 90, and an inflate of RTT. Thanks to TSO auto sizing, we can restrict the bloat to the number of MSS contained in a TSO packet (tp->xmit_size_goal_segs), to keep original intent without performance impact. Because we keep cwnd small, it helps to keep TSO packet size to their optimal value. Example for a 10Mbit flow, with low TCP Small queue limits (no more than 2 skb in qdisc/device tx ring) Before patch : lpk51:~# ./ss -i dst lpk52:44862 | grep cwnd cubic wscale:6,6 rto:215 rtt:15.875/2.5 mss:1448 cwnd:96 ssthresh:96 send 70.1Mbps unacked:14 rcv_space:29200 After patch : lpk51:~# ./ss -i dst lpk52:52916 | grep cwnd cubic wscale:6,6 rto:206 rtt:5.206/0.036 mss:1448 cwnd:15 ssthresh:14 send 33.4Mbps unacked:4 rcv_space:29200 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Van Jacobson <vanj@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net-tcp: fastopen: fix high order allocationsEric Dumazet2014-02-222-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes two bugs in fastopen : 1) The tcp_sendmsg(..., @size) argument was ignored. Code was relying on user not fooling the kernel with iovec mismatches 2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5 allocations, which are likely to fail when memory gets fragmented. Fixes: 783237e8daf13 ("net-tcp: Fast Open client - sending SYN-data") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sit: fix panic with route cache in ip tunnelsNicolas Dichtel2014-02-201-3/+4
| | | | | | | | | | | | | | | | | | | | | | Bug introduced by commit 7d442fab0a67 ("ipv4: Cache dst in tunnels"). Because sit code does not call ip_tunnel_init(), the dst_cache was not initialized. CC: Tom Herbert <therbert@google.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip_tunnel: Move ip_tunnel_get_stats64 into ip_tunnel_core.cDavid S. Miller2014-02-202-46/+46
| | | | | | | | | | | | | | net/built-in.o:(.rodata+0x1707c): undefined reference to `ip_tunnel_get_stats64' Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2014-02-192-5/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: * Fix nf_trace in nftables if XT_TRACE=n, from Florian Westphal. * Don't use the fast payload operation in nf_tables if the length is not power of 2 or it is not aligned, from Nikolay Aleksandrov. * Fix missing break statement the inet flavour of nft_reject, which results in evaluating IPv4 packets with the IPv6 evaluation routine, from Patrick McHardy. * Fix wrong kconfig symbol in nft_meta to match the routing realm, from Paul Bolle. * Allocate the NAT null binding when creating new conntracks via ctnetlink to avoid that several packets race at initializing the the conntrack NAT extension, original patch from Florian Westphal, revisited version from me. * Fix DNAT handling in the snmp NAT helper, the same handling was being done for SNAT and DNAT and 2.4 already contains that fix, from Francois-Xavier Le Bail. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=nFlorian Westphal2014-02-171-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get lots of "TRACE: filter:output:policy:1 IN=..." warnings as several places will leave skb->nf_trace uninitialised. Unlike iptables tracing functionality is not conditional in nftables, so always copy/zero nf_trace setting when nftables is enabled. Move this into __nf_copy() helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_nat_snmp_basic: fix duplicates in if/else branchesFX Le Bail2014-02-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The solution was found by Patrick in 2.4 kernel sources. Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | tcp: snmp stats for Fast Open, SYN rtx, and data pktsYuchung Cheng2014-03-035-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the following snmp stats: TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse the remote does not accept it or the attempts timed out. TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down retransmissions into SYN, fast-retransmits, timeout retransmits, etc. TCPOrigDataSent: number of outgoing packets with original data (excluding retransmission but including data-in-SYN). This counter is different from TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate. Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to TCPFastOpenPassive. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Lawrence Brakmo <brakmo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'master' of ↵David S. Miller2014-02-279-172/+575
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== This is the rework of the IPsec virtual tunnel interface for ipv4 to support inter address family tunneling and namespace crossing. The only change to the last RFC version is a compile fix for an odd configuration where CONFIG_XFRM is set but CONFIG_INET is not set. 1) Add and use a IPsec protocol multiplexer. 2) Add xfrm_tunnel_skb_cb to the skb common buffer to store a receive callback there. 3) Make vti work with i_key set by not including the i_key when comupting the hash for the tunnel lookup in case of vti tunnels. 4) Update ip_vti to use it's own receive hook. 5) Remove xfrm_tunnel_notifier, this is replaced by the IPsec protocol multiplexer. 6) We need to be protocol family indepenent, so use the on xfrm_lookup returned dst_entry instead of the ipv4 rtable in vti_tunnel_xmit(). 7) Add support for inter address family tunneling. 8) Check if the tunnel endpoints of the xfrm state and the vti interface are matching and return an error otherwise. 8) Enable namespace crossing tor vti devices. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | vti4: Enable namespace changingSteffen Klassert2014-02-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | vti4 is now fully namespace aware, so allow namespace changing for vti devices Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | vti4: Check the tunnel endpoints of the xfrm state and the vti interfaceSteffen Klassert2014-02-251-5/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tunnel endpoints of the xfrm_state we got from the xfrm_lookup must match the tunnel endpoints of the vti interface. This patch ensures this matching. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | vti4: Support inter address family tunneling.Steffen Klassert2014-02-251-14/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this patch we can tunnel ipv6 traffic via a vti4 interface. A vti4 interface can now have an ipv6 address and ipv6 traffic can be routed via a vti4 interface. The resulting traffic is xfrm transformed and tunneled throuhg ipv4 if matching IPsec policies and states are present. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | vti4: Use the on xfrm_lookup returned dst_entry directlySteffen Klassert2014-02-251-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to be protocol family indepenent to support inter addresss family tunneling with vti. So use a dst_entry instead of the ipv4 rtable in vti_tunnel_xmit. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | xfrm4: Remove xfrm_tunnel_notifierSteffen Klassert2014-02-251-68/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This was used from vti and is replaced by the IPsec protocol multiplexer hooks. It is now unused, so remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | vti: Update the ipv4 side to use it's own receive hook.Steffen Klassert2014-02-251-47/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this patch, vti uses the IPsec protocol multiplexer to register it's own receive side hooks for ESP, AH and IPCOMP. Vti now does the following on receive side: 1. Do an input policy check for the IPsec packet we received. This is required because this packet could be already prosecces by IPsec, so an inbuond policy check is needed. 2. Mark the packet with the i_key. The policy and the state must match this key now. Policy and state belong to the outer namespace and policy enforcement is done at the further layers. 3. Call the generic xfrm layer to do decryption and decapsulation. 4. Wait for a callback from the xfrm layer to properly clean the skb to not leak informations on namespace and to update the device statistics. On transmit side: 1. Mark the packet with the o_key. The policy and the state must match this key now. 2. Do a xfrm_lookup on the original packet with the mark applied. 3. Check if we got an IPsec route. 4. Clean the skb to not leak informations on namespace transitions. 5. Attach the dst_enty we got from the xfrm_lookup to the skb. 6. Call dst_output to do the IPsec processing. 7. Do the device statistics. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | ip_tunnel: Make vti work with i_key setSteffen Klassert2014-02-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vti uses the o_key to mark packets that were transmitted or received by a vti interface. Unfortunately we can't apply different marks to in and outbound packets with only one key availabe. Vti interfaces typically use wildcard selectors for vti IPsec policies. On forwarding, the same output policy will match for both directions. This generates a loop between the IPsec gateways until the ttl of the packet is exceeded. The gre i_key/o_key are usually there to find the right gre tunnel during a lookup. When vti uses the i_key to mark packets, the tunnel lookup does not work any more because vti does not use the gre keys as a hash key for the lookup. This patch workarounds this my not including the i_key when comupting the hash for the tunnel lookup in case of vti tunnels. With this we have separate keys available for the transmitting and receiving side of the vti interface. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | xfrm: Add xfrm_tunnel_skb_cb to the skb common bufferSteffen Klassert2014-02-251-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IPsec vti_rcv needs to remind the tunnel pointer to check it later at the vti_rcv_cb callback. So add this pointer to the IPsec common buffer, initialize it and check it to avoid transport state matching of a tunneled packet. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | ipcomp4: Use the IPsec protocol multiplexer APISteffen Klassert2014-02-251-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | Switch ipcomp4 to use the new IPsec protocol multiplexer. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | ah4: Use the IPsec protocol multiplexer APISteffen Klassert2014-02-251-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | Switch ah4 to use the new IPsec protocol multiplexer. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | esp4: Use the IPsec protocol multiplexer APISteffen Klassert2014-02-251-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | Switch esp4 to use the new IPsec protocol multiplexer. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | xfrm4: Add IPsec protocol multiplexerSteffen Klassert2014-02-253-10/+269
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add an IPsec protocol multiplexer. With this it is possible to add alternative protocol handlers as needed for IPsec virtual tunnel interfaces. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | | tcp: switch rtt estimations to usec resolutionEric Dumazet2014-02-2614-160/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upcoming congestion controls for TCP require usec resolution for RTT estimations. Millisecond resolution is simply not enough these days. FQ/pacing in DC environments also require this change for finer control and removal of bimodal behavior due to the current hack in tcp_update_pacing_rate() for 'small rtt' TCP_CONG_RTT_STAMP is no longer needed. As Julian Anastasov pointed out, we need to keep user compatibility : tcp_metrics used to export RTT and RTTVAR in msec resolution, so we added RTT_US and RTTVAR_US. An iproute2 patch is needed to use the new attributes if provided by the kernel. In this example ss command displays a srtt of 32 usecs (10Gbit link) lpk51:~# ./ss -i dst lpk52 Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port tcp ESTAB 0 1 10.246.11.51:42959 10.246.11.52:64614 cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448 cwnd:10 send 3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559 Updated iproute2 ip command displays : lpk51:~# ./ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source 10.246.11.51 Old binary displays : lpk51:~# ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source 10.246.11.51 With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Larry Brakmo <brakmo@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMITHannes Frederic Sowa2014-02-262-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IP_PMTUDISC_INTERFACE has a design error: because it does not allow the generation of fragments if the interface mtu is exceeded, it is very hard to make use of this option in already deployed name server software for which I introduced this option. This patch adds yet another new IP_MTU_DISCOVER option to not honor any path mtu information and not accepting new icmp notifications destined for the socket this option is enabled on. But we allow outgoing fragmentation in case the packet size exceeds the outgoing interface mtu. As such this new option can be used as a drop-in replacement for IP_PMTUDISC_DONT, which is currently in use by most name server software making the adoption of this option very smooth and easy. The original advantage of IP_PMTUDISC_INTERFACE is still maintained: ignoring incoming path MTU updates and not honoring discovered path MTUs in the output path. Fixes: 482fc6094afad5 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE") Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragmentHannes Frederic Sowa2014-02-261-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket is attached to the skb (in case of forwarding) or determines the mtu like we do in ip_finish_output, which actually checks if we should branch to ip_fragment. Thus use the same function to determine the mtu here, too. This is important for the introduction of IP_PMTUDISC_OMIT, where we want the packets getting cut in pieces of the size of the outgoing interface mtu. IPv6 already does this correctly. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: tcp: add mib counters to track zero window transitionsFlorian Westphal2014-02-262-1/+14
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Three counters are added: - one to track when we went from non-zero to zero window - one to track the reverse - one counter incremented when we want to announce zero window, but can't because we would shrink current window. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'master' of ↵David S. Miller2014-02-241-11/+42
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Introduce skb_to_sgvec_nomark function to add further data to the sg list without calling sg_unmark_end first. Needed to add extended sequence number informations. From Fan Du. 2) Add IPsec extended sequence numbers support to the Authentication Header protocol for ipv4 and ipv6. From Fan Du. 3) Make the IPsec flowcache namespace aware, from Fan Du. 4) Avoid creating temporary SA for every packet when no key manager is registered. From Horia Geanta. 5) Support filtering of SA dumps to show only the SAs that match a given filter. From Nicolas Dichtel. 6) Remove caching of xfrm_policy_sk_bundles. The cached socket policy bundles are never used, instead we create a new cache entry whenever xfrm_lookup() is called on a socket policy. Most protocols cache the used routes to the socket, so this caching is not needed. 7) Fix a forgotten SADB_X_EXT_FILTER length check in pfkey, from Nicolas Dichtel. 8) Cleanup error handling of xfrm_state_clone. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | {IPv4,xfrm} Add ESN support for AH ingress partFan Du2014-02-121-5/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add esn support for AH input stage by attaching upper 32bits sequence number right after packet payload as specified by RFC 4302. Then the ICV value will guard upper 32bits sequence number as well when packet getting in. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | {IPv4,xfrm} Add ESN support for AH egress partFan Du2014-02-121-6/+20
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add esn support for AH output stage by attaching upper 32bits sequence number right after packet payload as specified by RFC 4302. Then the ICV value will guard upper 32bits sequence number as well when packet going out. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | tcp: use zero-window when free_space is lowFlorian Westphal2014-02-191-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the kernel tries to announce a zero window when free_space is below the current receiver mss estimate. When a sender is transmitting small packets and reader consumes data slowly (or not at all), receiver might be unable to shrink the receive win because a) we cannot withdraw already-commited receive window, and, b) we have to round the current rwin up to a multiple of the wscale factor, else we would shrink the current window. This causes the receive buffer to fill up until the rmem limit is hit. When this happens, we start dropping packets. Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2] even if socket is not being read from. As we cannot avoid the "current_win is rounded up to multiple of mss" issue [we would violate a) above] at least try to prevent the receive buf growth towards tcp_rmem[2] limit by attempting to move to zero-window announcement when free_space becomes less than 1/16 of the current allowed receive buffer maximum. If tcp_rmem[2] is large, this will increase our chances to get a zero-window announcement out in time. Reproducer: On server: $ nc -l -p 12345 <suspend it: CTRL-Z> Client: #!/usr/bin/env python import socket import time sock = socket.socket() sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) sock.connect(("192.168.4.1", 12345)); while True: sock.send('A' * 23) time.sleep(0.005) socket buffer on server-side will grow until tcp_rmem[2] is hit, at which point the client rexmits data until -EDTIMEOUT: tcp_data_queue invokes tcp_try_rmem_schedule which will call tcp_prune_queue which calls tcp_clamp_window(). And that function will grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2]. Thanks to Eric Dumazet for running regression tests. Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsgHannes Frederic Sowa2014-02-194-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case we decide in udp6_sendmsg to send the packet down the ipv4 udp_sendmsg path because the destination is either of family AF_INET or the destination is an ipv4 mapped ipv6 address, we don't honor the maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO. We simply can check for this option in ip_cmsg_send because no calls to ipv6 module functions are needed to do so. Reported-by: Gert Doering <gert@space.net> Cc: Tore Anderson <tore@fud.no> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-02-193-7/+79
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/bonding/bond_3ad.h drivers/net/bonding/bond_main.c Two minor conflicts in bonding, both of which were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: fix counter in_slow_totDuan Jiong2014-02-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | since commit 89aef8921bf("ipv4: Delete routing cache."), the counter in_slow_tot can't work correctly. The counter in_slow_tot increase by one when fib_lookup() return successfully in ip_route_input_slow(), but actually the dst struct maybe not be created and cached, so we can increase in_slow_tot after the dst struct is created. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: distinguish EHOSTUNREACH from the ENETUNREACHDuan Jiong2014-02-161-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."), the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of err was always set to ENETUNREACH. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: ipconfig.c: add parentheses in an if statementFX Le Bail2014-02-141-1/+1
| | | | | | | | | | | | | | | | | | | | | Even if the 'time_before' macro expand with parentheses, the look is bad. Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: ip, ipv6: handle gso skbs in forwarding pathFlorian Westphal2014-02-131-2/+69
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marcelo Ricardo Leitner reported problems when the forwarding link path has a lower mtu than the incoming one if the inbound interface supports GRO. Given: Host <mtu1500> R1 <mtu1200> R2 Host sends tcp stream which is routed via R1 and R2. R1 performs GRO. In this case, the kernel will fail to send ICMP fragmentation needed messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu checks in forward path. Instead, Linux tries to send out packets exceeding the mtu. When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does not fragment the packets when forwarding, and again tries to send out packets exceeding R1-R2 link mtu. This alters the forwarding dstmtu checks to take the individual gso segment lengths into account. For ipv6, we send out pkt too big error for gso if the individual segments are too big. For ipv4, we either send icmp fragmentation needed, or, if the DF bit is not set, perform software segmentation and let the output path create fragments when the packet is leaving the machine. It is not 100% correct as the error message will contain the headers of the GRO skb instead of the original/segmented one, but it seems to work fine in my (limited) tests. Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid sofware segmentation. However it turns out that skb_segment() assumes skb nr_frags is related to mss size so we would BUG there. I don't want to mess with it considering Herbert and Eric disagree on what the correct behavior should be. Hannes Frederic Sowa notes that when we would shrink gso_size skb_segment would then also need to deal with the case where SKB_MAX_FRAGS would be exceeded. This uses sofware segmentation in the forward path when we hit ipv4 non-DF packets and the outgoing link mtu is too small. Its not perfect, but given the lack of bug reports wrt. GRO fwd being broken this is a rare case anyway. Also its not like this could not be improved later once the dust settles. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip_tunnel: return more precise errno value when adding tunnel failsFlorian Westphal2014-02-171-5/+10
| | | | | | | | | | | | | | | | | | | | | | Currently this always returns ENOBUFS, because the return value of __ip_tunnel_create is discarded. A more common failure is a duplicate name (EEXIST). Propagate the real error code so userspace can display a more meaningful error message. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: add pacing_rate information into tcp_infoEric Dumazet2014-02-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add two new fields to struct tcp_info, to report sk_pacing_rate and sk_max_pacing_rate to monitoring applications, as ss from iproute2. User exported fields are 64bit, even if kernel is currently using 32bit fields. lpaa5:~# ss -i .. skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send 13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15 rcv_space:29200 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: introduce netdev_alloc_pcpu_stats() for driversWANG Cong2014-02-141-8/+2
| | | | | | | | | | | | | | | | | | There are many drivers calling alloc_percpu() to allocate pcpu stats and then initializing ->syncp. So just introduce a helper function for them. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: ip_forward: perform skb->pkt_type check at the beginningDenis Kirjanov2014-02-131-3/+4
| | | | | | | | | | | | | | | | | | | | Packets which have L2 address different from ours should be already filtered before entering into ip_forward(). Perform that check at the beginning to avoid processing such packets. Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove unnecessary return'sstephen hemminger2014-02-132-2/+0
| | | | | | | | | | | | | | | | | | | | | | One of my pet coding style peeves is the practice of adding extra return; at the end of function. Kill several instances of this in network code. I suppose some coccinelle wizardy could do this automatically. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: remove unused min_cwnd member of tcp_congestion_opsStanislav Fomichev2014-02-139-18/+0
|/ | | | | | | | | | Commit 684bad110757 "tcp: use PRR to reduce cwin in CWR state" removed all calls to min_cwnd, so we can safely remove it. Also, remove tcp_reno_min_cwnd because it was only used for min_cwnd. Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: tsq: fix nonagle handlingJohn Ogness2014-02-101-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 46d3ceabd8d9 ("tcp: TCP Small Queues") introduced a possible regression for applications using TCP_NODELAY. If TCP session is throttled because of tsq, we should consult tp->nonagle when TX completion is done and allow us to send additional segment, especially if this segment is not a full MSS. Otherwise this segment is sent after an RTO. [edumazet] : Cooked the changelog, added another fix about testing sk_wmem_alloc twice because TX completion can happen right before setting TSQ_THROTTLED bit. This problem is particularly visible with recent auto corking, but might also be triggered with low tcp_limit_output_bytes values or NIC drivers delaying TX completion by hundred of usec, and very low rtt. Thomas Glanzmann for example reported an iscsi regression, caused by tcp auto corking making this bug quite visible. Fixes: 46d3ceabd8d9 ("tcp: TCP Small Queues") Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: David S. Miller <davem@davemloft.net>