From 2b5f9dad32ed19e8db3b0f10a84aa824a219803b Mon Sep 17 00:00:00 2001
From: Andrei Vagin <avagin@gmail.com>
Date: Tue, 20 Sep 2022 17:31:19 -0700
Subject: fs/exec: switch timens when a task gets a new mm

Changing a time namespace requires remapping a vvar page, so we don't want
to allow doing that if any other tasks can use the same mm.

Currently, we install a time namespace when a task is created with a new
vm. exec() is another case when a task gets a new mm and so it can switch
a time namespace safely, but it isn't handled now.

One more issue of the current interface is that clone() with CLONE_VM isn't
allowed if the current task has unshared a time namespace
(timens_for_children doesn't match the current timens).

Both these issues make some inconvenience for users. For example, Alexey
and Florian reported that posix_spawn() uses vfork+exec and this pattern
doesn't work with time namespaces due to the both described issues.
LXC needed to workaround the exec() issue by calling setns.

In the commit 133e2d3e81de5 ("fs/exec: allow to unshare a time namespace on
vfork+exec"), we tried to fix these issues with minimal impact on UAPI. But
it adds extra complexity and some undesirable side effects. Eric suggested
fixing the issues properly because here are all the reasons to suppose that
there are no users that depend on the old behavior.

Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Origin-author: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220921003120.209637-1-avagin@google.com
---
 fs/exec.c | 5 +++++
 1 file changed, 5 insertions(+)

(limited to 'fs/exec.c')

diff --git a/fs/exec.c b/fs/exec.c
index 32dc8cf5fceb..34e6a2e53268 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -64,6 +64,7 @@
 #include <linux/io_uring.h>
 #include <linux/syscall_user_dispatch.h>
 #include <linux/coredump.h>
+#include <linux/time_namespace.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1297,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
 
 	bprm->mm = NULL;
 
+	retval = exec_task_namespaces();
+	if (retval)
+		goto out_unlock;
+
 #ifdef CONFIG_POSIX_TIMERS
 	spin_lock_irq(&me->sighand->siglock);
 	posix_cpu_timers_exit(me);
-- 
cgit v1.2.1


From 275498a98b1fe77deebddfc4f8986c0cf2c3ced7 Mon Sep 17 00:00:00 2001
From: Kees Cook <keescook@chromium.org>
Date: Tue, 18 Oct 2022 00:17:24 -0700
Subject: exec: Add comments on check_unsafe_exec() fs counting

Add some comments about what the fs counting is doing in
check_unsafe_exec() and how it relates to the call graph.
Specifically, we can't force an unshare of the fs because
of at least Chrome:
https://lore.kernel.org/lkml/86CE201B-5632-4BB7-BCF6-7CB2C2895409@chromium.org/

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20221018071537.never.662-kees@kernel.org
---
 fs/exec.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

(limited to 'fs/exec.c')

diff --git a/fs/exec.c b/fs/exec.c
index 34e6a2e53268..b3e87bd50962 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1573,6 +1573,12 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	if (task_no_new_privs(current))
 		bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
 
+	/*
+	 * If another task is sharing our fs, we cannot safely
+	 * suid exec because the differently privileged task
+	 * will be able to manipulate the current directory, etc.
+	 * It would be nice to force an unshare instead...
+	 */
 	t = p;
 	n_fs = 1;
 	spin_lock(&p->fs->lock);
@@ -1753,6 +1759,7 @@ static int search_binary_handler(struct linux_binprm *bprm)
 	return retval;
 }
 
+/* binfmt handlers will call back into begin_new_exec() on success. */
 static int exec_binprm(struct linux_binprm *bprm)
 {
 	pid_t old_pid, old_vpid;
@@ -1811,6 +1818,11 @@ static int bprm_execve(struct linux_binprm *bprm,
 	if (retval)
 		return retval;
 
+	/*
+	 * Check for unsafe execution states before exec_binprm(), which
+	 * will call back into begin_new_exec(), into bprm_creds_from_file(),
+	 * where setuid-ness is evaluated.
+	 */
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-- 
cgit v1.2.1


From 8f6e3f9e5a0f58e458a348b7e36af11d0e9702af Mon Sep 17 00:00:00 2001
From: Kees Cook <keescook@chromium.org>
Date: Tue, 18 Oct 2022 00:14:20 -0700
Subject: binfmt: Fix whitespace issues

Fix the annoying whitespace issues that have been following these files
around for years.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/r/20221018071350.never.230-kees@kernel.org
---
 fs/exec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'fs/exec.c')

diff --git a/fs/exec.c b/fs/exec.c
index b3e87bd50962..1644bf10fb9d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -172,7 +172,7 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
 exit:
 	fput(file);
 out:
-  	return error;
+	return error;
 }
 #endif /* #ifdef CONFIG_USELIB */
 
-- 
cgit v1.2.1


From bfb4a2b95875a47a01234f2de113ec089d524e71 Mon Sep 17 00:00:00 2001
From: Rolf Eike Beer <eb@emlix.com>
Date: Wed, 19 Oct 2022 09:32:35 +0200
Subject: exec: simplify initial stack size expansion

I had a hard time trying to understand completely why it is using vm_end in
one side of the expression and vm_start in the other one, and using
something in the "if" clause that is not an exact copy of what is used
below. The whole point is that the stack_size variable that was used in the
"if" clause is the difference between vm_start and vm_end, which is not far
away but makes this thing harder to read than it must be.

Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/2017429.gqNitNVd0C@mobilepool36.emlix.com
---
 fs/exec.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

(limited to 'fs/exec.c')

diff --git a/fs/exec.c b/fs/exec.c
index 1644bf10fb9d..9585bc1bc970 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -843,16 +843,13 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	 * will align it up.
 	 */
 	rlim_stack = bprm->rlim_stack.rlim_cur & PAGE_MASK;
+
+	stack_expand = min(rlim_stack, stack_size + stack_expand);
+
 #ifdef CONFIG_STACK_GROWSUP
-	if (stack_size + stack_expand > rlim_stack)
-		stack_base = vma->vm_start + rlim_stack;
-	else
-		stack_base = vma->vm_end + stack_expand;
+	stack_base = vma->vm_start + stack_expand;
 #else
-	if (stack_size + stack_expand > rlim_stack)
-		stack_base = vma->vm_end - rlim_stack;
-	else
-		stack_base = vma->vm_start - stack_expand;
+	stack_base = vma->vm_end - stack_expand;
 #endif
 	current->mm->start_stack = bprm->p;
 	ret = expand_stack(vma, stack_base);
-- 
cgit v1.2.1


From cd57e443831d8eeb083c7165bce195d886e216d4 Mon Sep 17 00:00:00 2001
From: Kees Cook <keescook@chromium.org>
Date: Thu, 17 Nov 2022 16:31:55 -0800
Subject: exec: Remove FOLL_FORCE for stack setup

It does not appear that FOLL_FORCE should be needed for setting up the
stack pages. They are allocated using the nascent brpm->vma, which was
newly created with VM_STACK_FLAGS, which an arch can override, but they
all appear to include VM_WRITE | VM_MAYWRITE. Remove FOLL_FORCE.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/lkml/202211171439.CDE720EAD@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/exec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'fs/exec.c')

diff --git a/fs/exec.c b/fs/exec.c
index 9585bc1bc970..870a707b5d3b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -200,7 +200,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 {
 	struct page *page;
 	int ret;
-	unsigned int gup_flags = FOLL_FORCE;
+	unsigned int gup_flags = 0;
 
 #ifdef CONFIG_STACK_GROWSUP
 	if (write) {
-- 
cgit v1.2.1