traffic_replay: Improve assign_groups() performance with large domains

When assigning 10,000 users to 15 groups each (on average), assign_groups() would take over 30 seconds. This did not include any DB operations whatsoever. This patch improves things, so that it takes less than a second in the same situation. The problem was the code was looping ~23 million times where the 'random.random() < probability * 10000' condition was not met. The problem is individual group/user probabilities get lower as the number of groups/users increases. And so with large numbers of users, most of the time the calculated probability was very small and didn't meet the threshold. This patch changes it so we can select a user/group in one go, avoiding the need to loop multiple times. Basically we distribute the users (or groups) between 0.0 and 1.0, so that each user has their own 'slice', and this slice is proporational to their weighted probability. random.random() generates a value between 0.0 and 1.0, so we can use this to pick a 'slice' (or rather, we use this as an index into the list, using .bisect()). Users/groups with larger probabilities end up with larger slices, so are more likely to get picked. The end result is roughly the same distribution as before, although the first 10 or so user/groups seem to get picked more frequently, so the weighted-probability calculations may need tweaking some more. Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
author: Tim Beale <timbeale@catalyst.net.nz> 2018-10-15 16:24:00 +1300
committer: Tim Beale <timbeale@samba.org> 2018-11-04 23:55:16 +0100
commit: 898e6b4332e4641ed8377ff2db398a295c37cebf (patch)
tree: 0c57c0af2790e962bb15d31283c27f8b19795da5 /python/samba/emulate
parent: 18740ec0dd5c0ed59fa03b2d9d0d34ea11436b00 (diff)
download: samba-898e6b4332e4641ed8377ff2db398a295c37cebf.tar.gz
1 files changed, 36 insertions, 12 deletions
diff --git a/python/samba/emulate/traffic.py b/python/samba/emulate/traffic.py
index 4e2c8f38014..8eb10ee8819 100644
--- a/python/samba/emulate/traffic.py
+++ b/python/samba/emulate/traffic.py
@@ -52,6 +52,7 @@ from samba import gensec
 from samba import sd_utils
 from samba.compat import get_string
 from samba.logger import get_samba_logger
+import bisect
 
 SLEEP_OVERHEAD = 3e-4
 
@@ -1824,34 +1825,57 @@ class GroupAssignments(object):
                                               users_added,
                                               group_memberships)
 
+    def cumulative_distribution(self, weights):
+        # make sure the probabilities conform to a cumulative distribution
+        # spread between 0.0 and 1.0. Dividing by the weighted total gives each
+        # probability a proportional share of 1.0. Higher probabilities get a
+        # bigger share, so are more likely to be picked. We use the cumulative
+        # value, so we can use random.random() as a simple index into the list
+        dist = []
+        total = sum(weights)
+        cumulative = 0.0
+        for probability in weights:
+            cumulative += probability
+            dist.append(cumulative / total)
+        return dist
+
     def generate_user_distribution(self, n):
         """Probability distribution of a user belonging to a group.
         """
-        self.user_dist = []
+        # Assign a weighted probability to each user. Probability decreases
+        # as the user-ID increases
+        weights = []
         for x in range(1, n + 1):
             p = 1 / (x + 0.001)
-            self.user_dist.append(p)
+            weights.append(p)
 
-        self.num_users = n
+        # convert the weights to a cumulative distribution between 0.0 and 1.0
+        self.user_dist = self.cumulative_distribution(weights)
 
     def generate_group_distribution(self, n):
         """Probability distribution of a group containing a user."""
-        self.group_dist = []
+
+        # Assign a weighted probability to each user. Probability decreases
+        # as the group-ID increases
+        weights = []
         for x in range(1, n + 1):
             p = 1 / (x**1.3)
-            self.group_dist.append(p)
+            weights.append(p)
 
-        self.num_groups = n
+        # convert the weights to a cumulative distribution between 0.0 and 1.0
+        self.group_dist = self.cumulative_distribution(weights)
 
     def generate_random_membership(self):
         """Returns a randomly generated user-group membership"""
-        while True:
-            user        = random.randint(0, self.num_users - 1)
-            group       = random.randint(0, self.num_groups - 1)
-            probability = self.group_dist[group] * self.user_dist[user]
 
-            if random.random() < probability * 10000:
-                return user, group
+        # the list items are cumulative distribution values between 0.0 and
+        # 1.0, which makes random() a handy way to index the list to get a
+        # weighted random user/group. (Here the user/group returned are
+        # zero-based array indexes)
+        user = bisect.bisect(self.user_dist, random.random())
+        group = bisect.bisect(self.group_dist, random.random())
+
+        return user, group
 
     def assign_groups(self, number_of_groups, groups_added,
                       number_of_users, users_added, group_memberships):
author	Tim Beale <timbeale@catalyst.net.nz>	2018-10-15 16:24:00 +1300
committer	Tim Beale <timbeale@samba.org>	2018-11-04 23:55:16 +0100
commit	898e6b4332e4641ed8377ff2db398a295c37cebf (patch)
tree	0c57c0af2790e962bb15d31283c27f8b19795da5 /python/samba/emulate
parent	18740ec0dd5c0ed59fa03b2d9d0d34ea11436b00 (diff)
download	samba-898e6b4332e4641ed8377ff2db398a295c37cebf.tar.gz