README.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296

OSProfiler
==========

OSProfiler is an OpenStack cross-project profiling library.


Background
----------

OpenStack consists of multiple projects. Each project, in turn, is composed of
multiple services. To process some request, e.g. to boot a virtual machine,
OpenStack uses multiple services from different projects. In the case something
works too slowly, it's extremely complicated to understand what exactly goes
wrong and to locate the bottleneck.

To resolve this issue, we introduce a tiny but powerful library,
**osprofiler**, that is going to be used by all OpenStack projects and their
python clients. To be able to generate 1 trace per request, that goes through
all involved services, and builds a tree of calls (see an
`example <http://pavlovic.me/rally/profiler/>`_).


Why not cProfile and etc?
-------------------------

**The scope of this library is quite different:**

* We are interested in getting one trace of points from different service,
  not tracing all python calls inside one process.

* This library should be easy integratable in OpenStack. This means that:

  * It shouldn't require too many changes in code bases of integrating
    projects.

  * We should be able to turn it off fully.

  * We should be able to keep it turned on in lazy mode in production
    (e.g. admin should be able to "trace" on request).


OSprofiler API version 0.2.5
----------------------------

There are a couple of things that you should know about API before using it.


* **4 ways to add a new trace point**

    .. parsed-literal::

        from osprofiler import profiler

        def some_func():
            profiler.start("point_name", {"any_key": "with_any_value"})
            # your code
            profiler.stop({"any_info_about_point": "in_this_dict"})


        @profiler.Trace("point_name",
                        info={"any_info_about_point": "in_this_dict"},
                        hide_args=False)
        def some_func2(*args, **kwargs):
            # If you need to hide args in profile info, put hide_args=True
            pass

        def some_func3():
            with profiler.trace("point_name",
                                info={"any_key": "with_any_value"}):
                # some code here

        @profiler.trace_cls("point_name", info={}, hide_args=False,
                            trace_private=False)
        class TracedClass(object):

            def traced_method(self):
                pass

            def _traced_only_if_trace_private_true(self):
                 pass

* **How profiler works?**

  * **@profiler.Trace()** and **profiler.trace()** are just syntax sugar,
    that just calls **profiler.start()** & **profiler.stop()** methods.

  * Every call of **profiler.start()** & **profiler.stop()** sends to
    **collector** 1 message. It means that every trace point creates 2 records
    in the collector. *(more about collector & records later)*

  * Nested trace points are supported. The sample below produces 2 trace points:

      .. parsed-literal::

          profiler.start("parent_point")
          profiler.start("child_point")
          profiler.stop()
          profiler.stop()

      The implementation is quite simple. Profiler has one stack that contains
      ids of all trace points. E.g.:

      .. parsed-literal::

          profiler.start("parent_point") # trace_stack.push(<new_uuid>)
                                         # send to collector -> trace_stack[-2:]

          profiler.start("parent_point") # trace_stack.push(<new_uuid>)
                                         # send to collector -> trace_stack[-2:]
          profiler.stop()                # send to collector -> trace_stack[-2:]
                                         # trace_stack.pop()

          profiler.stop()                # send to collector -> trace_stack[-2:]
                                         # trace_stack.pop()

      It's simple to build a tree of nested trace points, having
      **(parent_id, point_id)** of all trace points.

* **Process of sending to collector**

  Trace points contain 2 messages (start and stop). Messages like below are
  sent to a collector:

  .. parsed-literal::
    {
        "name": <point_name>-(start|stop)
        "base_id": <uuid>,
        "parent_id": <uuid>,
        "trace_id": <uuid>,
        "info": <dict>
    }

   * base_id - <uuid> that is equal for all trace points that belong
               to one trace, this is done to simplify the process of retrieving
               all trace points related to one trace from collector
   * parent_id - <uuid> of parent trace point
   * trace_id - <uuid> of current trace point
   * info - it's dictionary that contains user information passed via calls of
            profiler **start()** & **stop()** methods.


* **Setting up the collector.**

    The profiler doesn't include a trace point collector. The user/developer
    should instead provide a method that sends messages to a collector. Let's
    take a look at a trivial sample, where the collector is just a file:

    .. parsed-literal::

        import json

        from osprofiler import notifier

        def send_info_to_file_collector(info, context=None):
            with open("traces", "a") as f:
                f.write(json.dumps(info))

        notifier.set(send_info_to_file_collector)

    So now on every **profiler.start()** and **profiler.stop()** call we will
    write info about the trace point to the end of the **traces** file.


* **Initialization of profiler.**

    If profiler is not initialized, all calls to **profiler.start()** and
    **profiler.stop()** will be ignored.

    Initialization is a quite simple procedure.

    .. parsed-literal::

        from osprofiler import profiler

        profiler.init("SECRET_HMAC_KEY", base_id=<uuid>, parent_id=<uuid>)

   ``SECRET_HMAC_KEY`` - will be discussed later, because it's related to the
    integration of OSprofiler & OpenStack.

    **base_id** and **trace_id** will be used to initialize stack_trace in
    profiler, e.g. stack_trace = [base_id, trace_id].

Integration with OpenStack
--------------------------

There are 4 topics related to integration OSprofiler & `OpenStack`_:

* **What we should use as a centralized collector?**

  We decided to use `Ceilometer`_, because:

  * It's already integrated in OpenStack, so it's quite simple to send
    notifications to it from all projects.

  * There is an OpenStack API in Ceilometer that allows us to retrieve all
    messages related to one trace. Take a look at
    *osprofiler.parsers.ceilometer:get_notifications*


* **How to setup profiler notifier?**

  We decided to use olso.messaging Notifier API, because:

  * `oslo.messaging`_ is integrated in all projects

  * It's the simplest way to send notification to Ceilometer, take a
    look at: *osprofiler.notifiers.messaging.Messaging:notify* method

  * We don't need to add any new `CONF`_ options in projects


* **How to initialize profiler, to get one trace across all services?**

    To enable cross service profiling we actually need to do send from caller
    to callee (base_id & trace_id). So callee will be able to init its profiler
    with these values.

    In case of OpenStack there are 2 kinds of interaction between 2 services:

    * REST API

        It's well known that there are python clients for every project,
        that generate proper HTTP requests, and parse responses to objects.

        These python clients are used in 2 cases:

        * User access -> OpenStack

        * Service from Project 1 would like to access Service from Project 2


        So what we need is to:

        * Put in python clients headers with trace info (if profiler is inited)

        * Add `OSprofiler WSGI middleware`_ to service, that initializes
          profiler, if there are special trace headers, that are signed by HMAC
          from api-paste.ini


        Actually the algorithm is a bit more complex. The Python client will
        also sign the trace info with a `HMAC`_ key passed to profiler.init,
        and on reception the WSGI middleware will check that it's signed with
        the **same** HMAC key that is specified in api-paste.ini. This ensures
        that only the user that knows the HMAC key in api-paste.ini can init
        a profiler properly and send trace info that will be actually
        processed. This ensures that trace info that is sent in that does
        **not** pass the HMAC validation will be discarded.


    * RPC API

        RPC calls are used for interaction between services of one project.
        It's well known that projects are using `oslo.messaging`_ to deal with
        RPC. It's very good, because projects deal with RPC in similar way.

        So there are 2 required changes:

        * On callee side put in request context trace info (if profiler was
          initialized)

        * On caller side initialize profiler, if there is trace info in request
          context.

        * Trace all methods of callee API (can be done via profiler.trace_cls).


* **What points should be tracked by default?**

   I think that for all projects we should include by default 5 kinds of points:

   * All HTTP calls - helps to get information about: what HTTP requests were
     done, duration of calls (latency of service), information about projects
     involved in request.

   * All RPC calls - helps to understand duration of parts of request related
     to different services in one project. This information is essential to
     understand which service produce the bottleneck.

   * All DB API calls - in some cases slow DB query can produce bottleneck. So
     it's quite useful to track how much time request spend in DB layer.

   * All driver calls - in case of nova, cinder and others we have vendor
     drivers. Duration

   * ALL SQL requests (turned off by default, because it produce a lot of
     traffic)

.. _CONF: http://docs.openstack.org/developer/oslo.config/
.. _HMAC: http://en.wikipedia.org/wiki/Hash-based_message_authentication_code
.. _OpenStack: http://openstack.org/
.. _Ceilometer: https://wiki.openstack.org/wiki/Ceilometer
.. _oslo.messaging: https://pypi.python.org/pypi/oslo.messaging
.. _OSprofiler WSGI middleware: https://github.com/stackforge/osprofiler/blob/master/osprofiler/web.py