summaryrefslogtreecommitdiff
path: root/docs/source/cloudsearch_tut.rst
blob: 08d07855cefc8aae3ec0d5866284c7ea91b918ec (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
.. cloudsearch_tut:

===============================================
An Introduction to boto's Cloudsearch interface
===============================================

This tutorial focuses on the boto interface to AWS' Cloudsearch_. This tutorial
assumes that you have boto already downloaded and installed.

.. _Cloudsearch: http://aws.amazon.com/cloudsearch/

Creating a Connection
---------------------
The first step in accessing CloudSearch is to create a connection to the service.

The recommended method of doing this is as follows::

    >>> import boto.cloudsearch
    >>> conn = boto.cloudsearch.connect_to_region("us-west-2",
    ...             aws_access_key_id='<aws access key>',
    ...             aws_secret_access_key='<aws secret key>')

At this point, the variable conn will point to a CloudSearch connection object
in the us-west-2 region. Currently, this is the only region which has the
CloudSearch service. In this example, the AWS access key and AWS secret key are
passed in to the method explicitly. Alternatively, you can set the environment
variables:

* `AWS_ACCESS_KEY_ID` - Your AWS Access Key ID
* `AWS_SECRET_ACCESS_KEY` - Your AWS Secret Access Key

and then simply call::

   >>> import boto.cloudsearch
   >>> conn = boto.cloudsearch.connect_to_region("us-west-2")

In either case, conn will point to the Connection object which we will use
throughout the remainder of this tutorial.

Creating a Domain
-----------------

Once you have a connection established with the CloudSearch service, you will
want to create a domain. A domain encapsulates the data that you wish to index,
as well as indexes and metadata relating to it::

    >>> from boto.cloudsearch.domain import Domain
    >>> domain = Domain(conn, conn.create_domain('demo'))

This domain can be used to control access policies, indexes, and the actual
document service, which you will use to index and search.

Setting access policies
-----------------------

Before you can connect to a document service, you need to set the correct
access properties.  For example, if you were connecting from 192.168.1.0, you
could give yourself access as follows::

    >>> our_ip = '192.168.1.0'

    >>> # Allow our IP address to access the document and search services
    >>> policy = domain.get_access_policies()
    >>> policy.allow_search_ip(our_ip)
    >>> policy.allow_doc_ip(our_ip)

You can use the :py:meth:`allow_search_ip
<boto.cloudsearch.optionstatus.ServicePoliciesStatus.allow_search_ip>` and
:py:meth:`allow_doc_ip <boto.cloudsearch.optionstatus.ServicePoliciesStatus.allow_doc_ip>`
methods to give different CIDR blocks access to searching and the document
service respectively.

Creating index fields
---------------------

Each domain can have up to twenty index fields which are indexed by the
CloudSearch service. For each index field, you will need to specify whether
it's a text or integer field, as well as optionaly a default value::

    >>> # Create an 'text' index field called 'username'
    >>> uname_field = domain.create_index_field('username', 'text')

    >>> # Epoch time of when the user last did something
    >>> time_field = domain.create_index_field('last_activity',
    ...                                        'uint',
    ...                                        default=0)

It is also possible to mark an index field as a facet. Doing so allows a search
query to return categories into which results can be grouped, or to create
drill-down categories::

    >>> # But it would be neat to drill down into different countries
    >>> loc_field = domain.create_index_field('location', 'text', facet=True)

Finally, you can also mark a snippet of text as being able to be returned
directly in your search query by using the results option::

    >>> # Directly insert user snippets in our results
    >>> snippet_field = domain.create_index_field('snippet', 'text', result=True)

You can add up to 20 index fields in this manner::

    >>> follower_field = domain.create_index_field('follower_count',
    ...                                            'uint',
    ...                                            default=0)

Adding Documents to the Index
-----------------------------

Now, we can add some documents to our new search domain. First, you will need a
document service object through which queries are sent::

    >>> doc_service = domain.get_document_service()

For this example, we will use a pre-populated list of sample content for our
import. You would normally pull such data from your database or another
document store::

    >>> users = [
        {
            'id': 1,
            'username': 'dan',
            'last_activity': 1334252740,
            'follower_count': 20,
            'location': 'USA',
            'snippet': 'Dan likes watching sunsets and rock climbing',
        },
        {
            'id': 2,
            'username': 'dankosaur',
            'last_activity': 1334252904,
            'follower_count': 1,
            'location': 'UK',
            'snippet': 'Likes to dress up as a dinosaur.',
        },
        {
            'id': 3,
            'username': 'danielle',
            'last_activity': 1334252969,
            'follower_count': 100,
            'location': 'DE',
            'snippet': 'Just moved to Germany!'
        },
        {
            'id': 4,
            'username': 'daniella',
            'last_activity': 1334253279,
            'follower_count': 7,
            'location': 'USA',
            'snippet': 'Just like Dan, I like to watch a good sunset, but heights scare me.',
        }
    ]

When adding documents to our document service, we will batch them together. You
can schedule a document to be added by using the :py:meth:`add
<boto.cloudsearch.document.DocumentServiceConnection.add>` method. Whenever you are adding a
document, you must provide a unique ID, a version ID, and the actual document
to be indexed. In this case, we are using the user ID as our unique ID. The
version ID is used to determine which is the latest version of an object to be
indexed. If you wish to update a document, you must use a higher version ID. In
this case, we are using the time of the user's last activity as a version
number::

    >>> for user in users:
    >>>     doc_service.add(user['id'], user['last_activity'], user)

When you are ready to send the batched request to the document service, you can
do with the :py:meth:`commit
<boto.cloudsearch.document.DocumentServiceConnection.commit>` method. Note that
cloudsearch will charge per 1000 batch uploads. Each batch upload must be under
5MB::

    >>> result = doc_service.commit()

The result is an instance of :py:class:`CommitResponse
<boto.cloudsearch.document.CommitResponse>` which will make the plain
dictionary response a nice object (ie result.adds, result.deletes) and raise an
exception for us if all of our documents weren't actually committed.

After you have successfully committed some documents to cloudsearch, you must
use :py:meth:`clear_sdf
<boto.cloudsearch.document.DocumentServiceConnection.clear_sdf>`, if you wish
to use the same document service connection again so that its internal cache is
cleared.

Searching Documents
-------------------

Now, let's try performing a search. First, we will need a
SearchServiceConnection::

    >>> search_service = domain.get_search_service()

A standard search will return documents which contain the exact words being
searched for::

    >>> results = search_service.search(q="dan")
    >>> results.hits
    2
    >>> map(lambda x: x['id'], results)
    [u'1', u'4']

The standard search does not look at word order::

    >>> results = search_service.search(q="dinosaur dress")
    >>> results.hits
    1
    >>> map(lambda x: x['id'], results)
    [u'2']

It's also possible to do more complex queries using the bq argument (Boolean
Query). When you are using bq, your search terms must be enclosed in single
quotes::

    >>> results = search_service.search(bq="'dan'")
    >>> results.hits
    2
    >>> map(lambda x: x['id'], results)
    [u'1', u'4']

When you are using boolean queries, it's also possible to use wildcards to
extend your search to all words which start with your search terms::

    >>> results = search_service.search(bq="'dan*'")
    >>> results.hits
    4
    >>> map(lambda x: x['id'], results)
    [u'1', u'2', u'3', u'4']

The boolean query also allows you to create more complex queries. You can OR
term together using "|", AND terms together using "+" or a space, and you can
remove words from the query using the "-" operator::

    >>> results = search_service.search(bq="'watched|moved'")
    >>> results.hits
    2
    >>> map(lambda x: x['id'], results)
    [u'3', u'4']

By default, the search will return 10 terms but it is possible to adjust this
by using the size argument as follows::

    >>> results = search_service.search(bq="'dan*'", size=2)
    >>> results.hits
    4
    >>> map(lambda x: x['id'], results)
    [u'1', u'2']

It is also possible to offset the start of the search by using the start
argument as follows::

    >>> results = search_service.search(bq="'dan*'", start=2)
    >>> results.hits
    4
    >>> map(lambda x: x['id'], results)
    [u'3', u'4']


Ordering search results and rank expressions
--------------------------------------------

If your search query is going to return many results, it is good to be able to
sort them. You can order your search results by using the rank argument. You are
able to sort on any fields which have the results option turned on::

    >>> results = search_service.search(bq=query, rank=['-follower_count'])

You can also create your own rank expressions to sort your results according to
other criteria, such as showing most recently active user, or combining the
recency score with the text_relevance::

    >>> domain.create_rank_expression('recently_active', 'last_activity')

    >>> domain.create_rank_expression('activish',
    ...   'text_relevance + ((follower_count/(time() - last_activity))*1000)')

    >>> results = search_service.search(bq=query, rank=['-recently_active'])

Viewing and Adjusting Stemming for a Domain
-------------------------------------------

A stemming dictionary maps related words to a common stem. A stem is
typically the root or base word from which variants are derived. For
example, run is the stem of running and ran. During indexing, Amazon
CloudSearch uses the stemming dictionary when it performs
text-processing on text fields. At search time, the stemming
dictionary is used to perform text-processing on the search
request. This enables matching on variants of a word. For example, if
you map the term running to the stem run and then search for running,
the request matches documents that contain run as well as running.

To get the current stemming dictionary defined for a domain, use the
:py:meth:`get_stemming <boto.cloudsearch.domain.Domain.get_stemming>` method::

    >>> stems = domain.get_stemming()
    >>> stems
    {u'stems': {}}
    >>>

This returns a dictionary object that can be manipulated directly to
add additional stems for your search domain by adding pairs of term:stem
to the stems dictionary::

    >>> stems['stems']['running'] = 'run'
    >>> stems['stems']['ran'] = 'run'
    >>> stems
    {u'stems': {u'ran': u'run', u'running': u'run'}}
    >>>

This has changed the value locally.  To update the information in
Amazon CloudSearch, you need to save the data::

    >>> stems.save()

You can also access certain CloudSearch-specific attributes related to
the stemming dictionary defined for your domain::

    >>> stems.status
    u'RequiresIndexDocuments'
    >>> stems.creation_date
    u'2012-05-01T12:12:32Z'
    >>> stems.update_date
    u'2012-05-01T12:12:32Z'
    >>> stems.update_version
    19
    >>>

The status indicates that, because you have changed the stems associated
with the domain, you will need to re-index the documents in the domain
before the new stems are used.

Viewing and Adjusting Stopwords for a Domain
--------------------------------------------

Stopwords are words that should typically be ignored both during
indexing and at search time because they are either insignificant or
so common that including them would result in a massive number of
matches.

To view the stopwords currently defined for your domain, use the
:py:meth:`get_stopwords <boto.cloudsearch.domain.Domain.get_stopwords>` method::

    >>> stopwords = domain.get_stopwords()
    >>> stopwords
    {u'stopwords': [u'a',
     u'an',
     u'and',
     u'are',
     u'as',
     u'at',
     u'be',
     u'but',
     u'by',
     u'for',
     u'in',
     u'is',
     u'it',
     u'of',
     u'on',
     u'or',
     u'the',
     u'to',
     u'was']}
    >>>

You can add additional stopwords by simply appending the values to the
list::

    >>> stopwords['stopwords'].append('foo')
    >>> stopwords['stopwords'].append('bar')
    >>> stopwords

Similarly, you could remove currently defined stopwords from the list.
To save the changes, use the :py:meth:`save
<boto.cloudsearch.optionstatus.OptionStatus.save>` method::

    >>> stopwords.save()

The stopwords object has similar attributes defined above for stemming
that provide additional information about the stopwords in your domain.


Viewing and Adjusting Stopwords for a Domain
--------------------------------------------

You can configure synonyms for terms that appear in the data you are
searching. That way, if a user searches for the synonym rather than
the indexed term, the results will include documents that contain the
indexed term.

If you want two terms to match the same documents, you must define
them as synonyms of each other. For example::

    cat, feline
    feline, cat

To view the synonyms currently defined for your domain, use the
:py:meth:`get_synonyms <boto.cloudsearch.domain.Domain.get_synonyms>` method::

    >>> synonyms = domain.get_synonyms()
    >>> synonyms
    {u'synonyms': {}}
    >>>

You can define new synonyms by adding new term:synonyms entries to the
synonyms dictionary object::

    >>> synonyms['synonyms']['cat'] = ['feline', 'kitten']
    >>> synonyms['synonyms']['dog'] = ['canine', 'puppy']

To save the changes, use the :py:meth:`save
<boto.cloudsearch.optionstatus.OptionStatus.save>` method::

    >>> synonyms.save()

The synonyms object has similar attributes defined above for stemming
that provide additional information about the stopwords in your domain.

Deleting Documents
------------------

It is also possible to delete documents::

    >>> import time
    >>> from datetime import datetime

    >>> doc_service = domain.get_document_service()

    >>> # Again we'll cheat and use the current epoch time as our version number

    >>> doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple())))
    >>> doc_service.commit()