diff options
author | Marcel Amirault <ravlen@gmail.com> | 2019-05-05 13:57:21 +0000 |
---|---|---|
committer | Achilleas Pipinellis <axil@gitlab.com> | 2019-05-05 13:57:21 +0000 |
commit | 0207468401e41dcea537fa2d48f395a5fa7b3300 (patch) | |
tree | 5c7fb2dee472bb57bce8e64bd3105cdf0af75ea4 /doc/development/elasticsearch.md | |
parent | 6f54ced40dadf268b6d1525c413f9ce52930f1fe (diff) | |
download | gitlab-ce-0207468401e41dcea537fa2d48f395a5fa7b3300.tar.gz |
Docs: Merge EE doc/development to CE
Diffstat (limited to 'doc/development/elasticsearch.md')
-rw-r--r-- | doc/development/elasticsearch.md | 166 |
1 files changed, 166 insertions, 0 deletions
diff --git a/doc/development/elasticsearch.md b/doc/development/elasticsearch.md new file mode 100644 index 00000000000..0c9e7908713 --- /dev/null +++ b/doc/development/elasticsearch.md @@ -0,0 +1,166 @@ +# Elasticsearch knowledge **[STARTER ONLY]** + +This area is to maintain a compendium of useful information when working with elasticsearch. + +Information on how to enable ElasticSearch and perform the initial indexing is kept in https://docs.gitlab.com/ee/integration/elasticsearch.html#enabling-elasticsearch + +## Initial installation on OS X + +It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with + +``` +docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12 +``` + +and use `docker stop elastic56` and `docker start elastic56` to stop/start it. + +### Installing on the host + +We currently only support Elasticsearch [5.6 to 6.x](https://docs.gitlab.com/ee/integration/elasticsearch.html#requirements) + +Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility. + +``` +brew install elasticsearch@5.6 +``` + +There is no need to install any plugins + +## New repo indexer (beta) + +If you're interested on working with the new beta repo indexer, all you need to do is: + +- git clone git@gitlab.com:gitlab-org/gitlab-elasticsearch-indexer.git +- make +- make install + +this adds `gitlab-elasticsearch-indexer` to `$GOPATH/bin`, please make sure that is in your `$PATH`. After that GitLab will find it and you'll be able to enable it in the admin settings area. + +**note:** `make` will not recompile the executable unless you do `make clean` beforehand + +## Helpful rake tasks + +- `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index. +- `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size. + +Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](https://docs.gitlab.com/ee/development/rake_tasks.html#extra-project-seed-options) + +## How does it work? + +The ElasticSearch integration depends on an external indexer. We ship a [ruby indexer](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/bin/elastic_repo_indexer) by default but are also working on an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a rake task, but after this is done GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_search.rb](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/concerns/elastic/application_search.rb). + +All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq jobs). + +Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them! + +## Existing Analyzers/Tokenizers/Filters +These are all defined in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/elasticsearch/git/model.rb + +### Analyzers +#### `path_analyzer` +Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters. + +Please see the `path_tokenizer` explanation below for an example. + +#### `sha_analyzer` +Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters. + +Please see the `sha_tokenizer` explanation later below for an example. + +#### `code_analyzer` +Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding` + +The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched. + +Please see the `code` filter for an explanation on how tokens are split. + +#### `code_search_analyzer` +Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters. + +### Tokenizers +#### `sha_tokenizer` +This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars). + +example: + +`240c29dc7e` becomes: +- `240c2` +- `240c29` +- `240c29d` +- `240c29dc` +- `240c29dc7` +- `240c29dc7e` + +#### `path_tokenizer` +This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input. + +example: + +`'/some/path/application.js'` becomes: +- `'/some/path/application.js'` +- `'some/path/application.js'` +- `'path/application.js'` +- `'application.js'` + +### Filters +#### `code` +Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves. + +Patterns: +- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens +- `"(\\d+)"`: extracts digits +- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]` +- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes +- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes +- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between +- `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one` + +#### `edgeNGram_filter` +Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses` + +## Gotchas + +- Searches can have their own analyzers. Remember to check when editing analyzers +- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches + +## Troubleshooting + +### Getting "flood stage disk watermark [95%] exceeded" + +You might get an error such as + +``` +[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct] + flood stage disk watermark [95%] exceeded on + [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%], + all indices on this node will be marked read-only +``` + +This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold. + +In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc + +``` +curl "http://localhost:9200/gitlab-development/_settings?pretty" +``` + +Add this to your `elasticsearch.yml` file: + +``` +# turn off the disk allocator +cluster.routing.allocation.disk.threshold_enabled: false +``` + +_or_ + +``` +# set your own limits +cluster.routing.allocation.disk.threshold_enabled: true +cluster.routing.allocation.disk.watermark.flood_stage: 5gb # ES 6.x only +cluster.routing.allocation.disk.watermark.low: 15gb +cluster.routing.allocation.disk.watermark.high: 10gb +``` + +Restart ElasticSearch, and the `read_only_allow_delete` will clear on it's own. + +_from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/disk-allocator.html)_ |