diff options
author | Yorick Peterse <yorickpeterse@gmail.com> | 2016-03-04 13:08:24 +0100 |
---|---|---|
committer | Yorick Peterse <yorickpeterse@gmail.com> | 2016-03-08 11:28:50 +0100 |
commit | 059df2256eb2ce5ddbbf43161eb8e9fca733bce4 (patch) | |
tree | 74d80c038cb25a4fbfa82b52fa0c842cb8f6f58b | |
parent | a19a9faba94a6ea6367032cbd001bcc166160835 (diff) | |
download | gitlab-ce-059df2256eb2ce5ddbbf43161eb8e9fca733bce4.tar.gz |
Added basic SQL guidelinessql-guide
[ci skip]
-rw-r--r-- | doc/development/README.md | 1 | ||||
-rw-r--r-- | doc/development/sql.md | 219 |
2 files changed, 220 insertions, 0 deletions
diff --git a/doc/development/README.md b/doc/development/README.md index b9a0d81e5ba..f5c3107ff44 100644 --- a/doc/development/README.md +++ b/doc/development/README.md @@ -9,4 +9,5 @@ - [Rake tasks](rake_tasks.md) for development - [Shell commands](shell_commands.md) in the GitLab codebase - [Sidekiq debugging](sidekiq_debugging.md) +- [SQL guidelines](sql.md) for SQL guidelines - [UI guide](ui_guide.md) for building GitLab with existing css styles and elements diff --git a/doc/development/sql.md b/doc/development/sql.md new file mode 100644 index 00000000000..23fd7604957 --- /dev/null +++ b/doc/development/sql.md @@ -0,0 +1,219 @@ +# SQL Query Guidelines + +This document describes various guidelines to follow when writing SQL queries, +either using ActiveRecord/Arel or raw SQL queries. + +## Using LIKE Statements + +The most common way to search for data is using the `LIKE` statement. For +example, to get all issues with a title starting with "WIP:" you'd write the +following query: + +```sql +SELECT * +FROM issues +WHERE title LIKE 'WIP:%'; +``` + +On PostgreSQL the `LIKE` statement is case-sensitive. On MySQL this depends on +the case-sensitivity of the collation, which is usually case-insensitive. To +perform a case-insensitive `LIKE` on PostgreSQL you have to use `ILIKE` instead. +This statement in turn isn't supported on MySQL. + +To work around this problem you should write `LIKE` queries using Arel instead +of raw SQL fragments as Arel automatically uses `ILIKE` on PostgreSQL and `LIKE` +on MySQL. This means that instead of this: + +```ruby +Issue.where('title LIKE ?', 'WIP:%') +``` + +You'd write this instead: + +```ruby +Issue.where(Issue.arel_table[:title].matches('WIP:%')) +``` + +Here `matches` generates the correct `LIKE` / `ILIKE` statement depending on the +database being used. + +If you need to chain multiple `OR` conditions you can also do this using Arel: + +```ruby +table = Issue.arel_table + +Issue.where(table[:title].matches('WIP:%').or(table[:foo].matches('WIP:%'))) +``` + +For PostgreSQL this produces: + +```sql +SELECT * +FROM issues +WHERE (title ILIKE 'WIP:%' OR foo ILIKE 'WIP:%') +``` + +In turn for MySQL this produces: + +```sql +SELECT * +FROM issues +WHERE (title LIKE 'WIP:%' OR foo LIKE 'WIP:%') +``` + +## LIKE & Indexes + +Neither PostgreSQL nor MySQL use any indexes when using `LIKE` / `ILIKE` with a +wildcard at the start. For example, this will not use any indexes: + +```sql +SELECT * +FROM issues +WHERE title ILIKE '%WIP:%'; +``` + +Because the value for `ILIKE` starts with a wildcard the database is not able to +use an index as it doesn't know where to start scanning the indexes. + +MySQL provides no known solution to this problem. Luckily PostgreSQL _does_ +provide a solution: trigram GIN indexes. These indexes can be created as +follows: + +```sql +CREATE INDEX [CONCURRENTLY] index_name_here +ON table_name +USING GIN(column_name gin_trgm_ops); +``` + +The key here is the `GIN(column_name gin_trgm_ops)` part. This creates a [GIN +index][gin-index] with the operator class set to `gin_trgm_ops`. These indexes +_can_ be used by `ILIKE` / `LIKE` and can lead to greatly improved performance. +One downside of these indexes is that they can easily get quite large (depending +on the amount of data indexed). + +To keep naming of these indexes consistent please use the following naming +pattern: + + index_TABLE_on_COLUMN_trigram + +For example, a GIN/trigram index for `issues.title` would be called +`index_issues_on_title_trigram`. + +Due to these indexes taking quite some time to be built they should be built +concurrently. This can be done by using `CREATE INDEX CONCURRENTLY` instead of +just `CREATE INDEX`. Concurrent indexes can _not_ be created inside a +transaction. Transactions for migrations can be disabled using the following +pattern: + +```ruby +class MigrationName < ActiveRecord::Migration + disable_ddl_transaction! +end +``` + +For example: + +```ruby +class AddUsersLowerUsernameEmailIndexes < ActiveRecord::Migration + disable_ddl_transaction! + + def up + return unless Gitlab::Database.postgresql? + + execute 'CREATE INDEX CONCURRENTLY index_on_users_lower_username ON users (LOWER(username));' + execute 'CREATE INDEX CONCURRENTLY index_on_users_lower_email ON users (LOWER(email));' + end + + def down + return unless Gitlab::Database.postgresql? + + remove_index :users, :index_on_users_lower_username + remove_index :users, :index_on_users_lower_email + end +end +``` + +## Plucking IDs + +This can't be stressed enough: **never** use ActiveRecord's `pluck` to pluck a +set of values into memory only to use them as an argument for another query. For +example, this will make the database **very** sad: + +```ruby +projects = Project.all.pluck(:id) + +MergeRequest.where(source_project_id: projects) +``` + +Instead you can just use sub-queries which perform far better: + +```ruby +MergeRequest.where(source_project_id: Project.all.select(:id)) +``` + +The _only_ time you should use `pluck` is when you actually need to operate on +the values in Ruby itself (e.g. write them to a file). In almost all other cases +you should ask yourself "Can I not just use a sub-query?". + +## Use UNIONs + +UNIONs aren't very commonly used in most Rails applications but they're very +powerful and useful. In most applications queries tend to use a lot of JOINs to +get related data or data based on certain criteria, but JOIN performance can +quickly deteriorate as the data involved grows. + +For example, if you want to get a list of projects where the name contains a +value _or_ the name of the namespace contains a value most people would write +the following query: + +```sql +SELECT * +FROM projects +JOIN namespaces ON namespaces.id = projects.namespace_id +WHERE projects.name ILIKE '%gitlab%' +OR namespaces.name ILIKE '%gitlab%'; +``` + +Using a large database this query can easily take around 800 milliseconds to +run. Using a UNION we'd write the following instead: + +```sql +SELECT projects.* +FROM projects +WHERE projects.name ILIKE '%gitlab%' + +UNION + +SELECT projects.* +FROM projects +JOIN namespaces ON namespaces.id = projects.namespace_id +WHERE namespaces.name ILIKE '%gitlab%'; +``` + +This query in turn only takes around 15 milliseconds to complete while returning +the exact same records. + +This doesn't mean you should start using UNIONs everywhere, but it's something +to keep in mind when using lots of JOINs in a query and filtering out records +based on the joined data. + +GitLab comes with a `Gitlab::SQL::Union` class that can be used to build a UNION +of multiple `ActiveRecord::Relation` objects. You can use this class as +follows: + +```ruby +union = Gitlab::SQL::Union.new([projects, more_projects, ...]) + +Project.from("(#{union.to_sql}) projects") +``` + +## Ordering by Creation Date + +When ordering records based on the time they were created you can simply order +by the `id` column instead of ordering by `created_at`. Because IDs are always +unique and incremented in the order that rows are created this will produce the +exact same results. This also means there's no need to add an index on +`created_at` to ensure consistent performance as `id` is already indexed by +default. + +[gin-index]: http://www.postgresql.org/docs/current/static/gin.html |