Video

Want to see the full-length video right now for free?

Sign In with GitHub for Free Access

Notes

Often when talking about optimizing SQL queries in web applications we focus on application layer concerns like n+1 queries, but occasionally we'll need to dive down into the database layer itself to optimize our queries. In this video thoughtbot CTO Joe Ferris takes us on a tour of how to work with Postgres to understand and optimize the structure of our database and queries to clean up the last mile of database interaction performance concerns.

Application vs Database Optimization

Previously we've discussed how to optimize database interactions when working from the application layer, specifically in the Advanced ActiveRecord Querying trail and the Advanced Query Optimization episode of the Weekly Iteration. Those episodes are a great starting point for improving your applications performance around database calls as multiple DB calls, will almost always introduce significantly more time than a less efficient query.

That said, once you've cleaned your query logic up and you've optimized your code to make a small number of queries for only the data you need, the next step for optimization is to jump into the database itself and see what sort of information we can see there, as well as how we can fine tune things.

Postgres SQL Console (psql)

Rather than working from the Rails console, for this video we'll be working directly in the Postgres prompt known as psql. For a detailed overview of the functionality and options available in psql be sure to check out the official psql docs, but for now we can get by with only the basics.

From the psql prompt we can do any of the following:

  • Issue SQL statements like select * from users;
  • Issue schema modification commands like create index ...
  • Issue special psql commands like \q for "quit". (most of the special commands will be prefixed with \)

One particularly useful psql command is \d which will "describe" the database, specifically listing the tables in the database. You can also use \d with a table name to have Postgres describe that table for you.

upcase_development=# \d statuses;
                                         Table "public.statuses"
      Column       |            Type             |                       Modifiers
-------------------+-----------------------------+-------------------------------------------------------
 id                | integer                     | not null default nextval('statuses_id_seq'::regclass)
 completeable_id   | integer                     | not null
 user_id           | integer                     | not null
 state             | character varying(255)      | not null default 'In Progress'::character varying
 created_at        | timestamp without time zone |
 updated_at        | timestamp without time zone |
 completeable_type | character varying(255)      | not null
Indexes:
    "statuses_pkey" PRIMARY KEY, btree (id)

When working with Rails and Heroku, there are convenience shortcuts for connecting a psql instance to your database:

# connect to local dev postgres repl
$ bin/rails db

# connect to psql repl for Heroku instance
$ heroku pg:psql

Explaining Queries

When trying to get a better understanding of how Postgres actually runs a given SQL query, explain is our best friend. Again, the official Postgres docs for explain will cover all the advanced options and flags, but for now we can dive in with our more approachable examples.

Prefixing a query with explain will cause Postgres to output a summary of how it would run the query, without actually running the query. This functionality behaves somewhat like the --pretend flag provided by some command line utilities. From the output of explain you can observe how Postgres will execute the query, including information such as whether or not an index is removed and the timing of various steps in the query execution. The explain keyword can be placed in front of any query and will instruct Postgres to describe what it is doing.

upcase_development=# explain select * from statuses where user_id = 524;
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on statuses  (cost=0.00..7873.16 rows=70 width=45)
   Filter: (user_id = 524)
(2 rows)

Explain Analyze

When using explain, Postgres will not actually perform the query, but instead it will only provide information about how it would perform the query if executed. Going a step further, we can use explain analyze which will perform the query, and as such can provide additional detail:

upcase_development=# explain analyze select * from statuses where user_id = 524;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on statuses  (cost=0.00..7873.16 rows=70 width=45) (actual time=0.016..35.358 rows=153 loops=1)
   Filter: (user_id = 524)
   Rows Removed by Filter: 358260
 Planning time: 0.067 ms
 Execution time: 35.389 ms
(5 rows)

Comparing the output between the two, we can see that with explain analyze, we end seeing data about the total execution time, the number of rows filtered, and the number of loops. All of these data points can help direct efforts in optimizing the query.

Looking at Explain output

Postgres has great documentation covering how to interpret the output of explain, but at the core you're looking to find any standout elements in the summary.

Rows Removed by Filter

If the output shows that there was a filter on the query (as there would be whenever there is a where clause) and it shows that many rows were removed by the filter, then this is a great indication that and index could be useful.

If the scan ended up returning most of the rows, an index likely won't help, but if many rows were removed with query conditions, the index could improve performance.

Optimizing a Query With an Index

In the above query we were filtered out over 350,000 rows with our where clause which hints that adding an index could likely be a big help here. We can try this out directly from psql to test this hypothesis:

upcase_development=# create index statuses_user_id on statuses (user_id);

Following the addition of the index, we can re-run our query analysis to see if that improved things:

upcase_development=# explain analyze select * from statuses where user_id = 524;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Index Scan using statuses_user_id on statuses  (cost=0.42..193.78 rows=70 width=45) (actual time=0.021..0.119 rows=153 loops=1)
   Index Cond: (user_id = 524)
 Planning time: 0.071 ms
 Execution time: 0.145 ms
(4 rows)

And as expected in this case the index did provide a big win, reducing the execution time from 35.389ms down to 0.145ms (almost a 250X improvement!).

As a general rule, adding an index to a foreign key (like we did above with indexing the user_id on the statuses table) will account for a significant number of the optimizations we can perform in a typical application with relational data (which is most applications).

Adding a Compound Index

In some cases we have data that tends to move in conjunction, the completeable_id and completeable_type in this case, and we may want to add a composite index to combine these two attributes.

Even with the slight increase in query time here, using indices is often still a good idea as they will help keep query times low even as the data in your systems grows over time.

Choosing the Right Indexes

One caveat to multi-column indexes is that they have to use all of the column in the index. In this case, if we were to perform a query on a table with a multi-column index but we only used one of the columns, Postgres would not be able to use the index to perform the query.

Again we see that performance optimizations should always be driven by benchmarking and real-world performance data from how your production application is used.

For each use case in your application you may want to add an additional index that helps to optimize that query, but then we run into the issue of having to keep multiple indexes up to date which will increase write times. Thus, we need to use the data from our application to decide between these trade offs and achieve the overall best performance and user experience.

Why Is Postgres Not Using My Index

Postgres is very smart about performing queries, and in some cases this means it will opt to not use an index, even when a relevant index exists. It's possible that an earlier portion of the query has filtered down the possible results so much that using an index for further filtering would be more costly, and thus Postgres will opt not to use the index. In general we can trust Postgres to do the right thing, but this might not always be the thing we expected. Again, this points to the ever-present need to benchmark and test any performance optimizations we make.

To Sum Up

When you run into a slow query (and you know that is the query that is slow, not the action or full request), then you can use explain and explain analyze to better understand how your database is actually performing the query. It is good to use production-like data wherever possible, at least to provide an order of magnitude on search results and query times.

From there, form a hypothesis about what can be done to improve the query time. Often the solution will involve adding an index. Lastly, be sure to test that the change made to the system actually did improve the query time. Some indexes will have a minimal impact on query time or not even be used, and all incur some cost when writing data, so it's critical to validate any speedups. Database tuning can be a complex science, but can pay off immensely in keeping query performance under control in your application.