Managing complex business logic in SQL using Common Table Expressions (CTEs)

Let’s say you have a data-intensive app with a bunch of complicated querying needs. As things progress and your needs expand, the data access code might start getting more or less unmaintainable.

It now takes a lot of time to come up with efficient ways to query data, and you need to duplicate querying code in multiple places. Or, you’re using higher-level abstraction such as object-relational mapping (ORM), but start bumping into performance limits when doing complex queries.

Sounds familiar so far? If any of the stuff above resonates, keep reading. I’ll present some ideas for approaching these challenges with the help of SQL’s Common Table Expressions, aka CTEs, with practical guidance.

So what are Common Table Expressions?

SQL’s Common Table Expressions or CTEs for short, also known as “WITH statements”, are query-scoped temporary result sets. You could also think of them as non-persistent table-like views. They have existed in the standard since SQL:1999 and have been around in any serious relational database engine for a long time already.

Especially if you write database queries directly in SQL, or use thin enough abstractions on top of it, you might have encountered CTEs already.

CTE benefits

CTEs are basically just a named, structured way to express queries in SQL. The most often enjoyed CTE benefits are related to improved readability: they allow you to separate certain concerns into dedicated named queries, instead of nested inline subselects (sometimes also called derived tables).

CTEs can be valuable for moving computation closer to your data to reduce the number of individual queries and round trips between your application code and database. Complicated joins could be split up into named CTEs for readability. CTEs can reduce duplication by reuse, i.e. referring to one multiple times within a single query. You could also abstract away specific implementation details behind a unified “interface”, to output columns with matching data types and naming, even with different sources of data under the hood.

As for performance, subqueries and CTEs should, at least in theory, be on the same line. Possible exceptions include duplication of subselects, whereas a CTE can be defined and computed only once. However, this could depend on the database engine’s query planner to choose whether identical subselects are actually run only once.

Sometimes you might need to give the planner more hints about how to execute your query, for example by explicitly telling it whether or not to materialize the CTE query results into a temporary memory table (but lose indexing benefits). This level of control is not present when using subselects, where the alternative would be to use temporary tables for in-memory materialization.

There’s one case that requires CTE usage if dealing with native SQL, though: recursive queries. Without CTEs, if the parent-child hierarchy depth can be arbitrary, you would need to implement recursion and stop condition checks in code that would issue multiple separate queries. This could sometimes imply more or less significant performance penalties.

CTEs might also prove useful in some less-commonly-needed cases, such as ordered UPDATEs from another table.

When are CTEs not too useful?

CTEs might be more difficult to utilize, and therefore less convenient in general, if you are using a deeper ORM-like database abstraction layer, such as JPA/Hibernate (Java and some other JVM languages), Entity Framework (.Net), or ActiveRecord (Ruby and others). You might at least need to supply raw SQL for declaring CTEs, because e.g., JPA’s JPQL abstraction doesn’t support CTEs.

CTEs as abstractions: example

Let’s concentrate on CTE benefits for writing readable and maintainable SQL code, the declarative way. The following example might not be very useful in any realistic business scenario on its own, but the idea is to demonstrate some CTE benefits using a compact example that should be easy enough to digest.

Boring stuff first: data setup. Pretend that we have a database of rectangular and cylindrical boxes. They can contain or be contained by other boxes of similar shape, but so that cylindrical ones can’t contain rectangular ones or vice versa. It is possible to rearrange boxes into or out of other boxes, with some constraints. On the querying side, we would like to get various numerical statistics about them.

CREATE TABLE rectangular_boxes (
  id           INT NOT NULL PRIMARY KEY,
  container_id INT CONSTRAINT container_id_fkey REFERENCES rectangular_boxes,
                                -- All boxes must have positive dimensions:
  width        INT NOT NULL CONSTRAINT positive_width  CHECK (width > 0),
  depth        INT NOT NULL CONSTRAINT positive_depth  CHECK (depth > 0),
  height       INT NOT NULL CONSTRAINT positive_height CHECK (height > 0)
);

CREATE TABLE cylindrical_boxes (
  id           INT NOT NULL PRIMARY KEY,
  container_id INT CONSTRAINT container_id_fkey REFERENCES cylindrical_boxes,
                                -- All boxes must have positive dimensions:
  radius       INT NOT NULL CONSTRAINT positive_radius CHECK (radius > 0),
  height       INT NOT NULL CONSTRAINT positive_height CHECK (height > 0)
);

-- We could also consider adding some extra triggers to prevent cyclic
-- container_id hierarchies, or fitting too big stuff within a container.
-- But let's not at this time for the simplicity of this example, as those
-- go beyond the main point here.

Then some sample data to start with:

INSERT INTO rectangular_boxes
  (id, container_id, width, depth, height)
VALUES
  (1,  null,         40,    30,    70    ), --outer box
  (2,  1,            30,    29,    45    ),
  (3,  2,            10,    15,    40    ),
  (4,  3,            2,     2,     5     ),
  (5,  null,         18,    26,    10    ), --other outer box
  (6,  3,            6,     8,     4     ),
  (7,  5,            10,    20,    5     );

INSERT INTO cylindrical_boxes
  (id, container_id, radius, height)
VALUES
  (10,  null,         10,    30    ), --outer box
  (11,  10,           9,     20    ),
  (12,  11,           3,     3     ),
  (13,  null,         15,    50    ); --other outer box, nothing inside it

For the following code snippets, I’ll use TypeScript with Slonik, a PostgreSQL client offering validation and type safety. The code here is hopefully approachable enough even if you’ve never heard about Slonik, provided you know your SQL — the main thing to note is that all native SQL is wrapped inside those sql...`` template strings.

Unified box interface

Let’s start with a simple case: we’d like to treat different types of boxes in a similar way, without code duplication or explicit conditional processing. We’d like to manage the boxes without necessarily knowing what exact kind of boxes they are. There would be two different queries for both types of boxes:

import { sql } from 'slonik'

const rectangularBoxesSelect = sql`
  SELECT
    *,
    width * height * depth AS volume
  FROM rectangular_boxes
`

const cylindricalBoxesSelect = sql`
  SELECT
    id,
    container_id,
    radius AS width,
    radius AS depth,
    height,
    2 * pi() * radius * height AS volume
  FROM cylindrical_boxes
`

Then we can build queries for different kinds of boxes and get unified-looking results that can be treated similarly.

const allBoxesSelect = sql`
  WITH
    rectangular_normalized AS (${rectangularBoxesSelect}),
    cylindrical_normalized AS (${cylindricalBoxesSelect})

  SELECT * FROM rectangular_normalized
  UNION
  SELECT * FROM cylindrical_normalized
`

// Query for boxes that we think are big enough (depends on all_boxes):
const bigBoxesSelect = sql`
  SELECT * FROM all_boxes WHERE volume > 7000
`

// Query for boxes that we think are tall enough (depends on all_boxes):
const tallBoxesSelect = sql`
  SELECT * FROM all_boxes WHERE height > 30
`

// Query for tall and thin (=not big) boxes, using previous building blocks:
const tallAndThinBoxesSelect = sql`
  WITH
    all_boxes AS (${allBoxesSelect}), --dependency for big & tall selects
    big_boxes AS (${bigBoxesSelect}),
    tall_boxes AS (${tallBoxesSelect})

  SELECT * FROM tall_boxes
  WHERE id NOT IN (SELECT id FROM big_boxes);
`

//=> returns boxes with ID 3 (rectangular) and 13 (cylindrical) for demo data

Now we have a number of SELECTs that can be neatly used as building blocks for more specific queries. The final SELECTs can be kept short and tidy, using named intermediate table-like CTE names.

At this point, you might point out that hey, all this can be done using subselects, too! And you would be absolutely right as no WITH syntax would be strictly necessary to get this stuff done. But using subselects, you would need to either repeat that all_boxes select expression, or less-conveniently join with the subselect-derived table, instead of referring to a once-only defined CTE we now have.

That said, in these simple examples subselects might not be too ugly or repetitive yet. But when things get more complicated, the main select can get too hairy for its own good without proper CTE use.

Contained boxes traversal

Next, towards the deeper end: recursive CTEs. As mentioned early in this post, this is the only way to do recursion in native SQL. Also without CTEs, the generic box abstraction approach presented above would be lost, and we’d need to handle things in TypeScript using multiple queries instead of only one.

Now, let’s say we want to find out facts about boxes that are contained within a given outermost box. Either directly inside the outer box, or inside one of the contained boxes, or inside the contained boxes’ contained boxes, and so on. Let’s call this a “box tree” structure, a hierarchy with the outer box being the parent item and all contained boxes its descendants in the tree.

function containedBoxesTreeCte(args: { cteName: string, sourceSelect: any}) {
  const cte = sql.identifier([args.cteName])
  return sql`
    WITH RECURSIVE ${cte} AS (
      WITH source_boxes AS (${args.sourceSelect})

      -- The anchor term: all outer boxes, i.e. any source_box
      SELECT
        source_boxes.*,
        id AS outermost_id -- exposed outside of this CTE
      FROM source_boxes

      UNION ALL

      -- The recursion term: descendants of the outermost boxes
      SELECT
        source_boxes.*,
        outermost_id -- passed unchanged to all the contained boxes
      FROM source_boxes
      JOIN ${cte} ON source_boxes.container_id = contained_boxes_tree.id
    )
  `
}

Then we can use the CTE for stats-finding queries for rectangular and cylindrical boxes, for which the volumes are calculated in different ways:

function totalContainedVolumeSelect(parentIds: number[]) {
  return sql`
    ${containedBoxesTreeCte({
      cteName: 'contained_boxes_tree',
      sourceSelect: allBoxesSelect
    })}

    SELECT sum(volume)
    FROM contained_boxes_tree
    WHERE outermost_id = ANY(${sql.array(parentIds, 'int4')}) --> INT (4 bytes)
  `
}

(See Appendix 1 for a working example using this totalContainedVolume function.)

Pretty neat, right? Now we define the tree-querying logic in one place in the source code, and can drop it to any place it’s needed. The final query can be written in an easy-to-understand way as the detailed implementation is abstracted away into a table-like, queryable thing.

CTE abstractions in practice at Swarmia

Our setup and tech stack consist of PostgreSQL 14 as the database engine (running on Google Cloud Platform). As a noteworthy CTE-related thing, this version introduced the new CYCLE syntax for built-in recursive cycle detection, though this was doable also earlier with some more difficulty.

On the programming language side, we use TypeScript in the backend, with the Slonik library already mentioned in the example.

The product itself combines issue tracker data with pull requests and provides team and organization level insights about software development productivity and investments. In practice, the data querying and aggregation needs are in many cases quite complicated.

Notable use cases for CTEs at Swarmia:

Making issue tracker issues from more than one source, and their status changes, look same to the queries so they don’t need to care which particular issue tracker is the actual data source; see Appendix 2 for a concrete code sample
Applying common types of filtering and data enrichment logic to lists of things like pull requests
Querying for parent/child hierarchies of items using recursion, not too unlike what was shown in the second example
Determining general team memberships based on multiple team data sources
Applying common time series logic for varying data aggregation purposes
In many cases, using CTEs as a nicer-structured alternative to subqueries, even if not sharing these CTEs between any other queries

Using CTEs has enabled us to build complex queries from multiple simpler parts that are in many cases also testable in isolation. This means there are less impediments for adding new pieces of business logic when needed.

Summary

As a syntactic construct, CTEs can be a great addition to your toolbox for making code more maintainable. They could be especially useful if you choose to practice the philosophy of getting computations closer to actual data. CTEs also bring some additional extras that are otherwise not possible or easy to do with standard SQL.

Appendix 1: runnable example code for `box.ts`

import { createPool } from 'slonik'

async function main() {
  // Configure connection pool for local DB "box" with login "box:pass"
  const pool = await createPool('postgres://box:pass@localhost/box')

  await pool.connect(async conn => {
    await conn.oneFirst(sql`SELECT 1`) // check DB connection: expect one row
    console.log('Connected')
  })

  // Test selecting tall and thin boxes
  console.log('\nTall and thin:', await pool.many(tallAndThinBoxesSelect))

  // Test selecting contained boxes by parent
  console.log(
    '\nTotal contained volume:',
    // Expect exactly one row, and return the first result column as a number
    await pool.oneFirst<number>(
      // note: box id = 4 a descendant of id = 1 as well
      totalContainedVolumeSelect([1, 4, 10])
    )
  )
}

// Run it! (With "void", to explicitly ignore returned promise)
void main()

You could test the recursive CTE code by having database set up as per the createPool configuration above. Then have all code snippets concatenated into box.ts, and run:

npx ts-node box.ts

Expected output:

Connected

Tall and thin: [
  {
    id: 3,
    container_id: 2,
    width: 10,
    depth: 15,
    height: 40,
    volume: 6000
  },
  {
    id: 13,
    container_id: null,
    width: 15,
    depth: 15,
    height: 50,
    volume: 4712.38898038469
  }
]

Total contained volume: 132454.4776152108175

Appendix 2: unifying issue tracker data from two different sources

Here’s a snippet from actual Swarmia production code, with only cosmetic edits for display purposes, to demonstrate how we deal with different issue tracker data while still being able to treat them the same way further down the road.

const jiraIssueToPlainSwarmiaIssueCte = (organizationId: string) => sql`
  SELECT
    jira_issues.id,
    jira_issues.jira_id AS source_id,
    jira_issues.organization_id,
    jira_issues.jira_project_id AS project_id,
    'Jira' AS source,
    jira_issues.jira_team_ids as source_team_ids,
    jira_issues.author_identity_id,
    jira_issues.assignee_identity_id,
    jira_issues.parent_issue_id AS parent_issue_id,
    jira_issues.jira_key AS issue_key,
    jira_issues.summary AS title,
    jira_installations.base_url || '/browse/' || jira_issues.jira_key AS web_url,
    jira_issues.issue_type AS source_issue_type,
    jira_issues.issue_type AS swarmia_issue_type,
    jira_issues.labels AS labels,
    jira_issues.jira_created_at AS created_at,
    jira_issues.jira_updated_at AS updated_at
  FROM jira_issues
  JOIN jira_installations ON jira_installations.organization_id = jira_issues.organization_id
  JOIN jira_projects ON jira_projects.id = jira_issues.jira_project_id
  WHERE jira_installations.disabled_at IS NULL
    AND jira_issues.organization_id = ${organizationId}
    AND (jira_installations.sync_only_project_keys IS NULL
         OR jira_projects.jira_key = ANY(jira_installations.sync_only_project_keys))
`

const linearIssueToPlainSwarmiaIssueCte = (organizationId: string) => sql`
  SELECT
    linear_issues.id,
    linear_issues.linear_id AS source_id,
    linear_issues.organization_id,
    linear_issues.linear_project_id AS project_id
    'Linear' AS source,
    author_identity_id,
    assignee_identity_id,
    COALESCE(parent_issue_id, linear_project_id) AS parent_issue_id,
    linear_issues.key AS issue_key,
    title,
    url AS web_url,
    CASE
      WHEN ARRAY['bug', 'defect'] && ARRAY(SELECT LOWER(UNNEST(linear_issues.labels)))
      THEN 'Bug'
      ELSE 'Story'
    END AS swarmia_issue_type,
    'Issue' AS source_issue_type,
    labels AS labels,
    linear_issues.linear_created_at AS created_at,
    linear_issues.linear_updated_at AS updated_at,
  FROM linear_issues
  LEFT JOIN linear_team_settings ON linear_team_settings.linear_team_id = linear_issues.linear_team_id
  LEFT JOIN linear_teams ON linear_teams.id = linear_issues.linear_team_id
  WHERE linear_issues.organization_id = ${organizationId}
    AND linear_teams.deleted_at IS NULL
  GROUP BY linear_issues.id

  UNION

  -- Also regard Linear's projects as "issues" for data normalization purposes
  SELECT
    linear_projects.id,
    linear_projects.linear_id AS source_id,
    linear_projects.organization_id,
    linear_projects.id AS project_id
    'Linear' AS source,
    author_identity_id,
    NULL AS assignee_identity_id,
    NULL AS parent_issue_id,
    'Project' AS issue_key, -- Linear Projects don't have an "issue key" but we need one
    linear_projects.name AS title,
    'https://linear.app/project/' || slug_id AS web_url,
    'Epic' AS swarmia_issue_type,
    'Project' AS source_issue_type,
    ARRAY[]::text[] AS labels,
    linear_projects.linear_created_at AS created_at,
    linear_projects.linear_updated_at AS updated_at,
  FROM linear_projects
  LEFT JOIN linear_teams_projects ON linear_teams_projects.linear_project_id = linear_projects.id
  LEFT JOIN linear_team_settings ON linear_team_settings.linear_team_id = linear_teams_projects.linear_team_id
  LEFT JOIN linear_teams ON linear_teams.id = linear_teams_projects.linear_team_id
  WHERE linear_projects.organization_id = ${organizationId}
    AND linear_teams.deleted_at IS NULL
  GROUP BY linear_projects.id
`

Managing complex business logic in SQL using Common Table Expressions (CTEs)

So what are Common Table Expressions?

CTE benefits

When are CTEs not too useful?

CTEs as abstractions: example

Unified box interface

Contained boxes traversal

CTE abstractions in practice at Swarmia

Summary

Appendix 1: runnable example code for `box.ts`

Appendix 2: unifying issue tracker data from two different sources

More content from Swarmia

Engineering benchmarks: A guide to improving time to deploy

What good Continuous Integration (CI) looks like

Managing complex business logic in SQL using Common Table Expressions (CTEs)

So what are Common Table Expressions?

CTE benefits

When are CTEs not too useful?

CTEs as abstractions: example

Unified box interface

Contained boxes traversal

CTE abstractions in practice at Swarmia

Summary

Appendix 1: runnable example code for box.ts

Appendix 2: unifying issue tracker data from two different sources

More content from Swarmia

Engineering benchmarks: A guide to improving time to deploy

What good Continuous Integration (CI) looks like

Appendix 1: runnable example code for `box.ts`