Reasoning about “complicated” Postgres SQL queries

TL;DR

When building non-trivial SQL queries, it helps to:

State your objective in as “declarative” a way as possible
Work from the “inside out”. When using subqueries in SQL it helps to find the innermost nested subquery and work your way outwards to figure out what is going on. Similarly, I find that when building an SQL query it helps to “wrap” subqueries and work your way to an objective

Scenario

At CyberSift we maintain a threat feed service we codenamed “Cerebrum”. One of Cerebrum’s functions is to curate a mapping IP Addresses to their corresponding ASN information. So a Postgres table similar to the following exists:

         timestamp         |   prefix    | description | asn_number 
---------------------------+-------------+-------------+-----------
2019-10-15 03:46:08.636552 | 2.56.9.0/24 |  CLOWN-AS   |  208076  |

We keep historical data, so a given prefix may have multiple entries. However we periodically need to clean the table and keep only the latest entry for any given prefix. So the objective would be:

For any given prefix, keep the row with the latest timestamp

A better objective definition

In order to arrive to our final objective, we first need to break it down to express it more effectively in SQL. My reasoning landed on these objectives, in order:

List any prefixes that have more than two rows (i.e. duplicates)
List the LATEST entry for all the duplicate entries
Delete any duplicate rows which ARE NOT in the latest entries list

The job is to define SQL statements for each of the above

SQL: List any prefixes that have more than two rows

“List” should immediately bring to mind the “SELECT” operation:

SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1

This is relatively straightforward. Group rows by prefix, and only show those having more than one entry:

prefix | count
------------+----
2.56.9.0/24 | 3

let’s wrap the query in a subquery because we’ll need it for the following objective. We end up with something like:

(SELECT prefix,count(*) FROM asn GROUP BY prefix HAVING count()>1) as sub

SQL: List the LATEST entry for all the duplicate entries

This objective depends on the first. When we have such a dependency, “INNER JOIN” is a good place to start. The dependency is on our previous table which we aliased as “sub” above, so we know we’re starting with:

INNER JOIN (SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1) as sub

“INNER JOIN” requires two tables to tango, and a “join on” column… which column to use to join/merge your two tables. One table is the “sub” table we just built before, and the other table is our asn table, so we end up with:

SELECT asn.prefix,asn.timestamp FROM asn INNER JOIN 
   (SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1) 
   AS sub 
ON asn.prefix=sub.prefix ORDER BY asn.prefix,asn.timestamp DESC

We “join” on the prefix, and select the resulting prefixes and timestamps. This gives us a list of prefixes, and timestamps sorted with most recent on top. How do we select only the latest entry? SELECT DISTINCT ON was my lesson for the day. This operator selects only the first matching row for a given condition.

SELECT DISTINCT ON (asn.prefix) asn.prefix,asn.timestamp FROM asn INNER JOIN 
   (SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1) 
   AS sub 
ON asn.prefix=sub.prefix ORDER BY asn.prefix,asn.timestamp DESC

That gives us the latest entry for all duplicate prefixes. Just like before, let’s wrap all this in a subquery, which I am calling “latest” below:

(
 SELECT DISTINCT ON (asn.prefix) asn.prefix,asn.timestamp FROM asn INNER JOIN 
   (SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1) 
   AS sub 
 ON asn.prefix=sub.prefix ORDER BY asn.prefix,asn.timestamp DESC
) AS latest

SQL: Delete any duplicate rows which ARE NOT in the latest entries list (1)

Tip: Before actually deleting any data it’s a good idea to instead use SELECT until you’re sure the data being returned is that you actually want to delete…

Once again this objective depends on the previous result, so an attentive reader already knows that we’re again going to have to wrap our progress so far in an INNER JOIN, leaving us with the starting point:

INNER JOIN (
   SELECT DISTINCT ON (asn.prefix) asn.prefix,asn.timestamp FROM asn INNER JOIN 
     (SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1) 
     AS sub 
   ON asn.prefix=sub.prefix ORDER BY asn.prefix,asn.timestamp DESC
  ) AS latest

What do we join with however? We need to join again on prefix, however this time we need to select those rows which are NOT in our “latest” table

SELECT current.timestamp,current.prefix FROM asn AS current 
 INNER JOIN (
   SELECT DISTINCT ON (asn.prefix) asn.prefix,asn.timestamp FROM asn INNER JOIN 
     (SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1) 
     AS sub 
   ON asn.prefix=sub.prefix ORDER BY asn.prefix,asn.timestamp DESC
   ) AS latest
ON current.prefix=latest.prefix WHERE current.prefix=latest.prefix AND current.timestamp!=latest.timestamp

The first new line means we’re going to select the timestamp and prefix of all our ASNs and name the resulting table “current“. We join this “current” table with our previous “latest” table on prefix (since this is our common field on which our dependency relies on) however we use the WHERE operator to filter on those entries which have the same prefix, but not the latest timestamp

All together this gives is a list of entries which are duplicates, and not the latest entry.

Now that we’re confident the returned data is what we expect, time to delete these entries…

SQL: Delete any duplicate rows which ARE NOT in the latest entries list (2)

Deleting is the easiest part… and the most dangerous hence my TIP above… Now that we have the list of rows we’d like to delete, we wrap our entire query so far into a DELETE operation condition:

DELETE FROM asn WHERE (timestamp,prefix) IN (
  SELECT current.timestamp,current.prefix FROM asn AS current 
   INNER JOIN (
     SELECT DISTINCT ON (asn.prefix) asn.prefix,asn.timestamp FROM asn INNER JOIN 
       (SELECT prefix,COUNT(*) FROM asn GROUP BY prefix HAVING COUNT(*)>1) 
       AS sub 
     ON asn.prefix=sub.prefix ORDER BY asn.prefix,asn.timestamp DESC
     ) AS latest
  ON current.prefix=latest.prefix WHERE current.prefix=latest.prefix 
  AND current.timestamp!=latest.timestamp
);

The DELETE operator has a condition specified in the WHERE expression. This essentially states that we’d like to delete any entry where the timestamp and prefix field are IN the table we’ve built using the select statements so far.

Conclusion

I realize this is a hyper-specific example, but I hope it does give an insight into how we build a non-trivial statement from the “inside out”. We broke our objective into smaller atomic operations, which we then wrapped into subqueries for use in join operations to finally get what we required