Monday, January 18, 2010

Common Table Expressions – Dispensing with the Sub-Query

The Common Table Expression (CTE) is a construct that was introduced in SQL Server 2005. It allows you to define a SELECT statement outside of your main query and then reference it within the query. It’s a great replacement for sub-queries, which can be difficult to debug and maintain, especially once their nested. You can think of CTEs as SQL views that exist only within the scope of a particular SQL statement.

Let’s start with an query example similar to one we were asked to do recently. Let’s imagine you’re asked to write an inventory analysis report that requires the following columns:

  • Item ID
  • Item Description
  • Item Class
  • Standard Cost
  • Quantity On Hand
  • Count Sold – The total number of items sold.
  • Average Unit Price – The average unit price that the
  • Unique Customers Count – The count if individual customers who have purchased the item
  • Percentage of Customers – The percentage of all customers who have purchased the item

The report will have start and end date range parameters (Quantity On Hand will be the current regardless of the date range.

Item ID, Item Description, Item Class and Standard Cost come from the item master table. Quantity On Hand comes from the item quantity master table. Count Sold, Average Unit Price, Count of Unique Customers and Percentage of Customers come from a combination of the Sales Order and Customer Master tables.

Now I suspect there may be a way to do this without sub-queries using aggregates and CASE statements but that’s a challenge for another day. When solving a problem I like to break it into manageable chunks. In this case I’d looking at breaking this into two pieces:

  • Returning the core item information, Item ID, Item Description, Standard Cost and Quantity On Hand.
  • Returning the sales information, Count Sold and Average Unit Price, Unique Customer Count and Percentage of Customers.

Let’s start with the core item info. That query would look as follows:

SELECT
        ItemNumber = ISNULL(ItemMaster.ITEMNMBR, QuantityMaster.ITEMNMBR)
        ,ItemDescription = ItemMaster.ITEMDESC
        ,ItemClass = ItemMaster.ITMCLSCD
        ,StandardCost = ItemMaster.STNDCOST
        ,QuantityOnHand = ISNULL(QuantityMaster.QTYONHND, 0)
    FROM IV00101 AS ItemMaster
        FULL JOIN IV00102 AS QuantityMaster
            ON ItemMaster.ITEMNMBR = QuantityMaster.ITEMNMBR
    WHERE ISNULL(QuantityMaster.RCRDTYPE, 1) = 1

Here we’re joining the item master (IV00101) with the item quantity master (IV00102). A couple of notes:

  • We’re using a FULL JOIN to ensure we pick up all items. I have found examples where there are records in the item master or quantity master but not a corresponding record in the other.
  • We’re restricting the quantity master records to a record type (RCRDTYPE) of 1, which only returns quantities that apply to all sites.

Next we’ll look at the statistics. The query would look as follows:

SELECT
        ItemNumber = SOPLineItems.ITEMNMBR
        ,CountSold = SUM(SOPLineItems.QUANTITY)
        ,AverageUnitPrice =
            CASE
                WHEN SUM(SOPLineItems.QUANTITY) = 0 THEN 0.00
                ELSE SUM(SOPLineItems.QUANTITY * SOPLineItems.UNITPRCE) / SUM(SOPLineItems.QUANTITY)
            END
        ,CustomerCount = COUNT(DISTINCT CUSTNMBR)
        ,CustomerPercentage = CAST(COUNT(DISTINCT CUSTNMBR) AS float) /
            CAST ((SELECT COUNT(*) FROM RM00101) AS float)
    FROM SOP30300 AS SOPLineItems
        INNER JOIN SOP30200 AS SOPHeaders
            ON SOPLineItems.SOPTYPE = SOPHeaders.SOPTYPE
                AND SOPLineItems.SOPNUMBE = SOPHeaders.SOPNUMBE
    WHERE SOPHeaders.SOPTYPE = 3
        AND SOPHeaders.GLPOSTDT BETWEEN @StartDate AND @EndDate
    GROUP BY SOPLineItems.ITEMNMBR

Here we’re pulling the data from SOP line items and SOP headers with a total count from the customer master. A couple of notes:

  • We’re using a CASE statement in the Average Unit Price to ensure we don’t get a divide by zero error.
  • We’re filtering the rows on the SOP header GL posting date (GLPOSTDT).
  • We’re getting the item customer count using the COUNT(DISTINCT ) construct.
  • While getting the customer percentage we’re explicitly casting the counts as floats so that it doesn’t perform integer division and only return 0 and 1.

So traditionally, we would combine these together as two sub-queries as below:

SELECT
        ItemNumber = Items.ItemNumber
        ,ItemClass = Items.ItemClass
        ,ItemDescription = Items.ItemDescription
        ,StandardCost = Items.StandardCost
        ,QuantityOnHand = Items.QuantityOnHand
        ,CountSold = ISNULL(ItemStatistics.CountSold, 0)
        ,AverageUnitPrice = ISNULL(ItemStatistics.AverageUnitPrice, 0)
        ,CustomerCount = ISNULL(ItemStatistics.CustomerCount, 0)
        ,CustomerPercentage = ISNULL(ItemStatistics.CustomerPercentage, 0)
    FROM (
            SELECT
                    ItemNumber = ISNULL(ItemMaster.ITEMNMBR, QuantityMaster.ITEMNMBR)
                    ,ItemDescription = ItemMaster.ITEMDESC
                    ,ItemClass = ItemMaster.ITMCLSCD
                    ,StandardCost = ItemMaster.STNDCOST
                    ,QuantityOnHand = ISNULL(QuantityMaster.QTYONHND, 0)
                FROM IV00101 AS ItemMaster
                    FULL JOIN IV00102 AS QuantityMaster
                        ON ItemMaster.ITEMNMBR = QuantityMaster.ITEMNMBR
                WHERE ISNULL(QuantityMaster.RCRDTYPE, 1) = 1) AS Items
        LEFT JOIN (
                SELECT
                        ItemNumber = SOPLineItems.ITEMNMBR
                        ,CountSold = SUM(SOPLineItems.QUANTITY)
                        ,AverageUnitPrice =
                            CASE
                                WHEN SUM(SOPLineItems.QUANTITY) = 0 THEN 0.00
                                ELSE SUM(SOPLineItems.QUANTITY * SOPLineItems.UNITPRCE) / SUM(SOPLineItems.QUANTITY)
                            END
                        ,CustomerCount = COUNT(DISTINCT CUSTNMBR)
                        ,CustomerPercentage = CAST(COUNT(DISTINCT CUSTNMBR) AS float) /
                            CAST ((SELECT COUNT(*) FROM RM00101) AS float)
                    FROM SOP30300 AS SOPLineItems
                        INNER JOIN SOP30200 AS SOPHeaders
                            ON SOPLineItems.SOPTYPE = SOPHeaders.SOPTYPE
                                AND SOPLineItems.SOPNUMBE = SOPHeaders.SOPNUMBE
                    WHERE SOPHeaders.SOPTYPE = 3
                        AND SOPHeaders.GLPOSTDT BETWEEN @StartDate AND @EndDate
                    GROUP BY SOPLineItems.ITEMNMBR) AS ItemStatistics
                        ON Items.ItemNumber = ItemStatistics.ItemNumber

Although this works, it’s starting to get a little difficult to read. If there were additional sub-queries, particularly nested ones, it would be much harder to follow and debug. If we’re using CTEs we would create the query below:

-- Items common table expression returns item list with quantities
-- for all sites.
WITH Items
AS (
    SELECT
            ItemNumber = ISNULL(ItemMaster.ITEMNMBR, QuantityMaster.ITEMNMBR)
            ,ItemDescription = ItemMaster.ITEMDESC
            ,ItemClass = ItemMaster.ITMCLSCD
            ,StandardCost = ItemMaster.STNDCOST
            ,QuantityOnHand = ISNULL(QuantityMaster.QTYONHND, 0)
        FROM IV00101 AS ItemMaster
            FULL JOIN IV00102 AS QuantityMaster
                ON ItemMaster.ITEMNMBR = QuantityMaster.ITEMNMBR
        WHERE ISNULL(QuantityMaster.RCRDTYPE, 1) = 1
),

-- Item statistics common table expression returns the relevant
-- statistics for each item.
ItemStatistics
AS (
    SELECT
            ItemNumber = SOPLineItems.ITEMNMBR
            ,CountSold = SUM(SOPLineItems.QUANTITY)
            ,AverageUnitPrice =
                CASE
                    WHEN SUM(SOPLineItems.QUANTITY) = 0 THEN 0.00
                    ELSE SUM(SOPLineItems.QUANTITY * SOPLineItems.UNITPRCE) / SUM(SOPLineItems.QUANTITY)
                END
            ,CustomerCount = COUNT(DISTINCT CUSTNMBR)
            ,CustomerPercentage = CAST(COUNT(DISTINCT CUSTNMBR) AS float) /
                CAST ((SELECT COUNT(*) FROM RM00101) AS float)
        FROM SOP30300 AS SOPLineItems
            INNER JOIN SOP30200 AS SOPHeaders
                ON SOPLineItems.SOPTYPE = SOPHeaders.SOPTYPE
                    AND SOPLineItems.SOPNUMBE = SOPHeaders.SOPNUMBE
        WHERE SOPHeaders.SOPTYPE = 3
            AND SOPHeaders.GLPOSTDT BETWEEN @StartDate AND @EndDate
        GROUP BY SOPLineItems.ITEMNMBR
)

-- Final query joins the two CTEs together.
SELECT
        ItemNumber = Items.ItemNumber
        ,ItemClass = Items.ItemClass
        ,ItemDescription = Items.ItemDescription
        ,StandardCost = Items.StandardCost
        ,QuantityOnHand = Items.QuantityOnHand
        ,CountSold = ISNULL(ItemStatistics.CountSold, 0)
        ,AverageUnitPrice = ISNULL(ItemStatistics.AverageUnitPrice, 0)
        ,CustomerCount = ISNULL(ItemStatistics.CustomerCount, 0)
        ,CustomerPercentage = ISNULL(ItemStatistics.CustomerPercentage, 0)
    FROM Items
        LEFT JOIN ItemStatistics
            ON Items.ItemNumber = ItemStatistics.ItemNumber

In the query above we have two common table expressions, Items and ItemStatistics. We declare the first CTE using the WITH clause. Subsequent CTEs are separated with a comma. Note that all CTEs must be declared at the beginning of the query. The final query simply joins the two CTEs together

The biggest advantage to CTEs is their readability, manageability and the ease of troubleshooting. It’s much easier to break the query down and troubleshoot its individual components than to do the same with nested sub-queries. Other things to note about CTEs:

  • A CTE can reference other CTEs defined above it.
  • A query can join to a CTE multiple times. i.e., wherever you might use the same sub-query multiple times you can now define it once in a CTE and simply reference the CTE multiple times.
  • You can reference parameters and variables in a CTE and thus optimize your queries by filtering out records prior to joining them in your main query.
  • If you have any statements above a CTE, such as DECLARE or SET statements in a stored procedure, you must terminate those prior statements using a semicolon.

So let us bid farewell to the much vaunted and occasionally overused sub-query and hello to our new friend, the Common Table Expression.

Next up, the Recursive Common Table Expression.

No comments:

Post a Comment