27 May

We'll take the CASE

CASE WHEN is used to create categorical variables, aggregate data into a single column with multiple filtering conditions, and to calculate counts/percentages.

  • Basic CASE statements
    • Ex. We want to know the number of home wins, away wins and ties in our dataset.
    • Ex. Idenitfy the home team as Bayern Munich, Schalke 04, or neither
    • Ex. Identify all matches from FC Barcelona with their respective opponent and outcome of the game.
  • In CASE things get more complex
    • Add multiple logical conditions to your WHEN clause
    • What's NULL? If there are some observations in your table that are not specified in the CASE WHEN statement, it will add a NULL value to it.
    • Filtering your CASE statement. Ex. obtaining every detail on every match where Bologna won.
  • CASE WHEN with aggregate functions.
    • with COUNT. 
      • Remark. In this case we are using the id column int THEN statement, but in fact you could use any value (like "random text" or 10008000) as you are just counting your observations
    • with SUM. 
    • with AVG.
    • percentages.
    • Remarks
      • All matches that have nothing to do with Liverpool, or ties in Liverpool games as marked as NULL and therefore not taken into account in the aggregates functions.
      • We could show rounded results, as shown in the following example
    • Ex. Number of matches for each season by country
    • In R or Python, you have the ability to calculate a SUM of logical values (i.e., TRUE/FALSE) directly. In SQL, you have to convert these values into 1 and 0 before calculating a sum. This can be done using a CASE statement. 
      • Calculating percent with CASE and AVG


Short and Simple Subqueries

We will see all about subqueries in the SELECT, FROM and WHERE clauses.

A subquery is a query nested inside another query. These are useful for intermediary transformations. Remark: They can be evaluated independently from the outer query. Subqueries are processed only once.

What do you do with subqueries? 

Why subequeries? 


  • WHERE clause. These are useful for filtering results based on information you'd have to calculate separately before hand.
    • Ex. Which matches in the 2012/2013 season scored home goals higher than overall average?
      • Without a subquery
      • With a subquery
    • Subquery for generating a filtering list. Ex. Which teams are part of Poland's league?
  • FROM clause.
    • Ex. Answering "Which 3 teams has the highest average of home goals scored?"
      • 1st. You will create the query that will become your subquery. For each team we now know their average home goals as well as the team's name
      • 2nd. Put it all together
    • Things to remember
    • Ex. 
      • Count of matches in each country where the total goals was higher than 10. 
      • Let's find out more about these games
  • SELECT clause.
    • Remark. You cannot include an aggregate value in an ungrouped SQL query
    • Ex. For each season we can extract their number of matches as well as the overall number of matches across seasons.
    • Ex. For each match we get the total goals and their difference from the average amount of goals in the season.
    • Things to remember
  • Subqueries everywhere and best practices. You can add as many subqueries as you want, however queries can easily be hard to read.
    • Best practices.
      • Format your queries
      • Annotate you queries. Specify the use of each part of your query
      • Ident your queries
      • Is that subquery really necessary? maintain only those subqueries that are necessary
      • Properly filter each subquery.
    • Examples
      • Average goals for each stage in the season along with the overall average of goals
      • Stages with an average home goals greater than the overall average home goals
      • We will now add a subquery in SELECT to compare the average number of goals scored in each stage to the total. 


Correlated queries, nested queries, and common table expressions

  • Correlated subqueries. 
    • Ex. In the WHERE and SELECT clause, respectively
    • Simple vs Correlated subqueries
    • Ex. Examine matches with scores that are extreme outliers for each country -- above 3 times the average score! 
    • Ex. What was the highest scoring match for each country, in each season? 
  • Nested subqueries. Some typesof questions you answer may requiere multiple layers of transformation and filtering of data before you extract it into the main query.
    • Ex. Let's answer the following question: How does each month's total goals differ from the average monthly total of goals scored?
    • Correlated nested subqueries
      • Ex. What is each country's average goals scored in a match in the season 2011/2012 season?
    • Ex. Examine the highest total number of goals in each season, overall, and during July across all seasons. 
    • Ex. What's the average number of matches per season where a team scored 5 or more goals? How does this differ by country? Let's use a nested, correlated subquery (in FROM) to perform this operation.
    • Remark. Nesting subqueries and performing your transformations one step at a time, adding it to a subquery, and then performing the next set of transformations is often the easiest way to yield accurate information about your data. 
  • Common table expressions. Common table expressions (CTEs) are a method for improving readability and accessibility of information in subqueries. These are a special type of subquery that are declared ahead of your main query.
    • You name the subquery using the WITH statement, and then reference it by name later in the FROM statement as if it were any other table in your database.
    • Ex. Without and with a CTE
    • Ex. Declare a CTE that calculates the total goals from matches in August of the 2013/2014 season.
    • Why CTEs
    • Remark. You can list multiple CTEs just with a comma in between each CTE, and no comma after the last one
  • Deciding on techniques to use. There is a lot of overlap between use cases for joins, subqueries, and common table expressions; that is, these techniques can be used interchangeably without sacrifying query time or the accuracy of the output.
    • Differentiating techniques.
      • Joins: Combine 2+ tables. For simple operations/Aggregations.
      • Correlated subqueries: To combine info between a subquery and a table (or another subquery). Avoid limits of joins, namely that you can't join two separate column in another at a time. However, correlated subqueries take high processing time.
      • Multiple/Nested subqueries: For multi-step transformations. Improve accuracy and reproducibility.
      • Common table expressions. Organize subqueries sequentially, and they can reference other CTEs.
    • So which do I use?
    • Different use cases
    • Ex. 


Window Functions

One limitation until now is that you have to group results when using aggregate functions, thus you cannot compare aggregate values to non-aggregate data. Windows functions work similar to subqueries in 

Window functions are a class of functions that perform calculations on a result set that has already been generated (a window). You can use these to perform aggregate calculations without having to group your data. 

You can use them for:

  • Running totals
  • Rankings
  • Moving averages


  • It's OVER.
    OVER() tells SQL to "pass this aggregate value over this existing result set"
    • Ex. How many goals were socred in each match in 2011/2012, and how did that compare to the average?. 1. Using a subquery in SELECT vs 2. Using the clause common to all windows functions- the OVER clause.
      • Remark. Using this OVER clause is a lot faster in runtime.
    • RANK creates a column numbering your data set from lowest to highest (or hight to low), based on a column that you specify
      • Ex. Create a data set of ranked matches in desc order according to which leagues, on average, score the most goals in a match. 
        • Remark. Unlike a subquery in SELECT, your window function will apply the filter that you include in your WHERE clause.


  • OVER with a PARTITION
    The real thing of Window functions that differentiates them from subqueries in SELECT are in the functions you can add within the OVER clause
    • PARTITION BY. A partition allows you to calculate separate value for different categories established in a partition. Also calculate different calculation in the same column
    • Ex.
      •  
    • PARTITION by multiple columns. You can also use PARTITION to calculate values broken out by multiple columns
    • Ex. (The result set returns the average goals scored broken out by season and country)
    • Considerations
  • Sliding Windows.
    Window functions can also be used to calculate information that changes with each subsequent row in a data set.A sliding funtion contains specific functions within the OVER clause to specify the data you want to use in your calulations. The general syntax isYou use the phrase ROWS BETWEEN to indicate that you plan on slicing information in your window function for each row in the data set an then you specify the starting and finishing point of the calculation. For the start and finish space you can specify any of the shown keywords.
    • PRECEDING and FOLLOWING are used to specify the number of rows before, or after, the current row that you want to include in a calculation.
    • UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING tell SQL that you want to include every row since the beginning, or the end, of the data set in your calculations.
    • CURRENT ROW tells SQL that you want to stop your calculation at the current row.

    • Ex. Create a column calculating the total number of goals scored across the season, with a final total listed in the last row
    • Using the PRECEDING statement, you also have the ability to calculate sliding windows with a more limited frame. For ex. 
    • Remark. There are a wide variety of sliding windows you can use to calculate information in your query.


  • Bringing it all together.
    Case study!
    Who defeated Manchester United in the 2013/2014 season?
Comments
* The email will not be published on the website.
I BUILT MY SITE FOR FREE USING