GROUP BY SQL: how to group and analyze data in SQL

November 11, 2023
By Schenia T
Reading Time: 6 minutes
MySQL

GROUP BY SQL is an essential technique in SQL that allows you to efficiently group and analyze data . The main motivation behind GROUP SQL is to simplify the analysis of large data sets, allowing users to extract meaningful information and create reports based on specific criteria. Therefore, we use GROUP BY to group records with similar information . Combining with other commands such as SELECT, WHERE, ORDER BY, among others, to specify the grouping criteria and information to extract and display in the results.

Throughout this article, we will cover the fundamental concepts of GROUP SQL, such as GROUP BY and aggregation functions, as well as the combination with joins to analyze and group information from multiple tables. Thus, offering a solid foundation for a more in-depth understanding of GROUP SQL and its application in data analysis. Also, learn about UPDATE in SQL!

GROUP BY can be used in any SQL database, such as a MySQL database present in the Copahost website hosting.

Table of Contents

Basic GROUP BY syntax

The GROUP BY syntax in SQL is used to group the results of a query according to the specified columns. The basic syntax of GROUP BY is as follows:

SELECT column1, column2, ...
FROM table_name
GROUP BY column1, column2, ...0

In this example, the query selects the columns column1, column2, etc., from the table table_name and groups the results using GROUP BY, using the columns specified in the group. So, here are some important points about the GROUP BY syntax:

We must use GROUP BY in conjunction with a SELECT query.
The SELECT query must be followed immediately by the GROUP BY.
The column or expression in parentheses () after the GROUP BY is the aggregation column or expression. If not specified, the default aggregation column is the COUNT(*) column.
We can use GROUP BY to group data using one or more columns. This way, if we only have one column specified in GROUP BY, the results will be grouped by row. And if multiple columns are specified, the results are grouped by set of values in the specified columns.
When we use GROUP BY with other aggregation functions, such as SUM, COUNT , AVG, MAX, or MIN, these functions are applied to each grouped subset of data.
GROUP BY is a standard feature of SQL and is supported by all major database management systems (DBMS).

Common uses of GROUP BY

GROUP BY is a clause in the SQL language that we use to group rows of data in a table based on one or more columns. Some of the common uses of GROUP BY include:

Data grouping:

SELECT Department, city, COUNT(*) as employees
FROM employees
GROUP BY Department, city;

This code lists the number of employees in each department and city.

Row count:

SELECT COUNT(*) as total
FROM employees
GROUP BY department;

This code counts the total number of lines in each department.

Calculate quantiles:

SELECT department, city,
       AVG(salary) the average,
       STDEV(salary) as standard_deviation,
       PERCENTILE_CONT(0.5) within group (order by salary) as median
FROM employees
GROUP BY department, city;

This code calculates the mean, standard deviation and median salary in each department and city.

Data analysis :

SELECT department, city, COUNT(*) employees,
       SUM(salary) as total_salaries
FROM employees
GROUP BY department, city;

This code lists the number of employees and total salaries in each department and city.

Generating reports Using GROUP BY SQL:

SELECT department, city, COUNT(*) employees,
       SUM(salary) as total_salaries,
       AVG(salary) the average,
       STDEV(salary) as standard_deviation
FROM employees
GROUP BY department, city;

This code lists the number of employees, total salaries, average and standard deviation of salary in each department and city.

Identifying trends Using GROUP BY SQL:

SELECT department, city, COUNT(*) employees,
       SUM(salary) as total_salaries,
       AVG(salary) the average,
       STDEV(salary) as standard_deviation,
       PERCENTILE_CONT(0.5) within group (order by salary) the median,
       LAG(salario, 1, 0) over (partition by department, city order by salary) as salary_previous
FROM employees
GROUP BY department, city;

This code lists the number of employees, total salaries, average, standard deviation of salary, median and previous salary in each department and city. This way, we use LAG() to obtain the previous salary in each department and city.

Patronage discovery GROUP BY SQL:

SELECT department, city, COUNT(*) employees,
       SUM(salary) as total_salaries,
       AVG(salary) the average,
       STDEV(salary) as standard_deviation,
       PERCENTILE_CONT(0.5) within group (order by salary) the median,
       LAG(salario, 1, 0) over (partition by department, city order by salary) as salary_anterior,
       LEAD(salario, 1, 0) over (partition by department, city order by salary) as salary_proximo
FROM employees
GROUP BY department, city;

This code lists the number of employees, total salaries, average, salary standard deviation, median, previous salary and next salary in each department and city. We use LAG() and LEAD() to obtain the previous salary and the next salary in each department and city.

Identifying gaps:

SELECT department, city, COUNT(*) employees,
       SUM(salary) as total_salaries,
       AVG(salary) the average,
       STDEV(salary) as standard_deviation,
       PERCENTILE_CONT(0.5) within group (order by salary) the median,
       LAG(salario, 1, 0) over (partition by department, city order by salary) as salary_anterior,
       LEAD(salario, 1, 0) over (partition by department, city order by salary) as salary_proximo,
       LAG(salario, 2, 0) over (partition by department, city order by salary) as salary_anterior_2,
       LEAD(salario, 2, 0) over (partition by department, city order by salary) as salary_proximo_2
FROM employees
GROUP BY department, city;

This code lists the number of employees, total salaries, mean, salary standard deviation, median, previous salary, next salary, salary before 2 periods and salary next to 2 periods in each department and city. We use Lag and Lead to obtain the previous and next salaries in each department and city, with an offset of 2 periods.

Anomaly detection:

SELECT department, city, COUNT(*) employees,
       SUM(salary) as total_salaries,
       AVG(salary) the average,
       STDEV(salary) as standard_deviation,
       PERCENTILE_CONT(0.5) within group (order by salary) the median,
       LAG(salario, 1, 0) over (partition by department, city order by salary) as salary_anterior,
       LEAD(salario, 1, 0) over (partition by department, city order by salary) as salary_proximo,
       LAG(salario, 2, 0) over (partition by department, city order by salary) as salary_anterior_2,
       LEAD(salario, 2, 0) over (partition by department, city order by salary) as salary_proximo_2,
       CASE
         WHEN salary > QUANTILE_CONT(0.9, salary) THEN 'Anomaly'
         ELSE 'Normal'
       END the anomaly
FROM employees
GROUP BY department, city;

This code uses functions such as CASE WHEN, FROM and SELECT to list the number of employees This, total salaries, mean, salary standard deviation, median, previous salary, next salary, salary before 2 periods and salary next to 2 periods in each department and city, as well as an anomaly detected based on quantile 0.9. Therefore, if we have a salary value greater than the 0.9 quantile, then we consider it an anomaly. Otherwise, we consider it normal.

Future salary forecast Using GROUP BY SQL:

SELECT department, city, COUNT(*) employees,
       SUM(salary) as total_salaries,
       AVG(salary) the average,
       STDEV(salary) as standard_deviation,
       PERCENTILE_CONT(0.5) within group (order by salary) the median,
       LAG(salario, 1, 0) over (partition by department, city order by salary) as salary_previous,
       LEAD(salario, 1, 0) over (partition by department, city order by salary) as salary_next,
       LAG(salario, 2, 0) over (partition by department, city order by salary) as salary_previous_2,
       LEAD(salario, 2, 0) over (partition by department, city order by salary) as salary_next_2,
       ARIMA(salary, 1, 1, 1) salary_forecast
FROM employees
GROUP BY department, city;

This code lists the number of employees, total salaries, mean, salary standard deviation, median, previous salary, next salary, salary before 2 periods and salary next to 2 periods in each department and city, as well as a salary forecast using the ARIMA model.

GROUP BY advanced features

The advanced features of GROUP BY in SQL allow you to perform more complex and detailed analyzes on large volumes of data. Here are some examples of how we use these resources:

1. Using aggregation functions in GROUP BY SQL:

By using aggregation functions in GROUP BY, we can perform more complex calculations with the grouped data. For example, we can use the SUM function to calculate the total sum of a column across all groups, the COUNT function to get the number of rows in each group, the AVG function to calculate the average of a column across all groups, the MAX function to find the maximum value in a column across all groups, and the MIN function to find the minimum value in a column across all groups.

Example:

SELECT Country, Region, SUM(Sales) AS TotalSales
FROM Sales
GROUP BY Country, Region

This example will present the total sum of sales for each country and region combination.

2. Using aggregation functions with subqueries in GROUP BY:

In addition to using simple aggregation functions, we use subqueries to perform more complex calculations. This allows us to perform more detailed analyzes on our data.

Example:

SELECT Country, Region, (SELECT AVG(Sales) FROM Sales WHERE Country = 'USA') AS USASales
FROM Sales
GROUP BY Country, Region

In this example, we are using a subquery to calculate the average sales for just the country United States, and then presenting this average for each country and region combination.

3. Using HAVING as an additional filter condition after GROUP BY in SQL:

HAVING is a clause that we use after GROUP BY to filter groups of data based on a specific condition. Thus, allowing to analyze only the groups that meet a certain condition.

Example:

SELECT Country, Region, AVG(Sales) AS AverageSales
FROM Sales
GROUP BY Country, Region
HAVING AverageSales > 100000

In this example, we are using HAVING to filter only those data groups where the average sales is greater than 100,000.

Therefore, the advanced features of GROUP BY in SQL allow you to perform more complex and detailed analyzes on large volumes of data. The use of aggregation, subqueries and HAVING functions allows us to perform more precise calculations and filter the data according to our needs.

GROUP BY optimization

GROUP BY optimization is an important part of SQL query performance, as this is one of the most used features in queries that involve data analysis. Here are some general SQL query optimization strategies that can improve GROUP BY performance:

Utilize indexes: Indexes can be used to speed up data fetching while executing a query. Check that the indexes are being used correctly on the columns that are being used in the GROUP BY.
Table partitioning: Partitioning tables can help reduce query execution time, especially if the table is large. Partitioning can be done through a partitioning function or through a partitioning clause.
Reduce the number of rows before the GROUP BY: If possible, try to reduce the number of rows being processed before the GROUP BY. This can be done using a WHERE clause or using a subquery.
Use efficient aggregation functions: Ensure that the aggregation functions are being used efficiently. For example, instead of using the SUM function to calculate the sum of all values in a column, you can use the SUMIF function to calculate the sum of only those values that meet a certain condition.
Use subqueries: Instead of using one large, complex query, try breaking the query into smaller, simpler subqueries. This can help reduce query execution time.
Use datasets: So, instead of using a large, complex table, try dividing the data into smaller, simpler datasets.

Share the Post:

Schenia T

Data scientist, passionate about technology tools and games. Undergraduate student in Statistics at UFPB. Her hobby is binge-watching series, enjoying good music working or cooking, going to the movies and learning new things!