How to use hash join in SQL Server?

A hash join is an efficient physical join operator in SQL Server, particularly useful for combining large, unsorted datasets or when one of the join inputs is significantly smaller than the other.

What is a Hash Join?

A hash join is an algorithm SQL Server uses to combine rows from two inputs (often tables or results of previous operations) based on an equality condition. It operates in two main phases:

Build Phase: SQL Server first scans or computes the entire build input. This is typically the smaller of the two inputs. It then constructs a hash table in memory for the build input. Each row from the build input is inserted into a hash bucket, determined by a hash value computed for its hash key (the join column(s)). If the entire build input is smaller than the available memory, all its rows can be inserted directly into this in-memory hash table, making the process very fast and efficient.
Probe Phase: Next, SQL Server scans the second input, called the probe input. For each row in the probe input, it calculates a hash value for its join key and then probes the hash table created in the build phase. If a match is found in the hash table, the corresponding rows are combined and returned as part of the query result.

This method is highly effective for large datasets, especially when neither input has useful indexes on the join columns or when the data is unsorted.

When Does SQL Server Use a Hash Join?

SQL Server's query optimizer automatically chooses the most efficient join algorithm (nested loops, merge join, or hash join) based on various factors. A hash join is typically chosen in the following scenarios:

Large, Unsorted Inputs: It's often preferred for joining large tables where neither input is sorted on the join key, or when no suitable non-clustered indexes exist for a nested loops join.
Size Disparity: When one input (the build input) is significantly smaller than the other (the probe input), making it feasible to build a hash table in memory for the smaller input.
Equality Joins: Hash joins are designed specifically for equality join predicates (e.g., ON A.ID = B.ID). They are not suitable for non-equality joins (e.g., <, >, LIKE).
Absence of Suitable Indexes: If there are no efficient indexes on the join columns that would make a nested loops join performant, or if a merge join is not possible due to unsorted data, the optimizer may choose a hash join.

How to Explicitly Force a Hash Join

While relying on the optimizer is generally recommended, you can explicitly suggest or force SQL Server to use a hash join using query hints. This is typically done for specific performance tuning scenarios where you've identified that a hash join performs better than the optimizer's default choice.

Using `OPTION (HASH JOIN)`

This hint forces all join operations in the query to use the hash join algorithm.

Syntax Example:

SELECT
    p.ProductName,
    o.OrderDate
FROM
    Production.Products AS p
INNER JOIN
    Sales.OrderDetails AS od ON p.ProductID = od.ProductID
INNER JOIN
    Sales.Orders AS o ON od.OrderID = o.OrderID
OPTION (HASH JOIN);

Using `INNER HASH JOIN`

You can also specify the HASH JOIN hint directly between the tables in the FROM clause if you only want to force a specific join to be a hash join.

Syntax Example:

SELECT
    e.FirstName,
    d.DepartmentName
FROM
    HumanResources.Employees AS e
INNER HASH JOIN
    HumanResources.Departments AS d ON e.DepartmentID = d.DepartmentID;

Important Note: Forcing join hints should be done with caution. It can override potentially better choices by the optimizer and may lead to performance degradation if not used appropriately. Always test thoroughly.

Understanding Hash Join Types

The efficiency of a hash join heavily depends on whether the build input can fit entirely into memory. SQL Server manages this through different types of hash joins:

In-Memory Hash Join: This is the most efficient and ideal scenario. The entire build input fits into available memory, allowing all rows to be hashed and stored in RAM. This leads to very fast probe operations as there's no disk I/O involved for the hash table itself.
Graceful Hash Join: If the build input is too large to fit entirely into memory, SQL Server employs a graceful hash join. It partitions both the build and probe inputs into multiple files on disk (in tempdb). It then processes these smaller partitions one by one, building an in-memory hash table for each partition. This reduces memory pressure but introduces I/O to tempdb.
Recursive Hash Join: In extreme cases, if even a single partition from a graceful hash join is too large to fit into memory, SQL Server might resort to a recursive hash join. This involves further subdividing the partitions, potentially leading to multiple passes over tempdb. This indicates severe memory pressure and can be a significant performance bottleneck.

Advantages and Disadvantages of Hash Join

Feature	Advantages	Disadvantages
Performance	Excellent for joining large, unsorted datasets.	Can be slow if memory is insufficient, leading to `tempdb` spills.
	Efficient for non-indexed equality join columns.	Not ideal for small inputs where a Nested Loops Join might be faster.
Memory Usage	Optimal when the build input fits completely in memory.	Memory-intensive; excessive memory grants can impact other queries.
Data Skew	Generally handles data skew better than a Merge Join in certain scenarios.	Can still suffer performance issues if hash bucket collisions are severe.
Join Type	Exclusively suited for equality joins (`=`).	Not applicable for non-equality joins (`<`, `>`, `LIKE`).
Input Sorting	Does not require inputs to be sorted, avoiding sorting overhead.	Builds a hash table, which adds an initial processing step.

Practical Considerations and Best Practices

Monitor tempdb Usage: Frequent tempdb spills by hash joins (indicating graceful or recursive hash joins in execution plans) often point to insufficient memory or suboptimal query design. Monitor tempdb activity and look for "Workfiles" or "Worktables" in query plans.
Examine Execution Plans: Always review the graphical execution plan to understand which join operator SQL Server is using and why. Pay close attention to warnings about memory grants or spills.
Memory Grant Feedback: SQL Server 2017 and later versions include memory grant feedback, which can dynamically adjust the memory granted to hash join operators over time to improve performance and reduce spills.
Avoid Forcing Joins Unnecessarily: Trust the query optimizer unless you have a compelling, tested reason to override its decision. Query hints should be used judiciously and after thorough testing.
Index Strategically: While hash joins don't require indexes, having appropriate indexes can sometimes allow the optimizer to choose a Nested Loops or Merge Join, which might be more efficient for specific scenarios.

By understanding how hash joins work and their typical use cases, you can better interpret SQL Server's query plans and diagnose performance issues related to join operations. For more in-depth information, consult the official Microsoft Learn documentation on Join Hints.