Keywords: SQL Server | ROW_NUMBER() | CROSS APPLY | Gaps-and-Islands | Time Series Analysis
Abstract: This paper delves into the classic Gaps-and-Islands problem in SQL Server when handling employee department change histories. Through a detailed case study, it demonstrates how to combine the ROW_NUMBER() window function with CROSS APPLY operations to identify continuous time periods and generate start and end dates for each department. The article explains the core algorithm logic, including data sorting, group identification, and endpoint calculation, while providing complete executable code examples. This method avoids simple partitioning limitations and is suitable for complex time-series data analysis scenarios.
In database applications, handling continuous intervals in time-series data, known as the Gaps-and-Islands problem, is a common challenge. This paper explores how to use the ROW_NUMBER() window function and CROSS APPLY operations in SQL Server 2008 R2 and later versions to address employee department change time periods, based on a specific case study.
Problem Background and Data Example
Assume an EmployeeHistory table that records employee department and supervisor changes over time. Sample data is as follows:
EmployeeID Date DepartmentID SupervisorID
10001 20130101 001 10009
10001 20130909 001 10019
10001 20131201 002 10018
10001 20140501 002 10017
10001 20141001 001 10015
10001 20141201 001 10014
Employee 10001 changes departments from 001 to 002 and back to 001. The goal is to output the start and end dates for each continuous department period:
EmployeeID DateStart DateEnd DepartmentID
10001 20130101 20131201 001
10001 20131201 20141001 002
10001 20141001 NULL 001
A simple ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY Date) cannot directly solve this, as department IDs repeat.
Core Solution Overview
The best answer employs a multi-step approach, combining temporary tables, ROW_NUMBER(), and CROSS APPLY. Key steps include: data sorting and initialization, group identification calculation, end date determination, and result filtering.
Detailed Implementation Steps
First, create a test table and insert data:
CREATE TABLE Source
(
EmployeeID int,
DateStarted date,
DepartmentID int
)
INSERT INTO Source
VALUES
(10001,'2013-01-01',001),
(10001,'2013-09-09',001),
(10001,'2013-12-01',002),
(10001,'2014-05-01',002),
(10001,'2014-10-01',001),
(10001,'2014-12-01',001)
Next, use ROW_NUMBER() to sort the data and initialize a temporary table:
SELECT *,
ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS EntryRank,
newid() as GroupKey,
CAST(NULL AS date) AS EndDate
INTO #RankedData
FROM Source
;
Here, EntryRank numbers each employee record in chronological order, GroupKey is used for subsequent grouping, and EndDate is initialized as NULL.
Group Identification Calculation
Use CROSS APPLY and NOT EXISTS conditions to determine the start group key for each record:
UPDATE #RankedData
SET GroupKey = beginDate.GroupKey
FROM #RankedData sup
CROSS APPLY
(
SELECT TOP 1 GroupKey
FROM #RankedData sub
WHERE sub.EmployeeID = sup.EmployeeID AND
sub.DepartmentID = sup.DepartmentID AND
NOT EXISTS
(
SELECT *
FROM #RankedData bot
WHERE bot.EmployeeID = sup.EmployeeID AND
bot.EntryRank BETWEEN sub.EntryRank AND sup.EntryRank AND
bot.DepartmentID <> sup.DepartmentID
)
ORDER BY DateStarted ASC
) beginDate (GroupKey);
This step ensures that records within the same continuous time period share the same GroupKey, achieved by checking for department changes in between.
End Date Determination
Calculate the end date for each time period:
UPDATE #RankedData
SET EndDate = nextGroup.DateStarted
FROM #RankedData sup
CROSS APPLY
(
SELECT TOP 1 DateStarted
FROM #RankedData sub
WHERE sub.EmployeeID = sup.EmployeeID AND
sub.DepartmentID <> sup.DepartmentID AND
sub.EntryRank > sup.EntryRank
ORDER BY EntryRank ASC
) nextGroup (DateStarted);
Here, EndDate is set to the start date of the next record with a different department, with the last period as NULL.
Result Filtering and Output
Finally, filter the first record of each group for output:
SELECT * FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY GroupKey ORDER BY EntryRank ASC) AS GroupRank FROM #RankedData
) FinalRanking
WHERE GroupRank = 1
ORDER BY EntryRank;
The output matches the expected results, showing continuous time periods for each department.
Supplementary References and Optimizations
Other answers mention using self-join methods, such as:
;WITH x
AS (SELECT *,
Row_number()
OVER(
partition BY employeeid
ORDER BY datestart) rn
FROM employeehistory)
SELECT *
FROM x x1
LEFT OUTER JOIN x x2
ON x1.rn = x2.rn + 1
This method identifies change points by joining adjacent rows but may require additional processing to aggregate time periods. In contrast, the main solution is clearer and more extensible.
Performance and Applicability Analysis
This solution is suitable for SQL Server 2008 and later, leveraging the efficiency of window functions and CROSS APPLY. For large datasets, consider indexing optimizations, such as creating indexes on EmployeeID and DateStarted. Time complexity primarily depends on sorting and join operations, typically O(n log n).
Conclusion
By combining ROW_NUMBER() and CROSS APPLY, the Gaps-and-Islands problem can be effectively solved, especially in complex time-series scenarios like department changes. This approach not only provides accurate start and end date calculations but also demonstrates the flexible application of SQL Server's advanced features. In practice, adjustments can be made based on specific needs, such as handling multiple employees or adding more filter conditions.