Keywords: PostgreSQL | Random Sampling | Performance Optimization | Large Table Query | Index Scanning
Abstract: This article provides an in-depth exploration of optimized random row selection techniques for large-scale data tables in PostgreSQL. By analyzing performance bottlenecks of traditional ORDER BY RANDOM() methods, it presents efficient algorithms based on index scanning, detailing various technical solutions including ID space random sampling, recursive CTE for gap handling, and TABLESAMPLE system sampling. The article includes complete function implementations and performance comparisons, offering professional guidance for random queries on billion-row tables.
Problem Background and Performance Challenges
Random row selection is a common but challenging requirement when dealing with massive data tables. Traditional methods like SELECT * FROM table WHERE random() < 0.01 and SELECT * FROM table ORDER BY random() LIMIT 1000, while syntactically simple, face severe performance issues on tables with millions of rows. The former requires full table scans, while the latter needs to sort the entire table first—both operations consume substantial computational resources and time with large datasets.
Core Principles of Efficient Random Sampling
For large tables with numeric ID columns and few gaps, index scanning can be leveraged to avoid full table operations. The basic approach involves generating random numbers within the ID space and then quickly locating corresponding rows through indexes. The key to this method lies in accurately estimating table size and ID distribution range, using appropriate buffers to handle potential gap issues.
Basic Query Implementation
First, obtain table statistics:
SELECT count(*) AS ct,
min(id) AS min_id,
max(id) AS max_id,
max(id) - min(id) AS id_span
FROM big;
For extremely large tables, system statistics can replace exact counts:
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM pg_class
WHERE oid = 'big'::regclass;
Core Sampling Algorithm
Simplified efficient random sampling query:
SELECT *
FROM (
SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
FROM generate_series(1, 1100) g
) r
JOIN big USING (id)
LIMIT 1000;
This query generates 1100 random numbers in the ID space (including a 10% buffer), removes duplicates using DISTINCT, and finally retrieves actual data through index joins. This approach avoids full table scans, significantly improving performance.
Recursive CTE Enhanced Solution
For cases with more ID gaps, recursive CTE can ensure sufficient random rows are obtained:
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM generate_series(1, 1030)
LIMIT 1030
) r
JOIN big b USING (id)
UNION
SELECT b.*
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM random_pick r
LIMIT 999
) r
JOIN big b USING (id)
)
TABLE random_pick
LIMIT 1000;
Encapsulation as Reusable Function
Specialized function for fixed tables:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
RETURNS SETOF big
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
_surplus int := _limit * _gaps;
_estimate int := (
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'big'::regclass);
BEGIN
RETURN QUERY
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM generate_series(1, _surplus) g
LIMIT _surplus
) r (id)
JOIN big USING (id)
UNION
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM random_pick
LIMIT _limit
) r (id)
JOIN big USING (id)
)
TABLE random_pick
LIMIT _limit;
END
$func$;
Generic Function Implementation
Universal version supporting any table:
CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement,
_id text = 'id',
_limit int = 1000,
_gaps real = 1.03)
RETURNS SETOF anyelement
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
_tbl text := pg_typeof(_tbl_type)::text;
_estimate int := (SELECT (reltuples / relpages
* (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = _tbl::regclass);
BEGIN
RETURN QUERY EXECUTE format(
$$
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM generate_series(1, $2) g
LIMIT $2
) r(%2$I)
JOIN %1$s USING (%2$I)
UNION
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM random_pick
LIMIT $3
) r(%2$I)
JOIN %1$s USING (%2$I)
)
TABLE random_pick
LIMIT $3;
$$
, _tbl, _id
)
USING _estimate,
(_limit * _gaps)::int,
_limit;
END
$func$;
Alternative Solution Comparison
PostgreSQL 9.5+ provides TABLESAMPLE SYSTEM syntax:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
This method is extremely fast but sampling results may not be sufficiently random, affected by data clustering. After installing the tsm_system_rows extension:
SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
Materialized View Strategy
For repeated calls allowing identical random sets, consider using materialized views:
CREATE MATERIALIZED VIEW random_sample AS
SELECT * FROM f_random_sample(1000);
Periodically refresh the view to obtain new random samples, querying directly from the materialized view for optimal performance.
Performance Analysis and Application Scenarios
The index scan-based method performs best when ID distribution is uniform with few gaps, approaching O(1) time complexity. The recursive CTE version can handle moderate gap issues but requires controlled recursion depth. The TABLESAMPLE method excels in pure speed scenarios but sacrifices some randomness. The materialized view approach suits batch processing scenarios with lower real-time requirements.
Best Practice Recommendations
In practical applications, choose the appropriate solution based on specific needs: prioritize index scan-based methods for scenarios requiring strict randomness; evaluate TABLESAMPLE for extremely high-performance requirements; materialized views are optimal for periodic batch processing. Additionally, maintain ID column indexes and ensure statistical information accuracy, as these factors directly impact sampling algorithm effectiveness.