How to get row counts for all tables in PostgreSQL Database

Finding Row Count for All Tables in PostgreSQL Database

On my previous post “Table Statistics vs. SELECT COUNT: What to Choose?” I have mentioned two concepts of finding Row Counts: Using Table Statistics and SELECT COUNT(*) from a non-technical, mostly business requirements perspective.

The debate centers around whether to rely on the table statistics or to compute the exact row count with “SELECT COUNT(*) FROM” of the table.

The main reason for questioning the accuracy of the table statistics is that they are often gathered using a sample (like default_statistics_target parameter, ALTER TABLE SET STATISTICS) or may be outdated and not current. If the table is not in a read-only state and serves an actively used application, the statistics can never truly be up-to-date.

On the other side, what’s wrong with computing the exact row count of each table?

The performance overhead alone is a sufficient reason to avoid it.

Worries arise when row counts are queried for all tables, not when only a single table.

Now, I aim to question this practice from a technical perspective to analyze for PostgreSQL whether this really has such a catastrophic impact from a technical standpoint.

I will use the “Adventureworks” database for testing in the PostgreSQL Database 17.2 running on a docker with MacBook M1. To create test environment you may use commands provided below.

Test connection.

Download Adventure Works database from Microsoft Github Repo.

Download AdventureWorks-for-Postgres from Github Repo.

Extract both .zip files and copy all of the CSV files into the same folder( /Users/osmandinc/Downloads/AdventureWorks-oltp-install-script for me) also containing update_csvs.rb file and install.sql.

Modify the CSVs to work with Postgres by running:

Create the database and tables, import the data.

First, I will create a test table with exactly 21,350,068 rows to analyze whether table size impacts query execution statistics. The goal is to determine how many buffers (through the cache or the disk) are required to calculate the total row count for a large table. When retrieving the row count from catalog statistics using pg_stat_all_tables , the actual table size is irrelevant since the query accesses metadata, not the table itself.

With just 12 blocks (7 blocks accessed during the query planning phase from catalog + 5 blocks were accessed from the shared buffer during query execution.), we can retrieve the row count information for a single large table efficiently. But what happens when we add two more large tables each containing 20 million records? Without surprise, even in this case, it’s not a big deal. Since all values are retrieved from the catalog, the table size does not play a role. The actual n_live_tup values are maintained by the stats collector, which monitors table modifications (inserts, updates, and deletes) in near real-time. Querying this catalog view does not involve scanning the table, so buffer gets are minimal.

Until PostgreSQL 15, the statistics collector received updates through a UDP socket and stored the collected data in temporary files, which were regularly written to disk. These files could become quite large and were updated frequently, often within fractions of a second. This approach was a potential bottleneck. In PostgreSQL 15, all statistics changes are initially collected locally on a per-user basis and stored as “pending.” These pending statistics represent data that has been gathered but not yet added to the shared statistics system. After a short delay, typically following a commit or when a timeout occurs, the pending statistics are flushed to shared memory. For more information refer to “How PostgreSQL improved statistics collector in the latest version” by Fujitsu team.

Retrieving the row counts for three tables does not result in a significant change in performance or resource usage compared to retrieving the count for a single table. (27 block access)

Determining the row counts for all tables in the sample schemas required minimal buffer gets (71 block access), even when dealing with 71 tables, only three of which are significantly large.

This time we will use reltuples column of pg_class view.

In the Official Documentation of PostgreSQL 17 – Chapter 51 – it is identified as:

Number of live rows in the table. This is only an estimate used by the planner. It is updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX. If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.

This time, With just 3 block access from the shared buffer, we can retrieve the row count information for a single large table efficiently.

In that case, Analyze command is no longer necessary with the introduction of the “Autovacuum for Insert-Only Tables” feature available starting in PostgreSQL version 13 if autovacuum is enabled.

With the parameters autovacuum_vacuum_insert_scale_factor (default: 0.2) and autovacuum_vacuum_insert_threshold (default: 1000), introduced in PostgreSQL version 13, the behavior somewhat resembles Oracle Database 12c’s Online Statistics Gathering (_optimizer_gather_stats_on_load). If the number of newly inserted rows exceeds 20% of the table’s row count and this 20% is greater than 1000 rows, autovacuum will trigger a vacuum and analyze operation for the table. This behavior some extent is similar to Oracle’s approach, where statistics are updated immediately after bulk data loads or direct-path inserts. However, in PostgreSQL, the update occurs during the first autovacuum cycle (default autovacuum_naptime = 1 minute) after the insert operations exceed the defined thresholds.

For three large tables: 19 block access.

For all tables (71 tables) : 45 block access.

It is also worth mentioning that increasing the sample size improves the accuracy of the reltuples column, but this comes at the cost of longer execution times for the ANALYZE command and increased storage usage in the pg_statistic system catalog. With the default settings in place (default_statistics_target = 100), the row count is computed by the analyzer randomly sampling 300 × default_statistics_target rows. (300 is a constant choosen according to the some statistical theory(Source Code).) For the default value of 100, this results in a total of 30,000 rows being sampled. Up to 30,000 rows it will provide the exact row count, if table is not changing.

Next, I will try the other method: dynamically executing SELECT COUNT(*) for each table and computing row counts instantly. While this method provides exact counts, it requires scanning the data directly, which may result in increased resource consumption, especially for larger tables. This comparison will help highlight the trade-offs between relying on table statistics and direct query execution for row count retrieaval.

First, It’s a common misconception that COUNT(1) might be faster than COUNT(*) because the latter seems to process an entire row. However, this is not the case. In this context, the * is not related to selecting all columns (as in SELECT *) and has no special meaning.

I will begin the testing with a single large table, person.big_table to serve as a baseline for understanding how buffer gets are impacted by the table size. As To ensure a full table scan (sequential scan) on each processed large table, I did not create an index. Once this baseline is established, I will expand the test to include multiple tables to evaluate overall performance overhead.

I will create a temporay function and call it for each table.

For a single large table, executing a dynamic query to calculate the row count resulted in 149,830 (2 planning phase , 1936 execution phase shared buffer cache accesses (logical reads) and 147,892 disk reads (physical reads)) total block accesses. In contrast, retrieving the count using table statistics required just 21 (16 planning phase, 5 execution phase) buffer gets. This shows a huge difference of nearly 7134 times more for the retrieving almost same information.

Now, let’s extend this test to three large tables to show the impact more clearly.

The calculation of 449482 ÷ 27 = 16647 reveals that the when tables are queried directly for finding row count, this method comes with an approximately 16000 times worse performance for this case. Almost the same results with Oracle for three large tables.

CREATE FUNCTION pg_temp.get_rows(p_schemaname text, p_tablename text)
RETURNS bigint AS $$
DECLARE
result_rowcount bigint;
BEGIN
EXECUTE format('SELECT COUNT(*) FROM "%I"."%I"', p_schemaname, p_tablename)
INTO result_rowcount;
RETURN result_rowcount;
END;
$$ LANGUAGE plpgsql;
SELECT
schemaname AS schemaname,
tablename AS table_name,
pg_temp.get_rows(schemaname, tablename) AS cnt
FROM
pg_tables
WHERE
schemaname in ('person','sales','purchasing','production','humanresources');

Additionally, the pg_stat_statements.shared_blks_read and pg_stat_statements.shared_blks_hit columns values verify the results we have observed so far.

shared_blks_read : 443315 (data coming from the disk or from the operating system cache)

shared_blks_hit : (16023 + 2) =16025 (data coming from the shared buffer)

Instead of creating a function, we can also use query_to_xml function to execute the query whose text is passed as parameter query and maps the result set similar to XMLQUERY in Oracle.

Another method for estimating row counts is block sampling, but its accuracy may decrease if the blocks of the table contain significantly different numbers of rows. Block sampling can be performed using the TABLESAMPLE clause, introduced in PostgreSQL 9.5.

PostgreSQL provides two built-in sampling methods: SYSTEM and BERNOULLI.

SYSTEM sampling selects entire table blocks and uses the rows within them, making it faster but less evenly distributed.
BERNOULLI sampling randomly selects individual rows from the entire table, offering a more uniform distribution but at a higher performance cost.

The following query demonstrates using SYSTEM sampling with a 10% of the rows.

This method is a little bit limited can be useful in scenarios where the statistics in pg_class are outdated or insufficient, such as when autovacuum is not running or when dealing with temporary tables.

By the way, the PostgreSQL planner uses a calculation to estimate the number of rows (as seen in the source code) that is conceptually similar to the one provided below. It simply multiplies reltuples/relpages by the current number of pages.

This method can also be particularly useful for large tables. As the sample size of data increases, the average number of rows per physical page tends to stabilize more reliably compared to the total row count. By multiplying this average rows-per-page value by the most recent information about the current number of pages occupied by the table.

There is also a generic implementation of this logic available for all the queries you want to execute. You can use the planner to return estimated rows without executing the query, which can be found in detail on the Count Estimate page of the PostgreSQL Wiki explained in detail.

Depesz’s explain plan visualization provides a much clearer and more intuitive representation of query execution plans, making it easier to understand database performance.

Option 1 : Relying on Stats Collector: 66 Total Block Accesses.

Option 3 : Using Select Count(*) from – Direct Query : 459,340 Total Block Accesses

Option 4 : Using Block Sampling with TABLESAMPLE : 45,886 Total Block Accesses

Option 5 : Acting Like Planner: 45 Total Block accesses : 45 Total Block Accesses

This analysis highlights the trade-offs between using table statistics and directly querying row counts with SELECT COUNT(*). While table statistics provide a quick and efficient way to estimate row counts with minimal resource consumption, their accuracy depends on how recently and comprehensively the statistics were gathered. On the other side, while direct row count queries offer precise results but come at a significant performance cost due to the need for full table scans, index full scans and index fast full scans, especially for large tables or when multiple tables are queried.

Using dictionary views like pg_class and pg_stat_all_tables, and relying on the information provided by the Stats Collector and Autovacuum Daemon, performed much better than directly querying each table.

In our tests on PostgreSQL v17 with the sample database (Adventureworks);

Direct row count queries with SELECT COUNT(*) resulted in 149,832 block accesses for a single table with 21 million records and up to 449,482 block gets for three large tables: an overhead of approximately 16,000 times compared to using the n_live_tup value from pg_stat_all_tables, which relies on the stats collector, and about 23,000 times when compared to using the reltuples value from pg_class, which depends on the autovacuum daemon.

This technical perspective demonstrates that, while direct row count queries can be useful in certain scenarios, their heavy performance impact makes them unsuitable for almost all operations, particularly in large-scale environments with tables containing millions of rows. Relying on up-to-date table statistics is the more practical choice for most applications, especially when performance are key priorities.

Option 1, Option 2, and Option 5 all yielded similar results. In my humble opinion, if autovacuum is enabled (as it should be, sticking with default settings is generally advisable), mimicking the planner’s logic by multiplying reltuples/relpages with the current number of pages (retrieved via pg_relation_size) for estimating row counts seems like a practical and effective choice.

As a best practice, consider the following; Enable autovacuum and leverage autovacuum_vacuum_insert_threshold and autovacuum_vacuum_insert_scale_factor parameters to keep statistics accurate and reduce the need for manual vacuum operations.

Hope it helps.


Discover More from Osman DİNÇ


Comments

Leave your comment