Table Statistics vs. SELECT COUNT(*)

Table Statistics vs. SELECT COUNT: What to Choose?

Finding the number of rows for each table in a schema or database is worth mentioning because every DBA encounters this type of question. It is a very hot topic one of the Top 20 Most Viewed Questions in ASKTOM and in stackoverflow. Although it is often discouraged due to its performance overhead and Connor McDonald objects to it, stating, “I can’t think of a reason why you would ever want to do it, so please just don’t do it” in his video How to get a row count for each table in the database, it is still commonly used by developers as a way to measure the size of application data (unfortunately similar to lines of codes) or as a comparison method during data migrations.

The debate centers around whether to rely on the table statistics or to compute the exact row count with “SELECT COUNT(*) FROM” of the table.

The main reason for questioning the accuracy of the table statistics is that they are often gathered using a sample (like ESTIMATE_PERCENT in Oracle, STATISTICS or default_statistics_target in PostgreSQL) or may be outdated and not current. If the table is not in a read-only state and serves an actively used application, the statistics can never truly be up-to-date.

On the other side, what’s wrong with computing the exact row count of the table?

The performance overhead alone is a sufficient reason to avoid it. Worries arise when row counts are queried for all tables, not when only a single table.

First, I aim to question this practice from a non-technical business requirements perspective.

1. Data Validation and Integrity Checking for a Migration :

In Extract, Transform and Load (ETL) processes, data migration projects or logical backups row counts are used check whether the source and destination systems have the same number of rows after a transfer. Partial or failed data loads can be identified quickly by comparing row counts.

Example Use Case:
If 5 million rows are transferred, but the destination contains only 4.9 million rows, this discrepancy strongly indicates a transfer issue. While row counts can help detect such differences immediately, what if the row count is the same, but the underlying column values differ or are updated during the process? In such cases, is an exact match test still required, and to what extent should row count alone be trusted? As an initial point to detect the issue we can say that number of row count may be used.

Then, just ensure that statistics are gathered on both sides and rely on the table statistics. Why bother using SELECT COUNT(*) FROM?

If you answer with accuracy then you need to use DBMS_COMPARISION package or another alternative method relies on hashes.

2. Capacity Planning :

For estimating future growth, Knowing the number of rows in each table may help to determine if partitioning, compression, or archiving strategies are necessary. It may also be used deciding whether a dedicated storage or tablespace usage.

Example Use Case:
A DBA notices a table growing drastically in row count. The trend informs them to implement partitioning or archiving before hitting storage or performance limits.

Then, just ensure that statistics are gathered and rely on the table statistics. Why bother using SELECT COUNT(*) FROM?

3. Anomaly Detection :

Being aware of the row count for an important table and monitoring any differences can help detect anomalies or potential issues. Sudden spikes or drops in row counts might indicate unexpected transactions. For read-only tables, a changing row count could signal unauthorized modifications.

Example Use Case:
A DBA tracks the row count of a read-only audit table and detects that rows are being deleted, revealing a potential security breach.

Then, just ensure that statistics are gathered for the table and rely on the table statistics. Why bother using SELECT COUNT(*) FROM for a read-only table that doesn’t change?

4. Identifying Unused or Less used tables :

Row counts may be useful to detect tables that are empty or infrequently updated: Tables with zero rows may be candidates for deletion or archiving when combined with access logs.

Example Use Case:
In a legacy schema with 1000+ tables, a DBA queries row counts to identify and clean up unused tables.

Then, just ensure that statistics are gathered for these tables and rely on the table statistics. Why bother using SELECT COUNT(*) FROM ?

5. Performance Tuning – Index Decisions :

Row counts can guide performance tuning: Small tables might not need indexes or partitioning.Large tables might require more advanced indexing strategies like choosing a partitioning strategy for partition pruning. Or when analyzing Optimizer behavior ?

Example Use Case:
For a table with 100 rows, a full table scan might be more efficient than an indexed lookup. Row counts may help the design of query plans and optimizations.

Then, just ensure that statistics are gathered for that table, optimizer also needs it for taking better decisions and rely on the table statistics. Why bother using SELECT COUNT(*) FROM ?

6. Analyzing Usage Trends for Auditing Purposes :

Organizations may need row counts for compliance with data regulations and auditors may request usage trends.

Example Use Case:
During a regulatory audit, a company demonstrates its transactional data table row counts to provide a quick snapshot for data distribution over time.

For populating this periodically recorded row count data the table statistics can be used. Why bother using SELECT COUNT(*) FROM ?

There could be many more examples like these.

In all the examples that come to mind, I can’t find a compelling reason to rely on table statistics. Up until now, I have approached this issue from a non-technical, mostly business requirements perspective.

In this article, the common use of row counts of all tables in database management is examined by focusing on the trade-off between using table statistics and the performance impact of “SELECT COUNT(*)”. While row counts can be useful for tasks like data validation, capacity planning, and anomaly detection, the performance overhead of exact counts often makes them impractical for large tables.

For most use cases, table statistics are a more efficient and sufficient approach. However, when precise accuracy is needed, alternative methods like hashing or specialized tools should be considered.

Now, I will analyze both Oracle and PostgreSQL to determine whether this truly has such a catastrophic impact from a technical standpoint and show how these approaches perform in both environments. And also show another way to get an almost current row count at a low cost.

For Oracle:

Finding Row Counts for All Tables in Oracle Database

For PostgreSQL:

Finding Row Counts for All Tables in PostgreSQL Database

Hope it helps. See you on the next post.


Discover More from Osman DİNÇ


Comments

Leave your comment