ChatGPT解决这个技术问题 Extra ChatGPT

Fastest way to count exact number of rows in a very large table?

I have come across articles that state that SELECT COUNT(*) FROM TABLE_NAME will be slow when the table has lots of rows and lots of columns.

I have a table that might contain even billions of rows [it has approximately 15 columns]. Is there a better way to get the EXACT count of the number of rows of a table?

Please consider the following before your answer:

I am looking for a database vendor independent solution. It is OK if it covers MySQL, Oracle, MS SQL Server. But if there is really no database vendor independent solution then I will settle for different solutions for different database vendors.

I cannot use any other external tool to do this. I am mainly looking for a SQL based solution.

I cannot normalize my database design any further. It is already in 3NF and moreover a lot of code has already been written around it.

Wouldn't we all hope that this particular construct has been optimized by our database vendor?
@Swaranga, can you elucidate a bit more on what this database maintenance purpose is that must know the exact number of rows in the table? I can't imagine. And as Kevin says, if there was a quicker way than COUNT(*) then the DBMS vendor would (should) surely re-implement COUNT(*) to use it...
Surely if the table is being written to often then your exact count will only be exact for a particular point in time, and may not even be accurate if other processes are writing to the table, unless you put a table lock on the query.
You could use insert and delete triggers to keep a rolling count?
If insert trigger is too expensive, but delete trigger is affordable, see my answer stackoverflow.com/a/39295280/199364 for an approach, for tables with auto-increment id: keep track of count and last id counted each day, have delete trigger decrement count when appropriate.

F
Felipe Augusto

Simple answer:

Database vendor independent solution = use the standard = COUNT(*)

There are approximate SQL Server solutions but don't use COUNT(*) = out of scope

Notes:

COUNT(1) = COUNT(*) = COUNT(PrimaryKey) just in case

Edit:

SQL Server example (1.4 billion rows, 12 columns)

SELECT COUNT(*) FROM MyBigtable WITH (NOLOCK)
-- NOLOCK here is for me only to let me test for this answer: no more, no less

1 runs, 5:46 minutes, count = 1,401,659,700

--Note, sp_spaceused uses this DMV
SELECT
   Total_Rows= SUM(st.row_count)
FROM
   sys.dm_db_partition_stats st
WHERE
    object_name(object_id) = 'MyBigtable' AND (index_id < 2)

2 runs, both under 1 second, count = 1,401,659,670

The second one has less rows = wrong. Would be the same or more depending on writes (deletes are done out of hours here)


Nope, COUNT(*) = COUNT(key). This is just wrong. If there is no NOT NULL constraint - then they can be not equal (in results as well as in execution plan).
@zerkmsby: For COUNT(key) I meant COUNT(primarykey) which should be non nullable. I'll clarify
with (NOLOCK) is not something that allows it to run on production, and it can lead to an inaccurate count. When you use that hint, sure it prevents locks but the side effects on a production box are that you can count rows twice in some situations or skip rows in other situations. NOLOCK is better to use on a table that is not being written to because it allows "dirty reads". Don't advise people to use that hint unless they fully understand the consequences
@mishrsud The only accurate query is the SELECT COUNT(*), but it's slow. You can either have exact & slow, or rough and quick. What you do will depend on what's more important for the purpose you need the count for. NO LOCK might include or indeed exclude rows that are mid-transaction or moving pages for whatever reason.
@gbn very nice solution, can you tell what is use of index_id < 2?
s
salbahra

The fastest way by far on MySQL is:

SHOW TABLE STATUS;

You will instantly get all your tables with the row count (which is the total) along with plenty of extra information if you want.


Smart way..with this you can get row count of multiple tables in 1 query.
did you run on db having tables with ~billion entries like @gbn and noticed the time ?
which value is the total row count for all tables in database? And these are approximate - what if you want exact row count values?
this doesn't work at all, on INNODB for example, the storage engine reads a few rows and extrapolates to guess the number of rows
it's good to get a total count without query condition. but what if I want the result with any query condition?
J
JakeJ

I got this script from another StackOverflow question/answer:

SELECT SUM(p.rows) FROM sys.partitions AS p
  INNER JOIN sys.tables AS t
  ON p.[object_id] = t.[object_id]
  INNER JOIN sys.schemas AS s
  ON s.[schema_id] = t.[schema_id]
  WHERE t.name = N'YourTableNameHere'
  AND s.name = N'dbo'
  AND p.index_id IN (0,1);

My table has 500 million records and the above returns in less than 1ms. Meanwhile,

SELECT COUNT(id) FROM MyTable

takes a full 39 minutes, 52 seconds!

They yield the exact same number of rows (in my case, exactly 519326012).

I do not know if that would always be the case.


Can you add an parameter to get rows count with this query? Example: Select COUNT(1) FROM TABLENAME WHERE ColumnFiled = '1' With your query?
That is the count - the number of rows (records) is the "count" in this case. "500 million records" was an approximate number, and "519326012" was the exact number of rows, or count. Rows = records = count.
When I do a count on table (select count(*) from table) vs this solution, the count of the latter is 11 less.
Hi, is there some way to do the same with two tables related by an inner join?
j
jams

You can try this sp_spaceused (Transact-SQL)

Displays the number of rows, disk space reserved, and disk space used by a table, indexed view, or Service Broker queue in the current database, or displays the disk space reserved and used by the whole database.


Won't sp_spaceused give me an approximated count?
FYI: This uses sys.dm_db_partition_stats internally
D
Denis de Bernardy

I have come across articles that state that SELECT COUNT(*) FROM TABLE_NAME will be slow when the table has lots of rows and lots of columns.

That depends on the database. Some speed up counts, for instance by keeping track of whether rows are live or dead in the index, allowing for an index only scan to extract the number of rows. Others do not, and consequently require visiting the whole table and counting live rows one by one. Either will be slow for a huge table.

Note that you can generally extract a good estimate by using query optimization tools, table statistics, etc. In the case of PostgreSQL, for instance, you could parse the output of explain count(*) from yourtable and get a reasonably good estimate of the number of rows. Which brings me to your second question.

I have a table that might contain even billions of rows [it has approximately 15 columns]. Is there a better way to get the EXACT count of the number of rows of a table?

Seriously? :-) You really mean the exact count from a table with billions of rows? Are you really sure? :-)

If you really do, you could keep a trace of the total using triggers, but mind concurrency and deadlocks if you do.


It's a lucky thing that Google managers are more reasonable than your boss... Picture how slow it would be if it returned the exact number of search results for each of your queries instead of sticking to an estimate number.
At least you empathize with me. How about an only Oracle solution? That will reduce my issue to an extent. Presently the customer is using Oracle; so if I come up with a workaround only for Oracle, that will do [for the time being]. :)
Well, you could always keep a counter updated using a trigger. Mind the concurrency if you do, though. :-)
"Yes Denis, the exact count is required. :(" - well I can only speculate. Does the db maintenance process find out that there are 42,123,876 rows in table A and then create 42,123,876 empty rows in table B, and then loop through table A and update the rows in table B...? Or is it crazier than that? ;-)
Transaction 2 cannot begin before transaction 1 has committed. Without the "counts table" update, many update transactions could run in parallel. With the "counts table", each transaction has to "obtain a ticket" for updating its count. So transactions start queueing up at the ticket machine (the scheduler deciding who will be the next to get a lock on the counts table).
J
Jesse Webb

Is there a better way to get the EXACT count of the number of rows of a table?

To answer your question simply, No.

If you need a DBMS independent way of doing this, the fastest way will always be:

SELECT COUNT(*) FROM TableName

Some DBMS vendors may have quicker ways which will work for their systems only. Some of these options are already posted in other answers.

COUNT(*) should be optimized by the DBMS (at least any PROD worthy DB) anyway, so don't try to bypass their optimizations.

On a side note: I am sure many of your other queries also take a long time to finish because of your table size. Any performance concerns should probably be addressed by thinking about your schema design with speed in mind. I realize you said that it is not an option to change but it might turn out that 10+ minute queries aren't an option either. 3rd NF is not always the best approach when you need speed, and sometimes data can be partitioned in several tables if the records don't have to be stored together. Something to think about...


T
Thierry

I found this good article SQL Server–HOW-TO: quickly retrieve accurate row count for table from martijnh1 which gives a good recap for each scenarios.

I need this to be expanded where I need to provide a count based on a specific condition and when I figure this part, I'll update this answer further.

In the meantime, here are the details from article:

Method 1:

Query:

SELECT COUNT(*) FROM Transactions 

Comments:

Performs a full table scan. Slow on large tables.

Method 2:

Query:

SELECT CONVERT(bigint, rows) 
FROM sysindexes 
WHERE id = OBJECT_ID('Transactions') 
AND indid < 2 

Comments:

Fast way to retrieve row count. Depends on statistics and is inaccurate.

Run DBCC UPDATEUSAGE(Database) WITH COUNT_ROWS, which can take significant time for large tables.

Method 3:

Query:

SELECT CAST(p.rows AS float) 
FROM sys.tables AS tbl 
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and
idx.index_id < 2 
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int) 
AND p.index_id=idx.index_id 
WHERE ((tbl.name=N'Transactions' 
AND SCHEMA_NAME(tbl.schema_id)='dbo')) 

Comments:

The way the SQL management studio counts rows (look at table properties, storage, row count). Very fast, but still an approximate number of rows.

Method 4:

Query:

SELECT SUM (row_count) 
FROM sys.dm_db_partition_stats 
WHERE object_id=OBJECT_ID('Transactions')    
AND (index_id=0 or index_id=1); 

Comments:

Quick (although not as fast as method 2) operation and equally important, reliable.


Thanks! Really useful tip. I do not have permission to view system tables so method 4 is not me. However method 3 is good enough.
Just a note for Method 3. It should be the SUM(CAST(p.rows AS FLOAT)) otherwise in partitioned tables we get n rows in output.
A
Alireza Maddah

If SQL Server edition is 2005/2008, you can use DMVs to calculate the row count in a table:

-- Shows all user tables and row counts for the current database 
-- Remove is_ms_shipped = 0 check to include system objects 
-- i.index_id < 2 indicates clustered index (1) or hash table (0) 
SELECT o.name, 
 ddps.row_count 
FROM sys.indexes AS i 
 INNER JOIN sys.objects AS o ON i.OBJECT_ID = o.OBJECT_ID 
 INNER JOIN sys.dm_db_partition_stats AS ddps ON i.OBJECT_ID = ddps.OBJECT_ID 
 AND i.index_id = ddps.index_id 
WHERE i.index_id < 2 
 AND o.is_ms_shipped = 0 
ORDER BY o.NAME 

For SQL Server 2000 database engine, sysindexes will work, but it is strongly advised to avoid using it in future editions of SQL Server as it may be removed in the near future.

Sample code taken from: How To Get Table Row Counts Quickly And Painlessly


This is approximate not exact: see my answer please
Do you know an example where this is not accurate? AFAIK, it does not depend on updated statistics.
M
Mainsh S

I use

select /*+ parallel(a) */  count(1) from table_name a;

select /*+ parallel(a) */ count(1) from table_name a
j
john rains

I'm nowhere near as expert as others who have answered but I was having an issue with a procedure I was using to select a random row from a table (not overly relevant) but I needed to know the number of rows in my reference table to calculate the random index. Using the traditional Count(*) or Count(1) work but I was occasionally getting up to 2 seconds for my query to run. So instead (for my table named 'tbl_HighOrder') I am using:

Declare @max int

Select @max = Row_Count
From sys.dm_db_partition_stats
Where Object_Name(Object_Id) = 'tbl_HighOrder'

It works great and query times in Management Studio are zero.


FWIW, you should mention WHICH database vendor you are using; I think the statement would be slightly different depending on vendor.
K
Kaliyug Antagonist

Well, late by 5 years and unsure if it helps :

I was trying to count the no. of rows in a SQL Server table using MS SQL Server Management Studio and ran into some overflow error, then I used the below :

select count_big(1) FROM [dbname].[dbo].[FactSampleValue];

The result :

24296650578 rows


B
Branko Dimitrijevic

Not exactly a DBMS-agnostic solution, but at least your client code won't see the difference...

Create another table T with just one row and one integer field N1, and create INSERT TRIGGER that just executes:

UPDATE T SET N = N + 1

Also create a DELETE TRIGGER that executes:

UPDATE T SET N = N - 1

A DBMS worth its salt will guarantee the atomicity of the operations above2, and N will contain the accurate count of rows at all times, which is then super-quick to get by simply:

SELECT N FROM T

While triggers are DBMS-specific, selecting from T isn't and your client code won't need to change for each supported DBMS.

However, this can have some scalability issues if the table is INSERT or DELETE-intensive, especially if you don't COMMIT immediately after INSERT/DELETE.

1 These names are just placeholders - use something more meaningful in production.

2 I.e. N cannot be changed by a concurrent transaction between reading and writing to N, as long as both reading and writing are done in a single SQL statement.


This can be a Great hack. In the long run especially
M
Mike Woodhouse

I don't think there is a general always-fastest solution: some RDBMS/versions have a specific optimization for SELECT COUNT(*) that use faster options while others simply table-scan. You'd need to go to the documentation/support sites for the second set, which will probably need some more specific query to be written, usually one that hits an index in some way.

EDIT:

Here's a thought that might work, depending on your schema and distribution of data: do you have an indexed column that references an increasing value, a numeric increasing ID, say, or even a timestamp or date? Then, assuming deletes don't happen, it should be possible to store the count up to some recent value (yesterday's date, highest ID value at some recent sample point) and add the count beyond that, which should resolve very quickly in the index. Very dependent on values and indices, of course, but applicable to pretty much any version of any DBMS.


I would very much hope that any decent DBMS would use an index for SELECT COUNT(*). Even MySQL apparently does it... .
assuming deletes don't happen - seriously?? ;p
s
sunitkatkar

I am late to this question, but here is what you can do with MySQL (as I use MySQL). I am sharing my observations here:

1) SELECT COUNT(*) AS TOTAL_ROWS FROM <TABLE_NAME>

Result Row Count: 508534 Console output: Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.125 sec. Takes a while for a table with large number of rows, but the row count is very exact.

2) SHOW TABLE STATUS or SHOW TABLE STATUS WHERE NAME="<TABLE_NAME>"

Result Row count: 511235 Console output: Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.250 sec Summary: Row count is not exact.

3) SELECT * FROM information_schema.tables WHERE table_schema = DATABASE();

Result Row count: 507806 Console output: Affected rows: 0 Found rows: 48 Warnings: 0 Duration for 1 query: 1.701 sec. Row count is not exact.

I am not a MySQL or database expert, but I have found that for very large tables, you can use option 2 or 3 and get a 'fair idea' of how many rows are present.

I needed to get these row counts for displaying some stats on the UI. With the above queries, I knew that the total rows were more than 500,000, so I came up with showing stats like "More than 500,000 rows" without showing exact number of rows.

Maybe I have not really answered the OP's question, but I am sharing what I did in a situation where such statistics were needed. In my case, showing the approximate rows was acceptable and so the above worked for me.


Y
Yangmun Choi

A literally insane answer, but if you have some kind of replication system set up (for a system with a billion rows, I hope you do), you can use a rough-estimator (like MAX(pk)), divide that value by the number of slaves you have, run several queries in parallel.

For the most part, you'd partition the queries across slaves based on the best key (or the primary key I guess), in such a way (we're going to use 250000000 as our Rows / Slaves):

-- First slave
SELECT COUNT(pk) FROM t WHERE pk < 250000000
-- Ith slave where 2 <= I <= N - 1
SELECT COUNT(pk) FROM t WHERE pk >= I*250000000 and pk < (I+1)*250000000
-- Last slave
SELECT COUNT(pk) FROM t WHERE pk > (N-1)*250000000

But you need SQL only. What a bust. Ok, so let's say you're a sadomasochist. On the master (or closest slave) you'd most likely need to create a table for this:

CREATE TABLE counter_table (minpk integer, maxpk integer, cnt integer, slaveid integer)

So instead of only having the selects running in your slaves, you'd have to do an insert, akin to this:

INSERT INTO counter_table VALUES (I*25000000, (I+1)*250000000, (SELECT COUNT(pk) FROM ... ), @@SLAVE_ID)

You may run into issues with slaves writing to a table on master. You may need to get even more sadis- I mean, creative:

-- A table per slave!
INSERT INTO counter_table_slave_I VALUES (...)

You should in the end have a slave that exists last in the path traversed by the replication graph, relative to the first slave. That slave should now have all other counter values, and should have its own values. But by the time you've finished, there probably are rows added, so you'd have to insert another one compensating for the recorded max pk in your counter_table and the current max pk.

At that point, you'd have to do an aggregate function to figure out what the total rows are, but that's easier since you'd be running it on at most the "number of slaves you have and change" rows.

If you're in the situation where you have separate tables in the slaves, you can UNION to get all the rows you need.

SELECT SUM(cnt) FROM (
    SELECT * FROM counter_table_slave_1
      UNION
    SELECT * FROM counter_table_slave_2
      UNION
    ...
  )

Or you know, be a bit less insane and migrate your data to a distributed processing system, or maybe use a Data Warehousing solution (which will give you awesome data crunching in the future too).

Do note, this does depend on how well your replication is set up. Since the primary bottleneck will most likely be persistent storage, if you have cruddy storage or poorly segregated data stores with heavy neighbor noise, this will probably run you slower than just waiting for a single SELECT COUNT(*) ...

But if you have good replication, then your speed gains should be directly related to the number or slaves. In fact, if it takes 10 minutes to run the counting query alone, and you have 8 slaves, you'd cut your time to less than a couple minutes. Maybe an hour to iron out the details of this solution.

Of course, you'd never really get an amazingly accurate answer since this distributed solving introduces a bit of time where rows can be deleted and inserted, but you can try to get a distributed lock of rows at the same instance and get a precise count of the rows in the table for a particular moment in time.

Actually, this seems impossible, since you're basically stuck with an SQL-only solution, and I don't think you're provided a mechanism to run a sharded and locked query across multiple slaves, instantly. Maybe if you had control of the replication log file... which means you'd literally be spinning up slaves for this purpose, which is no doubt slower than just running the count query on a single machine anyway.

So there's my two 2013 pennies.


T
ToolmakerSteve

If insert trigger is too expensive to use, but a delete trigger could be afforded, and there is an auto-increment id, then after counting entire table once, and remembering the count as last-count and the last-counted-id,

then each day just need to count for id > last-counted-id, add that to last-count, and store the new last-counted-id.

The delete trigger would decrement last-count, if id of deleted record <= last-counted-id.


.. sorry don't have time to show the SQL that would be used (my SQL is rusty). If anyone wants to edit my answer to add SQL, that would be great!
K
KevinS

If you have a typical table structure with an auto-incrementing primary key column in which rows are never deleted, the following will be the fastest way to determine the record count and should work similarly across most ANSI compliant databases:

SELECT TOP(1) <primarykeyfield> FROM <table> ORDER BY <primarykeyfield> DESC;

I work with MS SQL tables containing billions of rows that require sub-second response times for data, including record counts. A similar SELECT COUNT(*) would take minutes to process by comparison.


Not entirely true - what if an INSERT transaction is rolled back? That primary key value would be absent, so the actual record count would be one less than the maximum value.
The could be gaps in sequence. Usually a result of rollbacks.
Actually, there is a modification of this answer that might be significantly faster than count(*), if a database vendor has not sufficiently optimized count(*): Each day keep track of the last auto-index and its corresponding count, then ask for a count of records past that. Can also handle deletes if add a trigger on delete that decrements the previous total, if deleted record id <= that last auto-index.
A
Abhishek B Patel

For Sql server try this

SELECT T.name, 
       I.rows AS [ROWCOUNT] 
FROM   sys.tables AS T 
       INNER JOIN sys.sysindexes AS I 
               ON T.object_id = I.id AND I.indid < 2 
WHERE T.name = 'Your_Table_Name'
ORDER  BY I.rows DESC 

L
Liam
select rows from sysindexes
where id = Object_ID('TableName') and indid <2

R
Ravi Kiran

In a very large table for me,

SELECT COUNT(1) FROM TableLarge 

takes 37 seconds whereas

SELECT COUNT_BIG(1) FROM TableLarge

takes 4 seconds.


E
EvilTeach

Put an index on some column. That should allow the optimizer to perform a full scan of the index blocks, instead of a full scan of the table. That will cut your IO costs way down. Look at the execution plan before and after. Then measure wall clock time both ways.


If a table has billions of rows without an index on any column, then there will be widespread performance issues, far beyond the need expressed in original question .. but good that you mention that (assume nothing!) :)
D
Diogo Ferreira

If you are using Oracle, how about this (assuming the table stats are updated):

select <TABLE_NAME>, num_rows, last_analyzed from user_tables

last_analyzed will show the time when stats were last gathered.


D
Dorian

With PostgreSQL:

SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'table_name'

g
gotqn

With SQL Server 2019, you can use APPROX_COUNT_DISTINCT, which:

returns the approximate number of unique non-null values in a group

and from the docs:

APPROX_COUNT_DISTINCT is designed for use in big data scenarios and is optimized for the following conditions: Access of data sets that are millions of rows or higher and Aggregation of a column or columns that have many distinct values

Also, the function

implementation guarantees up to a 2% error rate within a 97% probability

requires less memory than an exhaustive COUNT DISTINCT operation

given the smaller memory footprint is less likely to spill memory to disk compared to a precise COUNT DISTINCT operation.

The algorithm behind the implementation its HyperLogLog.


i
iminiki

Use COUNT_BIG() for fetching the count of records in a very large sized file.

SELECT COUNT_BIG(*) FROM TABLENAME;

S
SenSei

In SQL server 2016, I can just check table properties and then select 'Storage' tab - this gives me row count, disk space used by the table, index space used etc.


He was looking for a database vendor independent solution. Also this requires a GUI and can not be automated. Also it is not faster as COUNT(*)
L
Liam

Maybe a bit late but this might help others for MSSQL

;WITH RecordCount AS (  SELECT      ROW_NUMBER() OVER (ORDER BY
COLUMN_NAME) AS [RowNumber]     FROM        TABLE_NAME )  SELECT
MAX(RowNumber) FROM RecordCount

This is significantly WORSE than COUNT(), unless we're VERY lucky and the optimizer manages to optimize it to a COUNT() - why ask it to SORT on a random column?!?

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now