ChatGPT解决这个技术问题 Extra ChatGPT

Why does MYSQL higher LIMIT offset slow the query down?

Scenario in short: A table with more than 16 million records [2GB in size]. The higher LIMIT offset with SELECT, the slower the query becomes, when using ORDER BY *primary_key*

So

SELECT * FROM large ORDER BY `id`  LIMIT 0, 30 

takes far less than

SELECT * FROM large ORDER BY `id` LIMIT 10000, 30 

That only orders 30 records and same eitherway. So it's not the overhead from ORDER BY. Now when fetching the latest 30 rows it takes around 180 seconds. How can I optimize that simple query?

NOTE: I'm the author. MySQL doesn't refer to the index (PRIMARY) in the above cases. see the below link by user "Quassnoi" for explanation.
A related link: We need tool support for keyset pagination. If you’d like to know what happens inside the database when using offset or keyset pagination, have a look at those slides.

E
Elzo Valugi

I had the exact same problem myself. Given the fact that you want to collect a large amount of this data and not a specific set of 30 you'll be probably running a loop and incrementing the offset by 30.

So what you can do instead is:

Hold the last id of a set of data(30) (e.g. lastId = 530) Add the condition WHERE id > lastId limit 0,30

So you can always have a ZERO offset. You will be amazed by the performance improvement.


It may not be obvious to all that this only works if your result set is sorted by that key, in ascending order (for descending order the same idea works, but change > lastid to < lastid.) It doesn't matter if it's the primary key, or another field (or group of fields.)
Just a note that limit/offset is often used in paginated results, and holding lastId is simply not possibly because the user can jump to any page, not always the next page. In other words, offset often needs to be calculated dynamically based on page and limit, instead of following a continuous pattern.
I talk at more length about "remembering where you left off" in mysql.rjweb.org/doc.php/pagination
man. you are a live saver. i have 5 mil data that need around 90 mins to process all with offset and limit now when i tried your answer. daamn its only need 9 mins to process Thankyou man. THANK YOU!!
@Lanti Let's assume that Page 563 begins at offset 563 * 30 = 16890, since in the OP's example 30 is the page size and assume page numbering starts from 0. Further assume that column id is unique and is indexed. Then execute select id from large order by id limit 16889, 1 to read the id of the last row of Page 562. This should be reasonably efficient since only the index is involved. Now you have the "lastId" to proceed with selecting the next page.
Q
Quassnoi

It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT records (and take only LIMIT of them). The higher is this value, the longer the query runs.

The query cannot go right to OFFSET because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.

Assuming that id is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:

SELECT  t.* 
FROM    (
        SELECT  id
        FROM    mytable
        ORDER BY
                id
        LIMIT 10000, 30
        ) q
JOIN    mytable t
ON      t.id = q.id

See this article:

MySQL ORDER BY / LIMIT performance: late row lookups


MySQL "early row lookup" behavior was the answer why it's talking so long. By the trick you provided, only matched ids (by the index directly) are bound, saving unneeded row lookups of too many records. That did the trick, hooray!
@harald: what exactly do you mean by "not work"? This is a pure performance improvement. If there is no index usable by ORDER BY or the index covers all fields you need, you don't need this workaround.
@f055: the answer says "speed up", not "make instant". Have you read the very first sentence of the answer?
Is it possible to run something like this for InnoDB?
@Lanti: please post it as a separate question and don't forget to tag it with postgresql. This is a MySQL-specific answer.
R
Riedsio

MySQL cannot go directly to the 10000th record (or the 80000th byte as your suggesting) because it cannot assume that it's packed/ordered like that (or that it has continuous values in 1 to 10000). Although it might be that way in actuality, MySQL cannot assume that there are no holes/gaps/deleted ids.

So, as bobs noted, MySQL will have to fetch 10000 rows (or traverse through 10000th entries of the index on id) before finding the 30 to return.

EDIT : To illustrate my point

Note that although

SELECT * FROM large ORDER BY id LIMIT 10000, 30 

would be slow(er),

SELECT * FROM large WHERE id >  10000 ORDER BY id LIMIT 30 

would be fast(er), and would return the same results provided that there are no missing ids (i.e. gaps).


This is correct. But since it's limited by "id", why does it take so long when that id is within an index (primary key)? Optimizer should refer to that index directly, and then fetch the rows with matched ids ( which came from that index)
If you used a WHERE clause on id, it could go right to that mark. However, if you put a limit on it, ordered by id, it's just a relative counter to the beginning, so it has to transverse the whole way.
Very good article eversql.com/…
Worked for me @Riedsio Thanks.
s
sym

I found an interesting example to optimize SELECT queries ORDER BY id LIMIT X,Y. I have 35million of rows so it took like 2 minutes to find a range of rows.

Here is the trick :

select id, name, address, phone
FROM customers
WHERE id > 990
ORDER BY id LIMIT 1000;

Just put the WHERE with the last id you got increase a lot the performance. For me it was from 2minutes to 1 second :)

Other interesting tricks here : http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql/

It works too with strings


this works only for tables, where no data are deleted
@miro That's only true if you are working under the assumption that your query can do lookups at random pages, which I don't believe this poster is assuming. While I don't like this method for most real world cases, this will work with gaps as long as you are always basing it off the last id obtained.
b
bobs

The time-consuming part of the two queries is retrieving the rows from the table. Logically speaking, in the LIMIT 0, 30 version, only 30 rows need to be retrieved. In the LIMIT 10000, 30 version, 10000 rows are evaluated and 30 rows are returned. There can be some optimization can be done my the data-reading process, but consider the following:

What if you had a WHERE clause in the queries? The engine must return all rows that qualify, and then sort the data, and finally get the 30 rows.

Also consider the case where rows are not processed in the ORDER BY sequence. All qualifying rows must be sorted to determine which rows to return.


just wondering why it consumes time to fetch those 10000 rows. The index used on that field ( id, which is a primary key ) should make retrieving those rows as fast as seeking that PK index for record no. 10000, which in turn is supposed to be fast as seeking the file to that offset multiplied by index record length, ( ie, seeking 10000*8 = byte no 80000 - given that 8 is the index record length )
@Rahman - The only way to count past the 10000 rows is to step over them one by one. This may just involve an index, but still index rows take time to step through. There is no MyISAM or InnoDB structure that can correctly (in all cases) "seek" to record 10000. The 10000*8 suggestion assumes (1) MyISAM, (2) FIXED length record, and (3) never any deletes from the table. Anyway, MyISAM indexes are BTrees, so it would not work.
As this answer stated, I believe, the really slow part is the row lookup, not traversing the indexes (which of course will add up as well, but nowhere near as much as the row lookups on disk). Based on the workaround queries provided for this issue, I believe the row lookups tend to happen if you are selecting columns outside of the index -- even if they are not part of the order by or where clause. I haven't found a reason why this is necessary, but it appears to be why some of the workarounds help.
I believe the delay is caused by counting the entries in the index tree, as oposed to finding the starting index (for which SQL index tree is optimized and it gets pointed close to the target row, without going through particular rows). The next part, reading number of rows, is equaly "slow" when using WHERE ID > x. But the latter is useless in most real world applications anyway.
c
ch271828n

For those who are interested in a comparison and figures :)

Experiment 1: The dataset contains about 100 million rows. Each row contains several BIGINT, TINYINT, as well as two TEXT fields (deliberately) containing about 1k chars.

Blue := SELECT * FROM post ORDER BY id LIMIT {offset}, 5

Orange := @Quassnoi's method. SELECT t.* FROM (SELECT id FROM post ORDER BY id LIMIT {offset}, 5) AS q JOIN post t ON t.id = q.id

Of course, the third method, ... WHERE id>xxx LIMIT 0,5, does not appear here since it should be constant time.

Experiment 2: Similar thing, except that one row only has 3 BIGINTs.

green := the blue before

red := the orange before

https://i.stack.imgur.com/bjxjA.png


Is your id primary key or non-primary key field?
@ospider primary imho