[CONJ-1122] OutOfMemoryError when iterating over result set and modify rows in the same connection - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Not a Bug
Affects Version/s: 3.1.4
Fix Version/s: N/A
Component/s: Other
Labels:
None
Environment:
Windows 10
MariaDB 10.6.5
Java 17
SpringBoot 3.1.1, Hibernate 6.2.5

Description

Background: I created a test project that demonstrates various ways in Jakarta Persistence API (JPA) how to iterate over arbitrary number of entities read from the database
and possibly modify them, without running into OutOfMemoryError (OOME). The project is written in Kotlin & SpringBoot and connects to a MariaDB database. I was able to iterate over many (even milions) of entities (rows) from the database without any memory issues even when explicitely lowering heap size to only 200M. But when the iterated entities were also modified in the loop, with higher count (e.g. 100 000), OutOfMemoryError occurred (I guess the complete ResultSet was read into memory). I first suspected that the ORM framework (Hibernate) had some hidden bug so I also tried doing the same in plain JDBC but OutOfMemoryError occurred too! Then I tried connection to an Oracle database and to my surprise, everything worked correctly - either by using JPA or JDBC! So my conclusion is that there is some bug in the MariaDB JDBC driver OR that something else also must be set in the connection properties.

Attached is the complete project. Just unzip it, import it to some IDE (set JDK 17) and run the Gradle build. Set Set the -Xmx=200M in order to detect possible memory problems soon. Follow README.md.
Basically, run create 1 000 000 objects first and then run `jdbcresultset` endpoint with count = 100 000 and modifyData=true -> you will get OutOfMemoryError.

The most important class in MassDataService. It has various functions that show the possible ways how to iterate over objects in JPA (resultList, offsetPaging, resultStream, stream) and in JDBC:

//pass 100 000, 1000, 1, true -> OutOfMemoryError

    fun jdbcResultSet(count: Int, flushAfter: Int, fetchSize: Int, modifyData: Boolean): Int {

        val connection = dataSource.connection // take connection from the connection pool

        val autoCommit = connection.autoCommit

        connection.autoCommit = false

        try {

            val limitClause = "limit $count" // MariaDB

            //val limitClause = "fetch first $count rows only" // Oracle

            val ps = connection.prepareStatement("select sys_pk, amount, name, description, valid from something $limitClause")

//                ResultSet.TYPE_FORWARD_ONLY, // default TYPE_FORWARD_ONLY,  TYPE_SCROLL_INSENSITIVE-> OOME

//                ResultSet.CONCUR_READ_ONLY,  // default CONCUR_READ_ONLY -> PreparedStatement.executeBatch() gets stuck!, CONCUR_UPDATABLE -> OOME

//                ResultSet.HOLD_CURSORS_OVER_COMMIT)

            ps.fetchSize = fetchSize // if not set to positive number then OOM occurs!

            printStatementInfo(ps)

            val rs = ps.executeQuery()

            var i = 0

            val toUpdate = mutableListOf<Something>()

            while (rs.next()) {

i++

                val something = Something(

                    sysPk = rs.getInt("sys_pk"),

                    sysId = null,

                    name = rs.getString("name"),

                    description = rs.getString("description"),

                    amount = rs.getInt("amount"),

                    valid = rs.getBoolean("valid"))

                if (i % flushAfter == 0) {

                    println("Entity $i: sysPk = ${something.sysPk}, name = ${something.name}")

                if (modifyData) {

                    something.amount = something.amount + 1

                    toUpdate.add(something)

                    if (i % flushAfter == 0) {

                        update(toUpdate, connection) // uncomment this line if you want to update the row one by one

                        //batchUpdate(toUpdate, connection) // uncomment this line if you want to update the rows in batches

                        toUpdate.clear()

                if (i == count) {

                    println("desired count reached..stopping")

                    rs.close()

                    ps.close()

                    break

            rs.close()

            ps.close()

        } catch (ex: SQLException) {

            connection.rollback()

            println("Error: " + ex.message)

        } finally {

            if (autoCommit) {

                connection.autoCommit = true

            connection.close() // return connection to the pool no matter what

        return 0

    private fun update(toUpdate: List<Something>, connection: Connection) {

        var numUpdated = 0

        val upds = connection.prepareStatement("update something set amount = ?, name = ?, description = ?, valid = ? where sys_pk = ?")

        for(something in toUpdate) {

            upds.setInt(1, something.amount)

            upds.setString(2, something.name)

            upds.setString(3, something.description)

            upds.setBoolean(4, something.valid)

            upds.setInt(5, something.sysPk!!)

            numUpdated += upds.executeUpdate() // OOME even on first call !!!

        upds.close()

        println("updated $numUpdated record(s)")

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

massdata.zip
72 kB
2023-11-27 10:04

Activity

Ascending order - Click to sort in descending order

Diego Dupin added a comment - 2023-11-16 15:02

Thanks for details testing project, i think that's the first time i see a test example so prepared !

The problem is just about oracle and mariadb connector differ about default fetch size (oracle is the only connector i know that has a default fetch size (of 10 rows)).

Having fetch size has some limitations, JDBC spec permits to go to previous resulset rows, and setting fetch size remove that possibility, and decrease performance for small resultset.

For big resultset, fetch size has to be set, in order to avoid mounting the whole resultset in memory.

This can be done in different ways, like
globally : hibernate.jdbc.fetch_size: 10
per query: query.setHint("org.hibernate.fetchSize", "10")

like in your example :

    @Transactional

    fun resultList(count: Int, action: (Something, Int) -> Unit): Int {

        val query = em.createQuery("select s from Something s", Something::class.java)

        query.setHint("org.hibernate.fetchSize", "10")

        query.maxResults = count

        val resultList = query.resultList

        resultList.forEachIndexed { i, s ->

            action(s, i)

        return count

Diego Dupin added a comment - 2023-11-16 15:02 Thanks for details testing project, i think that's the first time i see a test example so prepared ! The problem is just about oracle and mariadb connector differ about default fetch size (oracle is the only connector i know that has a default fetch size (of 10 rows)). Having fetch size has some limitations, JDBC spec permits to go to previous resulset rows, and setting fetch size remove that possibility, and decrease performance for small resultset. For big resultset, fetch size has to be set, in order to avoid mounting the whole resultset in memory. This can be done in different ways, like globally : hibernate.jdbc.fetch_size: 10 per query: query.setHint("org.hibernate.fetchSize", "10") like in your example : @Transactional fun resultList(count: Int, action: (Something, Int) -> Unit): Int { val query = em.createQuery( "select s from Something s" , Something:: class .java) query.setHint( "org.hibernate.fetchSize" , "10" ) query.maxResults = count val resultList = query.resultList resultList.forEachIndexed { i, s -> action(s, i) } return count }

Radek Wikturna added a comment - 2023-11-23 11:45 - edited

I'm really surprised that this bug got closed as 'Not a Bug'!
It took me a long time to investigate the issue and I provided code that can easily reproduce the problem (either in JPA or in plain JDBC)

> For big resultset, fetch size has to be set, in order to avoid mounting the whole resultset in memory.
yes I know. That's why I'm setting the fetch size (see line `ps.fetchSize = fetchSize ` in function jdbcResultSet() and also in other JPA functions like getResultStream()) where a value from a function argument is passed (any positive value will work). That is with set fetch size, you can really iterate over any number of rows without getting OOME, but you will get OOM when you also modify them.
Please, try it yourself - create mass data by calling fun create(1000 000), then call fun jdbcResultSet(100 000, 1000, 1, true) -> you will get OOME when the execution reaches line `numUpdated += upds.executeUpdate()` in fun update().

Maybe I shouldn't have included all the possible function like resultList() - I admit I'm not setting the fetch size there (it wouldn't help anyway) because I wanted to test how many records can be iterated using this function before OOME occurs. So, please just focus on fun jdbcResultSet(), which simulates what JPA/Hibernate does under the hood (e.g. when you call fun getResultStrem()).

Please reopen the issue. I'm sure we can find a way how understand each other.

I have also previously read this article from a Hibernate guru, which gives some advice on setting fetch size on mySQL to integer minimum value or setting connection property useCursorFetch - neither works on mariaDB:
https://vladmihalcea.com/whats-new-in-jpa-2-2-stream-the-result-of-a-query-execution/

Radek Wikturna added a comment - 2023-11-23 11:45 - edited I'm really surprised that this bug got closed as 'Not a Bug'! It took me a long time to investigate the issue and I provided code that can easily reproduce the problem (either in JPA or in plain JDBC) > For big resultset, fetch size has to be set, in order to avoid mounting the whole resultset in memory. yes I know. That's why I'm setting the fetch size (see line `ps.fetchSize = fetchSize ` in function jdbcResultSet() and also in other JPA functions like getResultStream()) where a value from a function argument is passed (any positive value will work). That is with set fetch size, you can really iterate over any number of rows without getting OOME, but you will get OOM when you also modify them . Please, try it yourself - create mass data by calling fun create(1000 000), then call fun jdbcResultSet(100 000, 1000, 1, true) -> you will get OOME when the execution reaches line `numUpdated += upds.executeUpdate()` in fun update(). Maybe I shouldn't have included all the possible function like resultList() - I admit I'm not setting the fetch size there (it wouldn't help anyway) because I wanted to test how many records can be iterated using this function before OOME occurs. So, please just focus on fun jdbcResultSet(), which simulates what JPA/Hibernate does under the hood (e.g. when you call fun getResultStrem()). Please reopen the issue. I'm sure we can find a way how understand each other. I have also previously read this article from a Hibernate guru, which gives some advice on setting fetch size on mySQL to integer minimum value or setting connection property useCursorFetch - neither works on mariaDB: https://vladmihalcea.com/whats-new-in-jpa-2-2-stream-the-result-of-a-query-execution/

Diego Dupin added a comment - 2023-11-24 16:06

There is different things that will indicate to connector how to handle memory : fetch size hint and resultset type.
if using ResultSet.TYPE_FORWARD_ONLY resultset type, that indicate that resultset can only read forward, so after using resultset.next(), that previous row can be removed from memory, because previous data won't be read anymore. Using TYPE_SCROLL_INSENSITIVE, that means the opposite, so after reading the 1_000_000st row for example, that application can still go to the first row using resultset.absolute(1) => All the rows must stay in memory, and that can lead to OOM.

Oracle and mariadb protocol are different, when using fetch size. for example for a value of fetch size of X, mariadb will retrieve the whole resultset, putting X rows in memory at a time, while oracle will only retrieve X rows, then when needing more rows, will ask the the next X rows. ( To be complete, mariadb has another exchange protocol that permit to retrieve data the same way, but beeing around 2x less performant that current implementation, it's better this way)

The thing is those 2 parameters - fetch size and resultset type - are how connector handle to know how memory must be handled.
So in case of having to deal with a large number of rows, the only solution are to use TYPE_FORWARD_ONLY resultset type and setting fetch size.

Oracle by default set fetch size to 10, while mariadb load the complete resultset (fetch size of 0). The thing is that loading complete resultset is a bit faster for small resultset. Small resultset are the norm, so i think we have good default.
Of course, if that's always debatable, and if can discuss more about that, about if they are good default or we lack documentation on it.

I'll not talk about MySQL connector setFetchSize(Integer.MIN_VALUE) that correspond in fact to setFetchSize(1), because that doesn't follow JDBC specification (fetch size must be >= 0)

Diego Dupin added a comment - 2023-11-24 16:06 There is different things that will indicate to connector how to handle memory : fetch size hint and resultset type. if using ResultSet.TYPE_FORWARD_ONLY resultset type, that indicate that resultset can only read forward, so after using resultset.next(), that previous row can be removed from memory, because previous data won't be read anymore. Using TYPE_SCROLL_INSENSITIVE, that means the opposite, so after reading the 1_000_000st row for example, that application can still go to the first row using resultset.absolute(1) => All the rows must stay in memory, and that can lead to OOM. Oracle and mariadb protocol are different, when using fetch size. for example for a value of fetch size of X, mariadb will retrieve the whole resultset, putting X rows in memory at a time, while oracle will only retrieve X rows, then when needing more rows, will ask the the next X rows. ( To be complete, mariadb has another exchange protocol that permit to retrieve data the same way, but beeing around 2x less performant that current implementation, it's better this way) The thing is those 2 parameters - fetch size and resultset type - are how connector handle to know how memory must be handled. So in case of having to deal with a large number of rows, the only solution are to use TYPE_FORWARD_ONLY resultset type and setting fetch size. Oracle by default set fetch size to 10, while mariadb load the complete resultset (fetch size of 0). The thing is that loading complete resultset is a bit faster for small resultset. Small resultset are the norm, so i think we have good default. Of course, if that's always debatable, and if can discuss more about that, about if they are good default or we lack documentation on it. I'll not talk about MySQL connector setFetchSize(Integer.MIN_VALUE) that correspond in fact to setFetchSize(1), because that doesn't follow JDBC specification (fetch size must be >= 0)

Radek Wikturna added a comment - 2023-11-27 10:06 - edited

I really appreciate the detailed answer giving insight on some technical details that need to be considered when implementing the driver.
However, there seems to be some misunderstanding - I'm doing everything you suggested (setting fetch size, using only ResultSet.TYPE_FORWARD_ONLY) but I still get OOME when modifying the data in the loop. Can you explain why the line `numUpdated += upds.executeUpdate() ` in the update() function fails with OOME ?

I suspect you have not really run the code. I admit that my testing code is long and has many functions and that may confuse you. Therefore, I've reduced it to bare minimum (ther's a new zip file) and now you only need to

create a database (by default named `massdata`), optionally modify the url
set -Xmx100M
run the program

it will create 200 000 rows in the `Something` table
then it will iterate over those records -> this works
then it will iterate over these records and modify them in the same loop -> this fails with OOME!

Please, try to run it and only then tell me that I'm doing something wrong.

Radek Wikturna added a comment - 2023-11-27 10:06 - edited I really appreciate the detailed answer giving insight on some technical details that need to be considered when implementing the driver. However, there seems to be some misunderstanding - I'm doing everything you suggested (setting fetch size, using only ResultSet.TYPE_FORWARD_ONLY) but I still get OOME when modifying the data in the loop. Can you explain why the line `numUpdated += upds.executeUpdate() ` in the update() function fails with OOME ? I suspect you have not really run the code. I admit that my testing code is long and has many functions and that may confuse you. Therefore, I've reduced it to bare minimum (ther's a new zip file) and now you only need to create a database (by default named `massdata`), optionally modify the url set -Xmx100M run the program it will create 200 000 rows in the `Something` table then it will iterate over those records -> this works then it will iterate over these records and modify them in the same loop -> this fails with OOME! Please, try to run it and only then tell me that I'm doing something wrong.

Vladislav Vaintroub added a comment - 2023-11-27 11:01 - edited

While I do not know any detail about the driver implementation, one thing that I do know is that you can't execute any other query, as long as result set from COM_QUERY/COM_STMT_EXECUTE was not fully read. Protocol does not allow to start a new command before the old one is finished.
So the JDBC driver might just have to read the result set fully, into memory, before update, so it can execute your update.

Now, Roderick, did you succeed with your program using Oracle's MySQL driver, using any tricks that Hibernate guru revealed?

Vladislav Vaintroub added a comment - 2023-11-27 11:01 - edited While I do not know any detail about the driver implementation, one thing that I do know is that you can't execute any other query, as long as result set from COM_QUERY/COM_STMT_EXECUTE was not fully read. Protocol does not allow to start a new command before the old one is finished. So the JDBC driver might just have to read the result set fully, into memory, before update, so it can execute your update. Now, Roderick , did you succeed with your program using Oracle's MySQL driver, using any tricks that Hibernate guru revealed?

Diego Dupin added a comment - 2023-11-27 15:25 - edited

Ah, i've finally understood your problem : basically, when streaming a resultset, you use the same connection to make some update command. In order to run the execution, the connector will then first finish to completly read all the remaining rows of the resultset into memory, and that will result in OOM.

Basically, for your example, changing

                    if (i % flushAfter == 0) {

                        update(toUpdate, connection) // uncomment this line if you want to update the row one by one

                        //batchUpdate(toUpdate, connection) // uncomment this line if you want to update the rows in batches

                        toUpdate.clear()

                    if (i % flushAfter == 0) {

                        val connection2 = dataSource.connection;

                        try {

                            update(toUpdate, connection2) // uncomment this line if you want to update the row one by one

                            //batchUpdate(toUpdate, connection) // uncomment this line if you want to update the rows in batches

                            toUpdate.clear()

                        } finally {

                            connection2.close()

To explain a bit more : mariadb connector and database exchanges when using fetch size are :

connector send SELECT command to database
database send 200_000 rows.
connector will read fetch size rows, the remaining still being in socket cache, or still not sent by server.
when asking connector another command, driver will load all the remaining data in memory, before sending the new command.

Oracle works differently. example with the default fetch size of 10:

connector send SELECT command to database
database send 10 rows.
connector will read 10 rows, then ask for the next 10 rows.
...
when asking driver another command, connector can send it and read the result

Oracle solution is way slower because it requires lots of exchanges between client and server, but does not have the limitation compare mariadb connector have there.

This need at least documentation.

Diego Dupin added a comment - 2023-11-27 15:25 - edited Ah, i've finally understood your problem : basically, when streaming a resultset, you use the same connection to make some update command. In order to run the execution, the connector will then first finish to completly read all the remaining rows of the resultset into memory, and that will result in OOM. Basically, for your example, changing if (i % flushAfter == 0 ) { update(toUpdate, connection) // uncomment this line if you want to update the row one by one //batchUpdate(toUpdate, connection) // uncomment this line if you want to update the rows in batches toUpdate.clear() } to if (i % flushAfter == 0 ) { val connection2 = dataSource.connection; try { update(toUpdate, connection2) // uncomment this line if you want to update the row one by one //batchUpdate(toUpdate, connection) // uncomment this line if you want to update the rows in batches toUpdate.clear() } finally { connection2.close() } } To explain a bit more : mariadb connector and database exchanges when using fetch size are : connector send SELECT command to database database send 200_000 rows. connector will read fetch size rows, the remaining still being in socket cache, or still not sent by server. when asking connector another command, driver will load all the remaining data in memory, before sending the new command. Oracle works differently. example with the default fetch size of 10: connector send SELECT command to database database send 10 rows. connector will read 10 rows, then ask for the next 10 rows. ... when asking driver another command, connector can send it and read the result Oracle solution is way slower because it requires lots of exchanges between client and server, but does not have the limitation compare mariadb connector have there. This need at least documentation.

People

Assignee:: Diego Dupin

Reporter:: Radek Wikturna

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2023-11-14 09:20

Updated:: 2023-11-27 15:44

Resolved:: 2023-11-22 16:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Connector/J

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration