Status: Closed (View Workflow)
Affects Version/s: 1.5.2
Fix Version/s: 1.5.5
Environment:MariaDB server version: 5.5.44-MariaDB
Java version: java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
version_comment: mariadb.org binary distribution
MariaDB server version: 5.5.44-MariaDB Java version: java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) version_comment: mariadb.org binary distribution version_compile_machine: AMD64 version_compile_os: Win64 net_buffer_length: 16384 max_allowed_packet: 16777216
Importing data using LOAD DATA LOCAL INFILE with large files using Connector/J fails with either memory running out or the connection being reset by the MariaDB service.
We get a "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" when the file to import is 1.4 GiB or larger (I haven't determined the exact number, but 1.17 GiB does not cause this particular error). A "java.sql.SQLNonTransientConnectionException: Could not read resultset: Connection reset by peer: socket write error" when the file to import is larger than 16 MiB. Small files (15.6 MiB) work fine.
The query that is run is the following:
The INFILE stream is set using code similar to this:
The table definition and example code is attached to this ticket.
From what I've seen so far it seems that the Connector/J driver attempts to allocate a buffer (org.mariadb.jdbc.internal.stream.PacketOutputStream::increase(int)) large enough to fit the entire file before sending anything to the server. This allocation fails for large files. For smaller files it seems to fail because the "max_allowed_packet" is smaller than the buffer.
No rows were created and nothing was logged in the access or error log in the two cases that caused errors to occur.
I have tested simulating imports with different file sizes, please see the attached java file.
The INFILE should be streamed to the server(s) without any issues, even if it does not fit within a single packet or in the JVM heap.
We have a system that generates data from several sources. This data is spread across several tables and databases. Users of this data have the need to add their own additional information (columns) to records that we have created. This additional data is something only the user has knowledge of, it is not available in the sources our data is generated from. The additional data is only required when the user performs queries, so the part of the system that is generating data is never aware of the additional data. One way to think of it is that the user maps every unique value of some column to some (non-unique) value of their choosing.
The process works like this: the user identifies some property (column) available in our databases they want to add additional data to, for example cellphone model. The user then creates a file in a special format that maps some number of cellphone models to the data they wish to add. The file does not have to contain every value that exists in the database, it may even reference values that currently does not exist. The user then notifies our system that they wish to add the data.
The system in turn does some processing of the file, for example transforming the values to the appropriate format. The end result is that the system produces a CSV file that will be then be inserted to one or more MariaDB databases. The data is placed in its own table, our original data is never modified, this is partly because the data generation part of the system does not have knowledge of the data.
After the data has been inserted to the databases the system will perform a JOIN on the created tables whenever a user asks for this additional data.
Currently this data insertion is performed by batching inserts in small volumes, this is too slow, thus we are attempting to use LOAD DATA INFILE.