[CONJ-375] Load data infile with large files fails with OutOfMemoryError - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.2
Fix Version/s: 1.5.5
Component/s: Other
Labels:
- infile
Environment:

Hide
MariaDB server version: 5.5.44-MariaDB
Java version: java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
version_comment: mariadb.org binary distribution
version_compile_machine: AMD64
version_compile_os: Win64
net_buffer_length: 16384
max_allowed_packet: 16777216

Show
MariaDB server version: 5.5.44-MariaDB Java version: java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) version_comment: mariadb.org binary distribution version_compile_machine: AMD64 version_compile_os: Win64 net_buffer_length: 16384 max_allowed_packet: 16777216

Description

Summary:
Importing data using LOAD DATA LOCAL INFILE with large files using Connector/J fails with either memory running out or the connection being reset by the MariaDB service.

Detailed problem description:
We get a "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" when the file to import is 1.4 GiB or larger (I haven't determined the exact number, but 1.17 GiB does not cause this particular error). A "java.sql.SQLNonTransientConnectionException: Could not read resultset: Connection reset by peer: socket write error" when the file to import is larger than 16 MiB. Small files (15.6 MiB) work fine.

The query that is run is the following:

"LOAD DATA LOCAL INFILE 'ignoredFileName'

INTO TABLE `test`.`infile`

CHARACTER SET UTF8

COLUMNS TERMINATED BY ',' ENCLOSED BY '\"' ESCAPED BY '\\'

LINES TERMINATED BY '\n'

IGNORE 1 LINES

(`a`, `b`)"

The INFILE stream is set using code similar to this:

Connection conn = DriverManager.getConnection([...]);

MariaDbStatement stmt = conn.createStatement().unwrap(org.mariadb.jdbc.MariaDbStatement.class);

Stmt.setLocalFileInputStream([...]);

Stmt.execute([SQL from above]);

The table definition and example code is attached to this ticket.

From what I've seen so far it seems that the Connector/J driver attempts to allocate a buffer (org.mariadb.jdbc.internal.stream.PacketOutputStream::increase(int)) large enough to fit the entire file before sending anything to the server. This allocation fails for large files. For smaller files it seems to fail because the "max_allowed_packet" is smaller than the buffer.

No rows were created and nothing was logged in the access or error log in the two cases that caused errors to occur.

I have tested simulating imports with different file sizes, please see the attached java file.

Expected behavior:
The INFILE should be streamed to the server(s) without any issues, even if it does not fit within a single packet or in the JVM heap.

Background information:
We have a system that generates data from several sources. This data is spread across several tables and databases. Users of this data have the need to add their own additional information (columns) to records that we have created. This additional data is something only the user has knowledge of, it is not available in the sources our data is generated from. The additional data is only required when the user performs queries, so the part of the system that is generating data is never aware of the additional data. One way to think of it is that the user maps every unique value of some column to some (non-unique) value of their choosing.

The process works like this: the user identifies some property (column) available in our databases they want to add additional data to, for example cellphone model. The user then creates a file in a special format that maps some number of cellphone models to the data they wish to add. The file does not have to contain every value that exists in the database, it may even reference values that currently does not exist. The user then notifies our system that they wish to add the data.

The system in turn does some processing of the file, for example transforming the values to the appropriate format. The end result is that the system produces a CSV file that will be then be inserted to one or more MariaDB databases. The data is placed in its own table, our original data is never modified, this is partly because the data generation part of the system does not have knowledge of the data.

After the data has been inserted to the databases the system will perform a JOIN on the created tables whenever a user asks for this additional data.

Currently this data insertion is performed by batching inserts in small volumes, this is too slow, thus we are attempting to use LOAD DATA INFILE.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

load_infile_test_file_sizes.zip
3 kB
2016-10-20 12:05
create_table.sql
1 kB
2016-10-20 12:05

Activity

People

Assignee:: Diego Dupin

Reporter:: Andreas Palm

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2016-10-20 12:07

Updated:: 2016-10-25 12:19

Resolved:: 2016-10-25 12:01

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.