Uploaded image for project: 'MariaDB Connector/J'
  1. MariaDB Connector/J
  2. CONJ-375

Load data infile with large files fails with OutOfMemoryError

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.2
    • Fix Version/s: 1.5.5
    • Component/s: Other
    • Labels:
    • Environment:

      Description

      Summary:
      Importing data using LOAD DATA LOCAL INFILE with large files using Connector/J fails with either memory running out or the connection being reset by the MariaDB service.

      Detailed problem description:
      We get a "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" when the file to import is 1.4 GiB or larger (I haven't determined the exact number, but 1.17 GiB does not cause this particular error). A "java.sql.SQLNonTransientConnectionException: Could not read resultset: Connection reset by peer: socket write error" when the file to import is larger than 16 MiB. Small files (15.6 MiB) work fine.

      The query that is run is the following:

      "LOAD DATA LOCAL INFILE 'ignoredFileName'
      INTO TABLE `test`.`infile`
      CHARACTER SET UTF8
      COLUMNS TERMINATED BY ',' ENCLOSED BY '\"' ESCAPED BY '\\'
      LINES TERMINATED BY '\n'
      IGNORE 1 LINES
      (`a`, `b`)"
      

      The INFILE stream is set using code similar to this:

      Connection conn = DriverManager.getConnection([...]);
      MariaDbStatement stmt = conn.createStatement().unwrap(org.mariadb.jdbc.MariaDbStatement.class);
      Stmt.setLocalFileInputStream([...]);
      Stmt.execute([SQL from above]);
      

      The table definition and example code is attached to this ticket.

      From what I've seen so far it seems that the Connector/J driver attempts to allocate a buffer (org.mariadb.jdbc.internal.stream.PacketOutputStream::increase(int)) large enough to fit the entire file before sending anything to the server. This allocation fails for large files. For smaller files it seems to fail because the "max_allowed_packet" is smaller than the buffer.

      No rows were created and nothing was logged in the access or error log in the two cases that caused errors to occur.

      I have tested simulating imports with different file sizes, please see the attached java file.

      Expected behavior:
      The INFILE should be streamed to the server(s) without any issues, even if it does not fit within a single packet or in the JVM heap.

      Background information:
      We have a system that generates data from several sources. This data is spread across several tables and databases. Users of this data have the need to add their own additional information (columns) to records that we have created. This additional data is something only the user has knowledge of, it is not available in the sources our data is generated from. The additional data is only required when the user performs queries, so the part of the system that is generating data is never aware of the additional data. One way to think of it is that the user maps every unique value of some column to some (non-unique) value of their choosing.

      The process works like this: the user identifies some property (column) available in our databases they want to add additional data to, for example cellphone model. The user then creates a file in a special format that maps some number of cellphone models to the data they wish to add. The file does not have to contain every value that exists in the database, it may even reference values that currently does not exist. The user then notifies our system that they wish to add the data.

      The system in turn does some processing of the file, for example transforming the values to the appropriate format. The end result is that the system produces a CSV file that will be then be inserted to one or more MariaDB databases. The data is placed in its own table, our original data is never modified, this is partly because the data generation part of the system does not have knowledge of the data.

      After the data has been inserted to the databases the system will perform a JOIN on the created tables whenever a user asks for this additional data.

      Currently this data insertion is performed by batching inserts in small volumes, this is too slow, thus we are attempting to use LOAD DATA INFILE.

        Attachments

          Activity

            People

            Assignee:
            diego dupin Diego Dupin
            Reporter:
            andreaspalm.polystar Andreas Palm
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.