[MCOL-1267] cpimport: simultaneous calls for the same table fail becuase of lock Created: 2018-03-13 Updated: 2019-05-16 Resolved: 2019-05-01 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | cpimport |
| Affects Version/s: | 1.0.13, 1.1.2 |
| Fix Version/s: | Icebox |
| Type: | Bug | Priority: | Major |
| Reporter: | David Hall (Inactive) | Assignee: | David Hall (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Sprint: | 2018-06, 2018-07, 2018-08, 2018-09, 2018-10, 2018-11, 2018-12, 2018-13, 2018-14, 2018-15, 2018-16, 2018-17, 2018-18, 2018-19, 2018-20, 2018-21 |
| Description |
|
If two cpimport jobs are submitted closely such that they both run together and they are for the same single table, then there's a chance both will fail because the lock system may have a bug. The theory is that cpimport without a -P <pm list> option tries to lock the table without a dbroot list. TableLockServer::lock() doesn't find a lock that already exists because there's no dbroot overlap. Recommend we modify TableLockInfo::overlaps() – used by lock() – to return true if the dbrootlist is empty(). |
| Comments |
| Comment by David Hall (Inactive) [ 2018-03-21 ] |
|
I haven't been able to reproduce the problem. I have set up a system as close to what they're using, given my hardware constraints. Table locking always catches the problem. I do, however, understand what's happening to cause the duplicate cpimports. Looking at what they gave us, it is apparent that they are simultaneously running a similar colxml | cpimport pair for two different data bases at exactly the same time. They also use the same job number. This means colxml is writing to the same job file, so the second job (by lottery) is wiping out the first job file. Both cpimports are running against the same resultant job file; At the same time. While they have different directories for the temp data for each job, the place where the job file itself is placed is determined by <WritEngine><BulkRoot> in the columnstore.xml. They have now corrected the problem of using the same job number and the problem should not reoccur. However, it is still a mystery how they got past table locking. |
| Comment by David Hall (Inactive) [ 2018-03-21 ] |
|
The problem appears to be: in a separate UM/PM system, running concurrent colxml | cpimport jobs with the same job number. I haven't been able to reproduce the issue: table locking always catches the problem. Either there is some other factor I haven't figured out, or the timing must be different. |