[MCOL-930] Cannot execute queries Created: 2017-09-18  Updated: 2018-11-28

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 1.0.9
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jimmy Atauje Hidalgo Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Red Hat Enterprise Linux Server release 7.2 (Maipo)


Attachments: File columnstoreSupportReport.columnstore-1.tar.gz    
Issue Links:
Blocks
is blocked by MCOL-1278 occasionally IDB-2031: Blocks are mis... Closed
Duplicate
duplicates MCOL-1662 WriteEngine bulk methods do not versi... Closed
Relates
relates to MCOL-984 Error 1815 after several executions o... Closed

 Description   

When executing a simple summary query it generally works but sometimes not, if query fails
one time then won't work anymore until i run " select calFlushCache();" or restart columnstore system.

Is there a workaround for these?

My errors are like these:

[root@server6653 ~]# grep "error"/var/log/mariadb/columnstore/err.log

Sep 18 13:10:08 localhost PrimProc[29337]: 08.069259 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529299 lbid 1363416840: input rids 8192, output rids 8190
Sep 18 13:18:22 localhost PrimProc[29337]: 22.491177 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1363475008: input rids 8192, output rids 8174
Sep 18 13:18:22 localhost PrimProc[29337]: 22.495557 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1363475009: input rids 8192, output rids 8173
Sep 18 13:18:22 localhost PrimProc[29337]: 22.502594 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1362927232: input rids 8192, output rids 8191
Sep 18 13:18:22 localhost PrimProc[29337]: 22.523705 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1363475010: input rids 8192, output rids 8174
Sep 18 13:18:22 localhost PrimProc[29337]: 22.911010 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1363475011: input rids 8192, output rids 8184
Sep 18 13:18:22 localhost PrimProc[29337]: 22.916014 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1363475012: input rids 8192, output rids 8176
Sep 18 13:18:22 localhost PrimProc[29337]: 22.918348 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1362927233: input rids 8192, output rids 8191
Sep 18 13:18:22 localhost PrimProc[29337]: 22.921142 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529318 lbid 1363475013: input rids 8192, output rids 8168
Sep 18 13:20:09 localhost PrimProc[29337]: 09.841498 |0|0|0| C 28 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/columncommand.cpp error on projectResultRG for oid 529299 lbid 1363417222: input rids 8192, output rids 8191

Additionally i get these error at random times:

Sep 18 13:03:18 localhost [29337]: 18.682927 |0|0|0| W 28 CAL0000: BRP::getBlock(): got a BRM lookup error. LBID=5490343937 ver= SCN: 0#012 Txns: txn=0 vbFlg=0
Sep 18 13:03:18 localhost PrimProc[29337]: 18.688682 |0|0|0| W 28 CAL0000: IDB-2035: An internal error occurred. Check the error log file & contact support.
Sep 18 13:03:18 localhost PrimProc[29337]: 18.732647 |0|0|0| W 28 CAL0000: BRP::getBlock(): got a BRM lookup error. LBID=5490343937 ver= SCN: 0#012 Txns: txn=0 vbFlg=0
Sep 18 13:03:18 localhost PrimProc[29337]: 18.783959 |0|0|0| W 28 CAL0000: BRP::getBlock(): got a BRM lookup error. LBID=5490343937 ver= SCN: 0#012 Txns: txn=0 vbFlg=0



 Comments   
Comment by David Thompson (Inactive) [ 2017-09-19 ]

Hi, is it possible for you to attach or send over seperately a support report as that will give us the info we need to triage further:
https://mariadb.com/kb/en/library/system-troubleshooting-mariadb-columnstore/#mariadb-columnstore-support-tool

You can also send this directly to me if you don't feel comfortable attaching to this jira.

Also if you can describe the symptoms a bit more that might help. Will a given query work for some time and then fail or will a given query always fail and then make the system unstable? Anything else going on around the same time when problems occur? The logs in the support report may tell us this but always good to get a human perspective.

Comment by Jimmy Atauje Hidalgo [ 2017-09-27 ]

Hi, the error was not present these days but today it happened again.

i have attached the support report, i have removed some archive logs because the file was too big for attachment.

The scenario is like this:

  • We store information in 1 table per day.
  • There is always a single thread executing LOAD DATA INFILE into the daily table.
  • At the same time there maybe another threads doing agregation on the daily table.
  • The error only shows in this daily tables.
  • Table has approximately 1,800,000,000 rows at the end of the day.
  • The error doesn't affect perfomarnce
  • The errors only goes away if i server restart or flush cache, example:

[root@elastic103 space]# mcsmysql -D cdrdatos

MariaDB [cdrdatos]> SELECT access_point_name_NI,SUM(duration) FROM cdr20170926 GROUP BY access_point_name_NI;
ERROR 1815 (HY000): Internal error: An unexpected condition within the query caused an internal processing error within InfiniDB. Please check the log files for more details. Additional Information: error in BatchPrimitiveProces

..... +after server restart +.......

MariaDB [cdrdatos]> SELECT access_point_name_NI,SUM(duration) FROM cdr20170926 GROUP BY access_point_name_NI;
-----------------------------------+

access_point_name_NI SUM(duration)

-----------------------------------+

record1 1000
record2 2000
record3 3000
record4 4000

Any suggestion is aprecciated.

Comment by David Thompson (Inactive) [ 2017-09-27 ]

Thanks, that is definitely helpful. What i can see today is that around 12:44, exemgr crashed:
Sep 27 12:44:23 localhost joblist[652]: 23.827443 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.

You can also see this by the fact it has a later start date by running mcsadmin getProcessStatus (ProcMon will restart it if it detects a crash).

I don't know if it's the cause or just a long running query but the following query never finished:
Sep 27 12:41:36 localhost ExeMgr[1055]: 36.345880 |574921|0|0| D 16 CAL0041: Start SQL statement: select served_MSISDN fono, hour(DATE_ADD(record_opening_time, INTERVAL duration SECOND)) hora, sum(losd_time_usage) t, sum(losd_datavolume_fbc_uplink + losd_datavolume_fbc_downlink) datavolume from cdrdatos.cdr20170925 where DATE_ADD(record_opening_time, INTERVAL duration SECOND) BETWEEN '2017-09-25 00:00:00' and '2017-09-25 23:59:59' and served_MSISDN regexp '^519[0-9]

{8}

$' and (uli_ci + uli_sac + uli_ecgi) in ( 10900, 10901, 10902, 10903, 10908, 10909, 62201, 10905, 10906, 10907, 20361, 20362, 20363, 20365, 20366, 20367, 57170, 57176, 57177, 62200, 62206, 62207, 130839143, 130839141, 130839142, 132152421 ) and (uli_lac + uli_tai) in (1183,1412,14121) group by served_MSISDN, hour(DATE_ADD(record_opening_time, INTERVAL duration SECOND)) order by served_MSISDN, hour(DATE_ADD(record_opening_time, INTERVAL duration SECOND)) asc limit 10; |cdrdatos|

This likely is tied in with the original errors you reported but i don't see any reported between 12.41 and 12.47. If it's ok to do on your system can you try running the above query manually and see if that triggers any error. It's possible it could be exhausting memory but you have a very beefy machine so this seems less likely.

One thing that i can say is that we have fixed some similar stability bugs in 1.0.10 and 1.0.11 so it may be worthwhile considering an upgrade to 1.0.11 first?

Comment by Sasha V [ 2017-10-24 ]

I reported a similar issue for the version 1.1.0.

Perhaps for a workaround, one may try replacing the SMALLINT columns with INT.

Comment by Jimmy Atauje Hidalgo [ 2017-10-30 ]

Thanks for your help!, i have many datatypes for my table like (bigint, tinyint, smallint, char, varchar) and reading your issue one guy said that it will happen with any columns not being the same, does this mean all my columns need to be integer for example?

Comment by Sasha V [ 2017-10-30 ]

According to LinuxJedi, all columns should have the same widths (in bytes) as the first column. Since "one cannot currently use CHANGE COLUMN to change the definition of that column," I used DROP/CREATE to change the column data type.

Comment by Jimmy Atauje Hidalgo [ 2017-10-30 ]

Guys correct me if i am wrong:
I found the commit: basically what i need to confirm from this is If i put in my first column the smallest datatype of my table the error for calculating row id will be gone?

Comment by Sasha V [ 2017-12-12 ]

That is a great example of thinking outside the box! However, as discussed in MCOL-1105, ADD COLUMN can only put the column at the end of the table.

Generated at Thu Feb 08 02:24:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.