[MCOL-3976] Amazon S3 needs to support use of IAM roles Created: 2020-05-01  Updated: 2021-05-03  Resolved: 2020-10-12

Status: Closed
Project: MariaDB ColumnStore
Component/s: Storage Manager
Affects Version/s: None
Fix Version/s: 5.4.1

Type: New Feature Priority: Critical
Reporter: David Hill (Inactive) Assignee: Patrick LeBlanc (Inactive)
Resolution: Done Votes: 0
Labels: CustomerRequest

Issue Links:
Relates
relates to MCOL-4386 IAM Role on EC2 instances Closed
Sprint: 2020-8

 Description   

Columnstore Amazon for 1.2 releases and early support both Access Keys and IAM roles. Customer used IAM roles in their production and they request that S3 storage support IAm role setup.

Currently only support Access keys via the configuration file.

From customer

It must be adapted to use an IAM instance profile or we can not use S3 storage.



 Comments   
Comment by David Hill (Inactive) [ 2020-06-10 ]

Amazon offers 2 ways to configure for
configuration updates, using the access
keys or using Roles where they can control
the level of permissions and commands
a user can do. This customer has always
used Role instead of the Access key that is
currently supported via the SM config file.

So they need the use of Role configuration
supported or will not be able to use the
1.4 + releases.

Comment by David Hill (Inactive) [ 2020-06-10 ]

more info that might help

https://mariadb.com/kb/en/installing-and-configuring-a-columnstore-system-using-the-amazon-ami/#amazon-iam-role

Comment by Daniel Lee (Inactive) [ 2020-10-10 ]

Build tested: Drone builds. ColumnStore: 907, cmapi: 283

Tested local cpimport and S3 source cpimport
IAM role used: S3-test
STS Region: us-west-2
STS Endpoint: sts.us-west-2.amazonaws.com

With STS region and endpoint specified

Installation: PASSED
Sanity test (LDI using local cpimport, 1gb lineitem), S3 cpimport 1gb orders. PASSED

AWS S3 storage is used, bucket = dleeqadbroot1, objects = 295, total size = 493284230 (470 MB)

1st 1g lineitem local cpimport test successful
2nd lineitem local cpimport test FAILED

[centos8:root~]# /usr/bin/cpimport mytest lineitem /data/qa/source/dbt3/1g/lineitem.tbl
2020-10-09 23:16:50 (7098) INFO : Running distributed import (mode 1) on all PMs...
2020-10-09 23:25:24 (7098) INFO : For table mytest.lineitem: 6001215 rows processed and 6001215 rows inserted.
2020-10-09 23:25:24 (7098) INFO : Bulk load completed, total run time : 514.093 seconds

[centos8:root~]# /usr/bin/cpimport mytest lineitem /data/qa/source/dbt3/1g/lineitem.tbl
2020-10-10 00:21:21 (8190) INFO : Running distributed import (mode 1) on all PMs...
2020-10-10 00:22:44 (8190) ERR : Received a Cpimport Failure from PM1
2020-10-10 00:22:44 (8190) INFO : Please verify error log files in PM1
2020-10-10 00:22:44 (8190) INFO : Canceling outstanding cpimports

[centos8:root~]# cat err.log
Oct 10 00:21:55 centos-8 StorageManager[5845]: S3Storage::getConnection(): ERROR: ms3_init_assume_role. Verify iam_role_name = S3-test, aws_access_key_id, aws_secret_access_key values. Also check sts_region and sts_endpoint if configured.
Oct 10 00:21:55 centos-8 StorageManager[5845]: S3Storage::getConnection(): ms3_error: server says 'Couldn't resolve host name' role name = S3-test
Oct 10 00:21:57 centos-8 configcpp[8218]: 57.292114 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 00:21:58 centos-8 configcpp[8218]: 58.296044 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 00:21:59 centos-8 configcpp[8218]: 59.296465 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
.
.
.
.
Oct 10 00:22:41 centos-8 configcpp[8218]: 41.525869 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 00:22:42 centos-8 configcpp[8218]: 42.527265 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 00:22:42 centos-8 cpimport.bin[8218]: 42.527779 |0|0|0| E 34 CAL0087: BulkLoad Error: Backup error writing backup for column OID-3032; DBRoot-1; partition-0; segment-0; Unable to rename compressed chunk bulk backup file. ; Success
Oct 10 00:22:42 centos-8 cpimport.bin[8218]: 42.528037 |0|0|0| E 34 CAL0087: BulkLoad Error: Error in pre-processing the job file for table mytest.lineitem
Oct 10 00:22:42 centos-8 writeengineserver[6210]: 42.558702 |0|0|0| E 32 CAL0000: pushing data : PIPE error .........Broken pipe
Oct 10 00:22:44 centos-8 writeengineserver[6210]: 44.670002 |0|0|0| E 32 CAL0000: 9765 : cpimport exit on failure (signal -1)
Oct 10 00:22:44 centos-8 writeenginesplit[8190]: 44.672718 |0|0|0| E 33 CAL0000: #033[0;31mReceived a Cpimport Failure from PM1#033[0m
Oct 10 00:22:44 centos-8 writeenginesplit[8190]: 44.673207 |0|0|0| E 33 CAL0087: BulkLoad Error: #033[0;31mReceived a Cpimport Failure from PM1#033[0m
Oct 10 00:25:49 centos-8 configcpp[6033]: 49.396311 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 00:25:50 centos-8 configcpp[6033]: 50.413124 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
.
.
.
Oct 10 00:48:56 centos-8 configcpp[6149]: 56.574334 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 00:48:57 centos-8 configcpp[6149]: 57.576610 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 00:48:58 centos-8 configcpp[6149]: 58.578203 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused

On PM1, from ‘top’ command

4447 root 20 0 1929636 63044 20600 S 2.0 0.3 2:08.18 python3
8190 root 20 0 1653100 41312 23824 S 0.7 0.2 0:05.11 cpimport
572 root 20 0 94188 15360 13980 S 0.3 0.1 0:02.22 systemd-journal
7041 root 20 0 0 0 0 I 0.3 0.0 0:02.82 kworker/0:2-events
8584 root 20 0 153256 6128 4764 S 0.3 0.0 0:00.01 sshd
8614 root 20 0 64516 4496 3828 R 0.3 0.0 0:00.09 top

There is no much activity going on PM1 and the failed cpimport job returned to the system prompt after 26 minutes with a core dumped error:

[centos8:root~]# /usr/bin/cpimport mytest lineitem /data/qa/source/dbt3/1g/lineitem.tbl
2020-10-10 00:21:21 (8190) INFO : Running distributed import (mode 1) on all PMs...
2020-10-10 00:22:44 (8190) ERR : Received a Cpimport Failure from PM1
2020-10-10 00:22:44 (8190) INFO : Please verify error log files in PM1
2020-10-10 00:22:44 (8190) INFO : Canceling outstanding cpimports

caught an exception: Table lock save file failure
terminate called after throwing an instance of 'std::runtime_error'
what(): Table lock save file failure
Aborted (core dumped)

Comment by Daniel Lee (Inactive) [ 2020-10-10 ]

Tried another test run and got a different error on the first cpimport

[centos8:root~]# /usr/bin/cpimport mytest lineitem /data/qa/source/dbt3/1g/lineitem.tbl
2020-10-10 01:43:24 (7409) INFO : Running distributed import (mode 1) on all PMs...
caught an exception: Table lock save file failure
terminate called after throwing an instance of 'std::runtime_error'
what(): Table lock save file failure
Aborted (core dumped)

It did not have that IAM error in the err.log file

[centos8:root~]# cat err.log
Oct 10 01:45:08 centos-8 configcpp[6042]: 08.335435 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 01:45:09 centos-8 configcpp[6042]: 09.337178 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 01:45:10 centos-8 configcpp[6042]: 10.343533 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
.
.
.
Oct 10 01:50:01 centos-8 configcpp[6042]: 01.794403 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 01:50:01 centos-8 writeengine[6179]: 01.929063 |0|0|0| E 19 CAL0001: SplitterReadThread::operator: Broken Pipe
Oct 10 01:50:02 centos-8 configcpp[6042]: 02.820562 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Oct 10 01:50:03 centos-8 configcpp[6042]: 03.826773 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'

Comment by Patrick LeBlanc (Inactive) [ 2020-10-12 ]

To me it looks like SM is down. That's really the only reason SocketPool will return a 'connection refused' error.

I tried for 3 hours to reproduce this problem with the specified engine/cmapi builds (engine 907, cmapi 283) using a tarballed playbook from Jose. I'm told this is how to try to reproduce it. Anyway, I couldn't get far enough in the playbook to be able to reproduce this. There were just too many other problems in those two builds.

I did successfully run everything using the current builds (engine 922 & cmapi 315) twice. I believe I reproduced the problem by causing ms3_init_assume_role() to fail, and logged it as MCOL-4347. I saw SM crash with an assertion failure. The only missing bit of info in the logs above is the output of journalctl for the SM unit. That would include the above, plus info about the assertion failure.

In general, this feature seems to be working. It's not great that a failure to assume a role (via misconfiguration in my case or because of a DNS failure in Daniel's case) causes an SM crash, but I thought about it, and the user's experience will be the same as if it didn't crash. The user gets a mess of errors both ways. A little investigation (like getting the journal output or running testS3Connection) tells them exactly what the problem was. The main difference is that the assertion failure can cause a core-dump, which may fill up the disk if SM keeps getting restarted.

IMO this isn't a show-stopper under the circumstances, confirmed it with Todd. We'll follow up on it via the ticket I logged above.

Generated at Thu Feb 08 02:46:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.