[MDEV-23740] ssl connection fails when server and client certs signed by same CA Created: 2020-09-16 Updated: 2021-03-24 Resolved: 2021-03-15 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | SSL |
| Affects Version/s: | 10.2.33 |
| Fix Version/s: | 10.2.38, 10.3.29, 10.4.19, 10.5.10 |
| Type: | Bug | Priority: | Major |
| Reporter: | Charles Lane | Assignee: | Vladislav Vaintroub |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | contribution | ||
| Environment: |
RHEL/SL 7.8, openssl 1.0.2k |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
"vanilla" ssl connections work, but assigning a cert to the server and a client cert causes "ssl protocol failures", sometimes obscure version mismatch errors. On investigation it turns out the that problem is in viossl.c and how ssl errors are handled. (attaching segment of mysqld.trace of the working, patched version...note how X509_R_CERT_ALREADY_IN_HASH_TABLE It took a couple of iterations to clean up these errors, because there was some race conditions on when the errors came in vs. when they get handled in the code. Solution: add ERR_clear_error() before each SSL_(read|write|etc) calls. And check for whatever errors are in the "queue" in the ssl_should_retry routine. The mysqld.trace shows that sometimes several errors are queued up, so just dismissing the first one won't be enough. Patch attached; it has some extra whitespace changes, and extra DBUG_PRINT's for diagnostics NOTE: v10.6 viossl.c has the same code in the 10.2 viossl.c, with minor changes that do not alter the problem noted in this bugreport |
| Comments |
| Comment by Charles Lane [ 2020-09-16 ] | ||||||||||
|
possibly related? | ||||||||||
| Comment by Charles Lane [ 2020-09-18 ] | ||||||||||
|
Additional: one underlying cause of the X509_R_CERT_ALREADY_IN_HASH_TABLE error that triggers this issue may be from the openssl-1.0.2k library. A check in openssl-1.1.1g doesn't have the code that throws this particular error message. Unfortunately Scientific Linux 7.8 doesn't yet have updated package distributions for openssl or mariadb; the openssl upgrade would probably require that many other packages also be upgraded. SL7.8 is moving to Centos8.0, which is likely to be a big, disruptive upgrade. So that's the backstory on "why this fix is needed", but clearing out the ssl error stack is still what's need to conform to the openssl library documentation. | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-07 ] | ||||||||||
|
I do not think that clearing out SSL error stack every time is what OpenSSL people envisioned. Clearing after error, yes, but not prior to each call of every openssl function, this would be unpractical. Nobody does that. | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-07 ] | ||||||||||
|
https://github.com/curl/curl/pull/2182/commits/549a585ca112120ddf4e564f850e7bbed098adaf#diff-8386d5872bc6e847805a8b6d63f8bf1aR91 , this is how curl handles this error. It is a tiny local fix, which I would prefer to decorating all openssl calls with ERR_clear_error. | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-08 ] | ||||||||||
|
celane , any comment on the above? Also is this really "server and client certs signed by same CA" problem as in the title? Our test suite runs with client and server certificate signed by the same CA, and it was never a problem. What we never tried , I guess, is using servers certificate and server's private key on client. I pushed a test change adding such case, to see if it makes problems on any box in our CI(we have a lot of old Linuxen, and old OpenSSLs in there) | ||||||||||
| Comment by Charles Lane [ 2020-10-09 ] | ||||||||||
|
Yes, the curl approach is good, wish I had seen that before! Regarding all the ERR_clear_error...do we know for sure that openssl is going to report all errors into the error stack before returning from an openssl call? So if you want ot remove the ERR_clear_error calls, okay, but it might result in hard-to-diagnose rare and intermittent ssl errors. Might just want to #ifdef them for a debugging mode. | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-12 ] | ||||||||||
|
celane, thank you. While OpenSSL fuzzy on how exactly the error model should work, I do not think any OSS project that is using it would cautiously clear all errors prior to any OpenSSL call. This, to me, would be comparable to setting errno to 0 before calling anything POSIX, or SetLastError(ERROR_SUCCESS) for the Windows API. Nobody does it. Clearing error "after error" maybe alright. If you try out https://github.com/mariadb/server/commit/582e482d63876fef72af672776fffe7fec2bee4b , would that fix the X509_R_CERT_ALREADY_IN_HASH_TABLE ? a minimal fix is preferred, especially for the older versions. How to debug hard-to-find openssl bugs, we can think about it in 10.6 or so. | ||||||||||
| Comment by Charles Lane [ 2020-10-12 ] | ||||||||||
|
One of the problems that I ran into during the diagnosis of this problem was having the openssl connection fail intermittently. That is, same mariadb server, same mysql client, same certificates, etc....try it once, it works, try it again it fails with a non-helpful error message. About 60% failure rate. This was between nearby machines on the same net, running the same SL versions. That, to me, looked very much like a race condition somewhere inside openssl. Tried debugging, but you really have to recompile openssl without optimization, and then you're testing something with different timing characteristics. The patch I posted got it to work reliably. So yeah, "nobody does that" for clearling the errors...except for things like real-time programming, when error conditions can get arrive asynchronously. That is what I suspected was happening with openssl: the connection started okay, exchaging keys, etc, but then the next step was verifying the certs which mean certs were I'll generate some cert/key pairs (one for client, one for server) for you to test with, but ..I generally "fix" the ip and hostname for the server cert. What ip and hostname do you need for that, and where should I put the cert/key pairs? | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-12 ] | ||||||||||
|
for our test suite, we generate certificate like that https://github.com/MariaDB/server/blob/10.6/mysql-test/lib/generate-ssl-certs.sh . It uses CN=localhost in subject of the server certificate is enough to make the validation pass, if client does validation at all (it is optional, non-default). There is one test cert with subjectAltName, and IP address and DNS name, though usually we just use server-cert.pem for the server and client-cert.pem for the client. Anyway, 127.0.0.1 and localhost are good enough for a test certificate. Thank you! | ||||||||||
| Comment by Charles Lane [ 2020-10-13 ] | ||||||||||
|
just added a tgz with user and client certs (pem format), keys, and my CA (and sub-CA) cert. The 'bundle' has CA+subCA appended. Please let me know when you're done with testing so I can revoke the user/client certs. | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-19 ] | ||||||||||
|
my initial attempts did not lead to success. a) start server with the provided server certificate d) do c) in parallel using 4 scripts running in background
server is started with
None of that brought the error in description. | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-19 ] | ||||||||||
|
[wlad@localhost mysql-test]$ cat /etc/*-release REDHAT_BUGZILLA_PRODUCT="Scientific Linux 7" | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-19 ] | ||||||||||
|
celane this part of the patch makes me uncomfortable. If SSL_ERROR_SYSCALL happens to be persistent, it will run into endless loop. Maybe it is not harmless. | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-19 ] | ||||||||||
|
I downgraded Prio from Critical to Major, due to lack of evidence something like this would happen often (and partly, due to my own inability to reproduce this, even with some effort) | ||||||||||
| Comment by Vladislav Vaintroub [ 2020-10-19 ] | ||||||||||
|
I'd still like to know if https://github.com/mariadb/server/commit/582e482d63876fef72af672776fffe7fec2bee4b fixes the error in the environment where it happens (as mentioned above, I got some trouble recreating the error). celane, could you check that patch https://github.com/mariadb/server/commit/582e482d63876fef72af672776fffe7fec2bee4b ? | ||||||||||
| Comment by Charles Lane [ 2020-10-20 ] | ||||||||||
|
Finally grabbed a clone of the git repository of 10.6 and did a "cherry pick" of your patch. Compiled okay, and put it on one of my hosts that were having the ssl problem (with 10.2) But since the problem is only with openssl < 1.1.0, and (it seems) only with whatever I expect that in the next few months I'll be upgrading all my SL7 hosts to Centos8, | ||||||||||
| Comment by Charles Lane [ 2020-10-26 ] | ||||||||||
|
OKAY! My 'bundle.crl' = 'subCA.crl' + 'rootCA.crl' for applications that only allow giving them a single CRL file. I don't recall seeing anything in the SSL docs saying 'don't do that', regarding the CRL files, but the real problem is So I strongly suggest putting a everywhere in viossl.c where a non-harmless ssl error is thrown, so that at least |