[MDEV-23740] ssl connection fails when server and client certs signed by same CA Created: 2020-09-16  Updated: 2021-03-24  Resolved: 2021-03-15

Status: Closed
Project: MariaDB Server
Component/s: SSL
Affects Version/s: 10.2.33
Fix Version/s: 10.2.38, 10.3.29, 10.4.19, 10.5.10

Type: Bug Priority: Major
Reporter: Charles Lane Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 0
Labels: contribution
Environment:

RHEL/SL 7.8, openssl 1.0.2k


Attachments: File MariaDB.patch     File mariaSSL.tgz     File mysqld.trace    
Issue Links:
Relates
relates to MDEV-20404 Validation of SSL server certificate ... Closed

 Description   

"vanilla" ssl connections work, but assigning a cert to the server and a client cert causes "ssl protocol failures", sometimes obscure version mismatch errors.

On investigation it turns out the that problem is in viossl.c and how ssl errors are handled.
ssl_should_retry calls SSL_get_error to check for the current error, but the documentation for SSL_get_error says:
"The current thread's
error queue must be empty before the TLS/SSL I/O operation is
attempted, or SSL_get_error() will not work reliably."

(attaching segment of mysqld.trace of the working, patched version...note how X509_R_CERT_ALREADY_IN_HASH_TABLE
errors get dropped by the patched code; otherwise they would cause a fatal error in the SSL connection, even though they are harmless, coming from the X509 certs being taken from both the server and the client)

It took a couple of iterations to clean up these errors, because there was some race conditions on when the errors came in vs. when they get handled in the code.

Solution: add ERR_clear_error() before each SSL_(read|write|etc) calls. And check for whatever errors are in the "queue" in the ssl_should_retry routine. The mysqld.trace shows that sometimes several errors are queued up, so just dismissing the first one won't be enough.

Patch attached; it has some extra whitespace changes, and extra DBUG_PRINT's for diagnostics

NOTE: v10.6 viossl.c has the same code in the 10.2 viossl.c, with minor changes that do not alter the problem noted in this bugreport



 Comments   
Comment by Charles Lane [ 2020-09-16 ]

possibly related?

Comment by Charles Lane [ 2020-09-18 ]

Additional: one underlying cause of the X509_R_CERT_ALREADY_IN_HASH_TABLE error that triggers this issue may be from the openssl-1.0.2k library. A check in openssl-1.1.1g doesn't have the code that throws this particular error message.

Unfortunately Scientific Linux 7.8 doesn't yet have updated package distributions for openssl or mariadb; the openssl upgrade would probably require that many other packages also be upgraded. SL7.8 is moving to Centos8.0, which is likely to be a big, disruptive upgrade.

So that's the backstory on "why this fix is needed", but clearing out the ssl error stack is still what's need to conform to the openssl library documentation.

Comment by Vladislav Vaintroub [ 2020-10-07 ]

I do not think that clearing out SSL error stack every time is what OpenSSL people envisioned. Clearing after error, yes, but not prior to each call of every openssl function, this would be unpractical. Nobody does that.

Comment by Vladislav Vaintroub [ 2020-10-07 ]

https://github.com/curl/curl/pull/2182/commits/549a585ca112120ddf4e564f850e7bbed098adaf#diff-8386d5872bc6e847805a8b6d63f8bf1aR91 , this is how curl handles this error. It is a tiny local fix, which I would prefer to decorating all openssl calls with ERR_clear_error.

Comment by Vladislav Vaintroub [ 2020-10-08 ]

celane , any comment on the above? Also is this really "server and client certs signed by same CA" problem as in the title?

Our test suite runs with client and server certificate signed by the same CA, and it was never a problem. What we never tried , I guess, is using servers certificate and server's private key on client. I pushed a test change adding such case, to see if it makes problems on any box in our CI(we have a lot of old Linuxen, and old OpenSSLs in there)

Comment by Charles Lane [ 2020-10-09 ]

Yes, the curl approach is good, wish I had seen that before!
One part of my patch does something similar, looking for the specific "already in hash table" error and ignoring it.
It also seems that openssl 1.1.1 removes this error, and that the change was backported to 1.1.0, so it's only 1.0.1 that is a problem.

Regarding all the ERR_clear_error...do we know for sure that openssl is going to report all errors into the error stack before returning from an openssl call?
I got the (possibly incorrect) impression that there would be 'prompt' synchronous errors, but could also be delayed errors that are added to the error stack
later, so that clearing the error stack immediately before the next call would be advisable. Probably not necessary, 99.9% of the time, but there could be edge
cases. Clearing the errors should be fast, not requiring any crypto or network activity, but it does clutter the code a bit.

So if you want ot remove the ERR_clear_error calls, okay, but it might result in hard-to-diagnose rare and intermittent ssl errors. Might just want to #ifdef them for a debugging mode.

Comment by Vladislav Vaintroub [ 2020-10-12 ]

celane, thank you. While OpenSSL fuzzy on how exactly the error model should work, I do not think any OSS project that is using it would cautiously clear all errors prior to any OpenSSL call. This, to me, would be comparable to setting errno to 0 before calling anything POSIX, or SetLastError(ERROR_SUCCESS) for the Windows API. Nobody does it.

Clearing error "after error" maybe alright.
I tried a couple of things, but failed to trigger the error, on a Scientific Linux 7.8 in a VM with OpenSSL 1.0.2k . There might be something special about your certificate (I think we'd have heard about this error much earlier, if it was common). I tried to pass the same ca certificates, and even the same public certificate and key for client and server, but there was no error. Maybe you can share self-signed certificate (and CA cert) with which you can trigger it, that would be great.

If you try out https://github.com/mariadb/server/commit/582e482d63876fef72af672776fffe7fec2bee4b , would that fix the X509_R_CERT_ALREADY_IN_HASH_TABLE ? a minimal fix is preferred, especially for the older versions. How to debug hard-to-find openssl bugs, we can think about it in 10.6 or so.

Comment by Charles Lane [ 2020-10-12 ]

One of the problems that I ran into during the diagnosis of this problem was having the openssl connection fail intermittently.

That is, same mariadb server, same mysql client, same certificates, etc....try it once, it works, try it again it fails with a non-helpful error message. About 60% failure rate. This was between nearby machines on the same net, running the same SL versions.

That, to me, looked very much like a race condition somewhere inside openssl. Tried debugging, but you really have to recompile openssl without optimization, and then you're testing something with different timing characteristics. The patch I posted got it to work reliably.

So yeah, "nobody does that" for clearling the errors...except for things like real-time programming, when error conditions can get arrive asynchronously. That is what I suspected was happening with openssl: the connection started okay, exchaging keys, etc, but then the next step was verifying the certs which mean certs were
requested and recieved which then triggered the "cert already in cache" error after the original openssl call returned. Again, this is a suspicion, because I couldn't
find a way to debug it directly.

I'll generate some cert/key pairs (one for client, one for server) for you to test with, but ..I generally "fix" the ip and hostname for the server cert. What ip and hostname do you need for that, and where should I put the cert/key pairs?

Comment by Vladislav Vaintroub [ 2020-10-12 ]

for our test suite, we generate certificate like that https://github.com/MariaDB/server/blob/10.6/mysql-test/lib/generate-ssl-certs.sh . It uses CN=localhost in subject of the server certificate is enough to make the validation pass, if client does validation at all (it is optional, non-default).

There is one test cert with subjectAltName, and IP address and DNS name, though usually we just use server-cert.pem for the server and client-cert.pem for the client.

Anyway, 127.0.0.1 and localhost are good enough for a test certificate.
You might be right about real-time programming, I do not know those specifics, but our server is not an example of the real-time programming.

Thank you!

Comment by Charles Lane [ 2020-10-13 ]

just added a tgz with user and client certs (pem format), keys, and my CA (and sub-CA) cert. The 'bundle' has CA+subCA appended.

Please let me know when you're done with testing so I can revoke the user/client certs.

Comment by Vladislav Vaintroub [ 2020-10-19 ]

my initial attempts did not lead to success.
What I did was

a) start server with the provided server certificate
b) connect to server with client using given client certificate with following variations:
b1) just user root
b2) user created with REQUIRE SSL, or REQUIRE X509, or REQUIRE ISSUER
c) for users in b) , do 1000 connections

d) do c) in parallel using 4 scripts running in background

[wlad@localhost mysql-test]$ cat many_connections.sh
#!/bin/bash
set -e
for i in {1..1000}
do
../client/mysql -uVV --ssl-verify-server-cert --ssl-ca=/home/wlad/lane5_bundle.crt --ssl-cert=/home/wlad/mariadb_client_VV.crt --ssl-key=/home/wlad/mariadb_client_VV.key -e "DO 1"
done

server is started with

ssl-ca=/home/wlad/lane5_bundle.crt
ssl-cert=/home/wlad/mariadb_server.crt
ssl-key=/home/wlad/mariadb_server.key

None of that brought the error in description.

Comment by Vladislav Vaintroub [ 2020-10-19 ]

[wlad@localhost mysql-test]$ cat /etc/*-release
NAME="Scientific Linux"
VERSION="7.8 (Nitrogen)"
ID="scientific"
ID_LIKE="rhel centos fedora"
VERSION_ID="7.8"
PRETTY_NAME="Scientific Linux 7.8 (Nitrogen)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:scientificlinux:scientificlinux:7.8:GA"
HOME_URL="http://www.scientificlinux.org//"
BUG_REPORT_URL="mailto:scientific-linux-devel@listserv.fnal.gov"

REDHAT_BUGZILLA_PRODUCT="Scientific Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Scientific Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"
Scientific Linux release 7.8 (Nitrogen)
Scientific Linux release 7.8 (Nitrogen)
Scientific Linux release 7.8 (Nitrogen)

Comment by Vladislav Vaintroub [ 2020-10-19 ]

celane
+ case SSL_ERROR_SYSCALL:
+ DBUG_PRINT("info",("harmless? ssl_error=%d",ssl_error));
+ ERR_print_errors_fp(DBUG_FILE);
+ break;

this part of the patch makes me uncomfortable. If SSL_ERROR_SYSCALL happens to be persistent, it will run into endless loop. Maybe it is not harmless.

Comment by Vladislav Vaintroub [ 2020-10-19 ]

I downgraded Prio from Critical to Major, due to lack of evidence something like this would happen often (and partly, due to my own inability to reproduce this, even with some effort)

Comment by Vladislav Vaintroub [ 2020-10-19 ]

I'd still like to know if https://github.com/mariadb/server/commit/582e482d63876fef72af672776fffe7fec2bee4b fixes the error in the environment where it happens (as mentioned above, I got some trouble recreating the error).
If it does, maybe it makes sense just to push a small local fix for some obscure? alleged? OpenSSL bug . Ignoring SSL_ERROR_SYSCALL seems to dangerous for a fix, to me.

celane, could you check that patch https://github.com/mariadb/server/commit/582e482d63876fef72af672776fffe7fec2bee4b ?

Comment by Charles Lane [ 2020-10-20 ]

Finally grabbed a clone of the git repository of 10.6 and did a "cherry pick" of your patch. Compiled okay, and put it on one of my hosts that were having the ssl problem (with 10.2)
And your patch does seem to fix the problem I was seeing.

But since the problem is only with openssl < 1.1.0, and (it seems) only with whatever
openssl configuration that I'm running, a minimal patch is certainly the way to go.

I expect that in the next few months I'll be upgrading all my SL7 hosts to Centos8,
(with mariadb 10.6 and openssl 1.1.1) which I've tried and had zero problems.

Comment by Charles Lane [ 2020-10-26 ]

OKAY!
I think finally I can give a more complete diagnosis.
The 'cert with duplicate hash' error is deceptive: it wasn't actually a duplicate certificate, it was a duplicate CRL.
(which is probably why the 'testing' didn't see the problem, you have to set up a CRL directory, and add to it both my
'subCA.crl' and 'bundle.crl'), and tell MariaDB to use a 'crl directory'.

My 'bundle.crl' = 'subCA.crl' + 'rootCA.crl' for applications that only allow giving them a single CRL file.
So moving that bundle.crl to a different directory, and the OPENSSL-1.0.2 error goes away, and MariaDB
is happy with my SSL connections.

I don't recall seeing anything in the SSL docs saying 'don't do that', regarding the CRL files, but the real problem is
that SSL errors are obscure at best, and too often dropped on the floor.

So I strongly suggest putting a
DBUG_PRINT("error", ("ssl_error: %s", ERR_error_string(ssl_error,NULL)));

everywhere in viossl.c where a non-harmless ssl error is thrown, so that at least
a --debug enabled MariaDB will see the errors. Even better is to write the error to where it
will show up in the MariaDB error log file.

Generated at Thu Feb 08 09:24:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.