Debugging defects in production is a common task for many software engineers. I’m privileged that I have had the chance to solve some very intesting issues that have come up in a production like setting. Many defects are caught in a qa/staging environment but every once in a while we get those issues in production that we just couldn’t catch anywhere else. Until now I haven’t spent any time writing up any of the glorious details. I’m going to do my best to share some of the tools/techniques I use to figure out issues.

Today, I was working on an issue related to upgrading the certificates used for SAML based single sign on authentication. It was a bit fun to debug.

tldr;

A new certificate was generated with a serial number that was greater than 24 digits. This caused xsd validation to fail for SAMLResponses due to a defect in libxml2.

The long story

It started with a message in chat:

hmmm AD sso login keeps giving me ‘Invalid account. ‘ error. Will check tomo.

There’s quite a bit to unpack in that one innocent message. AD refers to active directory. sso refers to our SAML based authentication.

The first step is to try to reproduce the defect. So I used Remote Desktop to connect to our Active Directory host in QA. I then launched the ADFS tool to look at the configured Service Providers.

I double checked the metadata information was correct, and that the Claim rules were correct. This article was very helpful with figuring out how to configure Active Directory.

I realized that the identity provider (active directory) verifies each incoming SAMLRequest using the entity id described in the request. The assertion consumer service url is specified as the load balancer host. This means that the response would go to any number of hosts and it would be difficult to predict which host would receive the request.

If we had aggregated logs, this would be less of an issue because I would grep/tail the logs at the aggregated location. This wasn’t the case so I had to think of another way to debug this.

I needed to isolate the reqeust/response to a single host. This would make it much easier to tail logs and watch what’s happening. So I chose one of the hosts that are load balanced and found a way to get it evicted from the load balancer rotation.

In our HAProxy configuration we have the following options:

backend console
  ...
  option httpchk GET /
  http-check disable-on-404
  ...

The above tells haproxy to do a health check against each of the hosts that it is load balancing, and if they start to return a 404 then they get evicted from the rotation.

I can trigger an eviction by placing a file at a specific location on disk, so I did that for one of the hosts.

Now, that I have my host I configured that single host as a unique service provider in active directory and started to test the login. I logged in to my account, configured the Active Directory host as the identity provider, then began to test the login.

I found the following error in the logs:

["Invalid SAML Response. Not match the saml-schema-protocol-2.0.xsd"]

I needed to determine where this error was coming from.

$ grep 'Not match the saml-schema-protocol-2.0.xsd' -rn ruby/2.2.0/gems/
ruby/2.2.0/gems/ruby-saml-1.4.1/test/response_test.rb:77:        assert_includes @response.errors, "Invalid SAML Response. Not match the saml-schema-protocol-2.0.xsd"
ruby/2.2.0/gems/ruby-saml-1.4.1/test/response_test.rb:341:          error_msg = "Invalid SAML Response. Not match the saml-schema-protocol-2.0.xsd"
ruby/2.2.0/gems/ruby-saml-1.4.1/test/response_test.rb:512:        assert response_without_attributes.errors.include? "Invalid SAML Response. Not match the saml-schema-protocol-2.0.xsd"
ruby/2.2.0/gems/ruby-saml-1.4.1/test/slo_logoutrequest_test.rb:43:        assert_equal ['Invalid SAML Logout Request. Not match the saml-schema-protocol-2.0.xsd'], invalid_logout_request.errors
ruby/2.2.0/gems/ruby-saml-1.4.1/test/slo_logoutrequest_test.rb:45:        assert_equal ['Invalid SAML Logout Request. Not match the saml-schema-protocol-2.0.xsd'], invalid_logout_request.errors
ruby/2.2.0/gems/ruby-saml-1.4.1/test/slo_logoutrequest_test.rb:210:        assert_includes invalid_logout_request.errors, "Invalid SAML Logout Request. Not match the saml-schema-protocol-2.0.xsd"
ruby/2.2.0/gems/ruby-saml-1.4.1/test/logoutresponse_test.rb:39:          assert_includes logoutresponse.errors, "Invalid SAML Logout Response. Not match the saml-schema-protocol-2.0.xsd"
ruby/2.2.0/gems/ruby-saml-1.4.1/lib/onelogin/ruby-saml/logoutresponse.rb:151:          return append_error("Invalid SAML Logout Response. Not match the saml-schema-protocol-2.0.xsd")
ruby/2.2.0/gems/ruby-saml-1.4.1/lib/onelogin/ruby-saml/response.rb:390:        structure_error_msg = "Invalid SAML Response. Not match the saml-schema-protocol-2.0.xsd"
ruby/2.2.0/gems/ruby-saml-1.4.1/lib/onelogin/ruby-saml/slo_logoutrequest.rb:191:          return append_error("Invalid SAML Logout Request. Not match the saml-schema-protocol-2.0.xsd")

It looks like the error message was coming from this line.

I wanted to see if I could reproduce the defect in a test. I took a raw dump of a SAMLRequest and used the same keys to reproduce the defect as a test.

describe "encrypted response" do
    ...
    it { expect(subject.valid_response?(saml_response, request)).to be_truthy }
  end

The test failed, but I still didn’t know why. I opened up the gem to add some additional logging.

$ EDITOR=vim bundle open ruby-saml

I added some additional logging to here and was able to see the error.

.#<Nokogiri::XML::SyntaxError: Element '{http://www.w3.org/2000/09/xmldsig#}X509SerialNumber': '214988299817646988277320106009336023856415744126' is not a valid value of the atomic type 'xs:integer'.>

Strange? So the XSD validation was failing on X509SerialNumber with a serial number of 214988299817646988277320106009336023856415744126.

I found the xsd validation here. The XSD validation checks to make sure the X509SerialNumber is an xml integer. I compared this with the version from w3.org

λ wget https://raw.githubusercontent.com/onelogin/ruby-saml/ce52a317477996807d87539264cfe2c1779f52f1/lib/schemas/xmldsig-core-schema.xsd
--2016-11-29 14:45:19--  https://raw.githubusercontent.com/onelogin/ruby-saml/ce52a317477996807d87539264cfe2c1779f52f1/lib/schemas/xmldsig-core-schema.xsd
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10004 (9.8K) [text/plain]
Saving to: ‘xmldsig-core-schema.xsd’

xmldsig-core-schema.xsd                         100%[======================================================================================================>]   9.77K  --.-KB/s    in 0s

2016-11-29 14:45:19 (54.5 MB/s) - ‘xmldsig-core-schema.xsd’ saved [10004/10004]

λ mv xmldsig-core-schema.xsd xmldsig-core-schema.xsd.orig

λ wget https://www.w3.org/TR/xmldsig-core/xmldsig-core-schema.xsd
--2016-11-29 14:45:28--  https://www.w3.org/TR/xmldsig-core/xmldsig-core-schema.xsd
Resolving www.w3.org (www.w3.org)... 128.30.52.100
Connecting to www.w3.org (www.w3.org)|128.30.52.100|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 10293 (10K), 289 remaining [application/xml]
Saving to: ‘xmldsig-core-schema.xsd’

xmldsig-core-schema.xsd                         100%[++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++==>]  10.05K  --.-KB/s    in 0s

2016-11-29 14:45:44 (23.0 MB/s) - ‘xmldsig-core-schema.xsd’ saved [10293/10293]

The diff looked minimal:

λ diff xmldsig-core-schema.xsd xmldsig-core-schema.xsd.orig
2,10d1
< <!DOCTYPE schema
<   PUBLIC "-//W3C//DTD XMLSchema 200102//EN" "http://www.w3.org/2001/XMLSchema.dtd"
<  [
<    <!ATTLIST schema
<      xmlns:ds CDATA #FIXED "http://www.w3.org/2000/09/xmldsig#">
<    <!ENTITY dsig 'http://www.w3.org/2000/09/xmldsig#'>
<    <!ENTITY % p ''>
<    <!ENTITY % s ''>
<   ]>
318a310,322
> type="ds:RSAKeyValueType"/>
> <complexType name="RSAKeyValueType">
>   <sequence>
>     <element name="Modulus" type="ds:CryptoBinary"/>
>     <element name="Exponent" type="ds:CryptoBinary"/>
>   </sequence>
> </complexType>
>
> <!-- End KeyValue Element-types -->
>
> <!-- End Signature -->
>
> </schema>

Where was this serial number coming from? Which serial number is this? I took a peak at the certificate embedded in our SAML metadata.

$ openssl x509 -in <snip>.com_chained.crt -text -noout | head -n 5
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            25:a8:68:a5:d2:42:b9:a6:d9:5d:15:0d:e4:ef:5f:f5:1e:7b:a0:7e

I needed to convert that serial number to a numeric value. irb to the rescue.

λ irb
irb(main):001:0> '25:a8:68:a5:d2:42:b9:a6:d9:5d:15:0d:e4:ef:5f:f5:1e:7b:a0:7e'.gsub(/:/, '').to_i(16)
=> 214988299817646988277320106009336023856415744126

Bingo! We have a match. Our certificate was recently updated and seems to now be failing the xsd validation.

If I look at the serial number we can see that it is 20 bytes and 48 digits.

irb(main):002:0> '25:a8:68:a5:d2:42:b9:a6:d9:5d:15:0d:e4:ef:5f:f5:1e:7b:a0:7e'.gsub(/:/, '').to_i(16).size
=> 20
irb(main):003:0> '25:a8:68:a5:d2:42:b9:a6:d9:5d:15:0d:e4:ef:5f:f5:1e:7b:a0:7e'.gsub(/:/, '').to_i(16).to_s.size
=> 48

I took a look at the xml spec for the integer type. Which I found here.

The spec says:

The ·value space· of integer is the infinite set {…,-2,-1,0,1,2,…}.

I interpreted this as to mean that there is no upper bound for the size of serial number. (The first source I found was a link to w3schools, which actually had the incorrect information and lead me astray. Thankfully, a colleague helped me catch that.)

That still doesn’t explain the defect, until I came across this.

I tested this theory, by changing the datatype from “integer” to “string” and that worked. I came across this thread which refers to this bug.

Next, is to figure out how to fix this.

Happy hacking!

comments powered by Disqus