For those that didn't notice, I've recently rebuilt my entire website from the ground up and switched web hosts from Ghost(Pro) to Netlify, and I learned a lot from the experience. Not everything went smoothly, but it was mostly problems I had expected. This post, however, is about those that I had not, and they all occurred when attempting the final deployment steps. Here's a technical breakdown of what happened.
Inability to edit CNAME record
The web host switch step should have been completely seamless. The plan was simply to edit the CNAME record of this website with my DNS to point to the new host, then wait for DNS propagation. The whole operation was supposed to take one minute to complete. It instead took more than a day because my DNS provider, DNSimple, simply refused to complete the operation, giving me this error message:
CNAME must be the only record on a subdomain
Now, if you're familial enough with DNS records, you would know that this statement is, in fact, correct. What made absolutely no sense however was that I was trying to edit an existing CNAME record, not create one. Even after deleting the old CNAME record, I could not recreate it. Meanwhile, my secondary DNS provider, Cloudflare, had no issues whatsoever, so I used it as a temporary measure.
After talking with DNSimple's support, we realized that my old configuration actually violated the standard for more than a year, as I had a useless TXT record on the same subdomain as the CNAME record. DNSimple actually implemented a safeguard mechanism to prevent such a bad configuration, but apparently it was done after I had done mine, and then never went back to notify their customers to fix their configuration after implementing the mitigation. Cloudflare, meanwhile, just allowed it without a second thought, adding to the initial confusion.
Thankfully the migration was saved thanks to having a backup service available, but unfortunately it led to the next problem.
Missing DNSSEC records
A few hours after fixing the CNAME issue, I tried pushing a minor update. The deployment operation usually takes less than 10 seconds and always worked flawlessly during my testing phase. This time however, the operation was not completing, and there were no logs to indicate that anything was going on. Worse, a quick check showed that my website was suddenly offline around the entire world, even though I never had any downtime during previous deployment.
By a bad stroke of luck, I had encountered two completely unrelated problems at the same time. The first problem was that my new host, Netlify, suddenly had an issue with their build pipeline affecting all of its customers and notified that there were aware of the issue a few minutes after I experienced the problem, so I naturally assumed that it was the cause of the site being offline and that I had to wait until this issue outside of my control was resolved before redeploying. Unfortunately, when that problem got fixed, my website stubbornly remained offline even after a new successful deployment.
The second problem, which was the one causing the downtime, was that my DNS configuration suddenly became invalid, a fact that took me several hours to discover. The issue was that browsers were informed when querying the
com DNS zone that the
debigare.com DNS zone was supposed to be protected with DNSSEC, but they could not find the required records within that zone even though I did not made any change on this regard.
The cause turned out to be related to Cloudflare. Basically, I had completely forgotten that I had to create a DS record with DNSimple (the primary DNS) using Cloudflare's public key (the secondary DNS) for the
com zone delegation, so when I shut down Cloudflare as my secondary DNS, it removed all of my other DNSSEC-related records, while DNSimple kept the old DNSSEC configuration active without warnings. Deleting and regenerating the DNSSEC configuration with DNSimple exclusively immediately fixed the issue.
Just as a final twist to this, one specific user in Asia complained that he was still unable to access the website after the DNSSEC fix. Fortunately, it turned out that the issue was related to his own set up, but it did cause me to invest a lot of investigation time to ensure everything was OK.
Certificate revocation delay violations
My entire website requires the HTTPS protocol to access so users can have the best user experience possible, and one of its requirements is to have a valid certificate deployed for it that is signed by a Certificate Authority (CA) trusted by web browsers.
With my old architecture, the organization responsible of generating certificates and control its linked private keys was Cloudflare. The reason was because my old host, Ghost(Pro), did not offer sufficient HTTPS protection, so I ended up using Cloudflare's Universal SSL feature as a front for it to partially mitigate the problem. Deploying this feature is also why I ended up using Cloudflare as a secondary DNS in the first place. As my new host, Netlify, did have proper HTTPS support, Cloudflare was no longer needed and became a liability.
However, disabling Cloudflare's Universal SSL feature did not have the outcome I had expected. Instead of revoking the old certificates that had not yet expired, it only removed them from its system. The difference is that Cloudflare was still able to use the certificates and private keys it already generated to impersonate my website until their expiration. This became an issue the moment I stopped redirecting requests for my domain to Cloudflare, as it only prevents them of generating trusted certificates, not use old ones. Looking at official Certificate Transparency logs, I realized that some of the affected certificates were still valid for more than 6 months, so I had to do something about it.
I first attempted to contact Cloudflare's support to ask them to revoke the certificates themselves on September 6 at 7:43 UTC. This only led to irrelevant responses and confused customer support agents that had no idea what I was talking about, and this appeared to go nowhere. I eventually got a response from them on September 11 at 5:53 UTC that they would request CAs to perform the revocation, but that was after I did so myself, and I never got a status report back afterwards.
Indeed, before that last reply, I had decided to go to the source and request revocation directly with the CAs that signed the affected certificates. For those unfamiliar with the CA ecosystem, they are actually bound to the CA/Browser Forum Baseline Requirements, which dictates what they need to do to be trusted by web browser vendors. A violation can cause a CA to lose said trust, rendering their signature worthless and eliminating their business model. Of particular importance is the following requirement (from version 1.6.0):
18.104.22.168. Reasons for Revoking a Subscriber Certificate
The CA SHALL revoke a Certificate within 24 hours if one or more of the following occurs:
- The CA is made aware of any circumstance indicating that use of a Fully Qualified Domain Name or IP address in the Certificate is no longer legally permitted (e.g. a court or arbitrator has revoked a Domain Name Registrant’s right to use the Domain Name, a relevant licensing or services agreement between the Domain Name Registrant and the Applicant has terminated, or the Domain Name Registrant has failed to renew the Domain Name)
There were two CAs affected by this issue. The vast majority of certificates were signed by Comodo CA, and the rest by DigiCert. I did not run into any issues with DigiCert (they in fact proactively checked with Cloudflare my claim and revoked the certificates before I even had the chance to attempt their domain ownership challenge), but Comodo CA was another story entirely.
My first request to Comodo CA to revoke the certificates for
debigare.com and all of its subdomains was on September 8 at 4:37 UTC. I did not get a reply until September 9 at 15:53 UTC stating that the certificates have been revoked. Not only was this past the 24 hours requirement, but it was also false, as only the most recent certificates had been revoked, not all of them. I mentioned to them their mistake on September 10 at 5:55 UTC with a full list of affected certificates just in case my initial request was unclear to them, and never got a reply back. I did, however, notice that the certificates eventually got revoked on September 13 at 16:04 UTC according to their Certificate Transparency logs, a fact that I only discovered on September 15. Assuming the log is correct, that would be a delay of more than 3 days after notifying them of the incomplete revocation, more than 5 days after my initial request to them, and more than a week until I finally noticed the problem was fixed. It's also worth noting that throughout this entire series of events, Comodo CA never requested proof of domain ownership beyond the evidence I initially provided, so that cannot explain the delays.
There are still parts of the story that I'm still unsure about. The first is why I never got any feedback once revocation was complete. The second is why my initial evidence for domain ownership was apparently sufficient for Comodo CA but not for DigiCert. On this regard, the only evidence I provided to both of them was that the email address I used to contact them matched the contact information on my website. (My emails were protected with SPF, DKIM and DMARC for authenticity.) For some reason, DigiCert believed that this evidence did not met the Baseline Requirements for my request, a claim that I am currently unable to verify as I couldn't find the relevant section.
I've sent a violation report to Mozilla's dev-security-policy mailing list to notify them about the Comodo CA issue as I'm publishing this. Not sure what kind of effect it will have, but we'll see.
Update 2018-12-17: Sectigo (formally Comodo CA) has published the incident report... more than 2 months past Mozilla's soft deadline to do so. Its content appears to be deceptive to me as well, but I'll let you guys judge for yourself.
Forged WHOIS data
During the certificate revocation process described above, I realized at some point that the WHOIS record for my domain had recently changed and no longer matched my personal information. Instead, it was filled with fake data that obviously didn't match anything that could remotely be considered a real identity. The issue was that valid WHOIS data is an ICANN requirement, and I had to make the record public in the first place to acquire my .com domain name, and never used a privacy protection service as an intermediate to hide my identity to prevent the possibility that they would go rogue and take it over. This issue prevented DigiCert from validating my domain ownership at a critical moment.
It turned out my registrar, DNSimple, actually did not use its ICANN-accredited registrar status to register my domain name, but instead partnered with Enom to do so as a reseller. (No, I don't get it either.) Later, when GDPR was about to become effective, Enom decided to redact the WHOIS information of all of their clients as it could not reliably determine which parts were in the scope of this legislation. This was done despite ICANN's continuing requirement to the contrary. The real issue however is not whether Enom did the right thing or not with this decision, but rather that neither Enom nor DNSimple ever made me aware of this change when it happened. Yep, I had been foiled by a lack of communications from months ago.
Fortunately the correct solution here was to do nothing, but with all the issues occurring around the same time, it sure didn't help my blood pressure in the meantime.
I need a vacation from my vacation.