Monday, November 12, 2007

Launch Day

Launch at Oracle Open World

So a 24 hour period to be forgotten. I had this troubling bug, that I could not reproduce. Sometimes it would come up and sometimes it would not. It was Saturday night and I knew that I would be traveling all day Sunday. So I staged what I had and all I needed to do was to trip the DNS switch and we would be live.

I started the journey relaxing, reading this months Wired magazine. Then it came to me a possible cause of events that could reproduce the bug. I fired up my laptop and low and behold I could reproduce it. I knew that I had to be super careful, you know a last minute change. I spent the rest of the flight looking as ways of fixing it without actually changing any code.

I got to San Francisco with a plan but I had many, many things to do that night. Roughly they were

•Interview the street marketing team
•Pick up all the stuff (T-Shirts, flyers, business cards) for the give aways
•Fix the bug

I’ll talk about the marketing events in the next blog entry. So, I fixed the bug easily but then I was faced with a dilemma. So I stage it from my Laptop or do I push from the build server? Neither seem any easy choice. Was my laptop 100% setup with all the config information to push the live image, could I make the changes remotely and push on the build server?

After a couple of attempts (aka failures) I pushed the image from my laptop. It took 15 minutes or so and at last I thought I could take it easy. But I was wrong...

So what was the issue? By 1am (which by body was really saying 4am) I could not find a good reason why the code was failing to run when I ran it from my laptop on the hosted site at Amazon. I figured sleep would help.

4am I wake up, can’t sleep any longer. I try running it from a remote machine and it appears to work. Ok, then its an issue with my laptop. I add a few extra debug statement and finally narrow it down to an extra Firefox plug-in on my Laptop. It was failing when I upload a new archive. The archive got there Ok and was processed Ok, but an error was returned to the web client. Why was this different? Well is the only form in the Application that was doing an HTTP PUT, since the rest of the Application is Ajax then there were no other PUTs and GETs. The culprit? The Snap shots plug-in was mucking with the return of the HTTP GET.... grrrrrrrr.

So I thought that was the end. Then came my nightmare from DNS street. I switch the DNS entries as I had practiced. Not the best way to do this, but the easiest way... so I thought. As you know if takes time for DNS to propagate, but its pretty quick 5-10 minutes. So I switched at 8.45am. By 9am my browser on my Laptop was still pointing to the old static web page. By 9.15am it still was.... panic!

I tried my build server, and thank god it was pointing to the correct new hosted page. Figured it was a DNS propagation delay. I continue to check all day to find that my Laptop was still pointing at the old page. Holly shit I thought. Then I connected to a VPN end-point that configured DNS and vola! I could now see the hosted web page. The problem? Using the Hotel’s ISP, there was no auto-config of DNS so my laptop was using the cached entry pointing to the old IP. Once the laptop connected to a DNS server, then it got the new IP and resolved correctly.

I think I aged 10 years in those 24 hours...

No comments: