2000 days ago we messed up – a diary

Skrivet den 11 september 2014, Klockan 10:03 | Nordnet Tech 

blog_image

This is a transparent diary of a stressful month in the life of Nordnet’s Head of Core Developments and what goes on behind the screens. IT is a department working 24/7.

Not another framework success story

You’ll find a million blog posts about what frameworks someone managed to use successfully and new tech to fiddle with. So let’s not pour more generic nonsense into that category. Instead it’s probably much more interesting to know what’s really going on (or going wrong) in the real world, aka production.

Disclaimer

First of all, no-one spills these sorts of dark secrets without a disclaimer. Here’s mine: every mistake can be sugar coated with all kinds of excuses or promises for the future, but sod it – we messed up, a lot. We’ve sort of stopped doing that, but not completely. The thing is that stuff goes wrong everywhere and the only difference is the amount of makeup and hot air that follows to cover it up. I guess it’s proportional to company size.

Learning hell

You might have new code running for the first time or you might have reconfigured something, but it’s all the same and everybody dealing with critical systems will recognise the feeling of instant maximum angst, cold sweat and panic when you realise things blew up. It’s not pretty and sometimes you see an IT veteran’s eyes glaze over into a 1000 yard stare, ending with a small shudder. That’s them remembering old mistakes.

Hopefully you learn and improve from such mistakes, and if you don’t someone should fire you! In the name of transparency – here’s what we learned some 2000 days ago (thank god those days are over!)

The month

Day 1

One of the Oracle database nodes starts acting up and complaints are coming in about the site being slow. The node is restarted and all is well. Apparently a backup was running right in the middle of trading hours.

Day 3

One server fills up the disks and the site goes down. Everybody’s screaming.

Day 6

The database pools are tuned to handle more load, which blows up the Oracle cache, which in turn messes up the memory. Site down.

Day 7

Price feeds are delayed and it seems that the German price feed is silent. The full disk killed them four days ago. The feeds are manually restarted and all is well.

Day 8

Wintrade users are experiencing delays. US stock markets takes a dive in the evening.

Day 9

Wintrade delays again. It’s optimized in pure panic, but it doesn’t help. The disk gets full again and the site takes a dive.

Day 12

A lot of servers are reallocated in the server room and one database master node got thrashed. A new one is built.

Day 13

Trading in Germany is offline for some time due to OS configuration error.

Day 14

An application server has a faulty clock which just keeps drifting.

Day 20

In the evening all application servers looses database connections and the site is stuttering. Restarts are needed.

Day 21

Someone is upgrading BIOS on some Solaris machines and by accident reboots the whole trading system – in the middle of trading. Not brilliant. A few moments later the German trading breaks down. Apparently our connection provider keeled over because we sent too many orders at once.

Day 23

One Oracle database node blows up but reboots in a couple of minutes. During the night a timestamp conversion results in an overflow and a bunch of trades are not visible to customers.

Day 27

Just before opening call one database cluster stops accepting new connections and the whole site stalls in seconds. An application server reboot revives the site. The funny thing, in a sad way, was that another cluster also had problems but the monitoring was configured to check the wrong cluster. Luckily both were having problems so the alarms worked anyway. Jesus.

Day 28

Another database master node goes down and everything stops working. This time because someone accidentally shut it down in the server room. Later a bug in the trading systems halts order handling and a reset is needed.

Day 31

Wintrade delays again. This time severe. A desperate restart of price feeds is done but that only overloads the trading system and everything needs a restart. Downtime. People work through the night and find big performance hogs that repaired.

Tommi Lahdenperä

My god…  All this in one month. Every incident is a gut wrenching moment for the people involved and I can’t believe we got any development done with all that chaos, but we did! And quite a lot of it. You live and you learn!

//Tommi Lahdenperä, Head of Core Development

Betygsätt, kommentera
och dela inlägget!

100%

comments

12 Kommentarer

33 Gillar 0 Ogillar
Anmäl inlägget


  1. Un seule participation par pesrnnoe, mais si comme Aurore et son the9 gourmand au chocolat ta recette ce compase de plusieurs petites (je suis pas sure d e8tre tre9s claire XD n he9site pas si tu n as pas compri) tu as tout a fait le droit

    Svar | Rapportera kommentar december 11, 2015 at 1:56 e m
  2. Soldes : votre sac e0 main en cuir e0 59,42 € (au lieu de 169,90 €)Partager :e-mailRSS Dans la/les cate9gorie(s): Accessoires Femmes, Bons Plans, Mode Mots Clefs: 3 Suisses, accessoire femme, peochtte en cuir /*

    Svar | Rapportera kommentar februari 26, 2015 at 12:42 e m
  3. Like yesterday for another company now days. :-p

    Svar | Rapportera kommentar oktober 7, 2014 at 4:31 e m
    • Il vaut mieux refuser une miosisn que l’on est pas sur de re9aliser correctement (par manque de temps, par compe9tence limite9e, par non alignement de valeurs ) et de l’expliquer au client. La franchise en affaires est toujours une position gagnante e0 long terme.

      Svar | Rapportera kommentar december 11, 2015 at 2:18 e m
  4. ca y est j’ai trouve9 un mec qui aime9 X3 ! oohh ! et en plus c’est un amateur de cocmis !! OOOHH !! plus serieusement marcaggi ! X3 etait une insulte a la saga du phoenix et xmen en general (cyclope qui se fait flingue9 direct par jean ??? prof X pareil ???)et puis t’as dut te faire chier devans thor et iron man 2, alors ! parceque niveau baston ils sont plutot avare !au moin chez singer, il n’y en avait pas des caisses non plus mais les persos etaient un minimum fouille9s ! enfin bon ! moi j’dis ca, j’dis rien !

    Svar | Rapportera kommentar oktober 1, 2014 at 6:31 e m
    • C’est tre8s vrai pas facile tous les jours d’eatre inde9pendant, hein ? Bon corugae ne9anmoins, vous avez l’air de tre8s bien vous en sortir (d’ailleurs des postes plus longs sur ce type de the9matique m’inte9ressent fortement, je trouve e7a tre8s enrichissant de pouvoir comparer les expe9riences, et se retrouver dans les te9moignages des autres).

      Svar | Rapportera kommentar december 11, 2015 at 1:30 e m
  5. au rique de choquer, je suis plutf4t d’accord avec Marcaggi sur ce coup. X3 n’avait pas la clssae des deux premier mais en temps que film refait dans l’urgence (sans synger) et devant conclure une trilogie avec un cahier des charges monumnetal, il s’en sort plutf4t bien. les fans ne sont pas me9prise9 comme j’ai pu lire. le combat de phe9nix/xavier existe dans la bd mais n’a pas la meame issue. scott est alle9 rejoindre superman et donc ca a limite9 son role dans celui ci.Enfin on a le premier combat des x-men qui ressemble e0 un combat des x-men. de plus on a le fastball special et une sce8ne de salle des dangers. moi,j’ai pas boude9 mon plaisir.

    Svar | Rapportera kommentar september 30, 2014 at 9:46 e m
  6. lol. sweet memories.

    Svar | Rapportera kommentar september 11, 2014 at 11:40 e m
  7. What was the main policy conclusion (”lesson”) from all the problems that month? Some errors seem avoidable with more automation in deployment and maintenance (human factor), but others are clearly the result of external factors that will probably remain hard to control. Introduce more circuit breakers and decoupling?

    Svar | Rapportera kommentar september 11, 2014 at 4:15 e m
    • ”Introduce more circuit breakers and decoupling?”
      Amen. Decoupling is the word of god.

      Svar | Rapportera kommentar september 11, 2014 at 11:44 e m
      • Very true. A structured workflow including quality control does also help but there is a risk that you overcompensate for previous chaos and make releases way too complicated. Which of course we did.

        Automation is key. It can also help reduce human error.

        Svar | Rapportera kommentar september 12, 2014 at 4:31 e m
        • Jag se5g stora delar av det pe5 webben.Ne5n je4tterapport orkar jag inte sikvra, men intressant var bl.a. utfre5gningen av Bf6rje Ekholm och Sven Hagstrf6mer, som i vissa delar hade olika syn pe5 investmentbolag.Fick dock intrycket av att dom har stor respekt ff6r varandra.Roligast var den off6rliknelige GW, som tydligen aktiesparat sen han var 15. Han talade varmt ff6r Haldex och Holmen.Avanza skall le4gga ut det pe5 webben, vet inte om dom gjort det e4nnu.Ff6rresten, IKEA? Kan man spara de4r?. Eller du menar kanske Ikano-banken…-)

          Svar | Rapportera kommentar december 11, 2015 at 5:36 e m