Outage: Master LDAP (resolved)

Mon 20th May 15:43 - 19:55

Services Affected: All services

Description: Master LDAP Test DOWN and the DNS servers status DOWN , our operations team are working on it.

UPDATE: LDAP and services are now restored. We apologize for the inconvenience here.

UPDATE Post-mortem 5/23/2013:

Outage Duration- 15:28 UTC  – 15:35 UTC, 16:11 UTC –  18:08 UTC, 18:50 UTC - 19:09 UTC  [ Total 2.13 hrs ]
 
Problem statement –  Access to most of CloudForge services was affected due to LDAP server non availability in 3 different time periods between 15:28 – 19:09 UTC. The initial outage due to a Chassis failure in our datacenter had a ripple effect on our LDAP server accumulating too many connections which required the server to be rebooted. Our backup server due to misconfiguration, mirrored the problem faced by primary server. Recovery took more than expected time due to clean up needed in database.
 
Remediation - CollabNet corrected the configuration which will allow additional connection & avoid the problem in the future. Additionally CollabNet will also be work with the hosting service provider to re configure the server architecture with the aim to provide better fail over performance.
 
Regards,
Mohan Achar

50 Comments

  • 0
    Avatar
    Jesse Yowell

    Looks like I spoke too soon.. once was up, now is down. Sorry all.. I've alerted the engineers.

  • 0
    Avatar
    Jesse Yowell

    Looks like the last SVN server is back up -- I think we might officially be back up

  • 0
    Avatar
    Shannon Massman

    Yes, we've been there; at an organization that doesn't have the availability they advertise, and those companies deservedly lose business. We aren't hacking on individuals, rather the organization that made claims that aren't met. We aren't saying people aren't doing the best with the resources they have currently, we are saying that after the organization already surpassed their downtime commitment in the first week of the year, they should have reset expectations or found other means to improve uptime.

  • 0
    Avatar
    Travis Romney

    SVN is still broken for me too

  • 0
    Avatar
    Andy Hsiung

    Svn works for me from the command line. However, if I want to browse the repository, it won't connect.

  • 0
    Avatar
    Jesse Yowell

    All, here is the update from our DIrector of Customer Service, Mohan:

    Outage Duration- 15:28 UTC  – 15:35 UTC, 16:11 UTC – 18:08 UTC, 18:50 UTC - 19:09UTC  [ Total 2.13 hrs]
     
    Problem statement –  Access to most of CloudForge services was affected due to LDAP server non availability in 3 different time periods between 15:28 – 19:09 UTC. The initial outage due to a Chassis failure in our datacenter had a ripple effect on our LDAP server accumulating too many connections which required the server to be rebooted. Our backup server due to misconfiguration, mirrored the problem faced by primary server. Recovery took more than expected time due to clean up needed in database.
     
    Remediation - CollabNet corrected the configuration which will allow additional connection & avoid the problem in the future. Additionally CollabNet will also be work with the hosting service provider to re configure the server architecture with the aim to provide better fail over performance.
     
    Regards,
    Mohan Achar

  • 0
    Avatar
    Jesse Yowell

    Sorry all, obviously this has been a huge pain for everyone. We didn't expect the LDAP slave to fail when our main LDAP server was unreachable. Shannon is correct; there is no excuse. Unfortunately, a lot of the time was waiting for engineers to resolve this, and by the time I sent out an ETA update to the support staff the problem was solved roughly 15 minutes later. We apologize that it took so long to resolve something that could have been easily avoided.

  • 0
    Avatar
    Dmytro BAZULIN

    Same here, SVN is still down.

  • 0
    Avatar
    QMX Development

    Unfiortunately, this is not the first time.  I have been a customer since the "Dude" days and now I am looking elsewhere.

  • 0
    Avatar
    Gregg Cirielli

    Was up for 5 mins for me, then down again.

  • 0
    Avatar
    Akshay Jain

    same here..SVN is not still working for me . 

  • 0
    Avatar
    Edward St. Lawrence

    Git is also still down. 

  • 0
    Avatar
    Gregg Cirielli

    Just saw some signs of life again....  time will tell if steady or not.

  • 0
    Avatar
    Jesse Yowell

    Edward,

    Strange.. can you try again real quick? I'm hoping there just needed to be a services resync

  • 0
    Avatar
    Paul Bennett

    SVN is still not responding for me

  • 0
    Avatar
    Richard Cook

    svn at commandline worked for me before, and is working now ... fingers crossed ...

  • 0
    Avatar
    Jesse Yowell

    All,

    Our management team is in the process of creating a post-mortem, which I will post here once it is completed.

  • 0
    Avatar
    Gregg Cirielli

    Still down for me, and my company.

  • 0
    Avatar
    Mike Olshansky

    Still no dice on my end. Tried comparing local vs repo... no response from SVN.

  • 0
    Avatar
    Edward St. Lawrence

    Its working for me now 12:30 PDT

  • 0
    Avatar
    Brian Comer

    Have a build I need to sort in the next hour to ship out heh. Would also be interested in how progress goes :)

  • 0
    Avatar
    Anton Matosov

    Git is down for me, either!

  • 0
    Avatar
    Patrick Oleary

    Sorry for beating on a dead horse Jesse but is there any sign of a postmortem coming out?

  • 0
    Avatar
    Gregg Cirielli

    3rd time -- signs of life observed.    First two only lasted ~5-10 mins each.

    I'll withhold my confidence until 20-60 mins steady up-time.

  • 0
    Avatar
    Jonathan Peterson

    Oh give 'em a break. They're working on it and when they're done, they'll let us know.

    Haven't you ever been there? I know I have.

  • 0
    Avatar
    Jesse Yowell

    We've updated the status once again..I'm sorry to have misled. Apparently it was only up and healthy for a good 5-10 minutes. Our engineers are investigating as we type this.

    Jesse Yowell

    Support Engineer

  • 0
    Avatar
    Dmytro BAZULIN

    0.1% is about 45min, so everybody should be requesting 7 free days because of this. I doubt they will drop below 97% though, it's about a day-long outage. But at that point I will not be looking for 2 free weeks of service, I will be switching.

     

  • 0
    Avatar
    Mike Olshansky

    I have performed several sync and commit operations. It's running smooth at the moment. 

  • 0
    Avatar
    Shannon Massman

    Also, there's no excuse for not providing status updates when their core service is out. That's something that any person in the whole organization can do. Get the freaking owner on the web to come post here every half hour until it's fixed. Do it now.

  • 0
    Avatar
    Jesse Yowell

    Engineering said Main LDAP was brought back up. We're hoping it stays steady.. sorry for the eventful morning, everyone

Article is closed for comments.