How one person provides high quality support to 4 million application users

Brian Cervino on how he supports Fog Creek Software’s four million strong user base for Trello:

As we pass four million Trello members I thought it would be a good time to share with other small software development teams the fact that providing high quality support doesn’t have to be expensive or impossible.  This includes a one business day initial response window for all newly created cases and making sure to follow through on all open cases until resolution.  With just a few tools and some dedicated time, it is possible for even just one person like myself to support our entire member base.

Pretty damn impressive.

Are you making these backup power generator mistakes?

Personal lessons from managing a critical facility.

A very common problem for facilities with critical loads is that the power generator doesn’t start when it is needed. Fortunately[1] this can be remedied in the vast majority of cases.

When I say critical loads I am talking about computer data centers, hospitals, schools, stadiums, police stations, 911 centers, office buildings, or whatever you’ve deemed important enough to attach a backup power generator to.

The Situation

You, or your organization, has a critical load and has also gone to the trouble of spending the money on investing in a generator backup power source for it.

The Objective

This is simple enough: keep the power on.

The Problem

A local newspaper reported on a sewage spill in the county I live in:

About 20,000 gallons of sewage spilled from the California Men’s Colony prison at 4:10 p.m. Sunday when power was lost and an emergency generator did not start. The sewage flowed into Chorro Creek, which flows into Morro Bay.

The fault was apparently that the generator did not start after a utility power failure:

“The power failed and then our backup generator failed, so it was kind of like a double power failure,” said Mike Minty, chief engineer at the prison’s waste water treatment plant. “It’s all fixed now.”

If you operate a data center or critical facility that has a power generator, there are some very easy pro-active actions that can be taken to mitigate the most common problem I observe: the generator fails when the power goes out. For most, that’s not the hoped for outcome of the capital they’ve invested in making their facility more resilient to utility power outages.

While there’s always a possibility that shit this can happen even if various preventative actions are taken, the chances are far lower if a handful of items are paid attention to. I am not privy to the maintenance procedures at the California Men’s Colony waste water treatment plant, so I’m just using their outage as the thought provoker and not judging them.

The Cause

When a generator “simply does not start”, rarely is that the entire story. Rather than being the root cause of the outage, it’s the manifestation of the maintenance and monitoring practices.

In my experience it’s usually a symptom of a lack of a pro-active culture surrounding the backup power system. Sadly, some organizations that invest large sums of capital into their backup power systems (and, presumably, whatever the critical load is they are protecting), don’t factor in proper operational costs and fail to implement appropriate procedures to see to it that appropriate preventative work is performed. This diminishes the return on investment on the capital invested in the entire system.

The failures then flow through to two areas:

  1. Maintenance
  2. Monitoring

The usual failure scenarios are one or more of:

  1. Generator fails to start (common)
  2. Generator starts, but fails under load (common)
  3. Generator starts, but no power reaches the critical load (less common)

The end result is the same: the critical load loses power.

The Solution

In the case of a generator, here’s the practice I’ve learned to follow:

Weekly no-load automatic tests (usually this can be programmed into your automatic-transfer switch)

  • What this verifies:
    • basic generation functionality
    • control functionality from the ATS to the generator (a simple cabling problem between the ATS and genset, even if both are completely operational and test out fine, can ruin your entire day)
  • Labor involved should be to verify:
    • genset actually starts on its own (checklist item)
    • inspection of the gauges for anything unusual (temperature, voltage output, battery voltage, fuel levels, etc.)
    • physical inspection of generator, looking for unusual sounds or animals that have crawled inside of it (I’ve had cats inside..)
  • Costs:
    • Junior technician or facility maintenance person, approximately 15-30 minutes one day per week
  • Risks resulting from implementing this procedure:
    • Nil. Won’t have a real load on it. Nothing will be disconnected during the test.
    • A risk is that this procedure isn’t completed every week. I suggest requiring a small checklist report to be filed each week with a colleague or supervisor (and making sure they know to expect it so that if it doesn’t come they go in search of a reason why) to make sure something isn’t missed just because somebody “got busy”, was out sick, went on vacation, etc. To verify it was really done and not just filled out paperwork, listen for the generator tests (duh!) and have the run hour meter reading from the generator be one of the fields filled out (which should be going up every week).

Monthly (or Bi-weekly) Manually Triggered Actual Load Tests

  • Verifies:
    • Takes the place of the weekly no-load every fourth week
    • Mechanical functionality of the ATS
    • Electrical functionality of the ATS
    • Real functionality of the generator (many problems do not manifest themselves when the generator is running without a load or with only a very low load)
    • That the facility is not exceeding the capacity of the generator (doh!)
    • That there aren’t some weird charactistics of the load or the backup power system that are interacting in a way that will lead to power loss
  • Risks
    • If you do not have UPSes for computer systems and other pieces that can’t lose power for even a few seconds, which are in turn fed by the generator, plan accordingly how to do this type of test. For a data center with UPSes, a monitored test can be performed. The worst that occurs if there is a failure should be a very brief loss of air conditioning while you switch back to main power so you can isolate what went wrong. Computers should run from the UPS batteries briefly.

Yearly (or Quarterly) Dummy Load tests

  • What this verifies:
    • Often the generator will not be under 100% utilization, even during the monthly actual Load tests. Some generator problems will not manifest themselves except under heavy load.
  • How to do this:
    • Have a testing and maintenance contract with a company that specializes in this. They can also assist you with other matter maintenance activities. Look in the yellow pages or speak with other folks who have backup generators — ignoring the folks that say they don’t do anything special to make sure their generator works when they need it of course. 🙂

Things This Solution Should Catch

Some of the causes of the inability to start, that would have been picked up under the above system, that I’ve observed are:

  • Generator battery fails
    • Causes: Age, No installed smart charger, Cabling disconnect during maintenance or due to loose connector that is shaken off during generation operation
  • Basic care and feeding of the generator overlooked (it’s like a car: think oil, spark plugs, etc.)
  • Empty fuel tank
    • 🙂
  • Bad fuel
    • There are different types of diesel, depending on season and location
  • Animal trapped inside
    • Not good for you or the animal
  • Power cabling from the generator to the load broken or poorly connected
  • The power consumption of the load exceeds the capacity of the generator
  • The generator is flakey under load
    • This is why you must test the generator under load when it is not actually needed, both with the a dummy load and a real load (see dummy load testing above, for example)
  • Transfer switch control wiring to the generator fails
    • Loose connections, low quality cabling, too much water in conduits, construction smashes underground conduit (unknowingly even sometimes)
  • Programming modified on the auto-transfer switch
    • No longer activates generator properly
    • No longer cuts over to generator under conditions desired
    • Turned off. 🙂
    • May have been an accident or the little battery in the ATS running lower (should be checked quarter with a volt-meter and replaced at least every year)

It’s good to be aware of these so that the staff implementing the system can understand the specifics of the problems they’re looking for.  They do require a checklist system — something for a junior tech or maintenance person to perform/verify on a regular schedule. These items are easy and even “cheap” both in absolute and ROI terms.   I’m reminded of a couple quotes I recently wrote in my journal (credit to Robert Rosenthal for these two sayings):

  • “If the cheap solution fails, it may be the most expensive option of all.”
  • “No one ever went to the board of directors and said, ‘The project failed but I’m proud of the fact that we paid next to nothing to implement it.”

  1. or unfortunately, depending on how you look at it 

Nintendo Still Doesn’t Get It

For the The New York Times, Brian X. Chen writes:

Tablets are considered a threat to Nintendo because games can be downloaded for a few dollars, or even free. Nintendo’s strategy has been to make most of its money from sales of the games it produces exclusively for Nintendo devices. Therefore, it has refused to offer its games to makers of tablets and smartphones.

While Nintendo may have to do the latter, eventually, that’s only a symptom of the massive opportunity they missed.

Our household has owned a Nintendo Wii for several years. To me the most obvious missed opportunity for Nintendo the last several years has been its lack of desire in creating a vibrant application ecosystem. Nintendo could have easy copied the iPhone/iPad App Store or Google Play — they didn’t even have to come up with the idea on their own or take a flyer on an unproven concept. They are uniquely positioned to have matched these app stores, unlike many of the wanna-be app store operators out there that you can find on every device and every site that has any semblance of a “platform”.

If only Nintendo had made it easier for developers to create apps for their platform and made it easier for users (read: customers) to navigate the marketplace for trying, downloading, and purchasing third-party applications on their console (e.g. being able to do it all from a web interface on one’s PC or, if really ambitious, via an iPhone and Android app). Then they’d have a 30% commission gravy train derived from the value of their installed base, just like Apple and Google do on their platforms, for any apps purchased from their ecosystem.

We’re not talking small numbers here. Even if you are a game console aficionado, you can’t refute that the original Wii console has outsold both the other two leading consoles (Xbox 360 and PlayStation 3). The fact that the Wii intentionally is marketed to and ends up appealing to a broader demographic should have been a catalyst for even greater success in the app space!

Doh, Nintendo. Investments in the new Wii U and in 3D and, now, lower end, variations? Fine, but let’s not continue to miss the forest for the trees. There is still time. A brief window of opportunity does remain, before Apple, Google, Samsung, and whoever else eat your lunch (or, at the very least, force you into doing things you don’t want to do like releasing your software onto other platforms in order to remain comfortably, albeit less so, profitable from your old standby “franchises” like Mario Bros et al.)

To gain respect, IT departments must mature

Many IT departments are fond of proclaiming a lack of resources. This has only been heightened, in the last couple of years, by economic turmoil. I’ve also observed that we are quite ignorant of our role in creating this situation.

I’m referring to our need to better justify our investments, to discuss benefits more eloquently (while being more in tune with our audience), to hold ourselves more accountable, and to seek out better ways to connect the output of our activities to the engine that drives our organizations.

With this mind, here are some things we might all consider doing more of this year…

  1. Digging deeper when attempting to justify new investments: In any business, there are always ways to bring the impact of an initiative closer to the real motivators for investment. What may seem immediately obvious to you may not be to a non-technical (or even technical) executive, especially one with many other folks asking for resources at the same time. It’s also helpful to remind yourself that even good ideas don’t get funded for all the right reasons. Successful businesses don’t invest in all their good ideas, but seek to invest in as many of their best ideas as possible. Constraints of manpower, focus, and capital are a legitimate fact of life. Furthermore, we don’t have the gift of omnipresence and we can only use hindsight after the fact!
  2.  Having a greater openness about our failures and limitations: There will always be investments that fail to achieve their aims. These should not be hidden away, but instead used to foster learning and build trust. Similarly, there will always be investments where success will seem hard to judge. With a frank set of business objectives at the start of a project (based upon desirable outcomes and not tasks) there will be fewer disappointments, more flexibility, and greater trust. This will increase support for future initiatives and the business will be more likely to achieve its aims.
  3. Raising our standards, particularly in the area of firefighting: Some IT groups seem to wear their expert (and superb) reaction abilities as a badge of honor. That’s okay, as long as it’s used to provide an excellent response in the case of an unexpected event. Emergencies are no longer unexpected if they have become the norm. Elevate your standards by making firefighting the exception rather than the rule (while still retaining your ability to respond rapidly and effectively in an emergency). Seek out patterns in recurring fires. Treat the symptom quickly and then move on to solving the underlying cause. Better yet, ignore the symptom and knock out a solution once and for all. Fires are either the result of unexpected events, or they are the result of setting your standards too low. You choose which you prefer.
  4. Making better use of the resources we’ve already got so we’ll be able to get more: Limited resources are a factor for everyone. It’s what you do with what you’ve got that ultimately decides whether you’ll get more. Complaining is unlikely to move you forward. You may not love the results that you get from your very limited resources, but if you can’t demonstrate that you used those resources wisely, how can you expect others to trust you to efficiently and effectively use even greater resources?
  5. Putting business objectives first: IT isn’t about technology, it’s about business. Ironically, this means that IT departments who get the most capital for investment in new toys, tools, training, and technical staff are those who pay more attention to the business aspects of their activities rather than the technology aspects. Prudent technology investments are not more “correct” because they’re more technically elegant. It is of greater importance that they produce results that are beneficial to the business.
  6. Seeking success not perfection: Like so much in business, and in life, the name of the game is success not perfection. Don’t make the mistake of seeking perfection over getting results. Technical people are often poor at hitting narrow deadlines by triaging features and requirements, and by applying the “80/20” principle. It is important to recognize when positive results have been achieved, even if the path there was not paved with perfect steps 🙂
  7. Seeking out, and pushing for, investments that generate money over those that only save it: While saving money is a worthwhile goal, the only reason there is any money (revenue) to attempt to save (by lowering costs) is because sales were generated in the first place. Creating new opportunities for increased sales – to both new and existing clients/customers – is not just the job of the sales and marketing groups. IT is involved in many different aspects of operations, both internal and external, direct and indirect. Stay alert to opportunities that boost the bottom line not only through savings, but also through growth.
  8. Outsourcing important, but non-strategic and non-differentiating, IT functions while finding new ways to add value by innovating and being strategic: Outsourcing is not an all or nothing strategy. The IT folks that fear for their jobs when the topic of outsourcing comes up, force me to wonder whether their problem is just low self-esteem or if they are, in fact, adding less value to the business than they should be. Despite the harshness of this statement, in many cases, the problem is the former more so than the latter. Even when it is the latter, it is often correctable with a shift in perspective.

It is already commonplace to outsource activities such as telecommunications services (dial tone and Internet) and software development (e.g. off-the-shelf software). An increasing number of IT departments should consider outsourcing their entire in-house phone solution (no more PBX), their off-site storage, and the deployment, upkeep, and troubleshooting of their corporate network and employees’ personal computers. The freed up time and energy (and money!) should be shifted towards strategic activities (such as seeking out more things to outsource for greater efficiency and, in particular, the seeking out of investments that contribute to business growth).

If you’re concerned about outsourcing replacing your position, shift your efforts to increasing your value to the business (instead of fighting against smart outsourcing initiatives). As a side note, I believe that one is better off cannibalizing his or her own position – if it is probable or inevitable – rather than waiting for someone else to force the change upon him/her. With this perspective, everyone in IT has an opportunity to become a trusted advisor to the business. The alternative is to be viewed as defensive, as out of touch with the business, and as having an agenda. That’s not a position you want to be in if you want your advice to be taken seriously.