Book Study: The Practice of Cloud System Administration
阿新 • • 發佈:2018-12-28
- All software must be created so that it can be monitored and logged. Not just for security (my obvious bias) but for all forms of performance.
- Measure everything. Use this information to find problems when they are small, before they cause outages or incidents. Even measure your counter measures, to understand when it’s time to automate.
- People who are oncall are alerted via automation, this can be the Ops or the Incident Response (IR) team.
- Being on call should not be hell. It should be planned in advance, with a realistic amount of alerts, with a backup person, and help in case the person oncall can’t handle what is thrown at them.
- Playbooks (automated if possible) should exist for everything. If an alert or incident happens, your team should know exactly what to do, and what is expected of them. Hmmmm, who has said that before?
- Test your counter measures by causing failures. That’s right, cause problems, on purpose! Paging the
- Implement auto-scaling. Up and down. Why pay for what you are not using?
- Release new features for users a few at a time, to allow for A/B testing. This means you can figure out if users like or dislike the new feature before the big announcement that it’s been released.
- Always have good hygiene. Regularly update documentation, tune your alerting, review post mortem findings and analyze your findings to create improvements.
- Dev and Ops are not two teams but one team that perform a variety of functions. *Everyone* participates in oncall duties.