System Administration Automation in High Scale

This techtalk will cover five biggest areas of interest:

  • Automatic System Provisioning: This topic covers deployment, repurposing and decommissioning of servers. This is an essential task needed in high scale environments to serve high amounts of traffic. Technologies involved: PXE, Kickstart, DHCP.

  • Configuration Management: Keeps the configuration files on multiple servers consistent and standard. By pre-defined rules, prevents unnecessary/accidental changes on systems. Tools: Cfengine, Puppet, Chef

  • Monitoring: This task is also essential to get a good understanding about what is going on in the environment at any given time. Main principle to monitor a highly scalable system is "count whatever you can count, measure whatever you can measure". Operation Engineers can only understand any problems, bottlenecks or potential failures by using this metrics. Tools: Nagios, Zenoss

  • Logging Management: This topic covers the management of huge amount of log files produces by high number of servers. All the files needs to be collected, examined and reduced into manageable sizes. Tools: Syslog, Scribe

  • Distributed Execution: This task covers the process of running a system command on multiple servers in parallel at the same time and aggregate the results. Tools: PDSH