Category Archives: Operations

DevOps: Managing internal and external DNS with Amazon Route53

So one fun project I’ve been working on at work is developing the integration to make managing our internal (server facing) and external (customer facing) dns a simpler process.  This has meant integrating a few different things: BIND, Amazon Route53, DHCP, and various tools to tie them together.  This meant taking some tools from various places, and tying them together.

So my system consists of basically four elements: Route 53, for serving internet facing DNS requests; Internal DNS servers that get zone data from route 53; route53d servers for pushing updates to Route53 via its API; and Route53’s Web UI.

I use Route53 is the single point of truth.  Updates go there via the API or WebUI.  Then the data is pulled out of there and published to my internal DNS servers.

Due to the stateless nature of how these servers work, this system is very scalable out of the box.  To handle more load, I just spin up more instances of the internal DNS servers.  Using a load balancer in front of them, I could provide plenty of service to basically an arbitrary amount of machines.

A couple cool programs made this project very possible:  dnscurl.py, route53d, and route53tobind.  These tools made it easy (as in, just writing integration code and deployment code) to tie together all these resources.

One thing that I did in this process was that I wanted to make the whole system scalable.   I didn’t ever want to have to log into the boxes when they’re in production, or indeed, even when deploying them.  So this meant that I developed deployment scripts and kickstart files that specified the whole structure of the internally facing DNS servers and the route53d servers.  Basically, I just kickstart it, and it’s done.  This is done by using a kickstart script that specifies in the postinstall section to pull over a script and run it.  That script configures all the stuff and sets it to start on boot.  This could (and should) be done with Chef or some other configuration management suite.

I periodically poll Route53 for new zone data, and when I get it, push it to the internal servers.  Then call “rndc reload” to incorporate the updated zones into the running server.

And presto, consistent internal and external DNS with scalability and availability.

Tools used:

 

DevOps: Using Fabric to fix the Leap Second bug

So today I had a fun problem:  How am I to correct the leap second bug on 700 systems?  Gosh, that’s a thing.  So I went looking in my systems administrator toolbox and found Fabric.  Turns out that you can automate things in pretty awesome ways with it.

Fabric is a python tool that lets you specify a runlist of things to do, and go execute them across a pile of hosts.  For me, this meant:$ cat fabfile.py

$ from fabric.api import run, env
def fix_date():
     run('date; date `date +"%m%d%H%M%C%y.%S"`; date')

$ fab --skip-bad-hosts -t 3 -H \
      `perl -e 'chomp(@l=<>); print join ",", @l;' < hosts.txt `\
      fix_date

And bam, 30 minutes later, the bug is done.

 

Web Operations Performance – The Handout

So one of the issues that I deal with a lot is tuning web applications to not suck.  This is done by a few things; by monitoring, by profiling, caching, caching (CACHING!), and by tuning.  The process for making a web application more awesome basically boils down to this list of steps:

  1. Monitor your application performance (http threads, cpu, memory, thread response time, etc)
  2. Profile your code
  3. Fix slow requests/implement caching
  4. Tune your web-server
  5. Goto 2.

Monitoring

Monitoring the response time of your application is useful and awesome for making positive changes to your environment .  This means paying attention to your application response time, cpu, memory, network traffic, disk IO, disk capacity, etc.  All those metrics that say whether your application is healthy or not.  There are a few different tools available for this that all work pretty well, here’s an incomplete list:

  • Cacti – http://www.cacti.net
  • Munin – http://www.munin-monitoring.org
  • Cricket – http://cricket.sourceforge.net

They all work well and solve slightly different problems.  Pick the one you like most.  I’m a fan of Cacti.

Profiling

Profiling means being able to see how long each call that an application makes takes to execute.  It’s invaluable for getting a feel for what parts, what components, of your application perform badly or perform well.

Caching

Whenever an application fetches data from a resource, that’s an opportunity to improve performance.  Every time something is fetched, there’s the ability to take that result set, and keep it.  Caching the results of database calls, of external API lookups, of intermediate steps, all these things leave lots of room for improving performance. Memcached is the de facto standard for a caching engine.

Cache early, cache often!

(Apache) Tuning

A well configured web-server is crucial to a happy environment.  This means not running into swap, not wasting time with connections that are dead, and other such things that waste time.  In short, don’t look up what you don’t need, don’t save what you don’t need, and be efficient.  Here are some basic things that apply to Apache:

   KeepAlive          Off (Or On, see below, it depends on workload)
   Timeout            5
   HostnameLookups    Off
   MaxClients         (RAMinMB * 0.8) / (AverageThreadSizeInMB)
   MinSpareServers    (0.25 * MaxClients)
   MaxSpareServers    (0.75 * MaxClients)

About these parameters:

  • KeepAlive – this controls whether when one request from a client to the web-server is completed whether that thread will remain connected to the clients for subsequent requests.  In high-scale applications, this can lead to contention for available resources.  Some workloads, however, benefit from keeping this on.  If you are serving lots of different content types on a page to a client, leaving this on can be a good thing.  Test it out, YMMV.
  • Timeout — how long before we assume that the client has gone away and won’t be requesting further data.  The default is 300.  It is in seconds.  This value is aggressive.
  • HostnameLookups — this is for logging, and if it is on, each client will cause a DNS request to be made.  This slows down the request.
  • MaxClients — the total number of threads that the server will allow to run site.  Each thread consumes memory.  This model assumes that 20% overhead for other system tasks is appropriate and will keep us out of swap.  On machines above 16GB of ram, use 0.9 instead of 0.8.
  • MinSpareServers — the fewest threads that Apache will leave running.  Setting this too low will result in load spikes when traffic increases.
  • MaxSpareServers — the most spare, unutilized threads that Apache will leave running.  Setting this too low will result in lots of process thrashing and threads are used and then terminated.  The tradeoff is utilized Ram.

There are a lot of other things that can be done as well, so don’t take this as a complete set…

These are my handout style tips on performance tuning.  There are whole volumes of books dedicated to this topic.  Some great resources include:

-Gabriel

Of Tuning WordPress on Cherokee

So at work, I had a blog.  Not my blog, of course.  One I support.  So this blog was thrown together quickly to facilitate business goals.  Like you do.  And that’s great.  We met the deadline.  We got the product functional.  But performance kinda sucked.  A little backstory.  Here’s how it got set up to start.

We love virtualization here.  Everyone should.  It’s a fantastic way to take adventage of hardware that you’ve got laying around that’s being idle.  Idle hardware is lame. So we run KVM.  This means we get to manage machines more effectively, and can provision things faster.  That’s awesome.

So this blog.  It’s a cluster of boxes, made up of a pair of webservers, and a pair of DB backends.  The DB backends are a master, and a backup.  This is a small project, so these got set up with just 4GB of ram.  So that’s fine.  The web servers are each a VM with 1 vcpu, 4GB of ram, and some disk space.

So we set that all up, and it did fine.  But not great.  here’s what it did.

Incidentally, if you’ve got data you’ve collected for performance with Pylot, and need to CSV the averages to make a pretty graph, here’s my way of extracting the averages from the results HTML to feed into gnuplot.

for file in load-dir/*/results.html ; do cat $file  | egrep 'agents|avg' | head -3 | perl -e 'while(<>){s/<\/?t[dr]>//g; s/(\d+\.?\d*)/\t$1/g; chomp; @p=split /\t/,$_; push @r,$p[1]; }; printf "%s\n", join ",",@r; '; done

So what we can see then from these graphs is that a pair of single core boxes, tuned with the default cherokee+php installation don’t do all that great.  They can handle 1-2 simultaneous requests, but past that, the response time gets pretty bad. That’s where the project got left for a while, until the other day I got the request to make it handle 500 requests per second.  “Wow, shit.” I thought.  So I dove in to see what I could do to improve performance on the blog, and it turned out that there was a lot I could do.

  1. Single-core boxes don’t have a lot of performance available to them.
  2. Cherokee uses PHP in FastCGI mode, and does a terrible job of tuning it out of the box.  It defaults to a single PHP thread.
  3. WordPress is very hungry for CPU time.  It chews up a lot of resources doing its DB queries and wrangling them.

To address these points, I did the following.

The VMs themselves were restarted with more CPU cores — four cores per box.   This allowed me to dive into discovering that Cherokee wasn’t tuned well.  Under a single core, I saw 80% cpu utilization on PHP, with high system wait time.  That sucked.   But after I bumped up the core count, it still looked bad.  Still only one CPU @ 80%. WTF.  So then I turned to Cherokee, and I tuned PHP so that it would invoke 16 children, enough to handle the request volume.  This helped a lot, but there was still room to do better.  So I added APC, the Alternative PHP Cache, to the configuration.  This helped out a lot.  I then looked at wordpress specific caching engines, and settled on using W3-Total-Cache to help out.  This brought performance into the range of fulfilling the customer’s request.  I felt great about it.

I used pylot to graph performance at various points through this project so I could figure out how I was doing a better or worse job of tuning the boxes.

Here’s the performance data showing the improvements that caching at various layers added to the party:

So it turns out that these are some great ways to make sure your blog performs:

  • Make sure that there’s enough hardware supporting it
  • Cache in depth and breadth, early and often
  • Tune your PHP/WebServer to handle the project
  • Employ performance testing to measure your real improvements

With some hard work and performance tuning, you can turn an abysmal 7 requests/sec peak performing blog into one that can sustain 350 requests per second, and do it in a tiny fraction of the time.

-Gabriel

 

Building Cassandra-Cacti-M6 on Centos 5.5

Turns out this is hard to do.    I’m writing this here for my benefit, and everyone else’s, too.  I’ve got a client using Cassandra, and of course you monitor that stuff.  So I figure that using the cassandra-cacti-m6 stuff is a good plan.  That’s cool, it works with Cacti.  Cacti’s pretty snazzy.   My customer is using Centos 5.5 on the monitoring box.  Turns out there are a lot of hoops to jump through for that to work.  So here’s what I did.

  1. Install jpackage-utils
  2. Install the jpackage.repo into /etc/yum.repos.d, and enable the rhel5 targets
  3. Install JDK6 Update 3 (because of the following step)
  4. Install java-1.6.0-sun-compat from ftp://jpackage.hmdc.harvard.edu/JPackage/5.0/generic/RPMS.non-free/

Then I go to install ANT, but I discover something — trying to install ant.noarch complains about “Missing Dependency: /usr/bin/rebuild-security-providers is needed by package”.  This sucks.  So I do some googling, and figure out that someone solved this issue on Centos 5.x.  I do what they did (documented at http://plone.lucidsolutions.co.nz/linux/centos/jpackage-jpackage-utils-compatibility-for-centos-5.x ).  This works great.  The package is built, I install it out of the local build dir.

Then I install ant.noarch and ant-nodeps.noarch.  These do the trick.

The I build cassandra-cacti-m6 as documented in its source tree.  Woo.

It took me a while to dig up and get all this working, but for you, I hope it’s fast and easy.  Enjoy!

-Gabriel