While we aren’t entirely on the
chatops bandwagon
at CTL (yet), I do believe that visibility is important to
operations. It’s often extremely helpful to know what’s going on with
our systems at a glance. We have a #monitoring
Slack channel that Github
and Travis CI as well as our internal
Jenkins server publish into so we can quickly
see when pull requests come in or are merged, when tests fail, and
when deployments are running. If you’re working on an application, you
may see a pull request go by that looks like it might conflict with
what you’re working on and you can back out some changes before you
get too far. Or if you’re thinking of deploying some code, but you see
that there are already a bunch of deployments happening, you might
decide to hold off for a bit.
One area where we lacked visibility and occasionally ran into conflicts were configuration management runs. We use Salt for all of our configuration management and orchestration. With Salt, like most other CM systems, you edit files to define the overall configuration of your infrastructure, then you run a command with basically says “update the servers to match the config”. In Salt’s case, that is done by running a “highstate” command on the Salt master server. That works fine as long as only one person is making changes and running highstates. If two people are trying to do it at the same time without coordinating, it gets confusing fast. It’s also the sort of thing that you’d like to know about if you’re deploying applications. In the course of a highstate, Salt may install packages and restart services left and right. This can cause deployments to fail in strange ways and you can spend a lot of time debugging if you didn’t know that a highstate happened while you were deploying.
Running a highstate also sometimes has unintended consequences. A typo in a config file might break something for a different application in a non-obvious way. Our monitoring systems should detect that, but there can be a delay and we still have large gaps in our monitoring. The Slack state files are tracked in git so we have a nice history and audit trail there, but you still don’t know exactly when the changes you see in git were applied via a highstate, so post-facto debugging can still be a chore. Salt also lets you run arbitrary commands across machines, which is handy for, eg, restarting a service that’s acting up. Those commands can change the state of the servers and further complicate debugging.
So I really wanted to make Salt highstates and commands visible in our Slack channel to provide a basic audit log and improve coordination.
To implement this, I made use of Salt’s Reactor System. The reactor system lets you trigger arbitrary commands from Salt’s internal events.
You configure Salt Reactor by putting an /etc/salt/master.d/reactor.conf
file on the
salt master that tells it to map classes of events to handler. We use
reactors for a few other things as well, but the part that’s relevant
here looks like:
reactor:
- 'salt/job/*/new':
- /srv/reactor/slack.sls
That tells Salt to pass every new job event (basically anything that
is run manually) to the slack.sls
reactor. The reactor is a bit
trickier with an ugly mix of YAML and Jinja syntax:
{% if data['fun'] == 'state.highstate' %}
slack-highstate:
local.cmd.run:
- tgt: saltmaster
- arg:
- /usr/local/bin/salt_slack {{data['tgt']}} {{data['tgt_type']}} {{data['fun']}} {{data['arg']}}
{% endif %}
{% if data['fun'] == 'state.sls' %}
slack-state:
local.cmd.run:
- tgt: saltmaster
- arg:
- /usr/local/bin/salt_slack {{data['tgt']}} {{data['tgt_type']}} {{data['fun']}} {{data['arg']}}
{% endif %}
{% if data['fun'] == 'cmd.run' and data['tgt'] != 'saltmaster' %}
slack-highstate:
local.cmd.run:
- tgt: saltmaster
- arg:
- /usr/local/bin/salt_slack {{data['tgt']}} {{data['tgt_type']}} {{data['fun']}} {{data['arg']}}
{% endif %}
The three event subtypes that we want to handle are state.highstate
,
state.sls
, and cmd.run
. Whenever one of those is seen, it pulls
out a few fields of data from the event and runs a salt_slack
command on the salt master with those fields as arguments. The
cmd.run
stanza has a very important conditional on it (data['tgt'] != 'saltmaster'
). That tells it to ignore cmd.run
events running on
the salt master itself. That’s important because the salt_slack
command is run via cmd.run
. Triggering another cmd.run
every time
it sees a cmd.run
would cause a nice infinite loop. I can tell you
from experience, an infinite loop on your salt master is not a fun
time.
Finally, the salt_slack
command is a little Python script that turns
the data from the events back into something intelligible and sends it
to Slack’s webhook:
#!/usr/bin/env python
import requests
import json
import sys
ENDPOINT = "https://hooks.slack.com/services/<slack token goes here>"
channel = "#monitoring"
emoji = ":computer:"
target = sys.argv[1]
target_type = sys.argv[2]
fun = sys.argv[3]
args = ""
if len(sys.argv) > 4:
args = sys.argv[4]
args = args.strip("[]")
def deformat(target, target_type):
if target_type == "glob":
return target
if target_type == "grain":
return "-G " + target
return target
command = "salt %s %s %s" % (deformat(target, target_type), fun, args)
payload = dict(
channel=channel,
mrkdwn=True,
username="salt-bot",
icon_emoji=emoji,
attachments=[
{
"mrkdwn_in": ["text", "fallback"],
"fallback": command,
"text": "`" + command + "`",
"color": "#F35A00"
}
]
)
data = dict(
payload=json.dumps(payload)
)
r = requests.post(ENDPOINT, data=data)
Now, when someone runs salt '*' state.highstate
, salt -G roles:postgresql state.sls
, or salt -G roles:nginx cmd.run 'nginx restart'
, those commands are immediately posted to our Slack channel
for everyone to see.
Printed from: https://compiled.ctl.columbia.edu/articles/salt-to-slack/