With Nagios you can monitor almost everything and philosophy is simple.
Nagios uses plug-ins, say Perl/shell script and check its returning value and according to that determines host/service state. So Nagios doesn't know and it's not interested to know what plug-in is monitoring.
Here is the plug-in that monitors an ambient temperature around machine. The plug-in supports next servers: Sun Enterprise T5240 and SunFire X4200/X4500
Basically, the script uses tool 'ipmitool' and connect to ILOM of supported systems. In my case, ILOM interface has name hostname.alom or hostname-alom, so script is also checking this. Another thing, the file .passwd.alom contains ILOM's password.
#!/usr/bin/sh
#set -x
# Nagios plugin : determine ambient temperature around a server
# by zdudic
# -- supported systems
# Sun Enterprise T5240 and SunFire X4200/X4500
# Nagios plugin return values
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
STATE_DEPENDENT=4
# variables
WARNTEMP=$2
CRITTEMP=$3
ILOMUSER=admin
PASSWDFILE=/opt/csw/libexec/nagios-plugins/ipmitool/.passwd.alom
# Function : error and exit 1
err() {
echo "\n ERROR: $* \n"
exit 1
}
# check if arguments are provided (hostname, warning, critical temperature)
if [ $# != 3 ]
then
echo ; echo "USAGE : `basename $0` hostname warn_tmp(C) crit_tmp(C)" ; echo
exit 2
fi
# check if critical temp is higher than warning
if [ $2 -ge $3 ]
then
echo NOTE : Critical temperature must be higher than warning temperature.
exit 3
fi
# Function: end script with output, with performance data for NagiosGraph
endscript () {
echo "${RESULT} | PerfData=${TEMP};${WARNTEMP};${CRITTEMP}"
exit ${EXIT_STATUS}
}
# find if ilom name has -alom or .alom (hostname-alom or hostname.alom)
ILOMNAME=`host $1.alom > /dev/null`
if [ $? -eq 0 ]
then
ILOMNAME=$1.alom
else
ILOMNAME=$1-alom
fi
PNAME=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} fru | head | grep "Product Name" \
| nawk -F":" '{print $2}' | nawk '{print $1}'` \
|| err "Cannot find what system type is $1"
case ${PNAME} in
T5240)
TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \
| grep T_AMB \
| awk -F"|" '{print $5}' | awk '{print $1}'`
#
if [ ${TEMP} -le ${WARNTEMP} ]
then
RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : OK"
EXIT_STATUS=${STATE_OK}
elif [ ${TEMP} -gt ${WARNTEMP} ] && [ ${TEMP} -le ${CRITTEMP} ]
then
RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : WARNING"
EXIT_STATUS=${STATE_WARNING}
else
RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : CRITICAL"
EXIT_STATUS=${STATE_CRITICAL}
fi
#
;;
ILOM)
# can be X4500 or X4200
BOARD=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} fru | head | grep "Board Product" \
| nawk -F"ASSY,SERV PROCESSOR," '{print $2}' | nawk '{print $1}'` \
|| err "Cannot find whar Board Product is."
if [ ${BOARD} = "G1/2" ]
then
# X4200
TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \
| grep fp.t_amb \
| nawk -F"|" '{print $5}' | nawk '{print $1}'`
elif [ ${BOARD} = "X4500" ]
then
# X4500
TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \
| grep dbp.t_amb \
| nawk -F"|" '{print $5}' | nawk '{print $1}'`
fi
# --
if [ ${TEMP} -le ${WARNTEMP} ]
then
RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : OK"
EXIT_STATUS=${STATE_OK}
elif [ ${TEMP} -gt ${WARNTEMP} ] && [ ${TEMP} -le ${CRITTEMP} ]
then
RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : WARNING"
EXIT_STATUS=${STATE_WARNING}
else
RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : CRITICAL"
EXIT_STATUS=${STATE_CRITICAL}
fi
;;
esac
# provide output and nagios return value
endscript
|
This executable shell script is located in the directory /opt/csw/libexec/nagios-plugins on machine to be monitored.
This article is not about NRPE, but I have to write this:
And now Nagios knows the state or recourse, like OK or Critical. And Nagios doesn't care what resource is.
Saying this, you need this line in your nrpe.cfg (configuration file for cswnrpe service) file on machine that is monitored.
# plugin for ambient temperature command[check_amb_temp]=/opt/csw/libexec/nagios-plugins/ipmitool/amb_temp.sh $ARG1$ |
Your Nagios machine needs defined service, something like:
define servicegroup{
servicegroup_name amb_temp_mvo
alias MVO Ambient Temperature
}
define service{
use gen-service ; Name of service template to use
host_name srv-1,srv-2
servicegroups amb_temp_mvo
service_description MVO Ambient Temperature
# The "$HOSTNAME$ X Y" is 1 argument for command, but actually simulates 3 of them
check_command check-nrpe!check_amb_temp!"$HOSTNAME$ 25 27" -t 60
}
|
There are many solutions for graphical presentation of Nagio data, one of them is Nagios Grapher from Netways. I am not writing how to setup this, but here is, in short, how to configure a graph for this plugin.
See the script's funcion that gives results back to Nagios, it also provides performance data. This is what Nagiosgrapher needs.
After installing nagiosgrapher, check the directory ngraph.d Say that I monitor ambient temperature of 2 servers in Mountain View (MVO) server room. The nagiosgrapher configuration file is:
#NagiosGrapherTemplate for check_amb_temp
# ---------- Help ------------------------------------
# service_name =
# regular expresion used to identify service
#
# graph_perf_regex =
# regular expresion used to find searched value in performance data
# must be in round brackets ()
#
# graph_value = variable name in rrd database, no empty space
#
# graph_units = units on Y axis, X axis is time
#
# graph_legend = it contains key for variable, shows under graph
#
# page = optional
#
# rrd_plottype = LINE1 is simple line, AREA is filled out surface
#
# -----------------------------------------------
# Amb Temp in MVO
define ngraph{
service_name MVO Ambient Temperature
graph_perf_regex PerfData=([0-9]*)
graph_value amb_temp
graph_units C
graph_legend MVO Ambient Temperature
graph_upper_limit 30
graph_lower_limit 15
rrd_plottype LINE2
rrd_color FF9900 # orange
}
# AVERAGE of ambient temperature
define ngraph{
service_name MVO Ambient Temperature
type VDEF
graph_value vdef_amb_temp_average
graph_legend Amb temp Average
graph_calc amb_temp,AVERAGE
rrd_plottype LINE1
rrd_color 0000ff
hide no
}
define ngraph{
service_name MVO Ambient Temperature
# HRULE draws horizontal line
type HRULE
hrule_value 25
rrd_color FF0000:Warning level # red
}
define ngraph{
service_name MVO Ambient Temperature
type HRULE
hrule_value 27
rrd_color 000000:Critical level # black
}
|
Here is the weekly graph. Beside this, you'll also have current graph, daily, monthly and yearly
There is also multigraph if you want to compare service of more systems. For example, I compare ambient temperature of 6 systems.
# NOTE : it is nmgraph, not ngraph
# -------------------------------
define nmgraph{
host_name Multigraph
service_name .* DCO.* Ambient Temperature
# RegEX
hosts [a-zA-Z]+
# RegEX
services .* DCO.* Ambient Temperature
# This matches 'graph_value' from the ngraph definition
graph_values amb_temp
# line or stack or area
graph_type LINE2
colors f0e68c,fff000,cd5c5c,ffa500,ff0000,ff1493
}
|
And the graph is:
