Nagios

nagios-plugins

http://www.linux.com/learn/tutorials/316105:setting-up-email-alerts-for-network-monitoring-with-nagios - printed
http://www.thegeekstuff.com/2009/06/4-steps-to-define-nagios-contacts-with-email-and-pager-notification/ - printed
http://www.kartar.net/2013/01/monitoring-sucks---a-rant/ - printed
http://www.stackdriver.com/why-monitoring-doesnt-have-to-suck/ - printed

http://www.linuxfunda.com/2013/04/02/steps-to-configure-nagiosgraph-with-nagios-core/
http://docs.pnp4nagios.org/pnp-0.6/start
https://serverfault.com/questions/115911/pnp4nagios-nagiosgraph-separate-cacti-or-something-else-for-nagios-trending

Misc
Nagios
http://www.codewalkers.com/c/a/Server-Administration/Monitoring-Temperatures-with-Cacti/
http://www.enterprisenetworkingplanet.com/netsysm/article.php/3605536/Cacti-SNMP-Monitoring-Without-All-the-Prickles.htm
http://www.cacti.net/
http://lists.mysql.com/mysql/210982
http://nagios.frank4dd.com/howto/apache-session-monitoring-nagios.htm
http://www.cyberciti.biz/tips/top-linux-monitoring-tools.html

nagios-writing-plugin
NRPE

statsd
Graphios

How to verify your configuration files?

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Where is the installation directory?

/usr/lib64/nagios/plugins/

find / -name nagios

How to set up a check that only run during certain time frame or period?

First, configure the time period in timeperiods.cfg:

define timeperiod{
        timeperiod_name operational-hours
        alias           Operational hours when machines are not subjected to log rotation and backup
        sunday          07:00-24:00
        monday          07:00-24:00
        tuesday         07:00-24:00
        wednesday       07:00-24:00
        thursday        07:00-24:00
        friday          07:00-24:00
        saturday        07:00-24:00
        }

and then use the configured time period in your services.cfg:

define service{
        use                             production_clinicalcafe
        name                            check-mysql-backup
        service_description             MySQL Backup
        host_name                       pgsql1
        contact_groups                  admins
        check_command                   check_nrpe!check_mysql_backup
        check_period                    operational-hours
        notification_period             operational-hours
}

How to set up a check that only runs once everyday or once every hour?

In my nagios.cfg file, I have:

interval_length=60

This means that each time interval is 60 seconds. To set up a check that only runs once everyday, make sure that the directive that is used to set up the check contains:

normal_check_interval    1440

To set up a check that runs once every hour, make sure that the directive that is used to set up the check contains:

normal_check_interval    60

What are the rules for defining variables?

  1. Variables names must begin at the start of the line - no white space is allowed before the name
  2. Variable names are case-sensitive

How to start nagios?

/etc/rc.d/init.d/nagios start
/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

How to stop nagios?

/etc/rc.d/init.d/nagios stop
kill <nagios_pid>

How to reload nagios?

/etc/rc.d/init.d/nagios reload
kill -HUP <nagios_pid>

Objects are all the elements that are involved in the monitoring and notification logic. Types of objects include:

  • Services
  • Service Groups
  • Hosts
  • Host Groups
  • Contacts
  • Contact Groups
  • Commands
  • Time Periods
  • Notification Escalations
  • Notification and Execution Dependencies

Hosts are one of the central objects in the monitoring logic. Important attributes of hosts are as follows:

  • Hosts are usually physical devices on your network (servers, workstations, routers, switches, printers, etc).
  • Hosts have an address of some kind (e.g. an IP or MAC address).
  • Hosts have one or more more services associated with them.
  • Hosts can have parent/child relationships with other hosts, often representing real-world network connections

When Are Host Checks Performed? Hosts are checked by the Nagios daemon:

  • At regular intervals, as defined by the check_interval and retry_interval options in your host definitions.
  • On-demand when a service associated with the host changes state.
  • On-demand as needed as part of the host reachability logic.
  • On-demand as needed for predictive host dependency checks.

Regularly scheduled host checks are optional. If you set the check_interval option in your host definition to zero (0), Nagios will not perform checks of the hosts on a regular basis. It will, however, still perform on-demand checks of the host as needed for other parts of the monitoring logic.

On-demand checks are made when a service associated with the host changes state because Nagios needs to know whether the host has also changed state. Services that change state are often an indicator that the host may have also changed state. For example, if Nagios detects that the HTTP service associated with a host just changed from a CRITICAL to an OK state, it may indicate that the host just recovered from a reboot and is now back up and running.

On-demand checks of hosts are also made as part of the host reachability logic. Nagios is designed to detect network outages as quickly as possible, and distinguish between DOWN and UNREACHABLE host states. These are very different states and can help an admin quickly locate the cause of a network outage.

On-demand checks are also performed as part of the predictive host dependency check logic. These checks help ensure that the dependency logic is as accurate as possible.

The performance of on-demand host checks can be significantly improved by implementing the use of cached checks, which allow Nagios to forgo executing a host check if it determines a relatively recent check result will do instead. More information on cached checks can be found here.

You can define host execution dependencies that prevent Nagios from checking the status of a host depending on the state of one or more other hosts. More information on dependencies can be found here.

Scheduled host checks are run in parallel. When Nagios needs to run a scheduled host check, it will initiate the host check and then return to doing other work (running service checks, etc). The host check runs in a child process that was fork()ed from the main Nagios daemon. When the host check has completed, the child process will inform the main Nagios process (its parent) of the check results. The main Nagios process then handles the check results and takes appropriate action (running event handlers, sending notifications, etc.).

On-demand host checks are also run in parallel if needed. As mentioned earlier, Nagios can forgo the actual execution of an on-demand host check if it can use the cached results from a relatively recent host check.

When Nagios processes the results of scheduled and on-demand host checks, it may initiate (secondary) checks of other hosts. These checks can be initated for two reasons: predictive dependency checks and to determining the status of the host using the network reachability logic. The secondary checks that are initiated are usually run in parallel. However, there is one big exception that you should be aware of, as it can have negative effect on performance…

Hosts which have their max_check_attempts value set to 1 can cause serious performance problems. The reason? If Nagios needs to determine their true state using the network reachability logic (to see if they're DOWN or UNREACHABLE), it will have to launch serial checks of all of the host's immediate parents. Just to reiterate, those checks are run serially, rather than in parallel, so it can cause a big performance hit. For this reason, I would recommend that you always use a value greater than 1 for the max_check_attempts directives in your host definitions.

Hosts that are checked can be in one of three different states:

  • UP
  • DOWN
  • UNREACHABLE

If the preliminary host state is DOWN, Nagios will attempt to see if the host is really DOWN or if it is UNREACHABLE. The distinction between DOWN and UNREACHABLE host states is important, as it allows admins to determine root cause of network outages faster. The following table shows how Nagios makes a final state determination based on the state of the hosts parent(s). A host's parents are defined in the parents directive in host definition.

As you are probably well aware, hosts don't always stay in one state. Things break, patches get applied, and servers need to be rebooted. When Nagios checks the status of hosts, it will be able to detect when a host changes between UP, DOWN, and UNREACHABLE states and take appropriate action. These state changes result in different state types (HARD or SOFT), which can trigger event handlers to be run and notifications to be sent out. Detecting and dealing with state changes is what Nagios is all about.

When hosts change state too frequently they are considered to be "flapping". A good example of a flapping host would be server that keeps spontaneously rebooting as soon as the operating system loads. That's always a fun scenario to have to deal with. Nagios can detect when hosts start flapping, and can suppress notifications until flapping stops and the host's state stabilizes. More information on the flap detection logic can be found here.

When Are Service Checks Performed? Services are checked by the Nagios daemon:

  • At regular intervals, as defined by the check_interval and retry_interval options in your service definitions.
  • On-demand as needed for predictive service dependency checks.

On-demand checks are performed as part of the predictive service dependency check logic. These checks help ensure that the dependency logic is as accurate as possible. If you don't make use of service dependencies, Nagios won't perform any on-demand service checks.

The performance of on-demand service checks can be significantly improved by implementing the use of cached checks, which allow Nagios to forgo executing a service check if it determines a relatively recent check result will do instead. Cached checks will only provide a performance increase if you are making use of service dependencies. More information on cached checks can be found here.

You can define service execution dependencies that prevent Nagios from checking the status of a service depending on the state of one or more other services. More information on dependencies can be found here.

Scheduled service checks are run in parallel. When Nagios needs to run a scheduled service check, it will initiate the service check and then return to doing other work (running host checks, etc). The service check runs in a child process that was fork()ed from the main Nagios daemon. When the service check has completed, the child process will inform the main Nagios process (its parent) of the check results. The main Nagios process then handles the check results and takes appropriate action (running event handlers, sending notifications, etc.).

On-demand service checks are also run in parallel if needed. As mentioned earlier, Nagios can forgo the actual execution of an on-demand service check if it can use the cached results from a relatively recent service check.

Services that are checked can be in one of four different states:

  • OK
  • WARNING
  • UNKNOWN
  • CRITICAL

Service checks are performed by plugins, which can return a state of OK, WARNING, UNKNOWN, or CRITICAL. These plugin states directly translate to service states. For example, a plugin which returns a WARNING state will cause a service to have a WARNING state.

Active check vs Passive check:

Nagios is capable of monitoring hosts and services in two ways: actively and passively.

Active checks are the most common method for monitoring hosts and services. The main features of actives checks as as follows:

  • Active checks are initiated by the Nagios process
  • Active checks are run on a regularly scheduled basis

Active checks are initiated by the check logic in the Nagios daemon. When Nagios needs to check the status of a host or service it will execute a plugin and pass it information about what needs to be checked. The plugin will then check the operational state of the host or service and report the results back to the Nagios daemon. Nagios will process the results of the host or service check and take appropriate action as necessary (e.g. send notifications, run event handlers, etc).

Active check are executed:

  • At regular intervals, as defined by the check_interval and retry_interval options in your host and service definitions
  • On-demand as needed

Regularly scheduled checks occur at intervals equaling either the check_interval or the retry_interval in your host or service definitions, depending on what type of state the host or service is in. If a host or service is in a HARD state, it will be actively checked at intervals equal to the check_interval option. If it is in a SOFT state, it will be checked at intervals equal to the retry_interval option.

On-demand checks are performed whenever Nagios sees a need to obtain the latest status information about a particular host or service. For example, when Nagios is determining the reachability of a host, it will often perform on-demand checks of parent and child hosts to accurately determine the status of a particular network segment. On-demand checks also occur in the predictive dependency check logic in order to ensure Nagios has the most accurate status information.

They key features of passive checks are as follows:

  • Passive checks are initiated and performed by external applications/processes
  • Passive check results are submitted to Nagios for processing

Passive checks are useful for monitoring services that are:

  • Asynchronous in nature and cannot be monitored effectively by polling their status on a regularly scheduled basis
  • Located behind a firewall and cannot be checked actively from the monitoring host

Examples of asynchronous services that lend themselves to being monitored passively include SNMP traps and security alerts. You never know how many (if any) traps or alerts you'll receive in a given time frame, so it's not feasible to just monitor their status every few minutes.

Passive checks are also used when configuring distributed or redundant monitoring installations.

More information: Passive Checks

State Types:

In order to prevent false alarms from transient problems, Nagios allows you to define how many times a service or host should be (re)checked before it is considered to have a "real" problem. This is controlled by the max_check_attempts option in the host and service definitions.

When a service or host check results in a non-OK or non-UP state and the service check has not yet been (re)checked the number of times specified by the max_check_attempts directive in the service or host definition. This is called a soft error. The following things occur when hosts or services experience SOFT state changes:

  • The SOFT state is logged.
  • Event handlers are executed to handle the SOFT state.

SOFT states are only logged if you enabled the log_service_retries or log_host_retries options in your main configuration file.

The only important thing that really happens during a soft state is the execution of event handlers. Using event handlers can be particularly useful if you want to try and proactively fix a problem before it turns into a HARD state. The $HOSTSTATETYPE$ or $SERVICESTATETYPE$ macros will have a value of "SOFT" when event handlers are executed, which allows your event handler scripts to know when they should take corrective action. More information on event handlers can be found here.

Hard states occur for hosts and services in the following situations:

  • When a host or service check results in a non-UP or non-OK state and it has been (re)checked the number of times specified by the max_check_attempts option in the host or service definition. This is a hard error state.
  • When a host or service transitions from one hard error state to another error state (e.g. WARNING to CRITICAL).
  • When a service check results in a non-OK state and its corresponding host is either DOWN or UNREACHABLE.
  • When a passive host check is received. Passive host checks are treated as HARD unless the passive_host_checks_are_soft option is enabled.

The following things occur when hosts or services experience HARD state changes:

  • The HARD state is logged.
  • Event handlers are executed to handle the HARD state.
  • Contacts are notifified of the host or service problem or recovery.

Notifications:

he decision to send out notifications is made in the service check and host check logic. Host and service notifications occur in the following instances:

  • When a hard state change occurs.
  • When a host or service remains in a hard non-OK state and the time specified by the <notification_interval> option in the host or service definition has passed since the last notification was sent out (for that specified host or service).

Each host and service definition has a <contact_groups> option that specifies what contact groups receive notifications for that particular host or service. Contact groups can contain one or more individual contacts.

When Nagios sends out a host or service notification, it will notify each contact that is a member of any contact groups specified in the <contactgroups> option of the service definition. Nagios realizes that a contact may be a member of more than one contact group, so it removes duplicate contact notifications before it does anything.

Just because there is a need to send out a host or service notification doesn't mean that any contacts are going to get notified. There are several filters that potential notifications must pass before they are deemed worthy enough to be sent out. Even then, specific contacts may not be notified if their notification filters do not allow for the notification to be sent to them.

The first filter that notifications must pass is a test of whether or not notifications are enabled on a program-wide basis. This is initially determined by the enable_notifications directive in the main config file, but may be changed during runtime from the web interface. If notifications are disabled on a program-wide basis, no host or service notifications can be sent out - period. If they are enabled on a program-wide basis, there are still other tests that must be passed.

The first filter for host or service notifications is a check to see if the host or service is in a period of scheduled downtime. If it is in a scheduled downtime, no one gets notified. If it isn't in a period of downtime, it gets passed on to the next filter. As a side note, notifications for services are suppressed if the host they're associated with is in a period of scheduled downtime.

The second filter for host or service notification is a check to see if the host or service is flapping (if you enabled flap detection). If the service or host is currently flapping, no one gets notified. Otherwise it gets passed to the next filter.

The third host or service filter that must be passed is the host- or service-specific notification options. Each service definition contains options that determine whether or not notifications can be sent out for warning states, critical states, and recoveries. Similiarly, each host definition contains options that determine whether or not notifications can be sent out when the host goes down, becomes unreachable, or recovers. If the host or service notification does not pass these options, no one gets notified. If it does pass these options, the notification gets passed to the next filter… Note: Notifications about host or service recoveries are only sent out if a notification was sent out for the original problem. It doesn't make sense to get a recovery notification for something you never knew was a problem.

The fourth host or service filter that must be passed is the time period test. Each host and service definition has a <notification_period> option that specifies which time period contains valid notification times for the host or service. If the time that the notification is being made does not fall within a valid time range in the specified time period, no one gets contacted. If it falls within a valid time range, the notification gets passed to the next filter… Note: If the time period filter is not passed, Nagios will reschedule the next notification for the host or service (if its in a non-OK state) for the next valid time present in the time period. This helps ensure that contacts are notified of problems as soon as possible when the next valid time in time period arrives.

The last set of host or service filters is conditional upon two things: (1) a notification was already sent out about a problem with the host or service at some point in the past and (2) the host or service has remained in the same non-OK state that it was when the last notification went out. If these two criteria are met, then Nagios will check and make sure the time that has passed since the last notification went out either meets or exceeds the value specified by the <notification_interval> option in the host or service definition. If not enough time has passed since the last notification, no one gets contacted. If either enough time has passed since the last notification or the two criteria for this filter were not met, the notification will be sent out! Whether or not it actually is sent to individual contacts is up to another set of filters.

At this point the notification has passed the program mode filter and all host or service filters and Nagios starts to notify all the people it should. Does this mean that each contact is going to receive the notification? No! Each contact has their own set of filters that the notification must pass before they receive it. Note: Contact filters are specific to each contact and do not affect whether or not other contacts receive notifications.

The first filter that must be passed for each contact are the notification options. Each contact definition contains options that determine whether or not service notifications can be sent out for warning states, critical states, and recoveries. Each contact definition also contains options that determine whether or not host notifications can be sent out when the host goes down, becomes unreachable, or recovers. If the host or service notification does not pass these options, the contact will not be notified. If it does pass these options, the notification gets passed to the next filter… Note: Notifications about host or service recoveries are only sent out if a notification was sent out for the original problem. It doesn't make sense to get a recovery notification for something you never knew was a problem…

The last filter that must be passed for each contact is the time period test. Each contact definition has a <notification_period> option that specifies which time period contains valid notification times for the contact. If the time that the notification is being made does not fall within a valid time range in the specified time period, the contact will not be notified. If it falls within a valid time range, the contact gets notified!

ou can have Nagios notify you of problems and recoveries pretty much anyway you want: pager, cellphone, email, instant message, audio alert, electric shocker, etc. How notifications are sent depend on the notification commands that are defined in your object definition files.

More information: Notifications

Event Handlers

Event handlers are optional system commands (scripts or executables) that are run whenever a host or service state change occurs. An obvious use for event handlers is the ability for Nagios to proactively fix problems before anyone is notified. Some other uses for event handlers include:

  • Restarting a failed service
  • Entering a trouble ticket into a helpdesk system
  • Logging event information to a database
  • Cycling power on a host. Cycling power on a host that is experiencing problems with an auomated script should not be implemented lightly. Consider the consequences of this carefully before implementing automatic reboots.

Event handlers are executed when a service or host:

  • Is in a SOFT problem state
  • Initially goes into a HARD problem state
  • Initially recovers from a SOFT or HARD problem state

There are different types of optional event handlers that you can define to handle host and state changes:

  • Global host event handler
  • Global service event handler
  • Host-specific event handlers
  • Service-specific event handlers

Global host and service event handlers are run for every host or service state change that occurs, immediately prior to any host- or service-specific event handler that may be run. You can specify global event handler commands by using the global_host_event_handler and global_service_event_handler options in your main configuration file.

Individual hosts and services can have their own event handler command that should be run to handle state changes. You can specify an event handler that should be run by using the event_handler directive in your host and service definitions. These host- and service-specific event handlers are executed immediately after the (optional) global host or service event handler is executed.

Event handlers can be enabled or disabled on a program-wide basis by using the enable_event_handlers in your main configuration file.

Host- and service-specific event handlers can be enabled or disabled by using the event_handler_enabled directive in your host and service definitions. Host- and service-specific event handlers will not be executed if the global enable_event_handlers option is disabled.

Global host and service event handlers are executed immediately before host- or service-specific event handlers. Event handlers are executed for HARD problem and recovery states immediately after notifications are sent out.

Event handler commands will likely be shell or perl scripts, but they can be any type of executable that can run from a command prompt. At a minimum, the scripts should take the following macros as arguments:

For Services: $SERVICESTATE$, $SERVICESTATETYPE$, $SERVICEATTEMPT$
$HOSTSTATE$, $HOSTSTATETYPE$, $HOSTATTEMPT$

Event handler commands will normally execute with the same permissions as the user under which Nagios is running on your machine. This can present a problem if you want to write an event handler that restarts system services, as root privileges are generally required to do these sorts of tasks. Ideally you should evaluate the types of event handlers you will be implementing and grant just enough permissions to the Nagios user for executing the necessary system commands. You might want to try using sudo to accomplish this.

It should be noted that the event handler will only be executed the first time that the service falls into a HARD problem state. This prevents Nagios from continuously executing the script to restart the web server if the service remains in a HARD problem state.

Event handlers are commands, so they need to be defined using 'define command', and you can use macros with event handlers.

Freshness Checks: (useful together with "passive checks")

The purpose of freshness checking is to ensure that host and service checks are being provided passively by external applications on a regular basis. Freshness checking is useful when you want to ensure that passive checks are being received as frequently as you want.

Nagios periodically checks the freshness of the results for all hosts services that have freshness checking enabled.

  • A freshness threshold is calculated for each host or service.
  • For each host/service, the age of its last check result is compared with the freshness threshold.
  • If the age of the last check result is greater than the freshness threshold, the check result is considered "stale".
  • If the check results is found to be stale, Nagios will force an active check of the host or service by executing the command specified by in the host or service definition. An active check is executed even if active checks are disabled on a program-wide or host- or service-specific basis.

Here's what you need to do to enable freshness checking…

  • Enable freshness checking on a program-wide basis with the check_service_freshness and check_host_freshness directives.
  • Use service_freshness_check_interval and host_freshness_check_interval options to tell Nagios how often in should check the freshness of service and host results.
  • Enable freshness checking on a host- and service-specific basis by setting the check_freshness option in your host and service definitions to a value of 1.
  • Configure freshness thresholds by setting the freshness_threshold option in your host and service definitions.
  • Configure the check_command option in your host or service definitions to reflect a valid command that should be used to actively check the host or service when it is detected as stale.
  • The check_period option in your host and service definitions is used when Nagios determines when a host or service can be checked for freshness, so make sure it is set to a valid timeperiod.

If you do not specify a host- or service-specific freshness_threshold value (or you set it to zero), Nagios will automatically calculate a threshold automatically, based on a how often you monitor that particular host or service. I would recommended that you explicitly specify a freshness threshold, rather than let Nagios pick one for you.

An example of a service that might require freshness checking might be one that reports the status of your nightly backup jobs. Perhaps you have a external script that submit the results of the backup job to Nagios once the backup is completed. In this case, all of the checks/results for the service are provided by an external application using passive checks. In order to ensure that the status of the backup job gets reported every day, you may want to enable freshness checking for the service. If the external script doesn't submit the results of the backup job, you can have Nagios fake a critical result.

Escalation

Escalation of host and service notifications is accomplished by defining host escalations and service escalations in your object configuration file(s).

Notifications are escalated if and only if one or more escalation definitions matches the current notification that is being sent out. If a host or service notification does not have any valid escalation definitions that applies to it, the contact group(s) specified in either the host group or service definition will be used for the notification.

If, after three problem notifications, a recovery notification is sent out for the service, who gets notified? The recovery is actually the fourth notification that gets sent out. However, the escalation code is smart enough to realize that only those people who were notified about the problem on the third notification should be notified about the recovery.

You can change the frequency at which escalated notifications are sent out for a particular host or service by using the notification_interval option of the hostgroup or service escalation definition.

In this example we see that the default notification interval for the services is 240 minutes (this is the value in the service definition). When the service notification is escalated on the 3rd, 4th, and 5th notifications, an interval of 45 minutes will be used between notifications. On the 6th and subsequent notifications, the notification interval will be 60 minutes, as specified in the second escalation definition.

Since it is possible to have overlapping escalation definitions for a particular hostgroup or service, and the fact that a host can be a member of multiple hostgroups, Nagios has to make a decision on what to do as far as the notification interval is concerned when escalation definitions overlap. In any case where there are multiple valid escalation definitions for a particular notification, Nagios will choose the smallest notification interval.

We see that the two escalation definitions overlap on the 4th and 5th notifications. For these notifications, Nagios will use a notification interval of 45 minutes, since it is the smallest interval present in any valid escalation definitions for those notifications.

An interval of 0 means that Nagios should only sent a notification out for the first valid notification during that escalation definition. All subsequent notifications for the hostgroup or service will be suppressed.

Under normal circumstances, escalations can be used at any time that a notification could normally be sent out for the host or service. This "notification time window" is determined by the notification_period directive in the host or service definition.

You can optionally restrict escalations so that they are only used during specific time periods by using the escalation_period directive in the host or service escalation definition. If you use the escalation_period directive to specify a timeperiod during which the escalation can be used, the escalation will only be used during that time. If you do not specify any escalation_period directive, the escalation can be used at any time within the "notification time window" for the host or service.

Escalated notifications are still subject to the normal time restrictions imposed by the notification_period directive in a host or service definition, so the timeperiod you specify in an escalation definition should be a subset of that larger "notification time window".

If you would like to restrict the escalation definition so that it is only used when the host or service is in a particular state, you can use the escalation_options directive in the host or service escalation definition. If you do not use the escalation_options directive, the escalation can be used when the host or service is in any state.
Escalation

Plugin Architecture:

Unlike many other monitoring tools, Nagios does not include any internal mechanisms for checking the status of hosts and services on your network. Instead, Nagios relies on external programs (called plugins) to do all the dirty work. Plugins are compiled executables or scripts (Perl scripts, shell scripts, etc.) that can be run from a command line to check the status or a host or service. Nagios uses the results from plugins to determine the current status of hosts and services on your network. Nagios will execute a plugin whenever there is a need to check the status of a service or host. The plugin does something (notice the very general term) to perform the check and then simply returns the results to Nagios. Nagios will process the results that it receives from the plugin and take any necessary actions (running event handlers, sending out notifications, etc).

Most all plugins will display basic usage information when you execute them using '-h' or '—help' on the command line: ./check_http —help

Macros:

One of the main features that make Nagios so flexible is the ability to use macros in command defintions. Macros allow you to reference information from hosts, services, and other sources in your commands. Before Nagios executes a command, it will replace any macros it finds in the command definition with their corresponding values. This macro substitution occurs for all types of commands that Nagios executes - host and service checks, notifications, event handlers, etc. Certain macros may themselves contain other macros. These include the $HOSTNOTES$, $HOSTNOTESURL$, $HOSTACTIONURL$, $SERVICENOTES$, $SERVICENOTESURL$, and $SERVICEACTIONURL$ macros. The beauty in this is that you can use a single command definition to check an unlimited number of hosts. Each host can be checked with the same command definition because each host's address is automatically substituted in the command line before execution.

You can pass arguments to commands as well, which is quite handy if you'd like to keep your command definitions rather generic. Arguments are specified in the object (i.e. host or service) definition, by separating them from the command name with exclamation points (!):

define service{
  host_name        linuxbox
  service_description    PING
  check_command    check_ping!200.0,80%!400.0,40%
}

In the example above, the service check command has two arguments (which can be referenced with $ARGn$ macros). The $ARG1$ macro will be "200.0,80%" and $ARG2$ will be "400.0,40%" (both without quotes). Assuming we are using the host definition given earlier and a check_ping command defined like this:

define command{
  command_name    check_ping
  command_line    /usr/local/nagios/libexec/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$
}

The expanded/final command line to be executed for the service's check command would look like this:

/usr/local/nagios/libexec/check_ping -H 192.168.1.2 -w 200.0,80% -c 400.0,40%

If you need to pass bang (!) characters in your command arguments, you can do so by escaping them with a backslash (\). If you need to include backslashes in your command arguments, they should also be escaped with a backslash.

By default, Nagios expects the CGI configuration file to be named cgi.cfg and located in the config file directory along with the main config file. If you need to change the name of the file or its location, you can configure Apache to pass an environment variable named NAGIOS_CGI_CONFIG (which points to the correct location) to the CGIs. See the Apache documentation for information on how to do this.

NRPE:

check_pid_file!/var/run/runAQMBatch-s1.pid!3600!5400
check_pid_file!/download/SRM/submitters/submitter4sf-s1/submitter.pid!10800!15800

/usr/lib64/nagios/plugins/check_pid_file -H CRMSuiteApp00 -c check_pid_file -f /var/run/update_abuse.pid -w 21600 -c 22000
/usr/local/nagios/libexec/check_nrpe -H localhost -n

useradd nagios
Compile and install nagios plugins on the remote box
Compile and install nrpe on the remote box

./configure —enable-command-args
make all
cp src/nrpe /usr/local/nagios/libexec
cp src/check_nrpe /usr/local/nagios/libexec
cp sample-config/nrpe.cfg /etc
/usr/local/nagios/libexec/nrpe -c /etc/nrpe.cfg -d

When the daemon receives a request to return the results of <command_name>, it will execute the command specified by the <command_line> argument. Unlike Nagios, the command line cannot contains macros - it must be typed exactly as it should be executed.

Also note that you will have to modify the definitions below to match the argument format the plugins expect.

The following examples allow user-supplied arguments and can only be used if the NRPE daemon was compiled with support for command arguments *AND* the dont_blame_nrpe directive in this config file is set to '1'

command[check_users]=/usr/local/nagios/libexec/check_users -w $ARG1$ -c $ARG2$

In order to use the check_nrpe plugin, you'll have to define a few things in the host config file. (See the README file).

In any service definitions that use the nrpe plugin to get their results, you would set the service check command portion of the definition to:

check_command check_nrpe!yourcommand

where yourcommand is the name of a command that you define in your nrpe.cfg file on the remote host

Contacts:

define contact{
        contact_name                    nagiosadmin             ; Short name of user
        use                             generic-contact         ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin            ; Full name of user

        email                           kdoan@quantros.com      ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        pager                           15555555555@tmomail.net
        service_notification_commands notify-service-by-email,notify-service-by-sms
        host_notification_commands notify-host-by-email,notify-host-by-sms
        }

define contact{
        contact_name                    quantros_application_admin
        use                             generic-contact
        alias                           Quantros Application Admin
        email                           kdoan@quantros.com
        pager                           15555555555@tmomail.net
        service_notification_commands   notify-service-by-email,notify-service-by-sms
        host_notification_commands      notify-host-by-email,notify-host-by-sms
}

check_mysql_health
What to monitor with Nagios
Nagios Core 3.x Documentation
Nagios Plugin API
Nagios Plugins
Nagios Plugins
NagiosExchange.org
Object Configuration Overview
Object Definitions
Object Inheritance
Time-Saving Tricks For Object Definitions
Determining Status and Reachability of Network Hosts
Time Periods
Main Configuration File Options
CGI Configuration File Options
Authentication And Authorization In The CGIs
Determining Status and Reachability of Network Hosts (host reachability)
Predictive Dependency Checks (predictive host dependency checks)
Cached Checks
Host and Service Dependencies
Authentication And Authorization In The CGIs
Nagios plug-in development guidelines
Performance Data
Optimization
nagiosgraph

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License