Monitoring

https://www.oreilly.com/ideas/monitoring-distributed-systems
https://blog.profitbricks.com/top-47-cloud-server-monitoring-sysadmin-tools/
http://solutioncenter.apexsql.com/operating-system-os-performance-monitoring/
https://www.sevone.com/white-paper/6-steps-effective-performance-monitoring-strategy
http://logz.io/blog/elk-monitor-platform-performance/
http://searchitchannel.techtarget.com/feature/Windows-7-performance-monitoring-tools
https://msdn.microsoft.com/en-us/library/bb726968.aspx
https://technet.microsoft.com/en-us/library/dd744567%28v=ws.10%29.aspx
http://www.techrepublic.com/blog/data-center/secret-agents-make-snmp-work-for-you/
http://www.techrepublic.com/blog/the-enterprise-cloud/use-resource-monitor-to-monitor-cpu-performance/
http://www.techrepublic.com/blog/the-enterprise-cloud/use-resource-monitor-to-monitor-network-performance/
http://www.techrepublic.com/blog/the-enterprise-cloud/use-resource-monitor-to-monitor-storage-performance/
https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ - done reading
https://codeascraft.com/2010/12/08/track-every-release/ - done reading
http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/ - done reading
http://graphite.wikidot.com/
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
http://network-management.softwareinsider.com/compare/51-96/Telenium-vs-NetXMS - done reading

Statsd

Monitor everything. Even the number of deployments that was done per day should be logged and graphed. We maintain a battery of thousands of tests that run against our application code before every single deploy, and we’re adding more every day. Combined with engineers pairing up for code reviews, we catch most issues before they get deployed. Tracking every deploy allows us to quickly detect any bugs that we missed.

Monitoring every aspect of your server and network architecture helps detect when something has gone awry. Correlating the times of each and every code deploy helps to quickly identify human-triggered problems and greatly cut down on your time to resolve them.

https://assets.nagios.com/downloads/nagioscore/docs/Installing_Nagios_Core_From_Source.pdf
http://jianmingli.com/wp/?p=1223
https://reddragdiva.dreamwidth.org/515416.html
http://permalink.gmane.org/gmane.network.nagios.user/27375

https://bigpanda.io/integrations/nagios-the-alternative-to-a-flood-of-alerts

Cacti - This is for graphing
Graphite - This is for graphing

netxms
Telenium - http://network-management.softwareinsider.com/l/51/Telenium, http://www.megasys.com/downloads/enter.asp
NetCrunch 9 - http://network-management.softwareinsider.com/l/56/NetCrunch-9, http://www.adremsoft.com/demo/

https://www.webnms.com/

DataDog - https://www.datadoghq.com/
ganglia
JStatsD
ManageEngine IT 360
https://mmonit.com/monit/
Monitis - http://www.monitis.com/
Munin
N-Able
N-Central - http://www.n-able.com/products/n-central/
Nagios for Windows - https://www.nagios.com/solutions/windows-monitoring/
Nagios for Windows - https://www.itefix.net/nagwin
Nagios
NeDi
PagerDuty - http://www.pagerduty.com
NewRelic - Analyze performance down to code level
Ruxit - https://ruxit.com/ - Analyze performance down to code level
Total Network Monitor - http://www.softinventive.com/total-network-monitor/

http://www.techrepublic.com/blog/five-apps/five-free-network-monitoring-tools/ - done reading
http://www.monitis.com/blog/2011/02/22/11-top-server-management-monitoring-software/ - done reading
http://www.gfi.com/blog/the-top-20-free-network-monitoring-and-analysis-tools-for-sys-admins/ - done reading
http://www.infoworld.com/article/2654082/networking/killer-open-source-monitoring-tools.html - done reading
http://www.networkworld.com/article/2825879/network-management/7-free-open-source-network-monitoring-tools.html - done reading
http://blog.unicsolution.com/2013/11/best-monitoring-solution-omd-nagios.html - done reading

http://www.techrepublic.com/blog/linux-and-open-source/nagios-xi-wizards-make-setup-a-snap-for-network-monitoring/

http://www.sitepoint.com/guide-monitoring-web-applications/

What should we monitor?

Monitor and graph everything:

  1. Applications:
    1. apache
    2. mysql
  2. network
    1. ICMP response time
    2. package drop
    3. number of tcp connection
    4. bits per second on network interface
    5. error rates / packet drops
    6. retry limits
    7. collisions
    8. throughput
  3. IO activities
  4. swap
    1. swap in
    2. swap out
  5. memory
  6. CPU usage
  7. load average
  8. disk space on each drive / partition
  9. context switches
  10. interupts
  11. major page faults
  12. number of logged in users
  13. number of login attempted
  14. temperature of physical devices
  15. relation rates
  16. and many more.

CloudViewNMS

Commercial. Work relatively well for someone. Need further examination.

Icinga 2:

Icigna is a Linux based fully open source monitoring application which checks the availability of network resources and immediately notifies users when something goes down. Icigna provides business intelligence data for in depth analysis and a powerful command line interface. When you first launch the Icigna web UI, you are prompted for credentials. Once you’ve authenticated, use the navigation menu on the left hand side to manage the configuration of hosts, view the dashboard, reports, see a history of events, and more.

ManageEngine OpManager:

Easy to install and needs no special requirements. Offers various facilities, including incident management, problem management, and a change management facility. Good conversation management. By adding business rules, Process Automation can be done. Provides powerful SLA features through Manage Engine Service Desk.

Pros:

  • Great feature set
  • No client required, as it is completely web-browser based;
  • Monitoring devices using SNMP, WMI, SSH/Telnet
  • Notifies admins on alarms, or escalation thresholds.

Cons:

  • Lots of manual configuration needed
  • Errors in device classification#
  • Unconventional UI is hard to navigate;
  • Configuration can be complex;
  • No multiple threshold alarms (e.g. Warning, Critical, etc.)

Monitance:

http://www.monitance.com/en/

OMD:

Recommended by another source as the best solution. http://blog.unicsolution.com/2013/11/best-monitoring-solution-omd-nagios.html

Tutorials:

Observium:

Observium is "an autodiscovering PHP/MySQL/SNMP-based network monitoring [tool]." It focuses on Linux, UNIX, Cisco, Juniper, Brocade, Foundry, HP, and more. With Observium, you'll find detailed graphs and an incredibly easy-to-use interface. It can monitor a huge number of processes and systems. The only downside is a lack of auto alerts. But to make up for that, you can set Observium up alongside a tool like Nagios for up/down alerts

OpenNMS:

OpenNMS is an open source enterprise grade network management application that offers automated discovery, event and notification management, performance measurement, and service assurance features. OpenNMS includes a client app for the iPhone, iPad or iPod Touch for on-the-go access, giving you the ability to view outages, nodes, alarms and add an interface to monitor. Once you successfully login to the OpenNMS web UI, use the dashboard to get a quick ‘snapshot view’ of any outages, alarms or notifications. You can drill down and get more information about any of these sections from the Status drop down menu. The Reports section allows you to generate reports to send by e-mail or download as a PDF.

OpenNMS is designed for Linux but can support Windows and OSX as well. Easy installation process. Features ability to configure “Path Outages”. Offers Event and Notification Management – receiving both internal and external events. Features thresholding, which is the evaluation of polled latency data or collected performance data against configurable thresholds, creating events when these are exceeded or rearmed. Alarms and automation – reducing events according to a reduction key and scripting automated actions centered on alarms. Sends notifications regarding noteworthy events via e-mail, XMPP, or other means.

Pros:

  • Free licensing
  • Offers good support and documentation through wikis and mailing lists;
  • Full featured and infinitely flexible
  • “Path outages” featuring “minimize excessive alerting”
  • Reasonable support costs via the OpenNMS Group.

Cons:

  • Steep learning curve
  • Interface not very intuitive;
  • Requires learning and modifying various config files for customization;
  • Money saved on licensing may have to be spent on development and maintenance.

Nagios:

Nagios (Figure D) is considered by many to be the king of open source network monitoring systems. Although not the easiest tool to set up and configure (you have to manually edit configuration files), Nagios is incredibly powerful. And even though the idea of manual configuration might turn some off, that setup actually makes Nagios one of the most flexible network monitors around. In the end, the vast number of features Nagios offers is simply unmatched. You can even set up email, SMS, and printed paper alerts!

Pandora FMS:

Recommended by another source: I have been using a different unified monitoring tool called Pandora FMS. It can monitor Servers as well as networks, devices, webpages and obtain detailed information in realtime. The alert system makes possible to go to the cinema for example and not miss a critical status on the memory usage on a server thanks to the e-mail sending feature… If you'd like to give it a try visit the following webpage: http://pandorafms.com/

Pandora FMS is a performance monitoring, network monitoring and availability management tool that keeps an eye on servers, applications and communications. It has an advanced event correlation system that allows you to create alerts based on events from different sources and notify administrators before an issue escalates. When you login to the Pandora FMS Web UI, start by going to the ‘Agent detail’ and ‘Services’ node from the left hand navigation pane. From here, you can configure monitoring agents and services.

SolarWinds:

SolarWinds’s ConnectNow Topology Mapping allows users’ environment to be mapped in real time automatically. This provides graphical visibility into users’ networks, requiring no additional work or tools. SolarWinds’s Integrated Wireless Poller monitors wireless devices for security and other issues and reduces the difficulty in managing these items, allowing more widespread use of wireless technologies.

Pros:

  • Excellent UI design
  • Customizable, automated network mapping
  • Great community support provided by Thwack
  • Mobile access
  • Native VMware support

Cons:

  • Unable to configure alerts from the web-console;
  • Clumsy “Group Dependency” configuration
  • Reporting module needs better ad-hoc reports;
  • No native support for Microsoft Hyper-V. Features SNMP only.

Spiceworks:

Spiceworks (Figure C) is becoming one of the industry standard free network/system monitoring tools. Although you have to put up with some ads, the features and Web-based interface can't be beat. Spiceworks monitors (and autodiscovers) your systems, alerts you if something is down, and offers outstanding topographical tools. It also allows you to get social with fellow IT pros via the Spiceworks community, which is built right in.

Spiceworks is a network management and monitoring, Help Desk, PC inventory and software reporting solution for handling IT in small and medium-sized businesses. Fast installation. Main dashboard completely configurable. Easy to use monitoring console. Active user community, with forums, ratings and reviews, how-tos and whitepapers.

Pros:

  • Free
  • Easy to install and configure for Windows environments
  • “All in one” solution for Inventory, Monitoring, and Help Desk.
  • Great starting point for IT management

Cons:

  • On larger networks, performance can be slow;
  • Limited scalability
  • Does not facilitate managing control of monitored devices;
  • Some initial device configuration is required to be recognized by Spiceworks;
  • VMWare and *nix systems not discovered nearly as easily as Windows;
  • Does not provide the same depth of monitoring and control as enterprise-level products.

WhatsUp Gold – Gold Premium:

Processing loads are handled by remote sites minimizing the overhead at the central location. Features real-time centralized network management across multiple sites using individualized dashboards. Continuous uninterrupted monitoring, and each site runs independently of the central site. Provides actionable intelligence, with over 200 reports to slice and dice consolidated data, including SLA levels. With monitoring localized at each remote site, there is minimal traffic overhead on the network. Air-tight security with 128-bit SSL encryption between each remote network connection to central site; Also, SSL over VPN can be configured.

Pros:

  • Easy setup and network discovery
  • Great feature set
  • Many notification options, including via email and SMS.
  • Detailed, customizable reporting; supports custom date ranges.

Cons:

  • Non-intuitive
  • Clumsy interface
  • Configuration requires both Web and Windows consoles;
  • Unfriendly “Passive” SNMP reporting.

Xymon:

Xymon is a web-based system – designed to run on Unix-based systems – that allows you to dive deep into the configuration, performance and real-time statistics of your networking environment. It offers monitoring capabilities with historical data, reporting and performance graphs. Once you’ve installed Xymon, the first place you need to go is the hosts.cfg file to add the hosts that you are going to monitor. Here, you add information such as the host IP address, the network services to be monitored, what URLs to check, and so on. When you launch the Xymon Web UI, the main page lists the systems and services being monitored by Xymon. Clicking on each system or service allows you to bring up status information about a particular host and then drill down to view specific information such as CPU utilization, memory consumption, RAID status, etc.

Zabbix:

Zabbix (Figure E) is as powerful as any other network monitoring tool, and it also offers user-defined views, zooming, and mapping on its Web-based console. Zabbix offers agent-less monitoring, collects nearly ANY kind of data you want to monitor, does availability and SLA reporting, and can monitor up to 10,000 devices. You can even get commercial support for this outstanding open source product. One unique Zabbix feature is the option to set audible alerts. Should something go down, have Zabbix play a sound file (say, a Star Trek red alert klaxon?).

ZABBIX is fully configurable from its web front end and so it is easier to use ZABBIX than the popular Nagios — whose configuration requires several text files. Further, ZABBIX combines both monitoring and trending functionality, while Nagios focuses exclusively on monitoring. The Web monitoring function of ZABBIX allows users to monitor the availability and performance of web-based services over time. Moreover, this functionality allows ZABBIX to log into a web application periodically and run through a series of typical steps being performed by a customer.

Pros: It’s open-source and has a well-designed Web GUI and overall concept; ZABBIX offers good alerts, dedicated agents and an active user community.

Cons: ZABBIX is not suitable for large networks with 1,000+ nodes, due to PHP performance and Web GUI limitations, a lack of real-time tests, as well as complicated templates and alerting rules.

Zenoss Core:

Zenoss Core is a powerful open source IT monitoring platform that monitors applications, servers, storage, networking and virtualization to provide availability and performance statistics. It also has a high performance event handling system and an advanced notification system. Once you login to Zenoss Core Web UI for the first time, you are presented with a two-step wizard that asks you to create user accounts and add your first few devices / hosts to monitor. You are then taken directly to the Dashboard tab. Use the Dashboard, Events, Infrastructure, Reports and Advanced tabs to configure Zenoss Core and review reports and events that need attention.

Other information:

We like graphite for a number of reasons: it’s very easy to use, and has very powerful graphing and data manipulation capabilities. We can combine data from StatsD with data from our other metrics-gathering systems. Most importantly for StatsD, you can create new metrics in graphite just by sending it data for that metric. That means there’s no management overhead for engineers to start tracking something new: simply tell StatsD you want to track “grue.dinners” and it’ll automagically appear in graphite. (By the way, because we flush data to graphite every 10 seconds, our StatsD metrics are near-realtime.). Not only is it super easy to start capturing the rate or speed of something, but it’s very easy to view, share, and brag about them. https://codeascraft.com/2011/02/15/measure-anything-measure-everything/

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License