Icon Rufen Sie uns an
+49 441.309197-69 +49 441.309197-69
 
EN

Icinga monitoring of Thecus NAS systems

Posted by Daniel Rauer on Sunday, March 10, 2019

At bytemine we make use of several NAS systems, mainly from Thecus and QNAP. These have proven to be very reliable, easy to setup and maintain, reasonable priced and more features than we need. But the support of Thecus does not make us very happy, so for our larger and important storage systems we switched to QNAP. But some systems of low importance are still Thecus-based, although some years old. For example the Bacula-based backups of our workstations gets stored on an older 5-bay N5200XXX; this is of low importance as almost nothing at bytemine exists only an single workstation: Mails are stored on our Kopano cluster and within our mail archive, customer communication within our Request Tracker helpdesk, code and configs within Gitlab, contracts and invoices at a server-based software from our parent company otris software AG.

But, lets get to the core of this post:

One day the monitoring of our Bacula backup alerted that it is unable to write its data. After a moment it was clear that the N5200XXX died. How could that happen despite of having a monitoring system in place?

How to monitor a Thecus NAS?

Although the N5200XXX is a Linux-based system it is quite a hazzle to integrate it into our Icinga environment. Of course it talks SNMP, but only very basic information is provided that way. So it was decided years ago to monitor from "outside" via ping and availability via HTTPS, and to configure its built-in monitoring to send out alerts via email.

So: What went wrong?

To make a long story short: Emails could not be sent out by the NAS because of changes to the email server we did weeks before the crash. One day the fan stopped turning, alerts could not be send, the system overheated until one harddisk died after the other until the system stopped responding. Bacula alerts arrived at nearly the same time as ping and HTTP alerts. So we replaced the fan, replaced all disks and setup the system from scratch. Quite annoying and additionally very unnecessary.

Lessons learned:

It became obvious that we needed to improve the monitoring of our Thecus NAS systems a lot. On our wishlist: SMART states of the disks, disk temperatures, CPU temperatures, fan status, and of course: The status of the RAID. All this information is available via webinterface, could this be an option?

Solution:

We found this project on github, providing exactly what we wanted, but not for the three different models we used. But the implementation was very good, quite flexible, and robust. So I forked this project, played around with it, enhanced it, created pull requests, chatted with the author, and here we are now:

Icinga-compatible monitoring of health, CPU, memory and disk usage for various models:

Currently our project 'check_thecus_nas' supports these models:

  • Thecus N5500 (firmware V5.00.04)
  • Thecus N5200XXX (firmware V5.03.02)
  • Thecus N2520 running (firmware OS6.build_341)
  • Thecus N8800PROv2 (firmware V2.05.08.v2)
  • NASBOX5G2
  • NAS models with similar Web UI's

All information is pulled from the webinterfaces, which luckily provided JSON endpoints for all relevant information. The structure is sometimes a bit weird and cluttered (depending on the model and firmware), but good enough to reduce String-parsing to a minumum.

Health check:

OK - Hardware working fine, RAID Healthy | Disk_1_a_temp=39;55;60 Disk_2_b_temp=43;55;60 Disk_3_c_temp=41;55;60 Disk_4_d_temp=40;55;60 Disk_5_e_temp=44;55;60

During a rebuild of the RAID (N5200XXX):

WARNING - Hardware working fine, RAID status: Build:35.0% ( 389.2min ) | Disk_1_a_temp=43;55;60 Disk_2_b_temp=47;55;60 Disk_3_c_temp=44;55;60 Disk_4_d_temp=44;55;60 Disk_5_e_temp=45;55;60

CPU check:

OK - CPU usage: 28% | CPU=28;90;95;0;100

Memory check:

OK - Memory usage: 48.7% | mem_usage=48.7%;90;95;0;100

Disk usage:

OK - Disk usage: RAID 38.58% (1998.4 GB/5180.3 GB) | RAID_usage=38.58;80;90;0;100

Conclusion

At the end we not only improved our monitoring a lot for these systems (let systems send out emails is NOT monitoring!), but also had a lot of fun working remotely with a stranger on some code, trying to improve things not only for us and him, but hopefully others might find this useful and and not have to experience a Thecus NAS that cooked itself.

Open source for the win!