How are server issues (faults etc) monitored?

Image

We have automated heart beat monitors which checks the health of the servers and a number of background tasks. In addition to checking the list of known tasks database read & write actions are performed and the available disk space is checked. If the heart beat monitor itself takes more than 5 minutes to run an alert is sent. The heart beat monitor runs every 15 minutes.

Heart beat message

The list of tasks that the system will monitor are defined in the class DBTask.

List of tasks

A task is defined by:-

  • code which uniquely identifies this task