From end to end: a monitoring story

Date posted
13 August 2019
Reading time
29 Minutes
Andrew Thompson

From end to end: a monitoring story

My first project at Kainos as a WebOps Engineer was working as part of the EMR Cloud team to transition Kainos' Electronic Medical Records Software (EMR) from on-premise architecture, to the cloud. The main intention behind this was to take advantage of a range of cloud benefits such as flexible capacity and costs, high levels of security, quicker deployments, scalability, reduced capital expenditure and environmental impact, etc.

Part of this solution is the EMR Cloud Managed Service, which removes the customer's responsibility for monitoring the environment, responding to any issues, and ensuring the solution remains highly available. As this was our first step into the Cloud with EMR, we needed to ensure that we could have the upmost confidence in our monitoring solution, so that we could be aware of any potential issues and react to them, before they turned into problems for our customers. Additionally, with a robust and trusted monitoring solution in place, we would have a firm foundation for scaling out the EMR Cloud managed service to other customers.

With on-premise versions of the EMR solutions, monitoring and alerting is the customer's responsibility. Kainos' Support and Infrastructure teams did not have access to these tools (SolarWinds, Nagios, Microsoft SCOM, etc), which meant that they would not be aware of any infrastructure issues occurring on the environment which could potentially affect the application they were supporting. Additionally, the level of monitoring in place for each customer, was entirely dependent on funding which varied from customer to customer. With the move to Managed Service, the Support, Infrastructure and Ops teams now have a full understanding of the current state of the environment which allows any potential issues to be dealt with as and when they arise. It also ensures that we have a reliable and consistent solution to take to market.

Whilst we didn't reinvent the wheel with regards to how the platform is monitored, one of the successes was in the fine tuning that took place before go-live. Customarily across the industry, monitoring and alerting is fine-tuned post go-live, when the environment will be facing expected levels of load and usage. This approach involves dealing with a lot of alert noise and false positives in the initial weeks from go-live. Having the monitoring tuned extensively beforehand allowed us to go live with an established solution, which meant that actual issues were recognised and resolved quickly, and that on-call engineers were not needlessly alerted out of hours, therefore maintaining confidence in the monitoring stack across the operational support teams.

Now that the trumpet blowing is out of the way, let's look into what we did to configure the monitoring and alerting solution.

Monitoring

We used the following tools to monitor key areas of the environment:

  • Alert Logic: Alert Logic is used to monitor the environment from a security perspective. All VM's in the environment have an Alert Logic Agent installed which send security event logs to Alert Logic to be reviewed by their team. They will alert on things like repeated failed account logins, changes to Active Directory groups, etc.
  • Azure Log Analytics: Log Analytics is set up differently in that it does not use an agent. Instead, the Log Analytics VM extension is installed on all hosts in the environment. The extension sends diagnostic logs and syslogs from various hosts to the Log Analytics workspace. As these logs are being fed into a central hub, we can configure alerts based on the log data being fed in. For example, if any changes are made to the firewall configuration, if malware is found on any VM, if the Zabbix VM goes down (as Zabbix is unable to monitor itself), etc. It should be noted that there is roughly a standard 15-minute delay from event occurrence to alerts being received in Log Analytics.
  • Zabbix: Like Alert Logic, all VM's have a Zabbix Agent installed which feeds monitoring metrics back to the Zabbix Management Host. We have thresholds set here for various metrics such as CPU load, memory utilisation, disk space utilisation, etc. Once a threshold is met/exceeded, an alert is sent out. Whilst Zabbix provides similar functionality to Log Analytics alerting, there is no 15-minute delay with it, so we are alerted on production critical events, as and when they occur and resolve. The final benefits of Zabbix are that it is widely used, and it is open source, so there are no additional costs bar the running costs of the Zabbix Management VM.
  • SQL Server Management Studio ? SQL Agent: Each database server has a SQL Agent installed and configured to alert on SQL job failures. Additionally, we alert on the successes of specific jobs like backups which are then checked as part of our routine checks (more information on this later).

Alerting

With our alert sources defined, we decided on OpsGenie as our Incident Management Solution. OpsGenie had been used successfully elsewhere in Kainos, so it made sense to stick with it. OpsGenie was configured as follows:

Integrations

OpsGenie provides a number of integrations to third party applications and communication methods:

  • Client Email: This provides the unique email address to be used by the various monitoring tools (Alert Logic, Log Analytics, Zabbix). We can configure the alert format and apply tags to alerts received from this integration. This allows us to identify the source of each alert and search for specific alerts based on the Client and Monitoring Tool.
  • Microsoft Teams: The Teams integration is set to post P1 and P2 alerts to our team communication tool. This is handy as each user will get a notification in MS Teams when an alert which requires attention is received. The team have found this integration to be very useful in supporting the environment, as email alerts are not always the most obvious and prioritised method for alerting on issues.

Policies

We use Policies to prioritise each issue received in OpsGenie and to assign the Alert Priority to OpsGenie Email subjects. Policies are identified in a top down approach according to the policy logic. For example, set alert to P5 first, test P4 condition and set if appropriate. If not move on to test P3 condition, and so on. We use the following priorities in EMR Cloud Managed Service:

  • P5: All alerts are initially categorised with P5 priority to begin with. When the team checks the OpsGenie Portal and sees a P5 alert, they know that the alert should be categorised appropriately for any future occurrences.
  • P4:All alerts received from the test environment are assigned this categorisation. This filters out non-production alerts.
  • P3: Informational alerts with no follow up required.
  • P2: Alerts which require follow up in working hours. If received out of hours, these will be worked on during the following working day.
  • P1: Alerts which require immediate follow up both in/out of working hours.
  • P2 or P1 (time dependent): Dependant on the time of occurrence, these alerts will be assigned either P1 or P2 priority. For example, ??A new Domain User has been created' is a P2 in working hours, but a P1 outside of working hours.

OpsGenie Mobile App

The managed service contract ensures that the environment will be supported 24/7. This means that engineers require notifications on P1 issues, out of hours. When a P1 alert is received out of hours, a text message is sent and an OpsGenie app notification is pushed to the on-call phone indicating that the alert has been received. If the text/notification is not opened/acknowledged within 2 minutes, a voice call is made to follow up on the original notifications. If these notifications are not acknowledged, the same process is repeated with a separate team members phone number to ensure that the issue is dealt with in the 45-minute SLA window. Engineers can use the OpsGenie app to get the full alert details instantly, without having to log into OpsGenie on their laptops.

SMTP Service

An SMTP service is required to send emails. Both Log Analytics and Alert Logic have their own SMTP services, however Zabbix and SSMS Database Mail do not. To send email alerts from Zabbix and Database Mail, we use a third-party SMTP Relay called SendGrid.

The following diagram shows how each element of the solution works together:

 width=

Automation

Whilst OpsGenie and SSMS required manual configuration to get it to a production ready state, we were able to automate some of the monitoring elements using Azure DevOps and the following tools:

  • Zabbix VM and Portal Configuration: ARM Template, Zabbix API calls and Zabbix Templates.
  • Alert Logic VM: ARM Template.
  • Zabbix & Alert Logic Agents on target VMs: PowerShell DSC.
  • Azure Log Analytics Workspace, Alerts and VM Extension on target VMs: ARM Template.

Proving it Works

With the Monitoring and Alerting solution configured, we planned and executed a day long incident simulation process to ensure the different elements of the Managed Service Solution worked together effectively. This process assigned roles to staff to allow for issues to be created, investigated/resolved, and escalated as necessary. We decided on a range of issues (including a red herring issue) that would cover all of the monitoring and alerting tools, require involvement from all three teams and their associated escalation points. This process enabled the teams to be confident in the process and technical solutions that had been developed. The following was monitored during the process:

  • Monitoring Sources: Each issue that was created needed to trigger an alert which could then be actioned by the team. Without an alert, the team would not be aware of issues as they occur. This check provided useful information on timing which allowed us to reduce the time period for heartbeat alerts being received from Log Analytics.
  • Alert Prioritisation: Each alert should be prioritised accordingly based on the policy in place in OpsGenie. As the prioritisation tree is based on alert content, it can be quite complex to ensure that alerts aren't reclassified as they process through the lower levels of the tree. This ensured that the logic set allowed alerts to be prioritised accurately.
  • Communication Tools: Alerts are received by all teams via email and in Microsoft Teams. Additionally, the on-call phone is notified out of hours.
  • Communication between Managed Service Teams: Each issue may require individual teams, or a mix of the Support/Infrastructure/Ops teams. For the latter, we needed to ensure that the staff involved were able to identify the particular teams required to investigate/resolve each issue.
  • Escalation Path: A number of security events were triggered. In these cases, we needed to ensure that Security Incident policies were followed, and issues were escalated appropriately.

Due to the success of the simulation, we plan to repeat the process in future. This will allow us to ensure that as teams change, the members are familiar with the environment and any new functionality, as well as being trained appropriately to deal with issues.

Supporting the Monitoring Solution

With a consistent and proven monitoring solution in place, we can proactively aim to stop certain issues occurring in the first place. With this in mind, we have outlined daily, weekly, monthly, quarterly and yearly routine checks. These checks are performed on a repeated and regular basis, focusing on backups, security, patching and licensing. This ensures that a lot of potential issues do not arise. The intention going forward is to automate as many of these checks as possible, to increase their accuracy and to reduce the time taken to complete them. This will also allow the service to scale as new customers come onboard.

Additionally, we take part in a monthly service review meeting with Alert Logic which gives a condensed insight into the cases and incidents that Alert Logic see within our environment.

Recommendations

Our main recommendation from this project would be the Incident Simulation exercise. Not only is this an all-encompassing test for the individual monitoring and alerting tools, and communications between teams, but also it identifies areas for improvement, both technical and process related. Additionally, it allows you to perform dry runs of major/security incidents to ensure processes are appropriate, and that the teams are aware of and understand them.

As there are many elements which make up our monitoring solution, knowledge sharing is key to ensuring that everyone who will make use of it, has the understanding and awareness to do so effectively. Ideally, you aim to pass on the knowledge accumulated by the subject matter expert to the other members of the team. We did this through extensive documentation and knowledge sharing sessions. It is also now part of our on-boarding process for new team members.

OpsGenie has become a cornerstone element of our monitoring solution. We have found it to be highly cost effective whilst providing a great deal of functionality and useful features. As such, we would certainly recommend it for future projects. Additionally, with the recent OpsGenie acquisition by Atlassian, we will hopefully see greater levels of integration between OpsGenie and other Atlassian tools which we make use of in the project.

From a best practices point of view, the following should also be considered:

  • Aim to get monitoring implemented early in the development of the solution. This will allow for a greater time period for fine tuning out the noise before go-live.
  • Think about the audience for the monitoring outputs. Is there benefit in other teams being aware of ongoing issues? If there would be value in this, make the output visible to other teams as well.
  • Always consider what the solution might look like later. For example, how it will scale when adding more customers, alert classification, naming conventions, etc.

Next Steps

We will aim to integrate OpsGenie with Jira to enable automatic ticket creation for P1/P2 incidents as and when they occur. This will tighten the ticket creation process, ensuring that tickets are raised consistently. This will also help from an auditing perspective.

There are now a number of Azure integrations within OpsGenie which we could consider for implementation including Azure OMS, Azure Resource Health and Azure Service Health. We have Log Analytics Alerts set up for Service Health alerts already, but the integration seems to provide some additional functionality not included with email alerts. These integrations could prove useful in keeping the team better updated relating to potential issues within the Azure estate.

With the architectural monitoring firmly in place, the next step would be to invest in application focused monitoring, utilising Key Performance Indicators to help identify and diagnose application level issues. We have this in place, to a degree with Azure Application Insights, however we have identified a range of KPI's which would help to round out the monitoring solution. An example of this would be to monitor the document capture/ingestion service using Zabbix. As this is a key part of the application, we need to ensure it is running smoothly and effectively so that physical documents can be ingested into the system. Application monitoring would be very beneficial to the Support team who do not currently have this level of understanding relating to application health at present.

According to the saying, ??a picture is worth a thousand words?�. This is especially true when investigating Memory/CPU/Disk utilisation issues using graphs over time, as opposed to sifting through log files. We have dashboards in the Azure subscription and Zabbix to provide this visual information, however it is separate from the alerting side of the solution. With a centralised dashboard (using a tool like Grafana within Azure), we would be able to tie resource monitoring from Zabbix and Azure, alert metrics from OpsGenie and the previously mentioned application KPI's into one central location.

 width=

Summary

This work took place roughly over a 10-week period resulting in a fully documented, fined tuned, production ready solution, before the go-live date.

Since going live, the team has commented on the reliability of the solution, highlighting in particular that we receive minimal false positives and alert noise. This is a testament to the work that went into the configuration and calibration of the solution across the team. As a whole, this is an example of something that is critical to the success of a production service, done well.

To learn more about our cloud services and to speak to a cloud expert click here.

 

About the author

Andrew Thompson