Some things I’ve learned about JIRA Service Desk (JSD)

JIRA Service Desk QueueThe basic configuration of JIRA Service Desk (JSD) is pretty easy. As long as you know JIRA Query Language (JQL) reasonably well, or you can use the basic search in JIRA before switching to Advanced to copy and paste the JQL you need, you can setup pretty much everything you need with little or no prior experience. Not much can be customized, and the interface is clean and simple.

That said, I’ve been digging into the nuance of JSD and learning a bit more about its inner workings. Atlassian hasn’t released the code to JSD, so we don’t have much visibility into why or how it does the things it does. This leads to questions and confusion, and in some cases, things simply not working as expected.

SLA Calculation

I’ve been investigating how JIRA Service Desk (JSD) calculates time recently in relation to SLA meet vs. breach statistics. I haven’t gotten too far into the investigation yet, but it has become apparent that JSD reporting is relative at best. The algorithms aren’t exposed, so we can’t easily audit how it is doing its calculations, but it has become clear that the percent of issues that meet SLA is more like an approximation, and it doesn’t update very quickly.

Similarly, the portal (customer) display doesn’t update in real time, and the information it displays is relativized. I’m seeing this on a system where I have disabled relativized dates, so JSD isn’t respecting that setting. For instance, you might have made a change 1-2 minutes ago, and the portal may say, “About 4 minutes ago.”

In looking at the logs on a JSD system, this appears to be because most of JSD operates through listeners. Events aren’t being fired that trigger JSD to update, but rather JSD has jobs that run every few seconds to pull information and update various displays. This isn’t the case for everything, but I’m seeing it in a lot of unexpected places. Rather than writing new events and triggering/hooking into those, JSD has been written to scan the system in relevant places and update based on what it finds at the time of the scan. It then relativizes its statements so the dates/times reported aren’t false.

All that to say, it doesn’t update in real time. We’re doing a bit of reverse engineering to try and figure it out, but either way the updates and dates/time will be relative regardless.

Potentially related, I’ve seen that if your local computer’s time doesn’t match the JIRA server’s time in relation to minutes, that can cause this to happen. Set everything to NTP that can be set to NTP.

Knowledge Base Permissions

Permissions for knowledge base search are provided by Confluence and JIRA Project access, and more granular permissions are unavailable. There is no setting in JIRA to have the project and its associated JSD portal available to users but hide the KB search from a subset of those users. You can restrict individual or groups of pages, or the entire space, on Confluence and a customer will not be able to reach pages for which they do not have permissions. But the search box itself will remain. If a KB is linked to a JSD project, the KB search box will be visible.

SLAs Can Be Corrupted

One of my clients has both JIRA Enterprise Mail Handler (JEMH) and JIRA Service Desk (JSD). They’re using JSD for external customer communications, but the built-in comment module in JSD wasn’t secure enough because you couldn’t have the comment box default to internal/restricted comments. Therefore, we disabled the comment module for JSD (service-desk-comment-field) and installed Comment Security Default.

And, because we weren’t using the comment module in JSD, we were struggling to design a good First Response SLA. We needed the SLA to stop counting when a comment was entered for a customer, but the Comment for Customer stop condition only works if you’re using that module. To get around this, we wrote some ScriptRunner code to transition to a status named Commented, then transition back to the previous status, and set the stop condition to the Commented status.

But we were still seeing weird problems. The ScriptRunner (SR) script worked, and the transition fired, and the SLA stopped. But when the issue was closed, the First Response SLA re-opened. It was set to start on Issue Creation, so why would it restart when the issue was closed?

JEMH was a common thread in the errors we saw in the logs, but we proved conclusively that JEMH was not a contributor to this bug, nor was it really involved. We had another script that emailed a survey to the customer in response to event 10100 (JEMH Send Survey on Close), and we were seeing that event pop up a lot, which wasn’t surprising since we had tied it to the close transition. When 10100 fires, it sends the survey email provided the Email custom field has an address in it. When the Email CF is blank, as it often is when testing SLAs (since we’re not testing email), errors are generated, and we wondered if the survey failing to send was causing something to hang and triggering the SLA to restart

By filling in the field, then removing the listener, and also commenting the 10100 event out of the post-inquiry survey script, we concluded that JEMH listeners weren’t contributing to this bug. Having eliminated JEMH from the equation, we wondered if the problem was how the post-inquiry survey script was firing the event.

We noticed that when the event was fired, the change log notation was null, and in talking with a few other consultants, we hypothesized that JSD wasn’t updating the SLA correctly due to a null change log entry. We attempted to manually write to the change log through a variety of methods but were unsuccessful. In addition, we demonstrated that the bug wasn’t occurring in a separate test system despite it being very similar to the dev system where the problem was occurring, and that suggested that the change log issue wasn’t the cause of the bug.

We modified the script a few ways, changed the order of the post functions so the post-inquiry survey fired after the generic event, and generally tried to understand what did and did not cause the error. Everything pointed back to the post-inquiry script.

The SLA configuration had both Create to Close and First Response with starting conditions of Issue Created. Issue Created is an event as well, and we had hypothesized at one time that something about how the post-inquiry survey was firing was triggering Event 1, or Issue Created. That didn’t make sense and we couldn’t see evidence of it in the logs, but we also couldn’t see any other way that the SLA would restart without that happening. What was also odd was that sometimes the SLA restarted, but in the paused state despite there being no pause conditions.

I had already wondered if the JSD add-on had been corrupted at some point, but I was successfully able to disable and re-enable it, and we restarted JIRA. This suggested that the add-on was not corrupted. I had also tried modifying the stop conditions of the SLA multiple times with no positive impact.

I decided to try modifying the start condition by changing it from Issue Created to Entered Status: New. This worked successfully, insofar as the First Response SLA halted when a public comment was added because that triggered the transition to the Commented status, and when the issue was closed, the First Response SLA did not restart. It was still possible that the Issue Created event was being fired somehow, but this no longer mattered because we weren’t using Issue Created for First Response start anymore. But, if Issue Created was being fired, then the the other SLA that used that event as a start condition, Create to Close, really should have started again.

The problem with this solution is that there are two transitions that go to the New status. The first is when you enter a public comment from New. In this instance, ScriptRunner transitions to Commented then immediately back to New, causing the First Response SLA to be satisfied and then immediately re-open and begin counting down again. The second is when an issue is re-opened, which sets the issue to the New status.

But this did conclusively fix the First Response SLA problem. I decided to set the First Response SLA start condition back to Issue Created to see what effect that had. This would be analogous to turning the SLA off and on again in relation to start conditions, which is not something we had tried before.

And with the SLA start condition set back to Issue Created, everything still worked.

I could only conclude that the SLA was corrupted at some point, and changing the start condition so the SLA recalculated and reset was needed to fix the problem. Changing the stop conditions was insufficient to fix this. We also learned along the way that when you modify an SLA and it begins recalculating, it applies to issue creation last—it will first run reports on issues that are already created or closed, and only once it has finished updating its reports on old issues will it apply to an issue being created.

And that’s the story of how we spent several days troubleshooting JSD, JEMH, and ScriptRunner, only to discover that we needed to turn if off and turn it back on again.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s