A New Hope

With the new academic year fast approaching, we were hoping to be able to avoid a return of the delays in Grouper to Active Directory provisioning we’ve suffered for the last two years. Salvation seemingly lay in the hands of Grouper’s next generation provisioning technology but, following a saga longer than a pod race and more twisted than Darth Vader’s mind, we’ve concluded that PSP-NG is still not quite production-ready.

But was that our last hope? No, there is another.

I’ve recently begun working on something I’d been thinking about for a while. It’s not a replacement for the PSP technology but I believe it can complement it and significantly alleviate the impact of the inevitable provisioning backlog at the start of September.

Using Talend, the force behind much of the Institutional Data Feed Service, I plan to interrogate the Grouper change log to find out which groups that are provisioned to AD have had membership changes. Then, for each of those groups, I can query the Grouper database to find the complete current membership list for those groups. After a bit of jiggery-pokery, I can then push the full list of members into the corresponding group in AD.

More testing is required but I’m confident that this will be a good addition to our resistance to the problem; perhaps the most powerful weapon in our arsenal of workarounds.

This is just a prequel; you can expect the next episode before the end of the month, where we will let you know whether or not we are in a position to make this new weapon fully operational.

New academic year (and the trouble it brings)

It’s now the middle of September so I feel I owe you an update on the issues we’re facing around provisioning Grouper group memberships to AD. Since 1st September we’ve not had any “real-time” provisionsing to AD. We have, again, offered some workarounds to alleviate the impact of the most urgent cases. I’d like to thank the Operations team for their help with this.

This wasn’t unexpected, of course; it happened last year and we knew it would happen again. Michael sent a couple of warning emails out in advance and I tried to prepare everyone I spoke to.

I’ll try to explain the reasons why we see this delay in Grouper to AD provisioning: it’s purely to do with the volume of membership changes that occur at the change over of academic year (August to September).

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to September 2017

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to September 2017

You can see from the chart that there have been even more changes to group memberships this year than last year, meaning that the provisioning delay has been longer. We’re currently only half way through September but there have already been over a million membership changes! Most of these are in the student Corporate Data, particularly module enrolment groups.

There are two main reasons why there are substantially more changes this year than last: firstly, Grouper usage has increased considerably over the last year; and secondly, the SAgE reorganisation necessitated the addition of many new student Corporate Data groups.

The issue around the volume of changes is further exacerbated by a bug in Grouper’s provisioning technology which means that every change in Grouper must be processed even though it’s only changes in the Applications stem which are actually pushed through to AD.

We were hoping to be able to escape the delays this year by upgrading Grouper to take advantage of their next generation provisioning technology, which is much faster and also doesn’t suffer from that bug. Unfortunately, after extensive testing and collaboration with the developers, we decided that it was not quite production-ready for deployment in an enterprise environment. I’m very hopeful that we will be able to upgrade before this time next year!

So, onto the backlog … Well, once we’d seen the number of changes to process, estimates started off at well over 20 days. Over the last week, however, we’ve seen processing times increase considerably. We believe this is due to the work of ISG in modernising the AD domain controller infrastructure. This means that my current estimate is that the backlog should be cleared sometime on Monday … just in time for registration!

Grouper to AD provisioning backlog

If you’re reading this you’re probably already aware that we have been suffering from a delay in provisioning group memberships from Grouper to AD. Those of you with long memories may recognise that this kind of thing has happened before, at the academic year switch over, in 2016.

First, the good news: this backlog is not as severe as we experienced in September last year. In fact, I’ve updated my chart to prove it! At the time of writing this, I expect the backlog to be cleared completely over the weekend.

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to August 2017

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to August 2017

The spike in July 2017 that caused the backlog is due to an unusual level of activity in preparation for the reorganisation of the SAgE faculty, on 31st July. It’s also worth remembering that there is only 2 days of data included in the August bar.

Now for the bad news: we are likely to experience another substantial delay in AD provisioning at the start of September.

We’re still hopeful of being able to upgrade Grouper to use it’s new provisioning technology and avoid any kind of backlog but we are fearful that the new technology is not quite production-ready yet. This is something that Michael and I are working on at the moment, with some help from the Grouper developers.

If we can’t solve this problem with an upgrade, we’ll mitigate the impact as best we can. You’ll hear more soon.

Grouper performance update

On Monday, the Grouper UI was, at times, unusably slow. This, understandably, was a great inconvenience to several people. For that, I apologise. It was unforseen but we now understand the reasons and have formulated a plan of action.

Firstly, the reason – why did this happen? Well, the eagle-eyed amongst you may have spotted that Monday was the first day of teaching for the majority of our students. OK, I hear you say, but what does that have to do with Grouper? Well, it’s actually down to the phenomenal popularity of the Newcastle University mobile app. The way the app is currently architected, there is a web service call to the Grouper API every time a student logs in to the app or they refresh their news feed.

Chart showing a large spike in load on the mobile app at start of term.

Chart showing a large spike in logins to the mobile app at start of term

The chart above shows the spike on Monday, when new students downloaded the app for the first time and returning students logged in to see their timetables. The resulting spike in calls to the Grouper API were too much for the poor little server to handle, with the database process maxing out the CPU.

So, what are we doing about it? Well, we’re not just going to stand here and do nothing but apologise. Our approach is three-pronged (rather like a fancy dessert fork):

  1. We’re going to add more CPU to our Grouper server.
  2. We’re working with Mike, Mike, Andy and Marc to redesign the data architecture behind the app, using purpose-built RESTful web services from IDFS and removing Grouper from the equation.
  3. We’re continuing with our Grouper upgrade plans.

Grouper performance (issues)

One of the main benefits of upgrading Grouper last year was the introduction of “real-time” provisioning of groups and memberships into AD. Previously, AD syncing had occurred four times a day which was good enough for most scenarios but not perfect for everyone.

Since upgrading, the Grouper PSP technology, which handles “real-time” AD provisioning, has coped nicely with everything that’s been thrown at it (averaging around 50,000 changes per month). This chart, showing monthly group membership changes by stem, gives an indication of what it’s handled from March to August 2016. You can see it was busy in August and there’s a peak in April.

Chart showing monthly updates to Grouper groups, by stem, from March to August 2016

Chart showing monthly updates to Grouper groups, by stem, from March to August 2016

It had coped nicely, that is, until the end of the academic year. Now, if we add September 2016 into the chart, it provides a nice visualisation of why PSP has been suffering for the last fortnight.

Chart showing monthly updates to Grouper groups, by stem, from March to September 2016

Chart showing monthly updates to Grouper groups, by stem, from March to September 2016

So, since 1st September we’ve not had any “real-time” provisionsing and the situation has been far worse than it was prior to our upgrade last year, with some changes having to wait well over a week before being reflected in AD. We’ve offered some workarounds to alleviate the impact of the most urgent cases but this service failure still weighs heavily upon me.

As I write this, I’m hopeful that the provisioning service will finally catch up with itself overnight tonight and we’ll return to the happy state of “real-time” provisioning tomorrow.

So, whilst I’m quite content that PSP can cope with our needs for the majority of the year, the service since the start of September has not been satisfactory. We’ve now started making the necessary plans to replace PSP so that we won’t have to suffer like this again next year.

Payback time

We had a good day yesterday. We didn’t produce anything new but we paid off a huge chunk of technical debt.

We moved our data warehouse to a new database on a new server and completed the upgrade of our (several hundred) data feed jobs to Talend 6.

There were a few obstacles along the way but nothing that the crack team of experts working around me couldn’t handle.

Whilst we were confident that we had got everything working yesterday, it was still reassuring to see that all of the overnight jobs, including the warehouse load, ran successfully last night.

Grouper upgrade

In December 2015, we upgraded our Grouper installation to the latest stable version. (For the uninitiated, Grouper is the software at the heart of NUIT’s group management service.)

The upgrade had been a long time coming. It had been talked about many times but had never quite managed to get to the top of the priority list. This is mainly due to the rapid expansion in demand for the institutional data feed service, which has taken up a significant proportion of our time and efforts over the last couple of years, but also because the group management service has been very stable and reliable and carried out its function quietly and competently.

The new version of Grouper has a few significant differences to the version we were running: most obvious is the new dashboard-based user interface but there are also a few other nice new features.

Shortly prior to the upgrade, we held a couple of captivating and enlightening demo sessions to highlight these differences to existing Grouper users. If you were unable to attend (or if you did attend and would like to relive the joy of the invigorating presentation), it’s now available on ReCap.