A New Hope

With the new academic year fast approaching, we were hoping to be able to avoid a return of the delays in Grouper to Active Directory provisioning we’ve suffered for the last two years. Salvation seemingly lay in the hands of Grouper’s next generation provisioning technology but, following a saga longer than a pod race and more twisted than Darth Vader’s mind, we’ve concluded that PSP-NG is still not quite production-ready.

But was that our last hope? No, there is another.

I’ve recently begun working on something I’d been thinking about for a while. It’s not a replacement for the PSP technology but I believe it can complement it and significantly alleviate the impact of the inevitable provisioning backlog at the start of September.

Using Talend, the force behind much of the Institutional Data Feed Service, I plan to interrogate the Grouper change log to find out which groups that are provisioned to AD have had membership changes. Then, for each of those groups, I can query the Grouper database to find the complete current membership list for those groups. After a bit of jiggery-pokery, I can then push the full list of members into the corresponding group in AD.

More testing is required but I’m confident that this will be a good addition to our resistance to the problem; perhaps the most powerful weapon in our arsenal of workarounds.

This is just a prequel; you can expect the next episode before the end of the month, where we will let you know whether or not we are in a position to make this new weapon fully operational.

A delay, you say?

It’s been pointed out to me that the impact of the Grouper to AD provisioning delays is actually not widely understood. I’ll try to sum it up here as explicitly as I can but please feel free to ask if I haven’t explained anything as well as you would’ve liked!

The delay is in provisioning new groups, changes to groups and changes to group memberships from Grouper to AD. There is no impact to existing groups and group memberships.

So, what could be affected by this?

Anything that uses AD groups in the ‘GrouperGroups’ OU for access control could be affected. Some of the things these groups are used directly to control access to include (but are not limited to) shared filestore, mailing lists, wifi, calendars, printers, PCs, software and the new Rocket HPC system (currently in pilot, I believe).

Additionally, AD’s GrouperGroups groups show up as a Shibboleth attribute which can be used to restrict access to any resources protected by the Login Gateway to a specific group of people. Known uses for this include Microsoft Imagine (formerly Dreamspark), some internal websites and some holiday booking systems. There could, of course, be others.

And what is definitely not affected?

There is no delay internal to Grouper so group memberships within Grouper are up to date. Anything relying on Grouper groups directly or via data feeds, such as some features of the mobile app or Chubb access to some buildings, will be fine.

What does all this actually mean?

Firstly, I hope that anyone who has chosen to use (or inherited) Grouper as an access control component of their service knows and understands how Grouper fits into their particular picture; if it includes AD then access to the service could be affected for new users.

It’s something to consider if an end user reports an access issue. For example, if a new member of staff can’t connect to their team’s shared filestore or they’re not receiving emails to a mailing list they should be on then the chances are they are a victim of the delay. If, however, they find that their smartcard won’t let them into Merz Court then that will be caused by something else.

I hope this has helped to clear up the impact of the delay but if not please let us know!

New academic year (and the trouble it brings)

It’s now the middle of September so I feel I owe you an update on the issues we’re facing around provisioning Grouper group memberships to AD. Since 1st September we’ve not had any “real-time” provisionsing to AD. We have, again, offered some workarounds to alleviate the impact of the most urgent cases. I’d like to thank the Operations team for their help with this.

This wasn’t unexpected, of course; it happened last year and we knew it would happen again. Michael sent a couple of warning emails out in advance and I tried to prepare everyone I spoke to.

I’ll try to explain the reasons why we see this delay in Grouper to AD provisioning: it’s purely to do with the volume of membership changes that occur at the change over of academic year (August to September).

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to September 2017

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to September 2017

You can see from the chart that there have been even more changes to group memberships this year than last year, meaning that the provisioning delay has been longer. We’re currently only half way through September but there have already been over a million membership changes! Most of these are in the student Corporate Data, particularly module enrolment groups.

There are two main reasons why there are substantially more changes this year than last: firstly, Grouper usage has increased considerably over the last year; and secondly, the SAgE reorganisation necessitated the addition of many new student Corporate Data groups.

The issue around the volume of changes is further exacerbated by a bug in Grouper’s provisioning technology which means that every change in Grouper must be processed even though it’s only changes in the Applications stem which are actually pushed through to AD.

We were hoping to be able to escape the delays this year by upgrading Grouper to take advantage of their next generation provisioning technology, which is much faster and also doesn’t suffer from that bug. Unfortunately, after extensive testing and collaboration with the developers, we decided that it was not quite production-ready for deployment in an enterprise environment. I’m very hopeful that we will be able to upgrade before this time next year!

So, onto the backlog … Well, once we’d seen the number of changes to process, estimates started off at well over 20 days. Over the last week, however, we’ve seen processing times increase considerably. We believe this is due to the work of ISG in modernising the AD domain controller infrastructure. This means that my current estimate is that the backlog should be cleared sometime on Monday … just in time for registration!

Grouper to AD provisioning backlog

If you’re reading this you’re probably already aware that we have been suffering from a delay in provisioning group memberships from Grouper to AD. Those of you with long memories may recognise that this kind of thing has happened before, at the academic year switch over, in 2016.

First, the good news: this backlog is not as severe as we experienced in September last year. In fact, I’ve updated my chart to prove it! At the time of writing this, I expect the backlog to be cleared completely over the weekend.

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to August 2017

Chart showing monthly updates to Grouper groups, by stem, from March 2016 to August 2017

The spike in July 2017 that caused the backlog is due to an unusual level of activity in preparation for the reorganisation of the SAgE faculty, on 31st July. It’s also worth remembering that there is only 2 days of data included in the August bar.

Now for the bad news: we are likely to experience another substantial delay in AD provisioning at the start of September.

We’re still hopeful of being able to upgrade Grouper to use it’s new provisioning technology and avoid any kind of backlog but we are fearful that the new technology is not quite production-ready yet. This is something that Michael and I are working on at the moment, with some help from the Grouper developers.

If we can’t solve this problem with an upgrade, we’ll mitigate the impact as best we can. You’ll hear more soon.

Creating new groups

I thought I’d already written about this but it seems I haven’t! (I can’t find it at any rate, which is just as bad.)

This post is for people who create groups in the Applications stem of Grouper and want them to be provisioned to AD or to be available as Shibboleth attributes.

There is a known error in this version of Grouper whereby the automatic provisioning only works if at least one member is added to the group before the provisioning service makes its first attempt to provision to AD; empty groups will not be provisioned.

In practice, this means that you must add a member to a group within about 45 seconds of creation of that group or provisioning (and subsequent updates) will fail for that group.

My tip here is simply to add yourself immediately to any new group you create in the Applications stem. This will ensure that it is picked up by the provisioning process. You can then correct the membership at leisure.

If you do happen to fall foul of this trap, don’t fret; we can fix it for you. Just contact the Service Desk (or log your own ticket in NU Service) explaining what’s happened and please include the full ID path of the group.

I’d also like to take this opportunity to remind you not to use spaces and slashes in group and folder IDs; please replace them with underscores.

Grouper performance update

On Monday, the Grouper UI was, at times, unusably slow. This, understandably, was a great inconvenience to several people. For that, I apologise. It was unforseen but we now understand the reasons and have formulated a plan of action.

Firstly, the reason – why did this happen? Well, the eagle-eyed amongst you may have spotted that Monday was the first day of teaching for the majority of our students. OK, I hear you say, but what does that have to do with Grouper? Well, it’s actually down to the phenomenal popularity of the Newcastle University mobile app. The way the app is currently architected, there is a web service call to the Grouper API every time a student logs in to the app or they refresh their news feed.

Chart showing a large spike in load on the mobile app at start of term.

Chart showing a large spike in logins to the mobile app at start of term

The chart above shows the spike on Monday, when new students downloaded the app for the first time and returning students logged in to see their timetables. The resulting spike in calls to the Grouper API were too much for the poor little server to handle, with the database process maxing out the CPU.

So, what are we doing about it? Well, we’re not just going to stand here and do nothing but apologise. Our approach is three-pronged (rather like a fancy dessert fork):

  1. We’re going to add more CPU to our Grouper server.
  2. We’re working with Mike, Mike, Andy and Marc to redesign the data architecture behind the app, using purpose-built RESTful web services from IDFS and removing Grouper from the equation.
  3. We’re continuing with our Grouper upgrade plans.

Grouper performance (issues)

One of the main benefits of upgrading Grouper last year was the introduction of “real-time” provisioning of groups and memberships into AD. Previously, AD syncing had occurred four times a day which was good enough for most scenarios but not perfect for everyone.

Since upgrading, the Grouper PSP technology, which handles “real-time” AD provisioning, has coped nicely with everything that’s been thrown at it (averaging around 50,000 changes per month). This chart, showing monthly group membership changes by stem, gives an indication of what it’s handled from March to August 2016. You can see it was busy in August and there’s a peak in April.

Chart showing monthly updates to Grouper groups, by stem, from March to August 2016

Chart showing monthly updates to Grouper groups, by stem, from March to August 2016

It had coped nicely, that is, until the end of the academic year. Now, if we add September 2016 into the chart, it provides a nice visualisation of why PSP has been suffering for the last fortnight.

Chart showing monthly updates to Grouper groups, by stem, from March to September 2016

Chart showing monthly updates to Grouper groups, by stem, from March to September 2016

So, since 1st September we’ve not had any “real-time” provisionsing and the situation has been far worse than it was prior to our upgrade last year, with some changes having to wait well over a week before being reflected in AD. We’ve offered some workarounds to alleviate the impact of the most urgent cases but this service failure still weighs heavily upon me.

As I write this, I’m hopeful that the provisioning service will finally catch up with itself overnight tonight and we’ll return to the happy state of “real-time” provisioning tomorrow.

So, whilst I’m quite content that PSP can cope with our needs for the majority of the year, the service since the start of September has not been satisfactory. We’ve now started making the necessary plans to replace PSP so that we won’t have to suffer like this again next year.