Why we use Git

Having all of our scripts and configurations in a single source code repository provides us with a single source of truth which is available to all team members. As we move towards having more systems built using Infrastructure as Code, this removes knowledge silos and reliance on single domain-experts. Everything being version controlled means that we have a full audit of changes; who made what change, when, and why. That means we can roll backwards and forwards in time to use scripts and configurations in different states.

On the Windows side, we initially used Microsoft’s Team Foundation source control, as part of Visual Studio Team Services, since this was used on a smaller project. We’d also used Subversion to manage configs on the Unix/Linux side of the estate. When expanding out the usage to a larger team of people, and for more systems, we felt that it made sense to migrate to Git for a number of reasons:

  • Git has excellent cross-platform support. You can use it with whichever editor/IDE you want on Windows, Linux, or Mac.
  • Git supports branching, which offers more flexibility for a diverse team working on different areas and merging into a single source control repository. It also allows us to ensure that anything going into the Master (or production) branch has passed tests.
  • Git is widely used in the community. We are increasingly finding community resources on GitHub, and we would aspire to contribute some of our work back to the community. It makes sense to be using the same tool.
  • The message we’re hearing from the DevOps community is “Use source control. Whatever source control you’ve got is fine, but if you don’t currently have any; use Git.”
  • Visual Studio Team Services is just as comfortable to use with Git as it is with TFS, if not more so.

Hosting our Git repository on Visual Studio Team Services offers a number of advantages:

  • When something is checked in to Git, we have VSTS set up to automatically trigger tests on everything. For example, as a bare minimum any PowerShell code needs to pass the PowerShell Script Analyser tests, and we are writing unit tests for specific PowerShell functions using Pester. If any of these fail, we don’t merge the changes into the Master code branch for production use.
  • Changes to code can be linked to Work Items on the VSTS Kanban board.
  • Microsoft’s Code Search extension in VSTS allows rich searching through everything in the repository.

In addition to our scripts and configurations, there are advantages using a version controlled repository for certain documents. By checking documents in to Git, we can see the history of edits, which may be important when it would help to know how a document and a configuration both looked at some point in the past. By having documents in a cloned Git repository, we also have access to them when network conditions may otherwise not allow.

DevOps on Windows events in Newcastle this month

If Sean’s post about the WinOps conference was interesting to you, there are a couple of events coming up in Newcastle which may be up your street. Both of these are free to attend, include some free beverages, and pizza. They are also both being held at Campus North on Carliol Square, so just a short walk from the campus.

On the evening of Wednesday 15th June, NEBytes is hosting Microsoft MVPs Richard Fennell and Rik Hepworth from Black Marble, talking about DevOps with Azure and Visual Studio Team Services, with a focus on environment provisioning and testing, much of which is relevant to on-premises delpoyments too. Registration is at https://www.eventbrite.co.uk/e/real-world-devops-with-azure-and-vsts-tickets-25901907302

The following Wednesday evening, the 22nd, DevOps North East has a session on Microsoft, Open Source and Azure, as well as an interesting sounding “Who Wants to be a (DevOps) Millionaire” game. Registration is open at http://www.meetup.com/DevOpsNorthEast/events/231268432/

WinOps 2016

Last week, Jonathan Noble and I attended the WinOps 2016 conference in London; this was a conference centred around the subject of using DevOps working practices with Windows Servers, which is something that Microsoft are focusing a lot of effort on, and something that ISG have taken a lot of interest in. I’ve been told that videos of the talks will soon be available on http://www.winops.org, and I would strongly recommend them for anyone who works with Windows Servers in any capacity. (Update: videos are now available at https://www.youtube.com/playlist?list=PLh-Ebab4Y6Lh09SnM63euerPW0-pauO7k).

The day started with a keynote speech by Jeffrey Snover, from Microsoft; I’m not sure of his current job title as it keeps changing, but he invented PowerShell and is basically in charge of Windows Server.

The speech covered the evolution of Windows Server from Windows NT, right through to Server 2016, explaining how the product was continuously changed to meet the needs of the time, which flowed nicely into an overview of Server 2016, designed to enable cloud workloads.

A big part of Server 2016 is the concept of ‘Just Enough Operating System’ and the new Nano Server installation option. For those not aware, Nano Server is the next logical step after Server Core; where Server Core removed the Desktop Experience, in order to improve the security, reliability, and speed of your servers, Nano Server strips out absolutely everything unnecessary. It’s not possible to login to a Nano Server in any way – they’re controlled entirely by remote management tools, and PowerShell Remoting. This has enabled Microsoft to shrink the Operating System down to under 500MB. It takes up less space, runs faster, boots in seconds, and requires only a small fraction of the number of patches and reboots that Server with Desktop Experience requires. Jeffrey went as far as to say that Nano Server is “the future of Windows Server.”

Also coming with Server 2016 is support for Docker-compatible Containers. If you’re not familiar with these, it’s worth getting acquainted – one server can run multiple containers, and each will function as if they were their own server, completely isolated from each other, but sharing the underlying operating system and other resources from the host machine. The container itself is a single object, making it very simple to transfer between hosts, or to duplicate and spin up multiple copies of a containerised application.

A couple of other important technologies touched upon were Windows Server Apps (WSA) – a new way of deploying applications based on AppX; Server Support for MSI will become deprecated in favour of WSA, largely because MSI is horrible and unsuitable for server environments – and Just Enough Administration (JEA) – a new PowerShell feature which allows the creation of PowerShell endpoints which users can connect to perform a specified subset of admin tasks, without requiring to be administrators on the target server (even if the tasks would usually require it); this means that you don’t need to hand over the keys to your kingdom in order to let someone perform a few updates or run backups.

The second talk of the day was by Iris Classon, a C# MVP who works for Konstrukt. Iris’s talk was entitled “To The Cloud” and discussed the journey that her company made while moving their services to Azure. Key points of the talk were discussions around the automation of manual processes, such as unit testing, integration testing, and operational validation testing, as well as deployment. She also advocated heavily for using JEA (mentioned above) to prevent system administrators from having access to sensitive data that they didn’t need to see.

The third talk of the day was by Ed Wilson, who works on Microsoft’s new Operations Management Suite (OMS), and is the author of the Hey, Scripting Guy! blog. The talk was primarily an overview of OMS, which is a suite of tools designed to offer Backup, Analytics, Automation, and Security Auditing for hybrid cloud/on-premises environments. OMS is constantly under active development with new features coming online all the time, so it’s definitely worth keeping an eye on. Highlights so far are:

  • OMS Automation (formerly Azure Automation), which has been described as PowerShell as a Service – it offers a repository where PowerShell run books can be stored and run on a schedule.
  • Secure Credential Store – exactly what it sounds like – store credentials securely so that you can use them from the rest of OMS.
  • Windows and Linux machines are supported for monitoring (as well as anything else that can output a text-based log file).

Fun fact mentioned in this talk: PowerShell is now ten years old. Probably time to pick it up if you haven’t yet done so 😉

Next up was Michael Greene, who works on Microsoft’s Enterprise Cloud Customer Advisory Team, who gave an excellent talk about using Visual Studio Team Services, PowerShell, and Pester to implement a release pipeline for applications and infrastructure. This was particularly interesting to me, as these are the tools that we’re using in ISG, and I’ve spent the last couple of months trying to do exactly this. Michael was strongly advocating configuring infrastructure as code, which allows the use of proper source control, automated testing, and automated deployment (only if all of the automated tests pass); working in this way has been shown to greatly improve reliability and agility of IT services.

Some excellent further reading on this subject was offered in the form of Microsoft’s whitepaper: The Release Pipeline Model (http://aka.ms/thereleasepipelinemodel) and Steven Murawski’s DevOps Reading List (http://stevenmurawski.com/devops-reading-list/).

Soundbite: If you want to work with Windows Server, the most important technology to learn right now is Pester.

During lunch we had a wander round stalls set up by vendors trying to sell their various DevOps-related products. One that interested me was Squared Up, a configurable dashboard that presents SCOM data (among other things) in a nice, easy to understand manner. I signed up for a free trial, before we discovered that the University already pays for this product. I need to chase this up with our contacts to get myself access to it.

After lunch, the talks split into two streams, so we split up in order to cover more ground. I’ll let Jonathan describe the talks he went to here…

My first afternoon session was with Richard Siddaway, covering Nano Server and Containers. This was really a practical demo following on from Jeffrey’s keynote, stepping through the process of configuring both with the caveat that all of this is pre-release at the moment. It was interesting to note that while Microsoft initially started out by building a PowerShell module to manage containers directly, as a result of feedback they’re re-engineering that to just be a layer on top of Docker, which is the tool that most people use to manage containers today. Another thing that I picked up was that as things stand, there’s no way to patch containers, yet they need to be at the same patch level as the host. The solution is to just blow it away and make a new one, but as was demonstrated, it’s quick and easy to do, so probably the most sensible approach anyway. We need to examine these two technologies carefully over the coming months. Richard mentioned the need to consider version numbering on containers, and which workloads they are suitable for. That’s partly dictated by the workloads that Nano Server will support, which will be limited at launch, but will likely grow reasonably quickly.

Following that, I went to a panel session on technologies, which gave me a shopping list of things to skill up on! The panel agreed that the two most important aspects of the toolchain were Source Control and Build, where the specific tool isn’t important – for Build it just needs to be something that will run scripts, and while it was suggested that any Source Control would be ok, if you didn’t already have something, you should choose Git. On the subject of the most significant tools from the community, Pester and Docker were highlighted. Other things that the panel suggested learning about were JavaScript/Node (although TypeScript is preferable to generic JavaScript), OMS, Linux, and Visual Studio Code. Another couple of interesting points I took from this were that containers don’t remove the problem of configuration management; they just move it, and that Azure Stack would work well for a hybrid model where you would usually host a workload on-premesis, but could burst up to the cloud for particularly busy periods.

…and while he was doing that, I went to a talk by Gael Colas – a Cloud Automation Architect (if anybody is thinking of overhauling our job titles any time soon, I quite like this one) – about configuration management theory.

This was one of my favourite talks of the day – Gael was making the case for short-lived, immutable servers. The general concept is that a server should be built from configuration code or scripts (the exact method is unimportant; what matters is that it’s completely automated), and then never changed at all – no extra configuration, no quick fixes, no patches. When the server needs to be changed (for patches, for example), the source configuration/script should be updated instead, and a new server deployed from that. This method ensures that we always know the exact configuration of a server and we’re always able to rebuild it it identically, every time – this has massive DR and service reliability benefits. This was referred to as Policy Driven Infrastructure. Gael did acknowledge that there are some applications for which this is unsuitable, but they’re rapidly shrinking in number.

The next session I went to was a panel session called DevOps Culture in a Windows World, which mostly turned into people offering advice about how they’ve convinced their organisations to embrace DevOps working practices. You’ll probably see me attempt to use most of the ideas presented over the next few months – this blog post is the start 😉
Two things that I will mention here were the suggestions that it’s important to improve visibility – which I think is something that our department could benefit greatly from – everyone should be able to easily see what everyone else is doing, and should be encouraged to share and help each other (I think we are encouraged to share, but we currently lack the tools to easily do this; I have some ideas about that one but need to work them through) – and the suggestion that we should look at our services like products, and consider their full lifespan when we set them up, instead of thinking of the set up of a service as a project which is completed once the service is up and running, and then left to rot indefinitely.

The last proper talk of the day was given by Peter Mounce of Just Eat, who was discussing how they run their performance testing. Performance is very important to Just Eat, and they work to keep their applications fast by testing their production environment twenty four hours a day. The theory is that running performance tests in QA is meaningless, because it’s impossible to replicate the behaviour of millions of real people using the production application, so they simply pile a load of fake load on their production servers. The fake load increases as real load increases, so that they’re effectively doubling the load on their application all the time – this means that they know that they can take that much load, and they’re able to disable the fake load in case of emergency to handle massive amounts of real load. In general, I’m not sure that the performance testing elements are that applicable to us at this stage, but there was a lovely soundbite which is very applicable to us: Embrace the fact that things are going to break; get better at fixing them quickly.

Finally, everybody came back together for a panel session and discussion, which was interesting, but nothing exceptional to report, then we went for drinks at the expense of Squared Up.

Office365 and SPAM filtering

We’ve started the process of migrating staff email to Office 365 (see http://www.ncl.ac.uk/itservice/email/staff/upgradingtomicrosoftoffice365/). We moved the IT Service this week – not totally without problems but that’s one reason we started on ourselves.

We’ve had some feedback that people are getting more spam since the move which surprises us. We’re using Office 365 in hybrid mode which means that all mail from outside the University comes through our on-site mail gateways (as it always has) before being delivered to the Office 365 servers.

The graph below shows stats for the last month – we’re still rejecting over 80% of the messages that arrive at the gateways. We know this isn’t catching everything but there’s a dramatic difference between an unfiltered mailbox to a filtered one.

Mail filtering stats for last month

SPAM filtering for the last month

FLOSS UK DevOps Conference, Day 2 (26th March)

Stuart Teasdale “Beyond Blue Green – Migrating a legacy application to CI and the Cloud”

Stuart talked us through the story of joining a start-up organisation that was suffering from some infrastructural and development issues around their data logging product; problems such as back-end scaling, inconsistent development practices and poorly specified hosted servers. We were taken through the process of identifying each problem and how it was migrated to modern, consistent processes. Server provision was moved to AWS to take advantage of quick-to-deploy, horizontal scaling and development processes were moved to a continuous integration development pipeline. Stuart ended with a good wrap-up of some of the lessons learned, including failing as early and loudly as possible in your development process and try to keep all instances of the infrastructure as consistent as possible – special cases always cause problems later on.

Richard Melville “An introduction to Btrfs”

Richard gave us an overview of the current state of Btrfs. He took us through the basic Btrfs concepts such as pools and subvolumes and explained the differences between the Btrfs “RAID” levels. He also showed us the ability of using quotas on a per-subvolume level and using snapshots for data protection and replication. Finally there was a run through of how to safely replace a failed drive in a Btrfs RAID pool.

Andrew Beverley “Rexify”

Andrew introduced us to Rex, a configuration management tool. It is similar to Ansible in that you “push” changes to end-nodes (using, SSH, for example) rather than pulling changes from a master server using an agent. Rex is Perl-based which means you can easily leverage existing Perl modules to use in your Rex configuration which is held in “Rexfiles” – similar to Makefiles and installation is as easy as installing the “Rex” module from CPAN. He also took us through some of the other features such as grouping, transaction support (with rollbacks) and referencing external configuration management databases.

Kenneth MacDonald “Kerberos – Protocol and Practice”

Kenneth opened the talk with an overview of Kerberos and a glossary of common terms before giving us a quick run through about how they’re using Kerberos at Edinburgh University and some statistics on their current infrastructure. This was followed by an entertaining physical demonstration of a typical Kerberos session initiation that involved several volunteers passing around envelopes, padlocks and keys that helped to visualise the process.

Wrap-up

The conference was closed with raffles for prizes from the attending sponsors and a closing speech from the FLOSS UK chairman. I personally thought this year’s event was particularly well organised and in a city that’s always interesting to visit. I highly recommend the FLOSS Spring conferences to anyone who’s interested in the operational/infrastructural side of open source software and meeting folk with similar interests.

FLOSS UK DevOps Conference, Day 1 (25th March)

Clifford's Tower, York.

Clifford’s Tower, York.

I recently travelled to York to attend the yearly Spring DevOps conference run by FLOSS UK. Here’s a quick overview of the talks I attended on the first day.

Jon Leach “Docker: Please contain your excitement”

Jon gave us a crash course introduction into Linux namespaces and an overview of the various types of namespace. He then went into Linux cgroups and how the combination of cgroups and namespaces enable lightweight containerisation in Linux. We got a quick introduction into LXC as an example of an early containerisation scheme before moving onto Docker. He then took us through the tools that Docker provide to enable building and sharing of container images and how to create reproducible container builds using dockerfiles.

David Profitt “Enhancing SSH for Security and Utility”

David told us about the various configuration files available to users of OpenSSH that configure behaviour of both client and server sides. He went through useful options for the client-side “.ssh/config” file and provided useful information on generating and distributing user-generated SSH keys as well as an overview of the options that can restrict what SSH keys can do from the server side.

In the server config he gave us an overview of useful options for locking down configurations and how to target specific configuration options using the “Match” keyword. Finally, there was additional information on how to provide a more secure “chrootable” SFTP environment by changing the default sftp-server process in the server configuration.

Julien Pivotto “Shipping your product with Puppet code”

Julien took us through the problems that you can encounter shipping software code in this age of virtualisation, containers and cloud infrastructure. Challenges such as distribution, hardware and software dependencies, upgrades and ongoing maintenance all need to be addressed. By using a configuration management tool such as Puppet you can design a single distribution package that is flexible enough to adapt to any environment and provide a mechanism to support and maintain the software after installation. He then went through some recommendations on how the Puppet modules should be designed to support this function.

Nick Moriarty “Puppet as a legacy system”

Nick talked us through York University’s current project to migrate their Puppet 2.7-based infrastructure to Puppet 3. He talked through the challenges of maintaining their existing Puppet repository (~130 modules) for an infrastructure that included a range of Linux distributions and versions.

They also decided that they wanted to move to a more “common” Puppet infrastructure setup using tools such as Git for the module repository management and Apache+Passenger for the Puppet master. By moving to a more standard platform they increase the amount of community support and resources available to them.

Pieter Baele “Linux centralized identity and authentication interoperability with AD”

Pieter took us through the history of Unix directory services in his organisation and the process they went through for selecting a new directory service that could interoperate with their Active Directory. After evaluating several options they went with OpenDJ as it provided several advantages including easy configuration, native replication and a RESTful interface for making changes. He then took us through recommendations for a basic directory layout (as flat as possible!) and how to configure clients to use the new directory.

Lightning Talks

A typically frantic session covering everything from research into animal behaviour(!), provisioning web hosting platforms on the fly with Jenkins & Ansible to bash shortcuts you never knew you needed.

Day 2 write-up is here.

Infrastructure issues (part 2)

Back in March we had performance issues with our firewalls. One of the things that our vendor raised was what they saw as an unusually high number of DNS queries to external servers. We were seeing around 2-3000 requests/second from our caching DNS servers to the outside world.

A bit more monitoring (encouraged by other sites reporting significantly lower rates than us) identified a couple of sources of unusual load:

1. The solution we use for filtering incoming mail sends DNS queries to all servers listed in resolv.conf in parallel. That doesn’t give any benefit in our environment so we changed things so that it only uses the caching DNS server on localhost.

2. We were seeing high rates of reverse lookups for IP addresses in ranges belonging to Google  (and others) which are getting SERVFAIL responses. These are uncacheable so always result in queries to external servers. To test this theory I installed dummy empty reverse zones on the caching name servers and the queries immediately dried up. The fake empty zones meant that the local servers would return a cacheable NXDOMAIN rather than SERVFAIL.

An example of a query that results in SERVFAIL is www.google.my. [should be www.google.com.my]). That was being requested half a dozen times a second through one of our DNS servers. www.google.my just caught my eye – there are probably many others generating a similar rate.

Asking colleagues at other institutions via the ucisa-ig list and on ServerFault reinforced the hypothesis that (a) the main DNS servers were doing the right thing and (b) this was a local config problem (because no-one else was seeing this).

Turned on request logging on the BIND DNS servers and used the usual grep/awk/sort pipeline to summarise – that showed that most requests were coming from the Windows domain controllers.

Armed with this information we looked at the config on the Windows servers again and the cause was obvious. It was a very long-standing misconfiguration of the DNS server on the domain controllers – they were set to forward not only to a pair of caching servers running Bind (as I thought) but also all the other domain controllers which would in turn forward the request to the same set of servers. I’m surprised that this hadn’t been worse/shown up before since as long as the domain returns SERVFAIL the requests just keep circulating round.

The graph below shows the rate of requests that gave a SERVFAIL response – note the sharp decrease in March when we made the change to the DNS config on the AD servers. [in a fit of tidiness I deleted the original image file and now don’t have the stats to recreate it – the replacement doesn’t cover the same period]

dns4

I can see why this might have seemed like a sensible configuration at the time – it looks (at one level) similar to the idea of a set of squid proxies asking their peers it they already have a resource cached). Queries that didn’t result in SERVFAIL were fine (so the obvious tests wouldn’t show any problems).

Postscript: I realised this morning that we’d almost certainly seen symptoms of this problem early last July – graph below shows the very sharp increase in requests followed by the sharp decrease when we installed some fake empty zones. This high level of requests was provoked by an unknown client on campus looking up random hosts in three domains which were all returning SERVFAIL. Sadly we didn’t identify the DC misconfiguration at the time.

dns4-july2014

Windows PowerShell 4.0 quick reference guides

Microsoft have released a number of cheat sheets, offering useful shortcuts and info for PowerShell 4.0, as well as a few of its related technologies such as DSC, WMI, and WinRM.

You can download these in PDF format from http://www.microsoft.com/en-us/download/details.aspx?id=42554 and then print them out and stick them up next to your desk to impress people who walk by.

 

Recent infrastructure issues (part 1)

It’s not been a great few months for IT infrastructure here. We’ve had a run of problems which have had significant impact on the services we deliver. The problems have all been in the foundation services which means that their effects have been wide-ranging.

This informal post is aimed at a technical audience. It’s written from a IT systems point of view because that’s my background. We’ve done lengthy internal reviews of all of these incidents from technical and incident-handling viewpoints and we’re working on (a) improving communications during major incidents and (b) making our IT infrastructure more robust.

Back in November and December we had a long-running problem with the performance and reliability of our main IT infrastructure. At a system/network level this appeared to be unreliable communications between servers and the network storage they use (the majority of our systems use iSCSI storage so are heavily reliant on a reliable network). We  (and our suppliers) spent weeks looking for the cause and going down several blind alleys which seemed very logical at the time.

The problems started after one of the network switches at our second data centre failed over from one controller to the stand-by controller. There were no indications of any problems with the new controller so theory was that something external had happened which caused the failover  _and_ lead to the performance problems. We kept considering the controller as a potential cause but discounted it since it reported as healthy.

After checking the obvious things (faulty network connections, “failing but not yet failed” disks) we sent a bundle of configs and stats to the vendor for them to investigate. They identified some issues with mismatched flow control on the network links. Theory was that this had been like this since installation but only had significant impact as the systems got busier. We updated config on both sides of link and that seemed to give some improvement but obviously didn’t fix the underlying problem. We went back to the vendor and continued investigations across all of the infrastructure but nothing showed up as a root cause.

Shortly before the Christmas break we failed over from the (apparently working) controller card in the main network switch at our second data centre to the original one – this didn’t seem logical as it wasn’t reporting any errors but we were running out of other options. However (to our surprise and delight) this brought an immediate improvement in reliability and we scheduled replacement of the (assumed) faulty part. We all gave a heavy sigh of relief (this was the week before the University closed for the Christmas break) and mentally kicked ourselves for not trying this earlier (despite the fact that the controller had been reporting itself as perfectly healthy throughout).

At the end of January similar issues reappeared. Having learnt our lesson from last time we failed over to the new controller very quickly – this didn’t have the hoped-for effect but we convinced ourselves that things were recovering. In hindsight improvement was because it was late on Friday afternoon and the load was decreasing. On Saturday morning things were worse and the team reassembled to investigate. This time we identified one of a pair of network links which was reporting errors. The pair of links were bonded together to provide higher bandwidth and a degree of resilience. We disabled the faulty component leaving the link working but with half the usual throughput (but still able to handle normal usage) and this fixed things (we thought). Services were stable for the rest of the week but on Monday morning it was clear that there was still a problem. At this point we failed back to the original controller and things improved. Given that we were confident that the controller itself wasn’t faulty (it had been replaced at the start of the month) the implication was that there was a problem with the switch which is a much bigger problem (this is what one of these switches looks like). We’re now working with our suppliers to investigate and fix this with minimal impact on service to the University.

In the last few weeks we’ve had problems with the campus network being overloaded by outfall from an academic practical exercise, a denial of service attack on the main web server and thousands of repeated requests to external DNS servers causing the firewall to run out of resource – but they’re stories for another day.

Problem (bug?) deleting folders in Outlook 2013

Doing some tidying up of a mailbox the other day, and wanted to delete some empty folders from the Deleted Items (i.e. permanently delete them). For some reason Outlook gave me the following error:

Opening up Deleted Items and trying to delete each folder in turn gave me this slightly different one – but equally frustrating :

Attempts to permanently delete the items (using SHIFT and Delete) drew a blank – same problem.

Playing with the folder permissions (via the properties pane) – no joy.

Clearly this is a bug in Outlook (and not anything I was doing wrong). Workaround was to log into the mailbox using Outlook Web Access (OWA) and trying it from there – no problem.

It would seem that the synchronisation logic that the full client uses is buggy – it’s useful to remember that you can sometimes do things in OWA that you cannot with Outlook if you run into trouble.