Wednesday, 14 March 2018

How Team Foundation Server saves you from a potential mess with IIS and SSL

This is an example of how TFS is robust enough to prevent you doing silly and potentially costly (in terms of time) mistakes.
Let’s say you are configuring a new instance, and you just got your SSL certificate installed on the machine. So you select the HTTPS and HTTP option in the configuration settings, and you select your certificate. And you get an error:

Clicking on that link creates the correct bindings for that certificate. Fair enough.
But the Public URL is not what you like, so you change it to something else. And you go ahead. The result?

The readiness checks prevent you from doing this. It also checks for SNI validity amongst the other things, something that comes handy when you deal with Chrome.

Thursday, 1 March 2018

Not all is lost if your cube gets corrupted…

I ran into this very odd situation yesterday with the Reporting capability of my production TFS instance – I realised the Incremental Analysis Database Sync job and the Optimize Databases job were running for hours!


I stopped the Incremental Analysis, and the Optimize Databases job completed successfully. Fine.
But – for whatever reason – my SSAS cube got corrupted! I couldn’t even connect to the Analysis Engine with SSMS. I also found errors in the Event Viewer pointing at a corrupted cube:


Errors in the metadata manager. An error occurred when loading the 'Team System' cube, from the file, '\\?\<path>\Tfs_Analysis.0.db\Team System.3330.cub.xml'.
Errors in the metadata manager. An error occurred when loading the 'Test Configuration' dimension, from the file, '\\?\<path>\Tfs_Analysis.0.db\Configuration.254.dim.xml'.

Now, what to do? It looked like a full-blown rebuild was in order, and it is a costly operation, given that what the rebuild does is dropping both the data warehouse and the SSAS cube, rebuilds the warehouse with data from the TFS databases and then rebuilds the cube.

It is not like being without source code or Work Items, but still… it is an outage, and it is painful to swallow.

Now, in this case the data warehouse was perfectly healthy – the report shown an update age just a few minutes old. So all the raw data in this case is fine, and all you need to do is to rebuild how you look at this data.

The SSAS cube is just a way of looking at the data warehouse. If your warehouse is fine, just wait for the next scheduled Incremental Analysis Database Sync job to run, it will recreate the cube (thus making the Analysis Database Sync job a Full one rather than an Incremental one) without going through the full rebuild.


Why didn’t I process this myself by using the WarehouseControlService? Simply because the less you mess with the scheduled jobs the better it is Smilehiccups happen, but the system is robust enough to withstand such problems and pretty much self-heal itself once the stumbling block is removed.

Tuesday, 27 February 2018

Containers and DevOps: where are we? Some of my thoughts.

I spent most of February talking about Containers from the DevOps perspective – why you (might) ask? Well, the reason is pretty straightforward: if you are a newbie and you are trying to find resources on containerisation technologies (not just Docker and Kubernetes then!) you will mostly find developer-focused articles, closely followed by C-suite overviews.

Hey, it is totally understandable, don’t get me wrong. They are the hip and cool technology to be aware of in 2018, the market moved and – admittedly – they are a brilliant idea.
The problem is that I believe (and this is a personal opinion) this drive to understand what containers are is skewing the market. Containers are not “like VMs, but better/cooler”, they exist to serve a business purpose.

This business purpose is twin-faced: one appeals to the technological person – you can run the same bits (meant as code, configuration, and toolset) everywhere with some resource tuning, and you are basically pushing Infrastructure as Code (another buzzword, but so 2017…) to the limit. And this is very cool.

But the other side of the business purpose – and the most important one IMHO – is that you can literally change how complex applications are deployed, maximising resource usage and enabling scenarios (blue-green is the first one I can think about) that were exclusive to the OTT before.

This is what really matters.

And to be totally fair with you, containerising an application is not that hard – but it won’t magically improve, it would remain a legacy application running in a container instead of a VM or a physical host.

On the other hand, an application which actually adopts the philosophy behind containers has more chances to actually bring a tangible benefit to the company as it naturally adopts many best practices from DevOps. 

Yes, it is unavoidable – whenever you see Containers you cannot avoid DevOps.
There is only a mistake you should never do. Containers are not DevOps. DevOps collates together practices and concepts that fit in perfectly when using Containers, it’s not enabled by Containers. You can do DevOps with anything, including Containers.

The two together are a match made in heaven. Just don’t forget they are not the same thing.

Friday, 16 February 2018

A quick look at the new SonarQube tasks for TFS and VSTS

Last week SonarSource released a new version of their tasks for TFS and VSTS, with a couple of very welcome additions.

Up to v3, we basically had to do everything manually – especially passing parameters with the /d:… switch.

v4 introduces a context-aware switch where you can specify what you are using for your build:


The Use standalone scanner is quite interesting, as it guides you towards providing a .properties file:


Also, gone are the days of using /d:… inline. There is a very handy Additional Properties textbox to use with a line-by-line parsing, which makes property override very easy to do:


Tasks are also split now into Prepare Analysis, Run Code Analysis and Publish Analysis Result, to allow a more streamlined design of your Build Definition.

Thursday, 8 February 2018

Build Agents losing connection after switch to HTTPS

A quick one I am dealing with these days – if you switch the Public URL of your Team Foundation Server to HTTPS you might see your Build Agents losing connection with the server.

This usually happens because of a known bug in TFS an OAuth token isn’t registered so all the authentication tokens on the agents expire.

Of course YMMV, so always double check with Support before running a Stored Procedure on your production instance.

If you happen to get into this problem, you can mitigate it by reverting your HTTPS switch-on and changing the Public URL back to the HTTP version. Doing that will re-establish the connection between the server and the agents.

Monday, 22 January 2018

Tips on dealing with WinRM and remoting using the Test Agent

Despite the push we’ve seen in the last few years, the Hosted Build Service might not be the right product for you for whatever reason.

Then, if you are in a situation where your agents aren’t running in the same domain as Team Foundation Server’s and you want to use the Test Agent then you really risk opening the Pandora’s box, courtesy of WinRM and PowerShell remoting.

And to be completely clear – I have nothing against them Smile the only downside is that they need to be approached in the right way, otherwise the can-of-worms effect is just behind the corner.

First and foremost, remember that whenever you target a machine for Test Agent deployment you only need to consider the Build Agent-Test Agent relationship. All the errors you will get are going to be from the Test box, not the build box.

So when you need to configure WinRM, the Test box is the machine that is going to be accepting the connections. While it sounds straightforward, sometimes things happen and one is tempted to look at the Build box first: don’t.

Also, if you really want to use HTTP and WinRM, remember that this is the trickiest combination – so think twice before going down that route!

Then in terms of errors – you will likely face WinRM errors of all sorts. The most common is this:


If you are outside a domain then REMEMBER about Shadow Accounts – it is the only way to keep identity issues to a minimum. You’ll also need to set the TrustedHosts value to the machines pushing the agent.

Then this:


Remember that passwords need to match, and that mixing users at setup time isn’t really a good idea if you are going down the workgroup/non-trusted domain route.

Always triple check passwords, and I recommend to use the same account for both provisioning and execution, at least as a baseline. This will make sure you have a safety net incase things don’t pan out as expected.

Eventually there is this error, that really puzzles me:


This is actually an aggregated exception:


Look at UAC and execution context for this – it always happens when you are not running stuff as Administrator when that’s supposed to be elevated. It always drives me mad.

Wednesday, 10 January 2018

Don’t overlook the details during a TFS outage

Two weeks ago I dealt with a head-scratching outage. A few minutes of downtime, for a very stupid reason.

So, we are starting in a situation of total outage. All the services which rely on the production AlwaysOn Availability Group cannot connect to the server. People start screaming, emails flow in at a rate of tens-per-second… well it wasn't that bad, but you get the idea 😊 outages are always annoying.

So off we go with the usual stuff – the TFS Management Console does not load any data from the Data Tier, so the first point of call is checking the database servers.

Which are humming along totally fine. What the hell?! The network stack works as expected, I can ping all the machines involved!

When checking the database servers, I can see that the Availability Group is totally fine – everything is green, synchronised and with no issues. While this is very good on its own (no backups to restore, nothing to sweat too much about), it still does not explain why the Application Tier cannot talk to the Data Tier.

Then the awakening – whenever I try to connect to the AlwaysOn Listener I get a network error, while going directly to the database server works without problems. There it is!

Pinging the Listener does not work indeed. But why? All the cluster resources were green, online. 
But for some reason the affected resource failed to perform its duties.

Given that all the other moving pieces were perfectly fine, a manual AlwaysOn failover solved the problem. The lesson learned here is that in a complex architecture there is always something unnoticeable but critical – it’s like breaking a malleolus.