2022 update

Finally resolved the issue with the free download. Follow the link at the bottom of this article.

What is it?

A business glossary is a cornerstone of any successful data governance framework. It underpins much of the effort to assess, track and improve the data asset.

A properly formed glossary is the foundation for driving up the utility of data. It does this by generating trust in that data because we know what it means, and the quality at which it is held.

It is therefore an institution wide, agreed business view of the most important data and where it is used. As such, it’s a key tool for data stewards and owners to move data out of silo and into a single, trusted source. It won’t do this on its own. But without it, it’s almost impossible to move the conversation on around whose data is right.

This glossary has been developed with Nicola Askham,

What else is it called?

Many things! The common ones are Data Glossary or Data Lexicon. I’ve also heard Business Catalogue. It is most definitely not a data dictionary.

What it is not?

This is not a data dictionary. While it contains information to help us understand where this data ‘lives’in the real world, it does not include detailed schema or physical validation criteria. There are two reasons why:

  1. A Data Glossary is a business document. It needs to be maintained by the owner and steward community. It is focused on showing what the data means for the university, not how it is physically implemented. Attempting to combine the two will be both confusing, and extremely difficult to manage. The Data Dictionary is owned – generally – by the IT group.
  2. A Data Glossary operates at a level of abstraction. So, while a term will be stable, its implementation may not be. A change of Student Record System – for example – will have a huge impact on how a term such as ‘Surname’is implemented, but the business definition remains stable. This is extremely useful to ensure consistency of use even when the underlying physical landscape changes.

Nicola Askham wrote a great article explaining the difference. It could be found here.

What can I use it for?

A data glossary has many uses. Here are a few of the benefits:

  • A single source of terms for the entire university. This stops – eventually – discussions starting with ‘Well I’ve got 887 students but your data says 900, who is right?’. The glossary should define what we mean for terms that are used the most, and – critically – when we use them. So even if two people don’t agree on a definition of – say – an EU student, they can compromise when looking at specific reports or analysis.
  • The development of conceptual models. I’ll do a whole other post on why these are so important, and how they are used to create a single institutional view of what data does and how it travels about the university. An integral part of building these models is agreeing terms and definitions.
  • Showing what is the same and what is different. The skill in developing a usable model is to understand what really is a single term used across the university, and what is something different. It’s a slog to get there but it massively increases trust in data.
  • Fixing problems. Having good definitions is often at the root of data quality problems.
  • A communication tool. Just having a starting point for discussion is a brilliant way to bring people into a community trying to improve the data asset. You cannot do that with ten people, but you can when you start to get interest and engagement from all areas of the university.

What do all the columns mean?

Each column header has an explanation of what it’s for and how to complete it. I’ve included a number of examples to show what that looks like. It still may appear daunting, so a rule of thumb for filling it in is:

  • Start with the higher level terms (e.g. Person, Applicant, etc) and drill down as you get to specific use cases.
  • Complete the green columns for conceptual modelling, data visualisation and gaining agreement on key terms.
  • Complete the blue columns to show the ‘as-is’state and find outliers, different data sources, etc.

How do I create, maintain and use it?

This is a ‘how long is a piece of string question’!So, I’ve included some guidelines based on my experience of the right approach for HE, and the things you need in place before you start:

  • Domains, data owners and stewards need to be defined. Stewards will normally ‘own’and maintain their terms in the glossary. Owners will sign off and enforce the data quality requirements.
  • It’s imperative to choose which terms to start working with. This can be done by reverse engineering parts of your warehouse (if you have one) or something project based (such as Student Recruitment or Student Record System replacement) or even for a problem you’re trying to solve.
  • The glossary must be implemented university wide and populated by the subject matter experts. This is a long game but getting acceptance of ‘the agreed view’is crucial.
  • You can’t put everything in it! Material data (i.e. data you really care about/has the most perceived value/regulatory importance etc) should form the bulk of the entries.
  • You can’t please all the people all the time. I try and create definitions that are scoped by use case. So, we will never agree what a student is (or how we count/categorise them) but for university wide reporting (for example), we’ll agree a definition.
  • Scope creep is legion. Good business analysis skills are needed to really understand the difference between a synonym and something which is truly different. However, ‘Glossary fundamentalism’will absolutely drive down adoption.
  • Tooling is pretty much mandatory. Spreadsheets do not scale. I’m not a tooling-first advocate but in this case having a way to search and – as importantly – query definitions is so important. Better still is some kind of workflow to help get the questions to the right people.
  • Early use case tends to be downstream of the warehouse/reporting platform in removing uncertainty around terms in reports/outputs which are used by many people. I’ve used it often to stop copies of data being taken because ‘the source data has the wrong name’. It’s a great way to solve problems and build a community around data outside of individual silos.

There is so much more I could add, but this post is already too long. You can download the template here.  It includes a few example entries. If that doesn’t make sense, I’m always happy to clarify and take questions around how to best use a business glossary. Get in touch via the contact page or leave a comment,