As I write this, I can almost hear you wail “No, no, we don’t have too much metadata, we don’t have nearly enough! We have several projects in flight to expand our use of metadata.”
Sorry, I’m going to have to disagree with you there. You are on a fool’s errand that will just provide busy work and will have no real impact on your firm’s ability to make use of the data they have.
Let me tell you what I have seen in the last half dozen or so very large firms I’ve been involved with, and you can tell me if this rings true for you. If you are in a mid-sized or even small firm you may want to divide these numbers by an appropriate denominator, but I think the end result will remain the same.
Most large firms have thousands of application systems. Each of these systems have data models that consist of hundreds of tables and many thousands of columns. Complex applications, such as SAP, explode these numbers (a typical SAP install has populated 90,000 tables and a half million columns).
Even as we speak, every vice president with a credit card is unknowingly expanding their firm’s data footprint by implementing suites of SaaS (Software as a Service) applications. And let’s not even get started on your Data Scientists. They are rabidly vacuuming up every dataset they can get their hands on, in the pursuit of “insights.”
Naturally you are running out of space, and especially system admin bandwidth in your data centers, so you turn to the cloud. “Storage is cheap.”
This is where the Marie Kondo analogy kicks in. As you start your migration to the cloud (or to your Data Lake, which may or may not be in the cloud), you decide “this would be a good time to catalog all this stuff.” You launch into a project with the zeal of a Property and Evidence Technician at a crime scene. “Let’s careful identify and tag every piece of evidence.” The advantage that they have, and you don’t is that their world is finite. You are faced with cataloging billions of pieces of metadata. You know you can’t do it alone, so you implore the people who are putting the data in the Data Swamp (er, Lake). You mandate that anything that goes into the lake must have a complete catalog. Pretty soon you notice, that the people putting the data in don’t know what it is either. And they know most of it is crap, but there are a few good nuggets in there. If you require them to have descriptions of each data element, they will copy the column heading and call it a description.
Let’s just say, hypothetically, you succeeded in getting a complete and decent catalog for all the datasets in use in your enterprise. Now what?