Wondering where to start with DataHub? In Shirshanka’s Nov 2020 presentation to the DataHub Community they explore the relationship between its two constituent parts: DataHub the app, and DataHub the General Metadata Architecture (GMA). These work in tandem but the latter is primary as a portal enabling the 4 business goals that originally drove DataHub’s origin within LinkedIn:
- Reproducibility
- Audit-ability
- Visibility
- Consistency of concepts
And the 2 principles:
- Integrated with development flow
- Data-centricity (metadata should live alongside the code)
Those metadata we need to co-maintain, include:
- problem statement of the analytics/AI model being supplied
- pipeline info describing the ETL elements of its lineage
- run info for each execution of those ETL pipelines
- associated projects depending on this model
- associated groups (e.g. project owners, stakeholders)
- analysis results (e.g. split into high-level exec summaries, and fully fledged reports with granular publication-level detail with analysts in mind)
That all sounds fairly technical; indeed, one comment on that presentation asks whether DataHub can be extended to accommodate business metadata. This is absolutely the case (arguably this is ootb functionality and no extension is required): Saxo have documented & presented a fantastic example from their implementation [👈 link to be added here soon].