In July 2019, Gartner® issued their report on data management and put “Data Lakes” into the “trough of disillusionment” on their hype cycle, expecting another 5 to 10 years to pass before they reach the “Plateau of Productivity”. So, in their eyes, many experiments and first implementations must have failed, and there must be a shakeout of technology providers.
Well, from our experience, we cannot fully subscribe to this assessment – we see first successful installations and a readiness of services, on the Google Cloud Platform, which speak a different language. Especially for Switzerland, we expect that already in 2-3 years Data Lakes will be used by 20 to 30% of all companies. We believe that this acceleration is strongly fuelled by the presence of Google Cloud in Switzerland and their landmark services for Analytics, Machine Learning and Artificial Intelligence.
We are aware that a “typical” Data Lake is but a “repository of data stored in its natural/raw format”. However, the borderlines between “Data Lakes”, “Data Warehouses”,” Data Marts” (and more!) are becoming as blurry as the business expectations from Data Lakes: We see a very broad interpretation of a Data Lake’s value proposition. Despite knowing that some of the areas we are covering in this article would probably exceed the exact scientific definition, we use the term “Data Lake” to cover all of this. So please bear with us here…
No matter what definition – we see a lot of businesses struggling to extract value from their data. And most of this struggle is linked to the fact that their data is spread between multiple systems (e.g. ERP, CRM, etc.), not available in the same format, and also typically structured, prepared and used for specific business processes. Everytime someone wants to cross-analyze this data, they are stuck with system limitations – limited data breadth, missing functionality for advanced analytics, scarcity of computation power and memory or simply the skills to use them. So there is a cry from many data scientists to get access to more freely analyzable data.
The path to get there, however, poses a potentially unlimited number of challenges. Invariably, questions around technology, security, data governance come up – and also the human aspects that come with all larger IT projects that aim at bridging differences between departments through the use of a new (?) platform.
There are some important ingredients to consider when brewing your own project.
Every CIO’s textbook states that the most important task is to align business and IT on any given IT project. No surprise – a similar challenge also applies to Data Lake initiatives. However, in this case you probably also need to pay attention to a third alignment dimension: the Data Scientist.
Being on the border between Business and IT, the data scientist always needs the “freshest” data, the easiest access and unlimited capacity to run analyses, change data structures back and forth – freedom which allows him or her to produce value from the data.
This is not exactly the same approach as the one of IT operations. The latter are running their productive systems mostly at maximum load (no money to waste on infrastructure) and with many dependencies in the business processes they fuel. So each and every change needs careful planning and alignment – and please, please! – any change must be secure, tested, acknowledged, agreed, and be fully compliant…
In the successful projects we accompanied and implemented, we actively moderated this dialogue between the communities. Wherever a good mutual understanding of the different points of view could be established during the course of the project , we were able to drive the project forward quickly towards a Data Lake that is both secure and compliant and at the same time meaningful and productive.
Where such mutual understanding was hard to come by, the project was in danger of producing a “frozen lake” – slippery to navigate, difficult to move on and nearly unusable.
A Data Lake consists of several elements that all need to be taken into consideration during its life cycle:
A Data lake doesn’t mean to have to boil the ocean. Start small, plan stages to allow yourself to launch and learn, to iterate and grow your solution with the increasing knowledge. But a good – if not the best – starting point is to begin with a “Minimum Viable Product” (MVP) approach.
Choose a useful but not too complex use case, start with manual ingestion, set up your architecture, install “good-enough security” – and then try out the toolset and check if it delivers what you expect. Usually, you would already have a concept for the final stage in mind, and sketch this out, but probably you do not implement it yet.
Then – when this is successfully set up – you may want to extend the automation, refine and extend the access management and make the Data Lake available to a broader community and for more use cases – and you could always expand on the MVP foundation you built.
Start small but with significance – that’s the idea…
Here are some findings we would like to share:
…and – last but not least – get the help of a professional that has done this before.
Like us.