Spark – Efficient technology for the enterprise

The Data Lakehouse – Best of Data Lake and Data Warehouse

Almost every company today utilizes a kind of data warehouse or business intelligence solution for data analysis and reporting. Those solutions are primarily based on relational data, ETL jobs and reporting. Although powerful they are limited when it comes to very large data sets or realtime processing.

Some years ago the paradigm of Data Lakes was born to process very large data sets. Data Lakes are based on the idea of raw data processing, streaming data, ELT and machine learning.

What about combining the strengths of both into something even more powerful? This is what is called the Data Lakehouse, a term conceived by Databricks.

Evolution of data storage, from data warehouses to data lakes to data lakehouses — Data Lakehouse. Source: https://databricks.com/de/glossary/data-lakehouse

As the name suggests, it combines the strengths of Data Warehouses with the power of Data Lakes. Although the term Data Lakehouse was not really used in 2020, we built a Data Lakehouse for a logistics company already then.

One of the main datasets in this project comprised 16 years or freight offers plus live data. The historical data was transferred from Oracle Databases to a new Data Lake. In addition stream sources were set up to ingest live data directly from the source applications into the Data Lake. The result was a huge active archive including historical and live data based on Hadoop, Spark, Kafka and HBase. The raw data was stored and continuously transformed into a normalized form ready to be processed by reporting and machine learning jobs. A logical structure, metadata and governance were added using Apache Atlas and Avro schemas. Reporting and end user security was implemented using Microsoft Power BI.

The result was something we would probably call a Data Lakehouse today. The combination of BI and Data Lake was very successful, so we created a success story to describe it.
To me is seems that Data Lakehouse is a very useful concept. It is an evolutionary step towards an integrated solution for processing and analysis of massive amounts of data by applying good practices in terms of governance, security and reporting. Surely something BI-Teams should have an eye on.

JAX 2020: Big Data and Agile Culture

This year JAX is taking place from 7. September to 11. September in Mainz. W-JAX is taking place from 2. November to 6. November in Munich. Due to the Corona situaton it will be a special experience as the conferences are going to be held in an hybrid manner (on-site and online). In my sessions I am going to talk about Big Data and Agile Culture.

In the Big Data session I am going to show you how to set up an Open Source Big Data platform from scratch. You will see how popular technologies such as Hadoop, Spark, Hive, Kafka and others work together. We are going to implement a typical end-to-end use case live together. You’ll get a solid understanding of what these technologies do and how they work together to form a platform.

The Agile session covers aspects of culture as a building block of agile organisation development. I am going to talk about what culture actually is, why it is an essential part of “being agile” and how to develop it. Moreover I am going to share experiences and common pitfalls on the journey of agile culture development.

I am glad to be there and hope to meet you on-site or online.

Workshop: Big Data you can Touch

Today I released the brand new Workshop:Big Data you can Touch.

If you start researching about Big Data Platforms, you will find an overwhelming amount of possible technologies. But if you dig deeper you’ll find that many platforms are based on the same proven Open Source products.

This workshop teaches how to set up your own Big Data platform using professional Open Source products. Together we’ll build a end-to-end use case using a Lambda-Architecture and Machine Learning.

It is intended for all people who are generally intested in Big Data platforms, e.g. developers, architects, analysts or decision makers, who want to know how those technologies work together.

The workshop takes 4 hours and can be booked as On-Site-Training and Online-Webinar. Hope to see you there…

Upcoming events:.

7. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

13. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

21. Mai 13:00 – 17:00 Webinar: Big Data zum Anfassen

JAX 2019: Agile Team Architecture and Big Data

JAX is one of the most known conferences for Java, architecture and software innovation in Germany. Im am glad to be invited this year to give some sessions. Between the 6th and 10th May 2019 JAX will be taking place at Rheingold Halle in Mainz.

Agile product teams are becoming more and more mission critical. On the 6th I am going to give a presentation about the way agile product teams can be built by applying software architecture principles such a resilience and performance to teams.

When people start learning Big Data technologies for many it seems to be complex due to the sheer amount of products in the Big Data ecosystem. On the 8th I am going to show a simple Big Data Stack to get started with. I am going to set up a working stack from scratch and implement a working lambda architecture.

You can see the timeslots on the JAX website. I look forward to seeing you there.