Databricks makes bringing data into its ‘lakehouse’ easier
Databricks today announced that launch of its new Data Ingestion Network of partners and the launch of its Databricks Ingest service. The idea here is to make it easier for businesses to combine the best of data warehouses and data lakes into a single platform - a concept Databricks likes to call 'lakehouse.'
At the core of the company's lakehouse is Delta Lake, Databricks' Linux Foundation-managed open-source project that brings a new storage layer to data lakes that helps users manage the lifecycle of their data and ensures data quality through schema enforcement, log records and more. Databricks users can now work with the first five partners in the Ingestion Network - Fivetran, Qlik, Infoworks, StreamSets, Syncsort - to automatically load their data into Delta Lake. To ingest data from these partners, Databricks customers don't have to set up any triggers or schedules - instead, data automatically flows into Delta Lake.
"Until now, companies have been forced to split up their data into traditional structured data and big data, and use them separately for BI and ML use cases. This results in siloed data in data lakes and data warehouses, slow processing and partial results that are too delayed or too incomplete to be effectively utilized," says Ali Ghodsi, co-founder and CEO of Databricks. "This is one of the many drivers behind the shift to a Lakehouse paradigm, which aspires to combine the reliability of data warehouses with the scale of data lakes to support every kind of use case. In order for this architecture to work well, it needs to be easy for every type of data to be pulled in. Databricks Ingest is an important step in making that possible."
Databricks VP or Product Marketing Bharath Gowda also tells me that this will make it easier for businesses to perform analytics on their most recent data and hence be more responsive when new information comes in. He also noted that users will be able to better leverage their structured and unstructured data for building better machine learning models, as well as to perform more traditional analytics on all of their data instead of just a small slice that's available in their data warehouse.