Databricks announces data sharing, ETL and governance solutions – Illinois News Today

Best Brothers Group of Companies - Automatic doors specialist > Security Camera > Databricks announces data sharing, ETL and governance solutions – Illinois News Today

(Dennis Diatel / Shutterstock)

The main purpose of Databricks is World’s first enterprise AI platform, This is a lofty goal and work in progress. But first and foremost, the data is messed up and requires some work. To that end, the company made many announcements today at the Annual Data + AI Summit, including the launch of Delta Sharing, Delta Live Tables, and Unity Catalog.

“To be successful with AI, the world of data must be tight and well organized.” Databricks Joel Minick, Vice President of Marketing, said: “With bad data, there’s no money to invest in successful AI. That’s the core idea of ​​Lakehouse. [that] You need to find a way to integrate data, analytics, and AI into one platform. “

It’s also the core idea behind the three announcements Databricks announced at this morning’s show. Start enjoying your data with Delta Sharing.

Delta sharing

Databricks claims Delta Sharing as the world’s first open protocol for sharing data in a secure manner. The software, donated to the Linux Foundation, is based on Delta Lake (also open source) and allows two or more participants to send and receive any number of raw data files using Python and SQL.

(Exciting / Shutterstock)

Minnick explains how it works:

“If you’re sharing deltas, you have a delta shared server that manages those connections. Then you make a request. I want to get a dataset from you. You have access to it. Please return a token indicating that you will be able to share data with each other over that protocol. “

Data owners have “full control” over the data, such as what they share and who has access to what parts of the data, Minnick says. The data can be anything, including structured and semi-structured data such as CSV and JSON, and unstructured data such as raw text and JPGS. Once your data enters your partner’s environment, it’s up to you to manage and protect your data.

Databricks is looking for a number of partners to participate in Delta Sharing, including data providers such as S & P, Nasdaq, Pitney Bowes, and Safegraph. AWS Data exchange.BI tools like Tableau, Microsoft Power BI, Google cloud BigQuery, and Starburstj is also queued to support the new Delta Sharing protocol. In terms of governance Imuta, Privital, And Colibra Supports new protocols from day one.

“The vision is to be able to handle all the datasets,” says Minnick. Data wave“What we’re looking at from the beginning is datasets and files. We’re already plotting on our roadmap the ability to share and manage things like machine learning models. [and] Something like a dashboard. So this is exactly the solution for sharing all your data assets between organizations. “

Unity Catalog

Similar to Delta Sharing, there is a new Unity catalog that allows Databricks to provide a unified view of all your data assets, including those stored on your server and those in other cloud repositories that you can access. It states that it will be provided to users. Or, as Databricks says, “the world’s first multi-cloud data catalog for lakehouses.”

Via Shutterstock

Some observations have driven Databricks into the development of the Unity catalog, Minnick says.

“You can infer at the file level to determine if Alex can access this file,” says Minnick. “If possible, he gets everything in the file. The only way around this is to copy a file that contains only the appropriate data and make it accessible. However, such a proliferation of files is one of the major reasons why data lakes become data swamps. “

Instead of throwing access control issues to downstream applications that lead to the data diffusion issues mentioned above, Databricks decided to incorporate that fine-grained access control into the data lake. The Unity Catalog provides attribute-level control for Delta Lake tables, giving you precise control over what your users can and cannot access.

“But it also solves some other problems,” Minnick adds. “One is that about 80% of our customers today use multiple clouds. This means that we have multiple data lakes, so using the Unity catalog, it’s there. All Data Lakes All data sets across the lake can be seen in one view, and those data sets can be put together in a new way that recognizes that some of the data may be in one cloud And some data may be in different locations, but can be represented as a single data set. “

The Unity Catalog supports Delta Lake (open source) and Delta Sharing (also open source), but this product is not open source. This is part of the Databricks platform that customers have to pay for.

Collibra and Alation,please do not worry. Minnick states that the Unity catalog will be integrated with them.

Delta live table

The final part of today’s announcement is Delta Live Tables. It is intended to address the extraction, transform, and load (ETL) pipeline vulnerabilities that big data customers have come to like (or dislike).

According to Minnick, the ETL pipeline is important for downstream analytics and machine learning applications, but it’s often hidden from users. Delta Lives Tables increase the visibility of your pipeline to ensure proper data movement.

“Today’s data pipeline building process is incredibly manual and extremely vulnerable,” he says. “If the data pipeline fails, it’s annoying. The information isn’t updated downstream. But what’s worse, the data flowing through the pipeline is so difficult to test that bad data is always piped. It is allowed to flow through the line. “

Databricks’ solution to that problem is Delta Live Tables (DLT). According to the company, DLT provides a declarative development environment that simplifies the creation of data pipelines to Delta Lake. This is achieved by eliminating engineering, improving data quality, and eliminating most of the cloud-scale production operations.

“[We said] The result of changing the experience of building data pipelines from the very manual and mandatory process of today to create something more declarative and what these ETL transforms and pipelines are supposed to achieve. Let’s focus on “

By using SQL, he says, users can tell DLT what the data transformation should be. “You can use only SQL to get a clear picture of what these pipelines should do, what good data looks like. Next, what DLT does is required to do. Understanding the data graph behind the transformation and traversing it. [graph] And understand what all the dependencies are. So if something changes upstream, it needs to be done automatically downstream as well, so it won’t break in terms of dependencies. “

DLT gives data engineers greater control over their data pipeline, Minnick says. If the DLT pipeline is sending unexpected data, warn the engineer of an error or shut down the pipeline automatically before entering the bad data into Delta Lake. Most errors are minor and temporary, so you can restart the pipeline automatically and scale up or down the underlying server resources as needed.

When the pipeline is restarted, DLT automatically backfills the pipe so that the job can be restarted exactly where it left off. Minnick also says that testing the DLT pipeline is easy because you can run the same code that engineers wrote on your laptop on the server.

“This is a huge amount of feedback from our customers during the early stages of use here. The process of building a reliable pipeline has been made faster and easier, and the level of work that occurs downstream has been reduced. It’s very simple and easy to manage, “says Minnick. ..

The Data + AI Summit (formerly Spark + AI Summit) will continue tomorrow.

Related products:

Will Databricks Build the First Enterprise AI Platform?

Databricks on Google Cloud

Databricks approaches IPO in $ 1 billion round

© 2021, Best Brothers Group. All rights reserved.