More than Meets the Eye: Three Months in at Databricks

Steven Muschler
5 min readApr 23, 2024

--

It is hard to believe that I have already been at Databricks for three month. Prior to joining Databricks, I worked in consulting and implemented large scale ETL pipelines that processed trillions of records on Databricks in 2019/2020. Spark on Databricks was an incredibly powerful combination for ETL, but until recently, that is really all that I knew Databricks for. During the course of interviewing and ultimately working at Databricks, I have been blown away by the capabilities Databricks has in other areas, such as data warehousing with DBSQL, governance with Unity Catalog, and a whole plethora of AI/ML capabilities. Databricks is not just an ETL tool, but rather a comprehensive platform for an organization’s data and AI needs. I’ll dive into a few features that I’m particularly wowed by; however, there are so many capabilities that I don’t have time to go over here, so also make sure to check out the Databricks website and blog for a more comprehensive list of features.

Data Warehousing

One of the most impressive features is Databricks SQL (DBSQL), Databricks’ serverless data warehousing solution. The ability to open a browser based SQL editor and then be able to execute SQL against any of the datasets you have access to without worrying or specifying any compute configuration is incredible powerful.

Databricks SQL Query Editor

The SQL Editor within Databricks is extremely simple and easy to use. In many cases, the default compute will be a serverless endpoint that does not require any additional setup. Simply write a query in the window and then execute the query and watch the results come back.

Leverage the Databricks Assistant for quick error resolution.

Should the query have any errors in it, as mine frequently do, then let the Databricks Assistant resolve it. Simply click on the “Diagnose error” button and watch it provide suggestions on how to fix the code.

BI Tools available through Partner Connect

Beyond allowing users to execute analytics queries using a serverless data warehouse, DBSQL is also capable of serving data for external BI applications, such as PowerBI and Tableau. Integrations are managed through Databricks Partner Connect and are quite easy to setup and configure.

Unified Governance

Unity Catalog is Databricks governance solution for any assets (e.g. Delta tables, volumes, etc.) within Databricks. This prevents administrators from needing to manage entitlements and access from across disparate external governance solutions. From an end user perspective, such as a data engineer or data analyst, it provides a whole bunch of quality of life enhancements beyond being where access to a data asset is granted or revoked. In particular, assets that you have access to now contain a whole variety of useful associated information, such as lineage.

Lineage Tab in Unity Catalog

The Lineage Tab enables users to understand which assets interact with the dataset in question. This is super useful as it helps in identifying dependencies and interactions across heterogeneous asset types. For example, I can easily see which Notebooks or Queries are interacting with this dataset in some capacity which allows me to more easily trace root cause issues and also identify dependent assets that may be impacted should I make a change to the dataset.

Data Lineage for Lakehouse Monitoring Enabled Table

Also on the Lineage Tab, is the ability to see the Lineage Graph, which provides a graphical representation of how the dataset fits within its upstream and downstream dependencies.

AI/ML

The AI/ML space is vast and continually growing, especially with the increased emphasis on generative AI over the past year plus. Databricks has a wide array of tooling for all AI/ML needs, both traditional and generative AI. From model serving, to fine-tuning, to AutoML, to vector search, and so on, Databricks has all of these capabilities within a single cohesive platform.

The AI Playground is a great tool for testing out large language models (LLM) and comparing them to other models. Databricks just released their own LLM, DBRX, which is excellent, but new and better models are being created every day. Having a simple way to evaluate and test them out is of great importance as what works best today will be replaced in the coming months or years by other models in the future which perform better.

Easily create a Vector Search Index from a Delta Table

Two years ago, I would have never thought that vector databases would become one of the hottest technologies in tech (guess that’s why I stick to index funds over individual stocks when investing), but here we are. They have become incredibly important due to their use within retrieval augmented generation (RAG) applications which has become one of the lowest hanging fruits in the generative AI space. Need one? Just select a Delta table with the data you want to put into a vector index and then complete a simple form. That’s it.

Conclusion

These are just a few of the capabilities that Databricks has today that surprised me as I interviewed and ultimately started working there in January 2024. Despite my initial impressions of Databricks being an excellent data engineering tool back in 2019/2020, I’ve come to learn that it is so much more. It truly can serve all of an organizations data & AI needs.

Keep an eye out for my next post in the coming weeks where I review how I would architect the project I did on Databricks in 2019/2020 differently given the tremendous enhancements Databricks has made since then.

--

--