The Five Best Things: Feb 6, 2021

If data is the new oil, data pipelines are the new oil pipelines

Happy Black History Month! Back to a data science/ML focused round up after an all-GameStop edition last week - which somehow became one of my most shared writeups!

The Five Best Things

  1. O’Reilly: Where Programming, Ops, AI, and the Cloud are Headed in 2021

    • AI Trends

    • Finally - cloud computing trends point heavily to multicloud (using more than one cloud provider)

      Cloud computing is hybrid by nature. Think about how companies “get into the cloud.” It’s often a chaotic grassroots process rather than a carefully planned strategy. An engineer can’t get the resources for some project, so they create an AWS account, billed to the company credit card. Then someone in another group runs into the same problem, but goes with Azure. Next there’s an acquisition, and the new company has built its infrastructure on Google Cloud. And there’s petabytes of data on-premises, and that data is subject to regulatory requirements that make it difficult to move. The result? Companies have hybrid clouds long before anyone at the C-level perceives the need for a coherent cloud strategy. By the time the C suite is building a master plan, there are already mission-critical apps in marketing, sales, and product development. And the one way to fail is to dictate that “we’ve decided to unify on cloud X.”

  2. Blog: What is a feature store?

    • A trend increasingly occurring in the industry is having a data scientist “embedded” within a team, v.s. a separately staffed data science team. Tools such as this which provide a “single pane of glass” are very critical in decentralized operations.

  3. Barr Moses: Incident Prevention for Data Teams: Introducing the 5 Pillars of Data Observability

    • Another trend in the MLOps space is Data Observability. Monte Carlo is an observability company, whose CEO Barr Moses defines it as

      A Data Observability layer literally “observes” data assets from end to end, alerting data engineers and analysts when issues arise so they can be addressed before they affect the business.

    • The key pillars of observability, and the questions they help us answer -

      1. Freshness: Is my data up to date?

      2. Distribution: Are there abnormalities in my incoming data (i.e. unexpected values, null values)?

      3. Volume: Do I have too few or too many data records than expected?

      4. Schema: Did the database organization change in a catastrophic way?

      5. Lineage: Where did my pipeline break? Where did the data come from?

  4. Jamin Ball: The Modern Data Cloud: Warehouse vs Lakehouse

    • Jamin Ball, a data platforms-focused VC at Altimeter Capital, presents an emerging trend in big data management and transformation: the data Lakehouse. This is in contrast to the data Warehouse approach of Snowflake.

      • A data lake stores ALL of the raw data of an organization.

      • A data warehouse stores data that has been Extracted from the data lake, Loaded into the warehouse, and then processed and Transformed into a form that can be readily analyzed. This process is called ELT or Extract - Load - Transform.


    • As you can see, Snowflake plays in the Business Intelligence (BI) and Analysis space where users rely on SQL (structured query language), while Databricks (which recently raised $1B at a $28B valuation) plays in the ML realm, where users use Spark / DataFrames.

    • Jamin predicts that Snowflake will start making lateral moves into the data science/ML space. Databricks on the other hand, is moving to vertically integrate the warehouse and data lake, in a new open source standard for building data lakes called ‘Delta Lake’. This preserves both the SQL and Spark/ DataFrame access methods and might look like this -

  5. Oldie but Goodie

Honorable Mentions

Some interesting pieces for Black History Month -

Other pieces -

Disclaimer: The views and opinions expressed in this post are my own and do not represent my employer.