I Built a Data Warehouse Data Model. Here’s What Actually Happened.

I’m Kayla. I plan data. I ship dashboards. I also break stuff and fix it fast. Last winter, I rebuilt our data warehouse model for our growth team and finance folks. (For the blow-by-blow on that rebuild, here’s the full story.) I thought it would take two weeks. It took six. You know what? I’d do it again.

I used Snowflake for compute, dbt for transforms, Fivetran for loads, and Looker for BI. My model was a simple star. Mostly. I also kept a few history tables like Data Vault hubs and satellites for the messy parts. If you're still comparing star, snowflake, and vault patterns, my notes on trying multiple data warehouse models might help. That mix kept both speed and truth, which sounds cute until refunds hit on a holiday sale. Then you need it. Still sorting out the nuances between star and snowflake designs? The detailed breakdown in this star-vs-snowflake primer lays out the pros and cons.

Let me explain what worked, what hurt, and the real stuff in the middle.

What I Picked and Why

  • Warehouse: Snowflake (medium warehouse most days; small at night)
  • Transforms: dbt (tests saved my butt more than once)
  • Loads: Fivetran for Shopify, Stripe, and Postgres
  • BI: Looker (semantic layer helped keep one version of “revenue”)

I built a star schema with one big fact per process: orders, sessions, and ledger lines. I used small dimension tables for people, products, dates, and devices. When fields changed over time (like a customer’s region), I used SCD Type 2. Fancy name, simple idea: keep history.

Real Example: Sales

I started with sales. Not because it’s easy, but because it’s loud.

  • Fact table: fact_orders

    • Grain: one row per order line
    • Keys: order_line_id (surrogate), order_id, customer_key, product_key, date_key
    • Measures: revenue_amount, tax_amount, discount_amount, cost_amount
    • Flags: is_refund, is_first_order, is_subscription
  • Dim tables:

    • dim_customer (SCD2): customer_key, customer_id, region, first_seen_at, last_seen_at, is_current
    • dim_product: product_key, product_id, category, sku
    • dim_date: date_key, day, week, month, quarter, year

What it did for us:

  • A weekly revenue by region query dropped from 46 seconds to 3.2 seconds.
  • The finance team matched Stripe gross to within $82 on a $2.3M month. That was a good day.
  • We fixed “new vs repeat” by using customer_key + first_order_date. No more moving targets.

If you’re working in a larger org, here’s a candid look at what actually works for an enterprise data warehouse based on hands-on tests.

Pain I hit:

  • Late refunds. They landed two weeks after the sale and split across line items. I added a refunds table and a model that flips is_refund and reverses revenue_amount. Clean? Yes. Fun? No.
  • Tax rules. We sell in Canada and the US. I added a dim_tax_region map, then cached it. That killed a join that cost us 15% more credits than it should.

Real Example: Web Events

Marketing wanted “the full journey.” I built two facts.

  • fact_events: one row per raw event (page_view, add_to_cart, purchase)

    • device_key, customer_key (nullable), event_ts, event_name, url_path
  • fact_sessions: one row per session

    • session_id, customer_key, device_key, session_start, session_end, source, medium, campaign

Side note: if your growth squad is also running edgy or adult-only influencer pushes on TikTok, you’ll eventually want concrete examples of how that content drives engagement so you can model the right attributes (creator tier, hashtag, audience age) in these same event tables. The no-filter roundup of viral clips at TikTok Nudes showcases real metrics and screenshots, giving you inspiration on which fields are worth capturing when you pull that data into your warehouse. Similarly, if you’re experimenting with location-based classifieds or dating promos, a quick scan of Backpage Morgantown surfaces real-time listing counts, posting cadence, and category splits you can transform into dimensional attributes and benchmarks for localized campaign modeling.

I stitched sessions by sorting events by device + 30-minute gaps. Simple rule, tight code. When a user logged in mid-session, I backfilled customer_key for that session. Small touch, big win.

What it gave us:

  • “Ad spend to checkout in 24 hours” worked with one Looker explore.
  • We saw weekends run 20% longer sessions on mobile. So we moved push alerts to Sunday late morning. CTR went up 11%. Not magic. Just good timing.

What bit me:

  • Bots. I had to filter junk with a dim_device blocklist and a rule for 200+ page views in 5 minutes. Wild, but it cut fake traffic by a lot.

Real Example: The Ledger

Finance is picky. They should be.

  • fact_gl_lines: one row per journal line from NetSuite
    • journal_id, line_number, account_key, cost_center_key, amount, currency, posted_at
  • dim_account, dim_cost_center (SCD2)

We mapped Shopify refunds to GL accounts with a mapping table. I kept it in seeds in dbt so changes were versioned. Monthly close went from 2.5 days to 1.5 days because the trial balance matched on the first run. Not perfect, but close.

What Rocked

  • The star model was easy to teach. New analysts shipped a revenue chart on day two.
  • dbt tests caught null customer_keys after a Fivetran sync hiccup. Red light, quick fix, no blame.
  • Looker’s measures and views kept revenue one way. No more four dashboards, five numbers.

What Hurt (And How I Fixed It)

  • Too many tiny dims. I “snowflaked” early. Then I merged dim_category back into dim_product. Fewer joins, faster queries.
  • SCD2 bloat. Customer history grew fast. I added monthly snapshots and kept only current rows in the main dim. History moved to a history view.
  • Time zones. We sell cross-border. I locked all facts to UTC and rolled out a dim_date_local per region for reports. Set it once, breathe easy.
  • Surrogate keys vs natural keys. I kept natural ids for sanity, but used surrogate keys for joins. That mix saved me on backfills.
  • I’ve also weighed when an ODS beats a warehouse; see my field notes on ODS vs Data Warehouse for the trade-offs.

Cost and Speed (Real Numbers)

  • Snowflake credits: average 620/month; spikes on backfills to ~900
  • 95th percentile query time on main explores: 2.1 seconds
  • dbt Cloud runtime: daily full run ~38 minutes; hourly incremental jobs 2–6 minutes
  • BigQuery test run (I tried it for a week): similar speed, cheaper for ad-hoc, pricier for our chatty BI. We stayed with Snowflake. Running Snowflake in a heavily regulated environment turned up similar pros and cons in this hospital case study.

A Few Rules I’d Tattoo on My Laptop

  • Name the grain. Put it in the table description. Repeat it.
  • Write the refund story before you write revenue.
  • Keep a date spine. It makes time math easy and clean.
  • Store money in cents. Integers calm nerves.
  • Add is_current, valid_from, valid_to to SCD tables. Future you says thanks.
  • Document three sample queries per fact. Real ones your team runs.
  • Keep a small “business glossary” table. One row per metric, with a plain note.

If you need hands-on examples of these rules in action, BaseNow curates open data-model templates you can copy and tweak.

Who This Fits (And Who It Doesn’t)

  • Good fit: small to mid teams (2–20 data folks) who need trust and speed.
  • Not great: pure product analytics with huge event volume and simple questions. A wide table in BigQuery might be enough there. Before you pick, here’s my hands-on take on data lakes vs data warehouses.
  • Heavy compliance or wild source changes? Add a Data Vault layer first, then serve a star to BI. It slows day one, but saves month six. For a head-to-head look at when a vault or

My Go-To Rules for a Data Warehouse (Told From My Own Messy Wins)

Hi, I’m Kayla. I’ve run data at a scrappy startup and at a big retail brand. I’ve used Snowflake, BigQuery, and Redshift. I break things. I fix them. And I take notes.

If you’d like the longer, unfiltered story behind these messy wins, check out my full write-up on BaseNow: My Go-To Rules for a Data Warehouse (Told From My Own Messy Wins).

This is my review of what actually works for me when I build and care for a data warehouse. It’s not a lecture. It’s field notes with real bumps, a few “oh no” moments, and some sweet saves. If you're after a more structured checklist of warehouse best practices geared toward growing teams, the Integrate.io crew has a solid rundown right here.

Start Simple, Name Things Well

I used to get cute with names. Bad idea. Now I keep it boring:

  • stg_ for raw, cleaned data (stg_orders)
  • int_ for mid steps (int_orders_with_discounts)
  • dim_ for lookup tables (dim_customer)
  • fact_ for event tables (fact_orders)

Want to see how the very first data model I ever built shook out—warts and all? Here’s the play-by-play: I Built a Data Warehouse Data Model—Here’s What Actually Happened.

In 2022, a Looker explore kept breaking because one teammate named a table orders_final_final. Funny name, sure. But it cost us two hours on a Monday. After we switched to the simple tags above, QA got faster. Fewer “where did that table go?” moments, too. For a formal deep dive on why disciplined, consistent prefixes matter (and what good ones look like), this naming-convention cheat sheet is gold: Data Warehouse Naming Conventions.

Star Shape Wins Most Days

When in doubt, I lean on a star model. One big fact. Little helpful dims. It’s not fancy. It just works.

Real example from our e-com shop:

  • fact_orders with one row per order
  • dim_customer with Type 2 history (so we keep old addresses and names)
  • dim_product (SKU, brand, category)
  • dim_date (day, week, month)

Curious how this stacks up against other modeling patterns I’ve tried? Here’s my candid comparison: I Tried Different Data Warehouse Models—Here’s My Take.

Before this, we had a chain of joins that looked like spaghetti. Dashboards took 40+ seconds. After the star setup, the main sales board ran in 5–8 seconds. Not magic. Just less chaos.

Partitions and Clusters Save Real Money

I learned this the hard way on BigQuery. A summer intern ran a full-table scan on a year of web logs. Boom—30 TB read. The cost alert hit my phone while I was in line for tacos.
If you’ve got interns or younger analysts itching to poke around, carving out a supervised sandbox chat where they can sanity-check queries first can save serious cash. One quick, no-sign-up option is InstantChat’s teen-friendly room — it’s a moderated space where new folks can fire off newbie questions in real time without cluttering your main engineering channels.

We fixed it:

  • Partition by event_date
  • Cluster by user_id and path

Next run: 200 GB. That’s still big, but not scary. Same move in Snowflake? I use date-based micro-partitions and sort keys where it helps. On Redshift, I set sort keys and run VACUUM on a schedule. Not cute, but it keeps things fast.

Fresh Beats Perfect (But Test Everything)

I run ELT on a schedule and sort by impact. Finance needs early numbers? Hourly. Product? Maybe every 3 hours. Ads spend? Near real time during promos.

The trick: tests. I use dbt for this:

  • unique and not_null on primary keys
  • relationships (orders.customer_id must exist in dim_customer)
  • freshness checks

For the full breakdown of the testing playbook that’s saved my bacon more than once, see: I Tried a Data Warehouse Testing Strategy—Here’s What Actually Worked.

These tests saved me during Black Friday prep. A dbt test flagged a 13% drop in orders in staging. It wasn’t us. It was a checkout bug. We caught it before the rush. The team brought me donuts that week. Nice touch.

CDC When You Need It, Not When You Don’t

We used Debezium + Kafka for Postgres change data. Then Snowpipe pushed it into Snowflake. It felt heavy at first. But support chats and refunds need near real time. So yes, it earned its keep for that stream.

Not sure when to keep data in an operational data store versus a warehouse? Here’s my field guide: ODS vs. Data Warehouse—How I’ve Used Both and When Each One Shines.

But for Salesforce? We used Fivetran. I tried to build it myself once. Look, it worked, but it took way too much time to keep it alive. I’d rather write models than babysit API limits all day.

Clear SLAs Beat Vague Wishes

“Real time” sounds great. It also costs real money. I set simple rules with teams:

  • Finance: hourly until noon; daily after
  • Marketing: 15 minutes during promos; 1 hour off-peak
  • Product: daily is fine; hourly on launch days

We put these in Notion. People stopped asking “Is the data fresh?” They could check the SLA. And yes, we allow “break glass” runs. But we also measure the blast.

If you’re wrangling an enterprise-scale warehouse and want to know what actually works in the real world, you’ll like this deep dive: What Actually Works for an Enterprise Data Warehouse—My Hands-On Review.

Row-Level Security Keeps Me Sane

One time, an intern ran a query and saw salary data. I felt sick. We fixed it by Monday:

  • Use roles for each group (sales_analyst, finance_analyst)
  • Use row-level filters (BigQuery authorized views; Snowflake row access policies)
  • Keep write access tight. Like, tight tight.

Now sales sees only their region. Finance sees totals. HR sees sensitive stuff. And I sleep better.

If you’re wondering how tight testing and alerting can translate into better sleep, here’s a story you’ll appreciate: I Test Data Warehouses—Here’s What Actually Helped Me Sleep.

Docs and Lineage: The Boring Hero

I keep docs in two places:

  • dbt docs for models and lineage
  • One-page team guide in Notion: table naming, keys, and joins you should never do

For an even deeper bench of templates and examples, the free resources at BaseNow have become one of my secret weapons for leveling up documentation without adding extra toil.

When someone new asks, “Where do I find churn by cohort?” I show the doc. If I get the same question three times, I write a tiny guide with a screenshot. It takes 10 minutes. It saves hours later.

Cost Controls That Help (Not Hurt)

Real things that worked:

  • Snowflake: auto-suspend after 5 minutes; auto-resume on query
  • BigQuery: per-project quotas and cost alerts at 50%, 80%, 100%
  • Redshift: right-size nodes; use concurrency scaling only for peaks

We saved 18% in one quarter by shutting down a weekend dev warehouse that no one used. It wasn’t clever. It was just a toggle we forgot to flip.

Backups, Clones, and “Oops” Moments

I once dropped a table used by a morning dashboard. My phone blew up. We fixed it fast with a Snowflake zero-copy clone. We also used time travel to pull data from 3 hours earlier. Five minutes of panic. Then calm. After that, I set daily clone jobs and tested them. Not just on paper—tested for real.

If you’re curious how Snowflake holds up in a high-stakes environment, here’s what happened when I ran a hospital’s warehouse on it: I Ran Our Hospital’s Data Warehouse on Snowflake—Here’s How It Really Went.

Analysts Need Joy, Too

I add small things that reduce pain:

  • Surrogate keys as integers (less join pain)
  • A shared dim_date with holidays and

I modernized our data warehouse. Here’s what actually happened.

I’m Kayla, and yes, I did this myself. I took our old warehouse and moved it to a new stack. It took months. It also saved my nights and my nerves.

If you’re after the full blow-by-blow, here’s the longer story of how I actually modernized our data warehouse from the ground up.

You know what pushed me? A 2 a.m. page. Our old SQL Server job died again. Finance needed numbers by 8. I stared at a red SSIS task for an hour. I said, “We can’t keep doing this.” So we changed. A lot.

Where I started (and why I was tired)

We had:

  • SQL Server on a loud rack in the closet
  • SSIS jobs that ran at night
  • CSV files on an old FTP box
  • Tableau on top, with angry filters

Loads took 6 to 8 hours. A bad CSV would break it all. I watched logs like a hawk. I felt like a plumber with a leaky pipe.

That messy starting point is exactly why I keep a laminated copy of my go-to rules for a data warehouse taped to my monitor today.

What I moved to (and what I actually used)

I picked tools I’ve used with my own hands:

  • Snowflake on AWS for the warehouse
  • Fivetran for connectors (Salesforce, NetSuite, Zendesk)
  • dbt for models and tests
  • Airflow for job runs
  • Looker for BI

Picking that stack sometimes felt like speed-dating—scrolling through feature “profiles,” testing chemistry in short bursts, and committing only when it clicked. If you’ve ever swiped for a match online, you’ll recognize the pattern; efficiency matters when options are endless. A fun example is SPDate where a real-time matching engine shows how smart filtering quickly pairs people who fit each other’s criteria. Similarly, niche local hubs can be equally instructive—check out the classifieds scene in South Elgin via Backpage South Elgin where you can browse hyper-local listings and see firsthand how a tightly focused marketplace sharpens both supply and demand.

When the stack was in place, I sat down and built a data-warehouse data model that could grow without toppling over.

I set Snowflake to auto-suspend after 5 minutes. That one switch later saved us real money. I’ll explain.

First real win: Salesforce in 30 minutes, then, oops

Fivetran pulled Salesforce in about 30 minutes. That part felt like magic. But I hit API limits by noon. Sales yelled. So I moved the sync to the top of the hour and set “high-volume objects” to 4 times a day. No more limit errors. I learned to watch Fivetran logs like I watch coffee brew—steady is good.

Like any cautious engineer, I’d already told myself “I tried a data-warehouse testing strategy and it saved me more than once,” so the next step was obvious—tests everywhere.

dbt saved me from bad data (and my pride)

I wrote dbt tests for “not null” on state codes. Day one, the test failed. Why? Two states had blank codes in NetSuite. People were shipping orders with no state. We fixed the source. That tiny test kept a big mess out of Looker. I also built incremental models. One table dropped from 6 hours to 40 minutes. Later I used dbt snapshots for “who changed what and when” on customers. SCD Type 2, but plain words: it tracks history.

I did mess up names once. I renamed a dbt model. Twelve Looker dashboards broke. I learned to use stable view names and point Looker there. New names live inside. Old names live on. Peace.

Since then, I’ve reminded every new hire that I test data warehouses so I can actually sleep at night.

Airflow: the flaky friend I still keep

Airflow ran my jobs in order. Good. But I pushed a big data frame into XCom. Bad. The task died. So I switched to writing a small file to S3 and passed the path. Simple. Stable. I also set SLAs so I got a ping if a job ran long. Not fun, but helpful.

Snowflake: fast, but watch the meter

Snowflake ran fast for us. I loved zero-copy clone. I cloned prod to a test area in seconds. I tested a risky change at 4 p.m., shipped by 5. Time Travel also saved me when I deleted a table by mistake. I rolled it back in a minute, and my heart rate went back down.

Now the part that stung: we once left a Large warehouse running all weekend. Credits burned like a bonfire. After that, I set auto-suspend to 5 minutes and picked Small by default. We turn on Medium only when a big report needs it. We also used resource monitors with alerts. The bill got sane again.

If you wonder how Snowflake fares in high-stakes environments, here’s how I ran our hospital’s data warehouse on Snowflake—spoiler: heartbeats mattered even more there.

A quick detour: Redshift, then not

Years back, I tried Redshift at another job. It worked fine for a while. But we fought with vacuum, WLM slots, and weird queue stuff when folks ran many ad hoc queries. Concurrency got tough. For this team, I picked Snowflake. If you live in AWS and love tight control, Redshift can still be fine. For us, Snowflake felt simple and fast.

I’ve also watched many teams debate the merits of ODS vs Data Warehouse like it’s a Friday-night sport. Pick what fits your latency and history needs, not the loudest opinion.

Real, everyday results

  • Finance close went from 5 days to 3. Less hair-pulling.
  • Marketing got near real-time cohorts. They ran campaigns the same day.
  • Data freshness moved from nightly to every 15 minutes for key tables.
  • Support saw a customer’s full history in one place. Fewer “let me get back to you” calls.

We shipped a simple “orders by hour” dashboard that used to be a weekly CSV. It updated every 15 minutes. Folks clapped. Not loud, but still.

Teams later asked why we landed on this design; the short answer is that I tried different data-warehouse models before betting on this one.

Governance: the part I wanted to skip but couldn’t

Roles in Snowflake confused me at first. I made a “BUSINESS_READ” role with a safe view. I masked emails and phone numbers with tags. Legal asked for 2-year retention on PII. I set a task to purge old rows. I also added row-level filters for EU data. Simple rules, less risk. Boring? Maybe. Needed? Yes.

Those guardrails might feel dull, but they’re exactly what actually works for an enterprise data warehouse when the auditors come knocking.

Stuff that annoyed me

  • Surprise costs from ad hoc queries. A giant SELECT can chew through credits. We now route heavy work to a separate warehouse with a quota.
  • Looker PDTs took forever one Tuesday at 9 a.m. I moved that build to 5 a.m. and cut it in half by pushing joins into dbt.
  • Fivetran hit a weird NetSuite schema change. A column type flipped. My model broke. I added a CAST in staging and set up a Slack alert for schema drift.

What I’d do again (and what I wouldn’t)

I’d do:

  • Start with one source, one model, one dashboard. Prove it.
  • Use dbt tests from day one. Even the simple ones.
  • Keep stable view names for BI. Change under the hood, not on the surface.
  • Turn on auto-suspend. Set Small as the default warehouse.
  • Tag PII and write it down. Future you will say thanks.

I wouldn’t:

  • Let folks query prod on the biggest warehouse “just for a minute.”
  • Rename core fields without a deprecation plan.
  • Pack huge objects into Airflow XCom. Keep it lean.

If your team looks like mine

We were 6 people: two analytics engineers, one data engineer, one analyst, one BI dev, me. If that sounds like you, this stack fits:

  • Fivetran + dbt + Snowflake + Airflow + Looker

For more practical guidance

I Hired 4 Data Lake Consulting Firms. Here’s What Actually Worked.

I run data projects for a mid-size retail brand. We sell boots, backpacks, and a lot of coffee mugs. Think back-to-school rush and Black Friday storms. We have stores in six states. Our team is small. We needed a data lake that didn’t break each time a new feed showed up. Everything I’d learned from building a data lake for big data told me that resilience mattered more than bells and whistles.

So I hired four different consulting teams over two years. Some work was great. Some was… fine. If you want the slide-by-slide decision record, I also put together a granular play-by-play. Here’s what I saw, what I liked, and what I’d do next time.


Quick picture: our setup and goals

  • Cloud: mostly AWS (S3, Glue, Athena, Lake Formation), then later Azure for HR and finance (ADLS, Purview, Synapse)
  • Engines and tools: Databricks, Kafka, Fivetran, Airflow, dbt, Great Expectations
  • Need: one place for sales, supply chain, and marketing data, with clean access rules and faster reporting
  • Budget range across all work: about $2.2M over two years

Before settling on consultants, I trial-ran six packaged platforms as well—here’s my honest take on each.

You know what? We didn’t need magic. We needed boring, steady pipes and clear names. And less drama on Monday mornings. If you’re still wrapping your head around what a clean, well-labeled data lake actually looks like, I recommend skimming the plain-English walkthrough on BaseNow before you start reviewing proposals.


Slalom: Fast wins on AWS, with a few gaps

We brought in Slalom first. (We leaned on Slalom's AWS Glue Services offering.) Goal: stand up an AWS data lake and show real value in one quarter.

  • Time: 12 weeks
  • Cost to us: about $350K
  • Stack: S3 + Glue + Athena + Lake Formation + Databricks (Delta Lake)

What went well:

  • They ran tight whiteboard sessions. The kind where the markers squeak and everyone nods. We left with a clear “bronze, silver, gold” flow.
  • They set up Delta tables that actually worked. Our weekly sales job dropped from 3 hours to 6 minutes. That one change made our merch team smile. Big win.
  • They built a “starter pack” in Git. We still use the repo layout.

What bugged me:

  • They spent two weeks on slides. The slides were pretty. My CIO loved them. My engineers rolled their eyes.
  • Data quality was thin. We had checks, but not enough guardrails. We caught bad SKUs late, which bit us during a promo weekend.
  • If you’re wondering how I later solved that QA gap, I tried a purpose-built lake testing playbook—full rundown here.

Real moment:

  • On week 10, we ran a price test. Athena queries that used to time out came back in under a minute. I texted our planner. She replied with three fire emojis. I’ll take it.

Best fit:

  • If you want visible wins on AWS, fast, and you can add your own QA later.

Databricks Professional Services: Deep fixes, less hand-holding

We used Databricks ProServe for a hard lift. (Officially, that's Databricks Professional Services.) We moved off EMR jobs that were flaky. Small files everywhere. Slow checkpoints. You name it.

  • Time: 8 weeks
  • Cost: about $180K
  • Stack: Databricks + Delta Lake + Auto Loader + Unity Catalog pilot

What went well:

  • They knew the platform cold. They fixed our small file mess with Auto Loader tweaks and better partitioning. Jobs ran 28% cheaper the next month. That hit our cloud bill right away.
  • They paired with our devs. Real code, real reviews. No fluff.
  • They set up a job failure playbook. Pager had fewer 2 a.m. pings. My on-call folks slept again.

What bugged me:

  • Less friendly for non-engineers. They talk fast. They use a lot of terms. My business partners got lost in calls.
  • Not cheap. Worth it for the hard stuff, but your wallet feels it.

Real moment:

  • We had a nasty merge bug in bronze-to-silver. Their lead hopped on at 7 a.m. We shipped a fix by lunch. No blame, just work. That won me over.

Best fit:

  • If your issue is deep platform pain, and you want engineers who live in notebooks and care about throughput.

Thoughtworks: Strong on data contracts and governance, slower pace

We saw our lake grow. Rules got messy. So we hired Thoughtworks to clean up the “how,” not just the “what.”

  • Time: 16 weeks
  • Cost: around $420K
  • Stack: Azure ADLS for HR/finance, plus Purview, Synapse, Databricks, Great Expectations, dbt tests, data contracts

What went well:

  • They brought product thinking. Each data set had an owner, a promise, and tests. We used Great Expectations to catch bad rows before they spread.
  • Purview got real tags. Not just “table_01.” We set row-level rules for HR data that kept salary safe but let us report headcount by store. Clean and calm.
  • The docs were actually good. Clear runbooks. Clear words. We still hand them to new hires.

What bugged me:

  • Slower pace. They will stop you and say, “let’s fix the shape first.” When a promo is live, that’s hard to hear.
  • They love refactors. They were right, mostly. But it stretched the timeline.

Real moment:

  • We rolled out data contracts for vendor feeds. A vendor sent a new column with a weird date. The test failed fast. The bad data never hit our gold layer. No fire drill. I wanted cupcakes.

Best fit:

  • If you need trust, rules, and steady habits. Less flash. More craft.

Accenture: Big program power, heavy change control

We used Accenture for a larger supply chain push across regions. Nightly feeds, near-real-time stock level updates, and vendor scorecards.

  • Time: 9 months
  • Cost: about $1.2M
  • Stack: Azure + Kafka + Fivetran + Databricks + Synapse + Power BI

What went well:

  • They handled a lot. PMO, status, offshore build, weekly risk logs. The train moved.
  • Their near-real-time stock stream worked. We cut out-of-stock “ghosts” by ~14%. Stores had better counts. Fewer weird calls from managers.

What bugged me:

  • Change requests took ages. A new vendor feed needed six weeks of paperwork. My buyers lost patience.
  • Layers on layers. Senior folks in pitch, then handoffs. The delivery team was solid by month two, but the early shuffle slowed us.

Real moment:

  • We had a weekend cutover with three war rooms on Slack. They brought pizza. We brought energy drinks. It was corny, but we shipped. Monday was quiet. Quiet is gold.

Best fit:

  • If you need a big, steady crew and heavy program control. Budget for change requests, and set clear gates up front.

Small notes on cost, people, and handoff

  • Don’t chase “one lake to rule them all.” We kept HR on Azure with tight rules. Sales lived on AWS. That split kept risk low.
  • For a broader view on when to use separate domains (data mesh) or centralized pipes (fabric), you can skim my field notes—I tried data lake, data mesh, and data fabric, here’s my real take.
  • Pay for a real handoff. Ask for runbooks, shadow weeks, and a “you break it, we fix it” period. We did this with two firms. Those are the stacks that still run smooth.
  • Watch data quality early. Add tests at bronze. It feels slow. It makes gold faster.

My scorecard (plain talk)

  • Slalom: A- for quick AWS wins. Could use stronger QA.
  • Databricks ProServe: A for deep platform fixes. Less shiny for non-tech folks.
  • Thoughtworks: A- for contracts and trust. Slower pace, worth it if you can wait.
  • Accenture: B+ for large programs. Strong engine, heavy on process.

What I’d do differently next time

  • Write the success yardsticks before kickoff: query speed, job cost, error budget, and user wait time. Simple numbers everyone can repeat.
  • Put data contracts in the SOW, not as a “maybe later.”
  • Ask for cost guardrails:

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!