I’m Kayla, and I’m a data person who cares a lot about tests. Not because I’m a robot. Because I hate bad numbers on a dashboard. They ruin trust fast.
If you want the blow-by-blow version with log outputs and error screenshots, I documented it all in BaseNow’s article “I Tried a Data Lake Testing Strategy. Here’s My Honest Take.”
Last year, I ran a full testing setup for a real data lake. It was for a mid-size retail group in the U.S. We used S3, Databricks with Delta tables, Glue catalog, Airflow, and Power BI. For checks, I used Great Expectations, PySpark unit tests with pytest, and a simple JSON schema contract. It was not fancy. But it worked. Most days.
So, did my strategy help? Yes. Did it catch messy stuff before it hit exec reports? Also yes. Did it break sometimes? Oh, you bet.
Let me explain.
What I Actually Built
- Zones: raw, clean, and serve (think: landing, logic, and ready-to-use)
- Tools: Great Expectations for data checks, pytest for Spark code, Airflow for runs, GitHub Actions for CI
- Formats: JSON and CSV in raw, Delta in clean and serve
- Contracts: JSON Schema in Git for each source table
- Alerts: Slack and email with short, plain messages
For teams still weighing which storage engine or managed service to adopt, my comparison of six leading providers in “I Tried 6 Data Lake Vendors—Here’s My Honest Take” might save you some evaluation cycles.
It sounds tidy. It wasn’t always tidy. But the map helped.
The Core Strategy, Step by Step
1) Raw Zone: Guard the Gate
- Schema check: Does the column list match the contract?
- Row count check: Did we get anything at all?
- File check: Is the file type right? Is the gzip real gzip?
- Partition check: Did the date folder match the file header date?
Real example: Our loyalty feed sent 17 CSV files with the wrong date in the header. My check saw a date mismatch and stopped the load. We asked the vendor to resend. They did. No broken churn chart later. Small win.
2) Clean Zone: Fix and Prove It
- Null rules: No nulls in keys; set sane defaults
- Duplicates: Check for dup keys by store_id + date
- Join checks: After a join, row counts should make sense
- Business rules: Price >= 0; refund_date can’t be before sale_date
Real example: We hit a null spike in the product table. Fill rate for brand dropped from 87% to 52% in one run. Alert fired. We paused the model. Vendor had a code change. They patched it next day. We backfilled. The chart didn’t flutter.
3) Serve Zone: Trust but Verify
- Totals: Sales by day should match POS files within 0.5%
- Dimension drift: If store_count jumps by 20% in a day, flag it
- Freshness: Facts must be newer than 24 hours on weekdays
- Dashboard checks: Compare top-10 products to last week’s list
Real example: On a Monday, the weekend sales were light by 12%. Our watermark test saw late data. The recovery job backfilled Sunday night files. Reports self-healed by noon. No angry sales calls. I slept fine.
The Tests I Liked Most
- Schema version gate: Contracts lived in Git. If a source added a column, we bumped the version. The pipeline refused to run until we added a rule. It felt strict. It saved us.
- PII guard: We ran a regex scan for emails, phones, and SSN-like strings in clean tables. One day, a supplier sent an “customer_email” field hidden in a notes column. The job failed on purpose. We masked it, reloaded, and moved on.
- Small files alarm: If a partition had more than 500 files under 5 MB, we warned. We then auto-merged. This cut read time on Athena from 2.3 minutes to 28 seconds for a heavy SKU report.
What Broke (and how I patched it)
- Great Expectations on huge tables: It crawled on wide, hot data. Fix: sample 5% on row-heavy checks, 100% on key checks. Fast enough, still useful.
- Dates from time zones: Our Sydney store wrote “yesterday” as “today” in UTC. Schedules slipped. Fix: use event_time, not load_time, for freshness checks.
- Late CDC events: Debezium sent update messages hours later. Our SCD2 tests thought we missed rows. Fix: widen the watermark window and add a daily backfill at 2 a.m.
- Flaky joins in tests: Dev data did not match prod keys. Fix: seed small, stable test data in a separate Delta path. Tests ran the same each time.
Academic readers might appreciate that many of these checks echo findings in the recent systems paper on scalable data-quality validation presented in this arXiv preprint, which benchmarks similar techniques against petabyte-scale workloads.
A Few Real Numbers
- We blocked 14 bad loads in 6 months. Most were schema changes and null spikes.
- Alert noise dropped from 23 per week to 5 after we tuned thresholds and grouped messages.
- A broken discount rule would’ve cost us a 3% error on gross margin for two weeks. A simple “price >= cost when promo=false” test caught it.
The Part That Felt Like Magic (and wasn’t)
We added “data contracts” per source. Just a JSON file with:
- Column name, type, and nullable
- Allowed values for enums
- Sample rate for checks
- Contact on-call person
When a source wanted a change, they opened a PR. The tests ran in CI on sample files. If all passed, we merged. No more surprise columns. It was boring. Boring is good.
By the way, if you’re looking for a structured, field-tested approach to defining and enforcing these agreements, the O’Reilly book “Data Contracts: Managing Data Quality at Scale” lays out patterns that map neatly to the playbook above.
Things I’d Do Differently Next Time
- Write fewer, sharper rules. Key fields first. Facts next. Fancy later.
- Put check names in plain English. “Nulls in customer_id” beats “GE-Rule-004.”
- Add cost checks early. Big queries that hit wide tables should get a warning.
- Store one-page run books for each test. When it fails, show how to fix it.
Quick Starter Kit (what worked for me)
- Pick 10 checks only:
- Schema match
- Row count > 0
- Freshness by event_time
- No nulls in keys
- Duplicates = 0 for keys
- Price >= 0
- Date logic valid
- Totals within 0.5% vs source
- PII scan off in raw, on in clean
- Small file alarm
- Automate with Great Expectations and pytest
- Run smoke tests on every PR with GitHub Actions
- Alert to Slack with short, clear text and a link to rerun
And if you’re dealing with petabyte-scale streams and wondering how the foundations scale, my build log in “I Built a Data Lake for Big Data—Here’s My Honest Take” breaks down the design decisions.
For teams that prefer a ready-made solution instead of stitching tools together, a managed platform like BaseNow bundles contracts, tests, and alerting so you can be production-ready in hours.
A Small Holiday Story
Black Friday hit. Feeds were wild. We saw 3 late drops, 2 schema nudges, and one scary file that said “NULL” as text. The checks held. We backfilled twice. Reports stayed steady. Folks in stores kept selling. I ate leftover pie and watched the jobs. Felt good.
Who Should Use This
- Data teams with 2 to 10 engineers
- Shops on S3, ADLS, or GCS, with Spark or SQL
- Anyone who ships daily reports that can’t be wrong
If you’re still deciding between lake, mesh, or fabric patterns, you might like my field notes in “I Tried Data Lake, Data Mesh, and Data Fabric—Here’s My Real Take.”
If you run real-time microseconds stuff, you’ll need more. But for daily and hourly loads, this works.
Verdict
Before we wrap, consider industries that live and die on hyper-personalized user interactions. Think adult dating marketplaces: if location or preference data drifts, matches feel random and users churn fast. The engineers behind LocalSex share how rigorous real-time validation keeps their geo-matching accurate and their community
