Storing semi-structured and unstructured data—such as text, images, video, and audio—is increasingly essential for digital advertising analytics as data sources diversify beyond classic databases. Data lakes and modern data lakehouse architectures are the most effective options for capturing and utilizing this variety of website data, while traditional data warehouses remain limited for unstructured content.
Why Store Unstructured Data?
Modern advertising relies on data from transaction logs, user session recordings, social media posts, website images, click-through videos, and customer feedback—all in diverse formats. Storing these efficiently enables broader analytics, like sentiment analysis, creative testing, and advanced attribution modeling, all crucial for optimizing campaign effectiveness.
Data Lakes: Unstructured and Semi-Structured Data
Data lakes store all data types in their native formats, providing maximum flexibility for advertising analytics. Key characteristics:
- Able to ingest raw website logs (semi-structured JSON), screenshots (images), product demo videos, and promotional audio files without enforced schema.
- Enable deep analytics across web session data, ad impressions, or user-uploaded photos and comments.
- Cost-effective and highly scalable object storage, often leveraging services like AWS S3 or Azure Blob for ad datasets.
Data Lakehouses: Enhanced Analytics and Governance
Data lakehouses combine the flexibility of data lakes with the governance, performance, and transactional consistency of warehouses. This hybrid model is ideal for ad tech:
- Seamlessly handle structured (campaign spend), semi-structured (web track events, XML page metadata), and unstructured (ad video files) data.
- Provide robust metadata management. For example, every video creative loaded into a website can be cataloged with campaign, target, and creative details, advancing governance and search.
- Support ACID transactions, beneficial when multiple teams iterate on ad data for modeling or reporting.
Website Data Examples in Advertising
- Text: Website content, product descriptions, user comments—stored as raw text or JSON with metadata (e.g., timestamp, author).
- Images: Banner ads, screenshots of A/B tests, and customer-uploaded content—stored as binary files with associated tags.
- Video: Pre-roll ad creatives, web page session recordings—stored as files with campaign and creative descriptors logged in metadata.
- Audio: Podcast ad segments or audio banners, often referenced by URL and indexed using metadata for analytics.
Challenges and Best Practices
- Metadata Management: Use table formats like Apache Parquet or Delta Lake to overlay schemas onto unstructured web data for discoverability and versioning.
- Scalability: Choose cloud object storage and decouple compute from storage, scaling analytics for growing ad impressions and rich media.
- Governance: Implement data quality monitoring and version control, crucial for regulated industries or ad campaign compliance audits.
- Analytical Flexibility: Enable query engines with SQL-on-lake support (Databricks, Snowflake, BigQuery) to process non-tabular website data for campaign reporting.
Summary Table
| Data Type | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Text | Limited, must be structured | Supported | Supported with schema integration |
| Images | Difficult, rarely supported | Supported as raw files | Supported with metadata governance |
| Video | Not supported | Supported as raw files | Supported with indexing/versioning |
| Audio | Not supported | Supported as raw files | Supported with rich metadata |
| Website Logs | Requires ETL, structure enforced | Native support in raw or semi-structured | Native, robust processing |
Modern advertising workflows demand support for semi-structured and unstructured data. Data lakes and lakehouses provide the scalable, flexible, and governed environments required to thrive as formats diversify and analytics advance.

- https://www.ibm.com/think/topics/data-warehouse-vs-data-lake-vs-data-lakehouse
- https://www.striim.com/blog/data-warehouse-vs-data-lake-vs-data-lakehouse-an-overview/
- https://aws.amazon.com/compare/the-difference-between-a-data-warehouse-data-lake-and-data-mart/
- https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-a-data-lake
- https://www.reddit.com/r/dataengineering/comments/1bfaon5/are_data_lakes_mostly_used_for_sturctured_and/
- https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
- https://www.montecarlodata.com/blog-data-warehouse-vs-data-lake-vs-data-lakehouse-definitions-similarities-and-differences/
- https://www.netguru.com/blog/data-lakehouse-vs-data-warehouse
- https://www.fivetran.com/blog/what-is-a-data-lakehouse
- https://blog.purestorage.com/purely-technical/data-fabric-vs-data-lake-vs-data-warehouse/
- https://www.workday.com/en-us/topics/erp/erp-vs-data-warehouse.html
- https://renta.im/blog/data-warehouse-vs-data-lake-vs-data-lakehouse/
- https://www.databricks.com/discover/data-warehouse
- https://www.altexsoft.com/blog/structured-unstructured-data/
- https://amplitude.com/blog/data-lake-vs-warehouse-vs-lakehouse
- https://www.knowi.com/blog/data-lake-vs-data-warehouse-vs-data-lake-house/
Also check out Amplitude’s big guide to data governance: https://amplitude.com/explore/data/data-governance-guide

Leave a Reply