Storing semi-structured and unstructured data

Storing semi-structured and unstructured data—such as text, images, video, and audio—is increasingly essential for digital advertising analytics as data sources diversify beyond classic databases. Data lakes and modern data lakehouse architectures are the most effective options for capturing and utilizing this variety of website data, while traditional data warehouses remain limited for unstructured content.

Why Store Unstructured Data?

Modern advertising relies on data from transaction logs, user session recordings, social media posts, website images, click-through videos, and customer feedback—all in diverse formats. Storing these efficiently enables broader analytics, like sentiment analysis, creative testing, and advanced attribution modeling, all crucial for optimizing campaign effectiveness.

Data Lakes: Unstructured and Semi-Structured Data

Data lakes store all data types in their native formats, providing maximum flexibility for advertising analytics. Key characteristics:

Able to ingest raw website logs (semi-structured JSON), screenshots (images), product demo videos, and promotional audio files without enforced schema.
Enable deep analytics across web session data, ad impressions, or user-uploaded photos and comments.
Cost-effective and highly scalable object storage, often leveraging services like AWS S3 or Azure Blob for ad datasets.

Data Lakehouses: Enhanced Analytics and Governance

Data lakehouses combine the flexibility of data lakes with the governance, performance, and transactional consistency of warehouses. This hybrid model is ideal for ad tech:

Seamlessly handle structured (campaign spend), semi-structured (web track events, XML page metadata), and unstructured (ad video files) data.
Provide robust metadata management. For example, every video creative loaded into a website can be cataloged with campaign, target, and creative details, advancing governance and search.
Support ACID transactions, beneficial when multiple teams iterate on ad data for modeling or reporting.

Website Data Examples in Advertising

Text: Website content, product descriptions, user comments—stored as raw text or JSON with metadata (e.g., timestamp, author).
Images: Banner ads, screenshots of A/B tests, and customer-uploaded content—stored as binary files with associated tags.
Video: Pre-roll ad creatives, web page session recordings—stored as files with campaign and creative descriptors logged in metadata.
Audio: Podcast ad segments or audio banners, often referenced by URL and indexed using metadata for analytics.

Challenges and Best Practices

Metadata Management: Use table formats like Apache Parquet or Delta Lake to overlay schemas onto unstructured web data for discoverability and versioning.
Scalability: Choose cloud object storage and decouple compute from storage, scaling analytics for growing ad impressions and rich media.
Governance: Implement data quality monitoring and version control, crucial for regulated industries or ad campaign compliance audits.
Analytical Flexibility: Enable query engines with SQL-on-lake support (Databricks, Snowflake, BigQuery) to process non-tabular website data for campaign reporting.

Summary Table

Data Type	Data Warehouse	Data Lake	Data Lakehouse
Text	Limited, must be structured	Supported	Supported with schema integration
Images	Difficult, rarely supported	Supported as raw files	Supported with metadata governance
Video	Not supported	Supported as raw files	Supported with indexing/versioning
Audio	Not supported	Supported as raw files	Supported with rich metadata
Website Logs	Requires ETL, structure enforced	Native support in raw or semi-structured	Native, robust processing

Modern advertising workflows demand support for semi-structured and unstructured data. Data lakes and lakehouses provide the scalable, flexible, and governed environments required to thrive as formats diversify and analytics advance.

Also check out Amplitude’s big guide to data governance: https://amplitude.com/explore/data/data-governance-guide

Adman Analytics

recent posts

about