Storing semi-structured and unstructured data

Storing semi-structured and unstructured data—such as text, images, video, and audio—is increasingly essential for digital advertising analytics as data sources diversify beyond classic databases. Data lakes and modern data lakehouse architectures are the most effective options for capturing and utilizing this variety of website data, while traditional data warehouses remain limited for unstructured content.​

Why Store Unstructured Data?

Modern advertising relies on data from transaction logs, user session recordings, social media posts, website images, click-through videos, and customer feedback—all in diverse formats. Storing these efficiently enables broader analytics, like sentiment analysis, creative testing, and advanced attribution modeling, all crucial for optimizing campaign effectiveness.​

Data Lakes: Unstructured and Semi-Structured Data

Data lakes store all data types in their native formats, providing maximum flexibility for advertising analytics. Key characteristics:​

  • Able to ingest raw website logs (semi-structured JSON), screenshots (images), product demo videos, and promotional audio files without enforced schema.​
  • Enable deep analytics across web session data, ad impressions, or user-uploaded photos and comments.​
  • Cost-effective and highly scalable object storage, often leveraging services like AWS S3 or Azure Blob for ad datasets.

Data Lakehouses: Enhanced Analytics and Governance

Data lakehouses combine the flexibility of data lakes with the governance, performance, and transactional consistency of warehouses. This hybrid model is ideal for ad tech:

  • Seamlessly handle structured (campaign spend), semi-structured (web track events, XML page metadata), and unstructured (ad video files) data.​
  • Provide robust metadata management. For example, every video creative loaded into a website can be cataloged with campaign, target, and creative details, advancing governance and search.​
  • Support ACID transactions, beneficial when multiple teams iterate on ad data for modeling or reporting.

Website Data Examples in Advertising

  • Text: Website content, product descriptions, user comments—stored as raw text or JSON with metadata (e.g., timestamp, author).
  • Images: Banner ads, screenshots of A/B tests, and customer-uploaded content—stored as binary files with associated tags.
  • Video: Pre-roll ad creatives, web page session recordings—stored as files with campaign and creative descriptors logged in metadata.
  • Audio: Podcast ad segments or audio banners, often referenced by URL and indexed using metadata for analytics.

Challenges and Best Practices

  • Metadata Management: Use table formats like Apache Parquet or Delta Lake to overlay schemas onto unstructured web data for discoverability and versioning.​
  • Scalability: Choose cloud object storage and decouple compute from storage, scaling analytics for growing ad impressions and rich media.
  • Governance: Implement data quality monitoring and version control, crucial for regulated industries or ad campaign compliance audits.​
  • Analytical Flexibility: Enable query engines with SQL-on-lake support (Databricks, Snowflake, BigQuery) to process non-tabular website data for campaign reporting.

Summary Table

Data TypeData WarehouseData LakeData Lakehouse
TextLimited, must be structuredSupportedSupported with schema integration
ImagesDifficult, rarely supportedSupported as raw filesSupported with metadata governance
VideoNot supportedSupported as raw filesSupported with indexing/versioning
AudioNot supportedSupported as raw filesSupported with rich metadata
Website LogsRequires ETL, structure enforcedNative support in raw or semi-structuredNative, robust processing

Modern advertising workflows demand support for semi-structured and unstructured data. Data lakes and lakehouses provide the scalable, flexible, and governed environments required to thrive as formats diversify and analytics advance.​

Site Icon
  1. https://www.ibm.com/think/topics/data-warehouse-vs-data-lake-vs-data-lakehouse
  2. https://www.striim.com/blog/data-warehouse-vs-data-lake-vs-data-lakehouse-an-overview/
  3. https://aws.amazon.com/compare/the-difference-between-a-data-warehouse-data-lake-and-data-mart/
  4. https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-a-data-lake
  5. https://www.reddit.com/r/dataengineering/comments/1bfaon5/are_data_lakes_mostly_used_for_sturctured_and/
  6. https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
  7. https://www.montecarlodata.com/blog-data-warehouse-vs-data-lake-vs-data-lakehouse-definitions-similarities-and-differences/
  8. https://www.netguru.com/blog/data-lakehouse-vs-data-warehouse
  9. https://www.fivetran.com/blog/what-is-a-data-lakehouse
  10. https://blog.purestorage.com/purely-technical/data-fabric-vs-data-lake-vs-data-warehouse/
  11. https://www.workday.com/en-us/topics/erp/erp-vs-data-warehouse.html
  12. https://renta.im/blog/data-warehouse-vs-data-lake-vs-data-lakehouse/
  13. https://www.databricks.com/discover/data-warehouse
  14. https://www.altexsoft.com/blog/structured-unstructured-data/
  15. https://amplitude.com/blog/data-lake-vs-warehouse-vs-lakehouse
  16. https://www.knowi.com/blog/data-lake-vs-data-warehouse-vs-data-lake-house/

Also check out Amplitude’s big guide to data governance: https://amplitude.com/explore/data/data-governance-guide

Posted in ,

Leave a Reply

Discover more from Adman Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading