Data Lake

Data Management

A data lake is a centralized storage repository that holds vast amounts of raw data in its native format—structured, semi-structured, and unstructured—until needed for analytics, machine learning, or other processing.

See It In Action Read More

Category Data Management

Related Terms 3 connected concepts

What Is a Data Lake?

A data lake is a storage system that holds large volumes of raw data in its original format until it’s needed. Unlike data warehouses that require data to be structured before loading, data lakes accept data as-is—structured tables, semi-structured JSON, unstructured documents, images, and more.

Key characteristics:

Schema-on-read: Structure applied when data is accessed, not when stored
Raw data storage: Preserves original data without transformation
Scalable: Handles petabytes of data cost-effectively
Flexible: Accommodates any data type or format

Data Lake vs. Data Warehouse

Aspect	Data Lake	Data Warehouse
Data format	Raw, any format	Processed, structured
Schema	Schema-on-read	Schema-on-write
Users	Data scientists, engineers	Business analysts
Processing	Flexible, exploratory	Predefined queries
Cost	Lower storage cost	Higher storage cost
Data quality	Variable	Curated
Use cases	ML, exploration	BI, reporting

Data Lake Architecture

Ingestion Layer

Bringing data into the lake:

Batch ingestion (scheduled loads)
Streaming ingestion (real-time)
File drops (manual uploads)

Storage Layer

Where raw data resides:

Cloud object storage (S3, Azure Blob, GCS)
Organized by source, date, or subject
Metadata catalogs track what’s stored

Processing Layer

Transforming data for use:

Batch processing (Spark, Hadoop)
Stream processing (Kafka, Flink)
SQL engines (Presto, Athena)

Consumption Layer

Accessing processed data:

BI tools
Data science notebooks
Machine learning pipelines
Applications

Data Lake Zones

Data lakes typically organize data into zones:

Raw/Bronze Zone

Data exactly as received:

No transformations applied
Complete history preserved
Source of truth for what arrived

Cleaned/Silver Zone

Data with basic quality improvements:

Duplicates removed
Obvious errors fixed
Standard formats applied

Curated/Gold Zone

Business-ready data:

Business logic applied
Aggregations calculated
Ready for analytics

Data Lake Benefits

Flexibility: Store any data without upfront schema design

Cost-effective: Cloud object storage is inexpensive

Future-proofing: Preserve raw data for unknown future uses

Data science: Raw data supports ML model training

Scalability: Handle massive data volumes

Data Lake Challenges

Data Swamp Risk

Without governance, data lakes become “data swamps”:

Nobody knows what data exists
Data quality is unknown
Finding useful data is difficult
Duplicate and conflicting data accumulates

Skills Required

Data lakes need technical expertise:

Data engineering for pipelines
Data science for analysis
DevOps for infrastructure

Query Performance

Raw data queries can be slow:

No optimization for common queries
May require preprocessing for performance

Data Lakehouse

A modern hybrid approach combining lake and warehouse:

Store like a lake: Raw data in object storage

Query like a warehouse: SQL access with good performance

Govern like a warehouse: Schemas, quality, security

Platforms like Databricks and Snowflake support lakehouse patterns.

How Go Fig Relates to Data Lakes

Go Fig can work with data lakes in several ways:

Source integration: Connect to data stored in your lake

Lake output: Deliver processed data to your lake

Alternative approach: For finance teams, Go Fig may eliminate the need for a separate lake by providing:

Integrated data storage
Business-ready transformations
Excel and dashboard delivery
No data engineering required

Most finance teams don’t need a data lake—they need clean, accessible data in familiar tools. Go Fig provides that without lake complexity.

Put Data Lake Into Practice

Go Fig helps finance teams implement these concepts without massive IT projects. See how we can help.

Request a Demo