diff --git a/docs/data-platform-manual.md b/docs/data-platform-manual.md index cc30352..4e2e094 100644 --- a/docs/data-platform-manual.md +++ b/docs/data-platform-manual.md @@ -11,7 +11,7 @@ This manual guides data engineers & data analysts (DA/DE) through using Airflow, - [Superset](#superset) - [Trino](#trino) - [Object Storage](#object-storage) -- [Workflow](#workflow) +- [Data Flow Diagram](#data-flow-diagram) - [Data Pipeline](#data-pipeline) - [1. Data Ingestion](#1-data-ingestion) - [2. Raw Data Storage](#2-raw-data-storage) @@ -43,7 +43,7 @@ A distributed **SQL query engine** designed for fast analytics across large data An **S3-compatible storage provider** (e.g., MinIO) used to store and retrieve unstructured data such as files, logs, and backups. It provides a scalable, API-driven interface for applications to manage data objects, similar to AWS S3. -## Workflow +## Data Flow Diagram ```mermaid flowchart TB @@ -179,6 +179,7 @@ This script demonstrates data ingestion & ETL phases. - `load_excel_to_csv()` - Ingesting data from source excel file, and store raw data as csv in object storage - `load_csv_to_trino()` - Extract raw data from csv file, transform it, then load it back to object storage using Trino. Processed data is stored in Trino iceberg format, 'default' schema & 'computer_parts_sales' table. +- Assuming sample data source `computer-parts-sales.xlsx` has been uploaded to `airflow/excel` folder in target bucket in object storage (configured in Data Platform service settings). ```python from airflow.sdk import DAG, task