Skip to content

Universal Mill List (UML) Scraper

Overview

Scraper for the Universal Mill List (UML) from Rainforest Alliance - a public database of RSPO-certified palm oil mills worldwide.

Authentication

None required. Public download.

Source

Property Value
URL https://www.rainforest-alliance.org/business/certification/the-universal-mill-list/
Format Excel (.xlsx) or CSV
Update Frequency Monthly
Data Provider Rainforest Alliance / RSPO

Workflow

1. Scrape page for download link → 2. Download Excel/CSV → 3. Save to GCS → 4. Parse & clean → 5. Upsert Supabase → 6. Replace SQL Server

GCS Storage

gs://calee_data/raw/uml/
└── YYYYMMDD_HHMMSS.xlsx   # Timestamped raw files

Database Tables

Supabase: traceability.uml_data

Column Type Description
uml_id text Unique mill identifier (PK)
group_name text RSPO member group name
parent_company text Parent company name
company_name text Operating company name
mill_name text Mill name
address text Mill address
rspo_status text RSPO certification status
rspo_type text Type of RSPO certification
date_rspo_certification_status date Certification date
latitude float GPS latitude (-90 to 90)
longitude float GPS longitude (-180 to 180)
gps_coordinates geometry PostGIS Point (SRID 4326)
iso text ISO country code
country text Country name
province text Province/state
district text District
state text State
confidence_level text GPS confidence level
alternative_name text Alternative mill names
updated_at timestamptz Last sync timestamp

Unique Constraint: uml_id

SQL Server: traceability.uml_data

Identical schema. Table is fully replaced on each sync.

Quick Start

cd cron/scrapers/uml

# Run the sync
./run.sh

# Or run directly with uv
uv run python main.py

Environment Variables

Variable Description
SUPABASE_URL Supabase project URL
SUPABASE_KEY Supabase service role key
SQLSERVER_HOST_JINLEE SQL Server hostname
SQLSERVER_DATABASE_JINLEE SQL Server database name
SQLSERVER_USER_JINLEE SQL Server username
SQLSERVER_PASSWORD_JINLEE SQL Server password
DISCORD_WEBHOOK_NOTIFICATIONS Discord webhook for failure alerts

Schedule

Daily at 21:00 SGT (9 PM Singapore time)

# Server is UTC, so 21:00 SGT = 13:00 UTC
0 13 * * * /home/leeca/workspace/cron/scrapers/uml/run.sh

Reports Documentation