Planet Four: Panoptes Data Extraction Pipeline

Converting Zooniverse Panoptes classification exports to the Planet Four catalog production format

Author

K.-Michael Aye

Published

2026-04-09

1 Overview

The Planet Four citizen science project migrated from the original Zooniverse/Ouroboros platform to the newer Panoptes system. The Panoptes data export format differs significantly from the original: classifications are stored as nested JSON within a flat CSV, rather than the pre-flattened format the existing planet4 catalog production pipeline expects.

This notebook implements and documents the extraction pipeline that converts Panoptes exports into the flat per-marking format required by planet4.reduction. It also serves as a record of the data characteristics and caveats discovered during development.

1.1 Input

  • planet-four-classifications.csv (1.6 GB, 801,430 classifications)
  • planet-four-subjects.csv (subject metadata and image URLs)

1.2 Output

  • One parquet file per workflow with one row per marking (fan, blotch, or none), matching the analysis_cols schema in planet4.reduction.

2 Raw Data Structure

2.1 Workflows

The Panoptes export contains classifications from three workflows:

Workflow ID Name Classifications Share
12978 P4 Main Workflow 798,313 99.6%
11388 P4 Workflow Version 1 (Question + Drawing) 1,806 0.2%
6321 P4 Workflow Version 2 (Drawing Tasks) 1,311 0.2%

Only workflow 12978 is processed. The older workflows (6321, 11388) lack sufficient metadata for reliable tile identification and represent negligible data volume.

2.2 Annotations JSON

Workflow 12978 classifications contain two tasks in their annotations JSON:

  1. T1 (yes/no question): "Are there seasonal fans and/or blotches visible?" — value is a string ("Yes" / "No")
  2. T0 (drawing task): "Please mark all the seasonal fans and blotches" — value is a list of marking objects

Each marking object contains:

  • Fan (tool=0): x, y, radius (fan length), rotation (direction), spread (opening angle)
  • Blotch (tool=1): x, y, rx, ry (ellipse radii), angle (orientation)

2.3 Subject Metadata

The subject_data JSON embedded in each classification contains tile identification metadata. Two schemas exist:

Rich metadata (21,941 unique tiles): Contains !old_planet_four_id (tile ID), !hirise_image_id (obsid), !x_tile, !y_tile, and !filename. Note that some keys have leading spaces (e.g., " !y_tile").

Sparse metadata (95 tiles from an early upload batch): Contains only !P4_subject_id (tile ID) and !filename. Missing fields are resolved via p4tools.io.get_tile_coords() from the published v3.1 catalog.

2.4 Column Mapping

The extraction maps Panoptes fields to the planet4 pipeline’s analysis_cols schema:

Output Column Panoptes Source
classification_id Direct from CSV
created_at Direct from CSV
image_id subject_data!old_planet_four_id or !P4_subject_id
image_name subject_data!hirise_image_id (or via tile_coords fallback)
image_url subjects.csvlocations["0"] (Panoptes CDN URL)
user_name Direct from CSV
marking "fan" / "blotch" / "none" from annotation tool field
x_tile, y_tile subject_data!x_tile, !y_tile (or via tile_coords)
acquisition_date Joined via obsid from p4tools.io.get_meta_data().START_TIME
local_mars_time NaN (requires SPICE computation, out of scope)
x, y Annotation marking coordinates (tile scope, pixels)
image_x, image_y Computed: x + 740*(x_tile-1), y + 548*(y_tile-1)
radius_1, radius_2 Blotch: rx, ry. Fan: NaN
distance Fan: radius. Blotch: NaN
angle Fan: rotation. Blotch: angle
spread Fan: spread. Blotch: 0
version workflow_version from CSV

3 Implementation


source

3.1 parse_subject_data


def parse_subject_data(
    subject_data_str:str
)->dict:

Parse the nested subject_data JSON and extract metadata fields.

The subject_data JSON has structure: {“”: {key: val, …}}. Keys may have leading spaces or ‘!’ prefixes that need stripping.

Returns dict with keys: subject_id, tile_id, obsid, x_tile, y_tile, filename.


source

3.2 parse_annotations


def parse_annotations(
    annotations_str:str
)->list[dict]:

Parse the annotations JSON into a list of marking dicts.

Workflow 12978 has two tasks: T1 (yes/no question) and T0 (drawing). We find the drawing task by looking for the first task whose value is a list. Returns one dict per marking. Empty value → [{“marking”: “none”}].

Fan (tool=0): x, y, distance (from radius), angle (from rotation), spread. Blotch (tool=1): x, y, radius_1 (from rx), radius_2 (from ry), angle.


source

3.3 process_chunk


def process_chunk(
    chunk:pd.DataFrame
)->pd.DataFrame:

Process one chunk of the classifications CSV into flat marking rows.

Parses annotations and subject_data, then explodes so each marking becomes its own row.


source

3.4 compute_image_coordinates


def compute_image_coordinates(
    df:pd.DataFrame
)->pd.DataFrame:

Add image_x and image_y from tile-scope x, y and tile indices.

image_x = x + 740 * (x_tile - 1) image_y = y + 548 * (y_tile - 1)


source

3.5 load_subjects_urls


def load_subjects_urls(
    subjects_csv:str | Path
)->pd.DataFrame:

Load subjects CSV and extract image URLs from the locations JSON.

Returns DataFrame with columns: subject_id (int), image_url (str).


source

3.6 enrich_with_metadata


def enrich_with_metadata(
    df:pd.DataFrame, metadata:pd.DataFrame | None=None, tile_coords:pd.DataFrame | None=None
)->pd.DataFrame:

Join external metadata: fill missing tile coords, obsid, and acquisition_date.

Some subjects in workflow 12978 have tile_id but no x_tile/y_tile/obsid. These are filled from tile_coords. acquisition_date comes from START_TIME. local_mars_time is set to NaN (requires SPICE computation).


source

3.7 extract_panoptes_classifications


def extract_panoptes_classifications(
    classifications_csv:str | Path, subjects_csv:str | Path | None=None, output_dir:str | Path | None=None,
    chunksize:int=50000
)->Path:

Extract Panoptes classifications CSV into a flat marking parquet file.

Only processes the main workflow (12978). Old workflows (6321, 11388) are skipped with a warning.

4 Dataset Comparison: Panoptes vs v3.1 Catalog

An important finding during development: the Panoptes data is not a subset of the published v3.1 catalog. The two datasets partially overlap but each contains unique observations.

Show code
import json
import pandas as pd
from p4tools import io
from pathlib import Path

DATA_DIR = Path('/Users/maye/Dropbox/Documents/01_projects/planet4/data/raw/p4_original_zooniverse')

# Scan all workflow 12978 classifications for obsids and tile_ids
reader = pd.read_csv(DATA_DIR / 'planet-four-classifications.csv', chunksize=50000)

panoptes_obsids = set()
panoptes_tiles = set()
tiles_with_full_meta = 0
tiles_with_sparse_meta = 0

for chunk in reader:
    main = chunk[chunk.workflow_id == 12978]
    for sd_str in main.subject_data:
        raw = json.loads(sd_str)
        sid = list(raw.keys())[0]
        inner = raw[sid]
        cleaned = {k.strip().lstrip('!'): v for k, v in inner.items()}
        
        obsid = cleaned.get('hirise_image_id')
        tile_id = cleaned.get('old_planet_four_id') or cleaned.get('P4_subject_id')
        
        if obsid:
            panoptes_obsids.add(obsid)
        if tile_id:
            panoptes_tiles.add(tile_id)
        if cleaned.get('x_tile') is not None:
            tiles_with_full_meta += 1
        else:
            tiles_with_sparse_meta += 1

tc = io.get_tile_coords()
catalog_obsids = set(tc.obsid.unique())
catalog_tiles = set(tc.tile_id.unique())

shared_obsids = panoptes_obsids & catalog_obsids
new_obsids = panoptes_obsids - catalog_obsids
only_catalog_obsids = catalog_obsids - panoptes_obsids

shared_tiles = panoptes_tiles & catalog_tiles
new_tiles = panoptes_tiles - catalog_tiles

print("=== Observation-level (obsid) comparison ===")
print(f"  Panoptes obsids:          {len(panoptes_obsids):>5}")
print(f"  v3.1 catalog obsids:      {len(catalog_obsids):>5}")
print(f"  Shared:                   {len(shared_obsids):>5}")
print(f"  NEW in Panoptes only:     {len(new_obsids):>5}")
print(f"  Only in v3.1 catalog:     {len(only_catalog_obsids):>5}")
print()
print("=== Tile-level comparison ===")
print(f"  Panoptes tiles:           {len(panoptes_tiles):>5}")
print(f"  v3.1 catalog tiles:       {len(catalog_tiles):>5}")
print(f"  Shared:                   {len(shared_tiles):>5}")
print(f"  NEW in Panoptes only:     {len(new_tiles):>5}")
print()
print("=== Subject metadata completeness ===")
print(f"  Full metadata (x_tile, y_tile, obsid): {tiles_with_full_meta}")
print(f"  Sparse metadata (tile_id only):        {tiles_with_sparse_meta}")
Show code
print("=== NEW HiRISE observations in Panoptes (not in v3.1 catalog) ===")
print(f"These {len(new_obsids)} obsids represent newer HiRISE data acquired after the v3.1 catalog was built:\n")
for obsid in sorted(new_obsids):
    print(f"  {obsid}")

4.1 Interpretation

The Panoptes dataset contains 44 HiRISE observations not present in the v3.1 catalog. These are newer observations (ESP numbers in the 046xxx–067xxx range) acquired after catalog v3.1 was produced. They represent new science data that will be included in the next catalog release.

Conversely, 410 obsids from the v3.1 catalog are not in the Panoptes export — these were classified on the original Ouroboros platform and were never re-uploaded to Panoptes.

The 94 tiles with sparse metadata (tile_id only, no x_tile/y_tile/obsid in subject_data) all belong to the 59 shared obsids. They were from an early Panoptes upload batch before the richer metadata convention was adopted. All but one could be resolved via the v3.1 tile_coords lookup.

The 628 tiles unique to Panoptes (not in v3.1 tile_coords) all have full metadata in subject_data, so they are self-contained and require no external lookups.

4.2 Implications for Catalog Production

  • The Panoptes extraction output can be fed directly into the existing planet4 clustering pipeline
  • New obsids will need HiRISE metadata (ground projection, SPICE) before full catalog integration
  • The combined dataset (v3.1 + Panoptes new data) would span a wider temporal range than either alone

5 Validation

Quick verification of the parsing functions on a small sample.

Show code
# Load a sample with actual markings
reader = pd.read_csv(DATA_DIR / 'planet-four-classifications.csv', chunksize=50000)
chunks = []
for chunk in reader:
    main = chunk[chunk.workflow_id == 12978]
    if len(main) > 0:
        chunks.append(main.head(25))
    if len(chunks) >= 3:
        break
main_sample = pd.concat(chunks).reset_index(drop=True)
print(f'Sample size: {len(main_sample)} classifications')
Show code
# Test full pipeline on sample
result = process_chunk(main_sample)
print(f'Input: {len(main_sample)} classifications → Output: {len(result)} marking rows')
print(f'Marking types: {result.marking.value_counts().to_dict()}')

# Enrich metadata
metadata = io.get_meta_data()
tile_coords = io.get_tile_coords()
result = enrich_with_metadata(result, metadata, tile_coords)
print(f'\nAfter metadata enrichment:')
print(f'  Missing x_tile: {result.x_tile.isna().sum()}')
print(f'  Missing image_name: {result.image_name.isna().sum()}')
print(f'  Has acquisition_date: {result.acquisition_date.notna().sum()}/{len(result)}')

# Compute coordinates
result = compute_image_coordinates(result)
fans = result[result.marking == 'fan']
if len(fans) > 0:
    print(f'\nSample fan with coordinates:')
    print(fans[['image_id', 'image_name', 'x', 'y', 'x_tile', 'y_tile', 'image_x', 'image_y']].head(3).to_string())

# Verify tile_id format
valid = result[result.image_id.notna()].image_id.str.startswith('APF').all()
print(f'\nAll tile_ids valid APF format: {valid}')

6 Usage

To run the full extraction (processes 1.6 GB, takes a few minutes):

from p4tools.panoptes_extract import extract_panoptes_classifications

outpath = extract_panoptes_classifications(
    'data/raw/p4_original_zooniverse/planet-four-classifications.csv',
    subjects_csv='data/raw/p4_original_zooniverse/planet-four-subjects.csv',
)

The output parquet file can be read directly by planet4.io.DBManager for clustering and catalog production.