Converting Zooniverse Panoptes classification exports to the Planet Four catalog production format
Author
K.-Michael Aye
Published
2026-04-09
1 Overview
The Planet Four citizen science project migrated from the original Zooniverse/Ouroboros platform to the newer Panoptes system. The Panoptes data export format differs significantly from the original: classifications are stored as nested JSON within a flat CSV, rather than the pre-flattened format the existing planet4 catalog production pipeline expects.
This notebook implements and documents the extraction pipeline that converts Panoptes exports into the flat per-marking format required by planet4.reduction. It also serves as a record of the data characteristics and caveats discovered during development.
planet-four-subjects.csv (subject metadata and image URLs)
1.2 Output
One parquet file per workflow with one row per marking (fan, blotch, or none), matching the analysis_cols schema in planet4.reduction.
2 Raw Data Structure
2.1 Workflows
The Panoptes export contains classifications from three workflows:
Workflow ID
Name
Classifications
Share
12978
P4 Main Workflow
798,313
99.6%
11388
P4 Workflow Version 1 (Question + Drawing)
1,806
0.2%
6321
P4 Workflow Version 2 (Drawing Tasks)
1,311
0.2%
Only workflow 12978 is processed. The older workflows (6321, 11388) lack sufficient metadata for reliable tile identification and represent negligible data volume.
2.2 Annotations JSON
Workflow 12978 classifications contain two tasks in their annotations JSON:
T1 (yes/no question): "Are there seasonal fans and/or blotches visible?" — value is a string ("Yes" / "No")
T0 (drawing task): "Please mark all the seasonal fans and blotches" — value is a list of marking objects
Each marking object contains:
Fan (tool=0): x, y, radius (fan length), rotation (direction), spread (opening angle)
Blotch (tool=1): x, y, rx, ry (ellipse radii), angle (orientation)
2.3 Subject Metadata
The subject_data JSON embedded in each classification contains tile identification metadata. Two schemas exist:
Rich metadata (21,941 unique tiles): Contains !old_planet_four_id (tile ID), !hirise_image_id (obsid), !x_tile, !y_tile, and !filename. Note that some keys have leading spaces (e.g., " !y_tile").
Sparse metadata (95 tiles from an early upload batch): Contains only !P4_subject_id (tile ID) and !filename. Missing fields are resolved via p4tools.io.get_tile_coords() from the published v3.1 catalog.
2.4 Column Mapping
The extraction maps Panoptes fields to the planet4 pipeline’s analysis_cols schema:
Output Column
Panoptes Source
classification_id
Direct from CSV
created_at
Direct from CSV
image_id
subject_data → !old_planet_four_id or !P4_subject_id
image_name
subject_data → !hirise_image_id (or via tile_coords fallback)
image_url
subjects.csv → locations["0"] (Panoptes CDN URL)
user_name
Direct from CSV
marking
"fan" / "blotch" / "none" from annotation tool field
x_tile, y_tile
subject_data → !x_tile, !y_tile (or via tile_coords)
acquisition_date
Joined via obsid from p4tools.io.get_meta_data().START_TIME
Parse the annotations JSON into a list of marking dicts.
Workflow 12978 has two tasks: T1 (yes/no question) and T0 (drawing). We find the drawing task by looking for the first task whose value is a list. Returns one dict per marking. Empty value → [{“marking”: “none”}].
Fan (tool=0): x, y, distance (from radius), angle (from rotation), spread. Blotch (tool=1): x, y, radius_1 (from rx), radius_2 (from ry), angle.
Join external metadata: fill missing tile coords, obsid, and acquisition_date.
Some subjects in workflow 12978 have tile_id but no x_tile/y_tile/obsid. These are filled from tile_coords. acquisition_date comes from START_TIME. local_mars_time is set to NaN (requires SPICE computation).
Extract Panoptes classifications CSV into a flat marking parquet file.
Only processes the main workflow (12978). Old workflows (6321, 11388) are skipped with a warning.
4 Dataset Comparison: Panoptes vs v3.1 Catalog
An important finding during development: the Panoptes data is not a subset of the published v3.1 catalog. The two datasets partially overlap but each contains unique observations.
Show code
import jsonimport pandas as pdfrom p4tools import iofrom pathlib import PathDATA_DIR = Path('/Users/maye/Dropbox/Documents/01_projects/planet4/data/raw/p4_original_zooniverse')# Scan all workflow 12978 classifications for obsids and tile_idsreader = pd.read_csv(DATA_DIR /'planet-four-classifications.csv', chunksize=50000)panoptes_obsids =set()panoptes_tiles =set()tiles_with_full_meta =0tiles_with_sparse_meta =0for chunk in reader: main = chunk[chunk.workflow_id ==12978]for sd_str in main.subject_data: raw = json.loads(sd_str) sid =list(raw.keys())[0] inner = raw[sid] cleaned = {k.strip().lstrip('!'): v for k, v in inner.items()} obsid = cleaned.get('hirise_image_id') tile_id = cleaned.get('old_planet_four_id') or cleaned.get('P4_subject_id')if obsid: panoptes_obsids.add(obsid)if tile_id: panoptes_tiles.add(tile_id)if cleaned.get('x_tile') isnotNone: tiles_with_full_meta +=1else: tiles_with_sparse_meta +=1tc = io.get_tile_coords()catalog_obsids =set(tc.obsid.unique())catalog_tiles =set(tc.tile_id.unique())shared_obsids = panoptes_obsids & catalog_obsidsnew_obsids = panoptes_obsids - catalog_obsidsonly_catalog_obsids = catalog_obsids - panoptes_obsidsshared_tiles = panoptes_tiles & catalog_tilesnew_tiles = panoptes_tiles - catalog_tilesprint("=== Observation-level (obsid) comparison ===")print(f" Panoptes obsids: {len(panoptes_obsids):>5}")print(f" v3.1 catalog obsids: {len(catalog_obsids):>5}")print(f" Shared: {len(shared_obsids):>5}")print(f" NEW in Panoptes only: {len(new_obsids):>5}")print(f" Only in v3.1 catalog: {len(only_catalog_obsids):>5}")print()print("=== Tile-level comparison ===")print(f" Panoptes tiles: {len(panoptes_tiles):>5}")print(f" v3.1 catalog tiles: {len(catalog_tiles):>5}")print(f" Shared: {len(shared_tiles):>5}")print(f" NEW in Panoptes only: {len(new_tiles):>5}")print()print("=== Subject metadata completeness ===")print(f" Full metadata (x_tile, y_tile, obsid): {tiles_with_full_meta}")print(f" Sparse metadata (tile_id only): {tiles_with_sparse_meta}")
Show code
print("=== NEW HiRISE observations in Panoptes (not in v3.1 catalog) ===")print(f"These {len(new_obsids)} obsids represent newer HiRISE data acquired after the v3.1 catalog was built:\n")for obsid insorted(new_obsids):print(f" {obsid}")
4.1 Interpretation
The Panoptes dataset contains 44 HiRISE observations not present in the v3.1 catalog. These are newer observations (ESP numbers in the 046xxx–067xxx range) acquired after catalog v3.1 was produced. They represent new science data that will be included in the next catalog release.
Conversely, 410 obsids from the v3.1 catalog are not in the Panoptes export — these were classified on the original Ouroboros platform and were never re-uploaded to Panoptes.
The 94 tiles with sparse metadata (tile_id only, no x_tile/y_tile/obsid in subject_data) all belong to the 59 shared obsids. They were from an early Panoptes upload batch before the richer metadata convention was adopted. All but one could be resolved via the v3.1 tile_coords lookup.
The 628 tiles unique to Panoptes (not in v3.1 tile_coords) all have full metadata in subject_data, so they are self-contained and require no external lookups.
4.2 Implications for Catalog Production
The Panoptes extraction output can be fed directly into the existing planet4 clustering pipeline
New obsids will need HiRISE metadata (ground projection, SPICE) before full catalog integration
The combined dataset (v3.1 + Panoptes new data) would span a wider temporal range than either alone
5 Validation
Quick verification of the parsing functions on a small sample.
Show code
# Load a sample with actual markingsreader = pd.read_csv(DATA_DIR /'planet-four-classifications.csv', chunksize=50000)chunks = []for chunk in reader: main = chunk[chunk.workflow_id ==12978]iflen(main) >0: chunks.append(main.head(25))iflen(chunks) >=3:breakmain_sample = pd.concat(chunks).reset_index(drop=True)print(f'Sample size: {len(main_sample)} classifications')