Pre-clustering classification cleaning

Canonicalisation of raw Zooniverse / Panoptes classification data — drops dirt rows, normalises angle ranges, fixes ellipse axis ordering, and derives angular components for clustering. Not the catalog reduction (= clustering) pipeline; this runs upstream of L0→L1A.

NaN sweep — drop incomplete markings


source

filter_nan_required


def filter_nan_required(
    df:DataFrame
)->DataFrame:

Drop fans/blotches missing any required column.

Required cols: fans need x, y, distance, angle, spread; blotches need x, y, radius_1, radius_2. Rows with marking != fan|blotch pass through untouched.

Default-marking filter (Zooniverse v1 only)


source

filter_default_markings


def filter_default_markings(
    df:DataFrame
)->DataFrame:

Drop the legacy Zooniverse-v1 auto-spawned default markings.

Verbatim port of legacy planet4.reduction.filter_data steps 2-5:

  • Origin-pinned default fan: |x|<eps & |y|<eps & |angle|<eps & distance~10.
  • Second-default fan: |angle|~90 & spread~2.017450 & distance~10.
  • Origin-pinned 10x10 default ellipse blotch: |x|<eps & |y|<eps & r1~10 & r2~10.
  • Origin-pinned none row.

Empirically a no-op for Panoptes-12978 data (the new UI does not auto-spawn these defaults on click-without-drag); kept available for legacy reprocessing.

Out-of-frame filter — drop markings far outside tile


source

filter_out_of_frame


def filter_out_of_frame(
    df:DataFrame, tolerance_px:int=25
)->DataFrame:

Drop markings whose centre falls more than tolerance_px outside the 840x648 tile. marking == "none" rows are exempted (they record ‘volunteer saw nothing’ and have meaningless coords).

Blotch geometry canonicalisation


source

canonicalize_blotch_geometry


def canonicalize_blotch_geometry(
    df:DataFrame
)->DataFrame:

Make blotch ellipses canonical: radius_1 >= radius_2, angle in [0, 180).

Verbatim port of legacy planet4.reduction.convert_ellipse_angles: where radius_1 < radius_2 we swap the radii and add 90 deg to the angle, then take angle % 180 (ellipse symmetry).

Modifies the dataframe in place and returns it.

Fan angle canonicalisation


source

canonicalize_fan_angles


def canonicalize_fan_angles(
    df:DataFrame
)->DataFrame:

Fold fan angles into [0, 360).

Verbatim port of legacy planet4.reduction.normalize_fan_angles. Empirically a no-op for Panoptes-12978 (the UI already produces fan angles in [0, 360)); kept for legacy reprocessing.

Angular components for clustering


source

compute_angle_components


def compute_angle_components(
    df:DataFrame
)->DataFrame:

Add x_angle = cos(deg2rad(angle)) and y_angle = sin(deg2rad(angle)).

These are the angular features the catalog reduction (clustering) reads directly: production.dbscan clusters fans on (x_angle, y_angle) and blotches on y_angle. Must run after the blotch and fan angle canonicalisations so the components reflect the canonical-quadrant angle.

Orchestrator — dispatch per raw source


source

clean_classifications


def clean_classifications(
    df:DataFrame, source:Literal='panoptes', out_of_frame_tolerance_px:int=25
)->DataFrame:

Top-level orchestrator. Dispatches the right cleanup steps per raw source.

Steps run, in order:

  1. [filter_nan_required](https://michaelaye.github.io/p4tools/production.cleaning.html#filter_nan_required) (always)
  2. [filter_default_markings](https://michaelaye.github.io/p4tools/production.cleaning.html#filter_default_markings) (zooniverse_v1 only)
  3. [filter_out_of_frame](https://michaelaye.github.io/p4tools/production.cleaning.html#filter_out_of_frame) (always)
  4. [canonicalize_blotch_geometry](https://michaelaye.github.io/p4tools/production.cleaning.html#canonicalize_blotch_geometry) (always)
  5. [canonicalize_fan_angles](https://michaelaye.github.io/p4tools/production.cleaning.html#canonicalize_fan_angles) (zooniverse_v1 only — Panoptes already canonical)
  6. [compute_angle_components](https://michaelaye.github.io/p4tools/production.cleaning.html#compute_angle_components) (always)