Galaxies

Author

Peter Ralph

Published

March 9, 2026

The Galaxy Zoo

galaxyzoo.org

For more than 15 years, we’ve asked volunteers to help us explore galaxies near and far, sampling a fraction of the roughly one hundred billion that are scattered throughout the observable Universe. Each one of the systems, containing billions of stars, has had a unique life, interacting with its surroundings and with other galaxies in many different ways; the aim of the Galaxy Zoo team is to try and understand these processes, and to work out what galaxies can tell us about the past, present and future of the Universe as a whole.

The Galaxy Zoo 1

https://data.galaxyzoo.org/

The original Galaxy Zoo project ran from July 2007 until February 2009. It was replaced by Galaxy Zoo 2, Galaxy Zoo: Hubble, and Galaxy Zoo: CANDELS. In the original Galaxy Zoo project, volunteers classified images of Sloan Digital Sky Survey galaxies as belonging to one of six categories - elliptical, clockwise spiral, anticlockwise spiral, edge-on , star/don’t know, or merger.

Data

  • nearly 900,000 images, each centered on a galaxy
  • each 424 x 424 (color) JPEG
  • I’m looking at only 5,000 of those images
  • … which is 2,696,640,000 values

Goals:

  • describe the color distribution among galaxies
  • see what PCA gives us

montage of 16 galaxies

Color distribution

We’ll use HSV space:

  • Hue is circular (think rainbow)
  • Saturation is white -> colorful
  • Value is black -> colorful

Set-up

from PIL import Image
import matplotlib
import matplotlib.pyplot as plt
import os, glob
import numpy as np
import pandas as pd
import plotnine as p9
import sklearn.decomposition
import colorsys

filenames = glob.glob("data/galaxies/*.jpg")

By pixel

Let’s compare color values in the center to values near the edge:

def get_pixel(filenames, i, j):
    out = pd.DataFrame(
            np.array([Image.open(fn).convert("HSV").getpixel((i, j)) for fn in filenames]),
            columns=("H", "S", "V"),
        )
    return out

pix0 = get_pixel(filenames, 212, 212)
pix1 = get_pixel(filenames, 0, 212)

Value

p9.ggplot(pix0, p9.aes(x='V')) + p9.geom_histogram(bins=32) |  p9.ggplot(pix1, p9.aes(x="V")) + p9.geom_histogram(bins=32)

Saturation

p9.ggplot(pix0, p9.aes(x='S')) + p9.geom_histogram(bins=32) |  p9.ggplot(pix1, p9.aes(x="S")) + p9.geom_histogram(bins=32)

Hue

p9.ggplot(pix0, p9.aes(x='H')) + p9.geom_histogram(bins=32) |  p9.ggplot(pix1, p9.aes(x="H")) + p9.geom_histogram(bins=32)

Image-wide averages

Next, the average value across the image:

RGB = pd.DataFrame(
    np.array([np.asarray(Image.open(fn)).mean(axis=(0,1)) for fn in filenames]),
    columns=("R", "G", "B"),
)
HSV = pd.DataFrame(
    np.array([np.asarray(Image.open(fn).convert("HSV")).mean(axis=(0,1)) for fn in filenames]),
    columns=("H", "S", "V"),
)
colors = pd.concat([RGB, HSV], axis=1)
p9.ggplot(colors, p9.aes(x="R", y="G", color='B')) + p9.geom_point() + p9.scale_color_continuous('Blues')

Extremes

def plot_extremes(filenames, x, n=3):
    fig, axes = plt.subplots(2, n)
    ix = np.argsort(x)
    for ax, k in zip(axes[0], ix[:n]):
        ax.imshow(Image.open(filenames[k]))
        ax.set_title(f"{k}: {x[k]:.2f}")
        ax.set_axis_off()
    for ax, k in zip(axes[1], ix[-n:]):
        ax.imshow(Image.open(filenames[k]))
        ax.set_title(f"{k}: {x[k]:.2f}")
        ax.set_axis_off();

Most blue:

plot_extremes(filenames, colors['B'])

Most red:

plot_extremes(filenames, colors['R'])

Image-wide average, in HSV

p9.ggplot(colors, p9.aes(x="H", y="V", color='S')) + p9.geom_point() + p9.scale_color_continuous('gray')

Most V

plot_extremes(filenames, colors['V'])

Most H

plot_extremes(filenames, colors['H'])

PCA, in RGB space

def im_to_array(fn, size):
    im = Image.open(fn).resize(size=size)
    return np.array([np.asarray(x) for x in im.split()]).flatten()
    
def images_to_array(filenames, size):
    return np.array([im_to_array(fn, size=size) for fn in filenames])

size = (100, 110)
X = images_to_array(filenames, size=size).astype("float")
X -= X.mean(axis=1)[:,np.newaxis]
X.shape
(5000, 33000)

pca = sklearn.decomposition.PCA(n_components=12).fit(X)
pcs = pd.DataFrame(
    pca.transform(X), columns=[f"PC{k+1}" for k in range(pca.n_components_)]
)
ij = pd.Series([f"{x}.{i}.{j}" for x in ('r', 'g', 'b') for i in range(size[0]) for j in range(size[1])])
loadings = pd.concat([
    pd.DataFrame(pca.components_.T, index=ij, columns=[f"PC{k+1}" for k in range(pca.n_components_)])
        .reset_index(names='variable'),
    pd.DataFrame.from_records(ij.str.split("."), columns=('rgb', 'row', 'col')),
], axis=1)
loadings['row'] = loadings['row'].astype('int')
loadings['col'] = loadings['col'].astype('int')
loadings
variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 rgb row col
0 r.0.0 -0.002556 -0.000460 0.000462 -0.006387 -0.002465 0.001673 0.007242 -0.000868 0.001055 0.005432 0.010187 -0.001327 r 0 0
1 r.0.1 -0.002568 -0.000415 0.000409 -0.006656 -0.002117 0.001825 0.007844 -0.001226 0.001306 0.006359 0.010463 -0.002005 r 0 1
2 r.0.2 -0.002496 -0.000585 0.000376 -0.006825 -0.001908 0.001726 0.008319 -0.000801 0.001367 0.006715 0.011059 -0.002117 r 0 2
3 r.0.3 -0.002542 -0.000692 0.000187 -0.007098 -0.002292 0.001729 0.008218 -0.000497 0.000891 0.006882 0.011823 -0.002481 r 0 3
4 r.0.4 -0.002634 -0.000789 -0.000240 -0.007402 -0.002236 0.001496 0.008715 0.000083 0.000841 0.007138 0.012801 -0.002597 r 0 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32995 b.99.105 -0.002404 -0.000214 0.000206 0.004532 -0.003090 -0.004303 0.001218 0.004700 0.003450 0.000932 -0.003604 0.002391 b 99 105
32996 b.99.106 -0.002451 -0.000150 0.000306 0.004506 -0.003063 -0.004476 0.001253 0.004087 0.003394 0.001102 -0.003609 0.001687 b 99 106
32997 b.99.107 -0.002275 0.000036 0.000373 0.004389 -0.003284 -0.004167 0.001216 0.004093 0.003440 0.001188 -0.003503 0.001381 b 99 107
32998 b.99.108 -0.002260 0.000057 0.000450 0.003915 -0.003304 -0.003941 0.000727 0.004434 0.003355 0.000867 -0.003392 0.001677 b 99 108
32999 b.99.109 -0.002334 0.000013 0.000459 0.003671 -0.003350 -0.003447 0.000463 0.003925 0.003687 0.000608 -0.003066 0.001869 b 99 109

33000 rows × 16 columns

What are the loadings?

(
    loadings
    >>
    p9.ggplot(p9.aes(x='row', y='PC1', color='factor(col)'))
    + p9.geom_line()
    + p9.facet_grid(rows='rgb')
    + p9.scale_color_discrete(guide=None)
) + p9.theme(figure_size=(12,12))

(
    loadings
    .query("col.isin((0, 24, 49, 74, 99))")
    >>
    p9.ggplot(p9.aes(x='row', y='PC1', color='factor(col)'))
    + p9.geom_line()
    + p9.facet_grid(rows='rgb')
    + p9.scale_color_discrete(guide=None)
) + p9.theme(figure_size=(12,12))

Another way to look at the loadings

def get_pc_image(loadings, k):
    pc_im = np.array([
        loadings.query(f"rgb == '{x}'")[f'PC{k}'].values.reshape((size[1], size[0]))
        for x in "rgb"
    ])
    pc_im -= pc_im.min()
    pc_im *= 255 / (pc_im.max() - pc_im.min())
    pc_im = pc_im.astype("int")
    return np.moveaxis(pc_im, [0,1,2], [2,1,0])

PCs 1-3

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for k, ax in zip(range(1,4), axes):
    ax.imshow(get_pc_image(loadings, k))
    ax.set_title(f"PC{k}");

PCs 4-6

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for k, ax in zip(range(4,7), axes):
    ax.imshow(get_pc_image(loadings, k))
    ax.set_title(f"PC{k}");

PC1

plot_extremes(filenames, pcs['PC1'])

PC2

plot_extremes(filenames, pcs['PC2'])

PC3

plot_extremes(filenames, pcs['PC3'])

PC4

plot_extremes(filenames, pcs['PC4'])