close
close

first Drop

Com TW NOw News 2024

Six Degrees of Kevin Bacon – Postgres Style
news

Six Degrees of Kevin Bacon – Postgres Style

Back in the 90s, when nothing was cool (or so my kids tell me), and at the dawn of the meme era, a few college students came up with a game they called the “Six Degrees of Kevin Bacon.”

The idea behind the Six Degrees of Kevin Bacon was that actor Kevin Bacon could be connected to any other actor through a chain of association of no more than six steps.

Why Kevin Bacon? More or less randomly, but the students had noticed that Bacon had said in an interview that “he’d worked with everybody in Hollywood or somebody who’d worked with them” and saw that statement as a challenge.

The number of steps it takes for an actor to go from being Kevin Bacon is their “Bacon Number.”

For example, comedy legend Steve Martin has a Bacon Number of 1, because Kevin Bacon starred with him in the 1987 road trip comedy Planes, Trains and Automobiles.

Bacon-Martin

Zendaya’s number 2 is Bacon. In 2017, she appeared with Marisa Tomei in Spider-Man: Homecoming, and in 2005, Tomei appeared with Bacon in Loverboy (which Bacon also directed).

Bacon-Zendaya

The Challenge of the Original 90s Six Degrees of Kevin Bacon was to pair two actors together with only the knowledge in your head. That seems incredibly difficult to me, but maybe people in the 90s were smarter.

In our modern age, we don’t need to be smart. We can tackle the Bacon number problem by combining data and algorithms.

The data part of the problem is relatively simple: with the Internet Movie Database (also known as IMDB), we can directly download the information we need.

In particular the

  • name.basics.tsv.gz file for actor names and functions
  • title.basics.tsv.gz file for movie names and dates
  • title.principals.tsv.gz file for actor-film relationships

The IMDB database files actually contain information about every job on a film (writers, directors, producers, casting, etc., etc.) and we are only interested in
actors for the match against Kevin Bacon.

ETL process for downloading and processing raw data

CREATE SCHEMA imdb;

------------------------------------------------------------------------
-- Load the names from the raw file
--

CREATE TABLE imdb.name_basics (
    nconst  text,
    primaryName  text,
    birthYear integer,
    deathYear integer,
    primaryProfession text,
    knownForTitles text
);

COPY imdb.name_basics
    FROM program 'curl https://datasets.imdbws.com/name.basics.tsv.gz | gunzip'
    WITH (
        FORMAT csv,
        DELIMITER E'\t',
        NULL '\N',
        HEADER true,
        QUOTE E'\x01'
    );

CREATE INDEX name_basics_pk ON imdb.name_basics (nconst);

------------------------------------------------------------------------
-- Strip down the raw to just actors and actresses in an 'actors' table
--

CREATE TABLE actors AS
    SELECT int4(substring(nconst, 3,15)) AS actor_id,
        nconst,
        primaryname AS name,
        int2(birthyear) AS birthyear,
        int2(deathyear) AS deathyear
    FROM imdb.name_basics
    WHERE (primaryProfession ~ '^act' OR primaryProfession ~ ',act')
    AND birthyear IS NOT NULL;

CREATE UNIQUE INDEX actors_pk ON actors (actor_id);
CREATE INDEX actors_nconst_x ON actors (nconst);

------------------------------------------------------------------------
-- Load the movie titles from the raw file
--

CREATE TABLE imdb.title_basics (
    tconst text,
    titleType text,
    primaryTitle text,
    originalTitle text,
    isAdult boolean,
    startYear integer,
    endYear integer,
    runtimeMinutes integer,
    genres text
    );

COPY imdb.title_basics
    FROM program 'curl https://datasets.imdbws.com/title.basics.tsv.gz | gunzip'
    WITH (
        FORMAT csv,
        DELIMITER E'\t',
        NULL '\N',
        HEADER true,
        QUOTE E'\x01'
    );

------------------------------------------------------------------------
-- Strip down the raw table to just movies, no tv shows, etc.
--

CREATE TABLE movies AS
    SELECT int4(substring(tconst, 3,15)) AS movie_id,
        tconst,
        primaryTitle AS title,
        int2(startyear) as startyear, int2(endyear) as endyear,
        runtimeminutes as runtimeminutes
    FROM imdb.title_basics
    WHERE titleType="movie"
      AND NOT isadult;

CREATE UNIQUE INDEX movies_pk ON movies (movie_id);
CREATE INDEX movies_tconst_x ON movies (tconst);

------------------------------------------------------------------------
-- Load the raw table of movie/job relationships
--

CREATE TABLE imdb.title_principals (
    tconst text,
    ordering integer,
    nconst text,
    category text,
    job text,
    characters text
);

COPY imdb.title_principals
    FROM program 'curl https://datasets.imdbws.com/title.principals.tsv.gz | gunzip'
    WITH (
        FORMAT csv,
        DELIMITER E'\t',
        NULL '\N',
        HEADER true,
        QUOTE E'\x01'
    );

CREATE INDEX title_principals_tx ON imdb.title_principals (tconst);
CREATE INDEX title_principals_nx ON imdb.title_principals (nconst);

------------------------------------------------------------------------
-- Strip down the raw table to just the ids defining the relationship
--

CREATE TABLE movie_actors AS
    SELECT m.movie_id,
        a.actor_id
    FROM imdb.title_principals i
    JOIN actors a ON a.nconst = i.nconst
    JOIN movies m ON m.tconst = i.tconst
    WHERE i.category ~ '^act';

CREATE INDEX movie_actors_ax ON movie_actors (actor_id);
CREATE INDEX movie_actors_mx ON movie_actors (movie_id);

To make the tables smaller and to speed up the operation, the raw files are stripped during the ETL process. The result is three smaller tables.

  • actors has 371,557 rows
  • movies has 678,204 rows, and
  • movie_actors has 1,866,533 rows.

ERD

So, what is the algorithm we need to calculate the Bacon Number for a random actor? Look at the example above, for Steve Martin and Zendaya. Actors are connected by movies. Together, the actors and movies form a graph! The actors are nodes of the graphics and the films are edges of the graph.

Graphic

And luckily, PostgreSQL already has a graph solver available, pgRouting!

(Wait, isn’t pgRouting meant for solving transportation optimization problems? Well, it is largely used for, but it is built as a fully generic graph solversuitable for all kinds of graph algorithms, including the Dijkstra shortest path algorithm. We need to calculate a Bacon number.)

Alternatively, we can tackle the problem directly within the PostgreSQL core by using a ‘recursive CTE’ to traverse the graph.

Let’s look at both approaches.

For both approaches we need to expand our table of movie/actor relationships to a table of graph borders where each edge represents a pair of actors in the same film.

CREATE TABLE actor_graph AS
SELECT
  row_number() OVER () AS edge_id,
  a.actor_id AS actor_id,
  a.movie_id AS movie_id,
  b.actor_id AS other_actor_id
FROM movie_actors a
JOIN movie_actors b ON a.movie_id = b.movie_id
WHERE a.actor_id != b.actor_id;

CREATE INDEX actor_graph_id_x ON actor_graph (actor_id);
CREATE INDEX actor_graph_other_id_x ON actor_graph (other_actor_id);
CREATE INDEX actor_graph_edge_id_x ON actor_graph (edge_id);

Self-connecting the movie_actors table gives us a table of 11M edges that the
actor_graph.

SELECT * FROM actor_graph LIMIT 5;
 edge_id | actor_id | movie_id | other_actor_id
---------+----------+----------+----------------
       1 |   951773 |  1861414 |         764895
       2 |   951773 |  1861414 |         618628
       3 |   951773 |  1861414 |         244428
       4 |   951773 |  1861414 |         258574
       5 |   951773 |  1861414 |         147923

pgRouting is a unique solver that allows you to dynamically create the graph you are solving for. This makes a lot of sense for a database-based solver, since data in a database is expected to be fairly dynamic.

Each algorithm in pgRouting uses a SQL query that generates the graph to work with, and parameters appropriate for the chosen algorithm.

We use the Dijksta algorithm, so the parameters are just the graph SQLthe starting point and the end junction.

Everything works with integer keys, so we start by finding the keys for two actors, Kevin Spek And Timothée Chalamet.

CREATE EXTENSION pgrouting

SELECT actor_id FROM actors WHERE name="Kevin Bacon";
SELECT actor_id FROM actors WHERE name="Timothée Chalamet";

SELECT seq, node, edge FROM pgr_dijkstra(
    'SELECT
        a.movie_id AS id,
        a.actor_id AS source,
        b.other_actor_id AS target,
        1.0 AS cost
        FROM actor_graph a',
    3154303, -- Timothée Chalamet
    102      -- Kevin Bacon
    );

What you get back (in about 5 seconds) is the list of edges that form the shortest path.

 seq |  node   |   edge
-----+---------+----------
   1 | 3154303 | 11286314
   2 | 2225369 |  1270798
   3 |     102 |       -1

This example is one that goes against the strengths of pgRouting. We are asking for a route over a static graph, not a dynamic graph, and we are routing over a very large graph (11M edges).

Each time the function is executed, the entire graph is fetched from the database, the graph is formed using the node keys, and finally the Dijkstra algorithm is executed.

For our 11M record table this takes about 5 seconds.

For this particular problem, recursive CTE turns out to be a great solution. The graph is static and the number of steps needed to form a shortest path is quite small — no more than 6, according to our rules.

The downside of recursive CTE should be clear from this example and the documentation. In a world of confusing SQL, recursive CTE SQL is the most confusing of all.

Here’s a simple query to perform the same search we performed in pgRouting:

WITH RECURSIVE bacon_numbers AS (
-- Starting nodes
SELECT
  ag.actor_id,
  ag.movie_id,
  ag.other_actor_id,
  1 AS bacon_number,
  ARRAY(ag.edge_id) AS path,
  false AS is_cycle
FROM actor_graph AS ag
WHERE actor_id = 102 -- Kevin Bacon

UNION ALL

-- Recursive set
SELECT
  bn.actor_id,
  ag.movie_id,
  ag.other_actor_id,
  bn.bacon_number + 1 AS bacon_number,
  path || ag.edge_id AS path,
  ag.edge_id = ANY(path) AS is_cycle
FROM actor_graph AS ag
JOIN bacon_numbers AS bn
  ON bn.other_actor_id = ag.actor_id
WHERE bn.bacon_number 

That’s a lot more complex! Since we write the traversal by hand, with a relatively blunt instrument, the result is a lot more complex than the pgRouting solution. On the other hand, this solution runs in only a few hundred milliseconds, so for the Bacon problem it is clearly better.

The output path array is an ordered list of edges that take us from the starting node (Bacon) to the ending node (Chalamet).

       path
-------------------
 {2016551,4962882}

Reconnect the edges and nodes to their human-readable movies and actors.

SELECT a.name, m.title, m.startyear, o.name
FROM unnest('{2016551,4962882}'::integer()) p
JOIN actor_graph ag ON ag.edge_id = p
JOIN actors a USING (actor_id)
JOIN actors o ON ag.other_actor_id = o.actor_id
JOIN movies m USING (movie_id)

And we see the second order path from Kevin Bacon to Timothée Chalamet.

      name       |        title         | startyear |       name
-----------------+----------------------+-----------+-------------------
 Kevin Bacon     | You Should Have Left |      2020 | Amanda Seyfried
 Amanda Seyfried | Love the Coopers     |      2015 | Timothée Chalamet

Bacon-Seyfried

If you want to experiment and find higher numbers, here is a complete PL/PgSQL function to simplify the process.

PL/PgSQL Function for RCTE Bacon Query

CREATE OR REPLACE FUNCTION bacon(query_name text)
RETURNS TABLE(name text, title text, year smallint, othername text, n bigint) AS $$
DECLARE
    bacon_id INTEGER := 102;
    query_id INTEGER;
    row_count INTEGER;
BEGIN
    SELECT actor_id INTO query_id
    FROM actors
    WHERE actors.name = query_name;
    GET DIAGNOSTICS row_count = ROW_COUNT;

    IF (row_count != 1) THEN
        RAISE EXCEPTION 'Found % entries for actor %', row_count, query_name;
    END IF;

    RETURN QUERY
    WITH RECURSIVE bacon_numbers AS (
    SELECT
      ag.actor_id,
      ag.movie_id,
      ag.other_actor_id,
      1 AS bacon_number,
      ARRAY(ag.edge_id) AS path,
      false AS is_cycle
    FROM actor_graph AS ag
    WHERE actor_id = 102 -- Kevin Bacon
    UNION ALL
    SELECT
      bn.actor_id,
      ag.movie_id,
      ag.other_actor_id,
      bn.bacon_number + 1 AS bacon_number,
      path || ag.edge_id AS path,
      ag.edge_id = ANY(path) AS is_cycle
    FROM actor_graph AS ag
    JOIN bacon_numbers AS bn
      ON bn.other_actor_id = ag.actor_id
    WHERE bn.bacon_number 

The highest number I’ve found so far is a 3 for Gates McFadden, who played
Doctor Crusher in Star Trek: The Next Generation.

SELECT * FROM bacon('Gates McFadden');
      name       |           title           | year |    othername    | n
-----------------+---------------------------+------+-----------------+---
 Kevin Bacon     | Beverly Hills Cop: Axel F | 2024 | Bronson Pinchot | 1
 Bronson Pinchot | Babes in Toyland          | 1997 | Jim Belushi     | 2
 Jim Belushi     | Taking Care of Business   | 1990 | Gates McFadden  | 3

Bacon McFadden

  • You don’t need any special external software to solve graph problems. You can tackle these problems in PostgreSQL using pgRouting or a custom recursive CTE.
  • pgRouting is suitable for smaller graphs and is especially powerful for graphs that are dynamically generated to reflect changes in the data or changes in the desired graph.
  • Recursive CTEs can handle much larger graphs, but not large traversals of those graphs, because they tend to accumulate very large intermediate results that get larger with each recursion.