Written 16 Feb 2024

In this article, I use the term “tweet”, but my ‘bot’ was banned from Twitter, so now it only publishes to Mastodon and to https://www.caltrain.live

A few years ago, I started taking Caltrain to work and was frustrated by the lack of communication about train status. At the time, statuses were tweeted by @CaltrainAlerts during business hours, by a human. I figured my robots could do better, so I started digging into the 511.org API, which is updated with train data once a minute.

The first step was to collect data. I started polling every minute, and collecting:

  • Train Number
  • Timestamp stored in “minutes since midnight”
  • Direction
  • Latitude/Longitude (stored raw and two different precision levels of geohashes)
  • Line reference
  • Origin
  • Destination
  • Day of the week
  • Distance from origin
  • Distance from destination

I stored this data in AWS DynamoDB and set a Time-to-Live (TTL) to 30 days for weekday trains and 90 days for weekend trains.

After accumulating 30 days of data, I was able to initiate data analysis. Let’s explore the primary isTrainLate function, which is divided into two components:

  1. Assessing Train Position: This involves determining if the train is lagging in terms of its usual position for the given time. I do this by querying a database secondary index of “trainId-minutesSinceMidnight” with the current time and taking the median of the resulting location (miles from origin) and comparing that against the current location.
  2. Determining Time Delays: For this, I consult a different secondary index, “trainId-geohash”, to fetch the minutesSinceMidnight for each corresponding entry of a train’s current location. This helps ascertain the time when the train is usually at its current approximate location. By subtracting the 75th percentile value from the current minutesSinceMidnight, I calculate the delay in minutes, defining how late the train is.

A train is deemed late if it is more than 3 miles off its expected location and exceeds a 3-minute delay.

But…there’s more!

  • I keep a list of stations and calculate what stops a train is at or between. This ends up being part of the tweet:
    sendTweet(`${record.trainId} (${record.lineRef}) is ${milesDiff} miles behind where it should be, running an estimated ${minutesLate} minutes late. Between ${stations[0].name} and ${stations[1].name}.`)
    
  • My code is stateless and only tweets every 5 minutes by storing the fact that it tweeted about a given trainId in Redis with a “just under 5 minute” TTL.
  • I calculate the % of a train through its journey
  • I send myself a push notification if the train I ride is late (I may open this up as an mobile app or PWA at a later date)
  • I tweet whenever a train is stopped or holding. To do this, I store each train record in a Redis sorted set with a TTL. I then query the set as part of the same operation and determine how many consecutive records show the train at the same location. If it has been in the same spot for 5 minutes, I send out a tweet.
sendTweet(`${now} Train ${record.trainId} is stopped/holding between stations ${currentStop[0].name} and ${currentStop[1].name} (has not moved in ${minutesHolding} minutes). Current location: ${getMapsLink(record.latitude, record.longitude)}`)
  • If I’ve already tweeted and its location has changed, I tweet that the train is now moving.

Infrastructure

This script is written in javascript (nodejs). It runs, along with Redis, on nano ec2 instance for about $2/mo. Everything else falls within AWS’s free tier:

  • DynamoDB NoSQL Database
  • S3 & Cloudfront for hosting the website

If you’re interested in learning how to use Terraform to set up a similar web hosting infrastructure on AWS, I have a github repo that makes it a breeze.

Final notes

  • On any given day, it seems that 10% of the trains don’t show up on the 511.org API. I call these “ghost trains.” One day I’ll get around to showing them on https://caltrain.live with a note that the status is not current. Ghost trains also don’t show up on Caltrain’s own website, which likely uses the same data source.
  • I was able to take some shortcuts calculating distance from origin knowing that Caltrain’s route is mostly a straight line.
  • My calculations do not rely on train schedules; instead, they are based solely on previous train data. However, there is one downside: during periods of electrification construction, train schedules have frequently changed, and trains have been regularly late. As a result, my system has suggested that trains are on time because it has learned from the consistently late trains during those periods :)
  • Shortly after release, Caltrain launched their own automated late train tweeting system. I suggested they should have hired me. They replied, stating that by the time they discovered me, they were already in contract with a third party to provide similar services.