Remix.run Logo
svat 6 days ago

Try it here (you may have to create a Google Cloud project, but you don't have to enable billing or start the free trial):

https://console.cloud.google.com/bigquery?p=bigquery-public-...

Click on the `+` (white over blue background) in the tab bar at the top that says "SQL query" on popup, and type the following (I use the GoogleSQL pipe syntax (https://cloud.google.com/bigquery/docs/reference/standard-sq... / https://news.ycombinator.com/item?id=41347188) below, but you can also use standard SQL if you prefer):

    FROM `bigquery-public-data.hacker_news.full` 
    |> WHERE type = 'comment' AND timestamp < '2022-11-30'
    |> AGGREGATE COUNT(*) AS total, COUNTIF(text LIKE '%—%') AS with_em GROUP BY `by`
    |> EXTEND with_em / total AS fraction_with_em
    |> ORDER BY fraction_with_em DESC
    |> WHERE total > 100 AND fraction_with_em > 0.1
(I'm in place 47 of the 516 results, with 0.29 of my comments (258 of 875) having an em dash in them.)

Edit: As you also asked about timestamps:

    FROM `bigquery-public-data.hacker_news.full`
    |> WHERE type = 'comment' AND timestamp < '2022-11-30'
    |> EXTEND text LIKE '%—%' AS has_em
    |> AGGREGATE
        COUNT(*) AS total,
        COUNTIF(has_em) AS with_em,
        MIN(timestamp) AS first_comment_timestamp,
        MIN(IF(has_em, timestamp, NULL)) AS first_em_timestamp,
        TIMESTAMP_SECONDS(CAST(AVG(time) AS INT64)) AS avg_comment_timestamp,
        TIMESTAMP_SECONDS(CAST(AVG(IF(has_em, time, NULL)) AS INT64)) AS avg_em_timestamp,
      GROUP BY `by`
    |> EXTEND with_em / total AS fraction_with_em
    |> ORDER BY fraction_with_em DESC
    |> WHERE total > 100 AND fraction_with_em > 0.1
for most people the average timestamp is just the midpoint of when they started posting (with em dashes) and the cutoff date of 2022-11-30, and the top-place user zmgsabst stands out for having started only in late January 2022.