Mass-importing games from Steam into Wikidata

With help from Facenapalm on Wikidata, I spent some time in early September importing a large set of games from Steam into Wikidata.

For those unfamiliar, Wikidata is a sister project of Wikipedia, and it’s a repository for structured data about essentially everything: books, movies, TV shows, people, video games, companies, laws, countries, cities, and more. That structured data - in a format like “Half-Life 2 => publication date => 16 November 2004” - is then used for things like the infoboxes on the side of Wikipedia articles. The data is freely-licensed, and is also used by various other organizations and companies as one of the sources of information for knowledge graphs - like Google’s - and personal assistants - like Siri or Alexa.

Steam is an online video game store that sells games for Windows, macOS, and Linux. It’s by far the most popular store for PC games.

I’m a member of WikiProject Video games, a group of volunteers that edit Wikidata with a focus on video game-related items in particular. I also build and run vglist, which is a website for tracking your video game library. It’s populated using the video game information on Wikidata, so improving the video game data on Wikidata will - in turn - improve vglist as well. I’ve written previously about Wikidata imports I’ve done, and have done large imports of various external IDs over the years.

The idea for this import is relatively simple. There are a lot of games on Steam. We want Wikidata to catalogue everything (with some reservations), and that includes essentially every video game ever released. The number of PC games on Wikidata, while significant, is still a lot less than the number of games on Steam. So let’s import all the games on Steam that aren’t already in Wikidata!

But first, some math.

The exact number of games on Steam varies depending on the source and what they use to determine which records in Steam are games (as opposed to DLC, modding SDKs, software, soundtracks, etc). In February 2021, PCGamesN says there were 50,000. As of September 17, 2023, SteamSpy says there are 71,845 games on Steam. That seems to be the most accurate, up-to-date number I can find, so we’ll go with it.

There are also a lot of games on Wikidata. Before this import was done, around 58,000 video games were catalogued by Wikidata. Of those, 22,922 had Steam IDs. That’s a Far Cry from the more than 70,000 overall games available on Steam. We can obviously do better.

Getting a list of Steam IDs to import

I need a list of every game on Steam. Or at least a list of most games on Steam.

I didn’t want to just scrape the entirety of the Steam store, as that’d be rather complex and could result in me being IP-banned by Valve, which - given that I just moved into a new apartment building - probably wouldn’t be ideal for my neighbors. I also just dislike scraping HTML pages, because parsing HTML is miserable. I’ll do it if I have to, of course, but in this case I had other tools I could take advantage of.

Essentially, from scripts I’ve written to import Internet Game Database IDs into Wikidata, I have a script which will dump data on every game from the IGDB database into a JSON file on my laptop. And that dump includes Steam IDs for many games. This is also useful in that we can ensure that every game imported will also have an IGDB page associated to it. So we’ll always have at least two external identifiers, both of which are used very frequently on Wikidata.

This reduces the risk of creating duplicate items for games already on Wikidata, as the game would need to be missing both a Steam ID and IGDB ID to not be caught by the duplication checks I’ve built. There’s also a third, relatively basic duplication system built into Wikidata which will check for any existing items in Wikidata with the exact same name (“Half-Life 2”) and description (“2004 video game”), and will prevent you from creating the item if there are any matches.

So we have our list of potential Steam IDs, now to determine which to import.

Filtering

There are a few heuristics I decided to use when choosing which games to import. We want to avoid doing a mass-import of games with data that later becomes incorrect, or which adds an overly-large maintenace burden for the WikiProject.

First, obviously, is that it has an IGDB record, since that’s where we’re sourcing this list.

Second, it has to be released. There are a lot of games on Steam that have a release date of “TBD” or “Coming Soon”. We could create items for those, but game development is incredibly difficult, and many will never really come out or will change drastically during development. For games that have set release dates, we exclude those as well, because those release dates often change, or the game just never comes out.

Third, I chose to exclude Early Access games. Early Access games are perfectly fine, and are “released” in that they are playable by anyone with a Steam account willing to buy them. But they also change often, have shifting release dates, and felt like an additional maintenace burden that would need to be handled by editors whenever any of the games release out of Early Access. I may choose to import those in a later batch, but for now I chose to skip them.

Fourth, the game needs to have an English translation. There are obvious problems with this decision, namely that it’s making the Wikidata video game corpus more Western-centric/Euro-centric by excluding games only released in non-English languages. My reasoning for this is fairly simple: I don’t speak any other languages besides English. I can’t perform due diligence on games without English titles or text or reference material. I can’t easily resolve duplicates if we find them later, or resolve other issues very easily, and the scripts we have aren’t set up to pull titles that are unique to specific languages. In many cases, this would result in lower-quality data that cannot be maintained by me, and that I have no good means of fixing. I don’t feel great about this decision, but it’s ultimately not one I have a great solution for. If anyone wants to do a mass-import of Steam games with titles in a non-English language that they can speak/write, I’d be more than happy to help! Please reach out if you’re interested in that.

Fifth and finally, the games must have only Latin characters in the title. This was a decision made for mainly the same reasons as the above: data quality, accuracy, and maintainability. I don’t speak or read any other languages, so figuring out and resolving duplicates if a game’s title is in Chinese or Japanese isn’t really feasible for me. For many of these, the game’s title includes both the English translation of the title and the native title. For these, we generally want to clean it up such that the English label is only the English title, and the label for the native language includes only the native title. I intend to go back and import many of these later, when I can focus specifically on games with these kinds of titles. But needing to handle all of these while importing potentially tens of thousands of other games just didn’t seem doable, so we’re skipping them for now.

And with that, we have our heuristics. Some of these are filterable just based off the data from the IGDB dump, others need to be handled by the script itself when it pulls data from Steam.

Writing and running the script

Now that we’ve decided on the heuristics, we need to filter the list of Steam IDs from IGDB based on them. I like scripting with Ruby, so I used Ruby. The final Ruby script can be found here on GitHub.

Basically, it takes the JSON file with the dump of games from IGDB and goes through every single game. It removes those already represented on Wikidata, removes games without Steam IDs or with more than one Steam ID, and then pulls data from Steam and applies the heuristics described previously. Then it prints the remaining Steam IDs to the console. It also logs of a list of “exclusions”, which are games already evaluated and excluded by the heuristics, so that we can re-run the script and don’t end up redoing all those checks again on the same exact games for no reason. This is very useful if the script fails part of the way through due to an error or some other issue.

I spent about three days running this script and occasionally babysitting it to make sure it continued running fine. We needed to use the Steam API to pull data for thousands of games, so it required a few seconds for every single game. Unfortunately, my internet at the new apartment was also having problems, which meant that I would leave the script running overnight, but it would lose internet and fail after only an hour or two while I was asleep. Not very efficient for me.

Importing the data

Technically, I did the imports in parallel with running the Ruby script, but talking through the process while switching back and forth would be rather confusing, so let’s pretend we only did the import after the Ruby script gave us our final, filtered list of all the Steam IDs.

Facenapalm has a Steam import script he wrote in Python last year, and has continued extending since. It takes a list of Steam IDs, and creates Wikidata items for each game in the list. It’s fairly straightforward, but since it was written for importing smaller batches of games, it did edits one-at-a-time for each statement. This meant each game could take anywhere from 15 seconds to multiple minutes to import (the longer ones were generally due to games which listed themselves as supporting all 106 languages on Steam, which necessitated 106 distinct edits), and would create potentially over a hundred edits just to create a single game item.

Thankfully, after I asked him about it, he kindly wrote me a script which reused the existing code for the Steam importer, but rather than creating the item via the Wikidata API, stores the items it wants to create in a local file as a set of QuickStatements commands. QuickStatements is a tool for batch-editing in Wikidata, and it’s the last step here for actually creating all of these items in Wikidata.

Once all the QuickStatements commands were generated, I ran the batches in QuickStatements, which created all of the games in Wikidata, with all the information we derived from Steam (the title, Steam ID, game mode, supported languages, etc.) creating only one edit per game. Unfortunately, QuickStatements’ web interface didn’t really like me giving it 2 million lines worth of commands all in one go, so I had to manually split them up into sets of around 30,000 lines, or about 250 items per batch. This took… a while, but it ultimately worked fine, and wasn’t too bad considering the time savings of all these scripts we had written.

Ultimately, this bulk import resulted in 20,800 new game items on Wikidata! It doesn’t quite get us to fully representing the entirety of Steam’s game catalogue, but given the previously mentioned heuristics, that was never really expected anyway. As of today, we have 43944 Steam IDs on Wikidata. Out of the 71,845 games on Steam, that makes for 61% coverage of the games on the Steam store. We’ve still got a ways to go, but this import added more than 28% of the entire storefront to Wikidata, so that’s a pretty significant step forward.

Further Data Enrichment

While we get a lot of the base data about a game from Steam, that’s not the end of the story. There are hundreds of properties that can be applied to video game items on Wikidata, many of them being “external identifiers”, the ID of a given game in a third-party database.

So, for the next week or two, Facenapalm and I ran a number of scripts written by him (and one or two by me, although most of mine have been usurped at this point and were only used when initially populating the IDs back in 2019-2021 🙂) to import other IDs. Most of these IDs are automatically derived based on the Steam ID, where we query the API or scrape the third party database to determine which record in their database includes the Steam ID we’re evaluating. The other IDs are not based on the Steam ID directly but instead “daisy-chained”, for example rather than being imported off the original Steam ID we imported, Lutris IDs are matched using the IGDB IDs.

This includes all of the following IDs, as well as a few more:

With all of those external IDs, we can then derive further information, entirely via automation. From Mod DB and PCGamingWiki we can derive the game engine used in a game. From the Nintendo eShop IDs, we can determine additional platforms the game is available on (probably Switch, given that the eShop for Wii U and 3DS shut down earlier this year). From the PlayStation Store ID we can potentially do the same for Sony platforms. And so on.

The result is more than 20,000 new, fleshed-out Wikidata items for video games. They all include data derived from Steam (the release date, supported PC platforms, whether the game is single-player or multiplayer, the Steam ID, the supported languages, etc.), as well as connections to potentially more than a dozen other video game databases. And this is all highly-accurate, and all done via automation and scripting.

In the future, we’ll also enrich these items - either manually or via automation - with data like developers, publishers, genres, series information, comprehensive platform data, and more.

Final Numbers

We’re still doing work to improve these items further, and will likely iterate upon this project to import further Steam games in the near future, but I wanted to share some before-and-after numbers for the video game items and external identifiers in Wikidata. Not all of these are from this import, as other editors have been working on video game items in the past 2 weeks as well, but our scripts make up the vast majority of the change.

Video games on Wikidata: from around 58,000 to 79,826 (+21826)
Steam IDs: from 22,922 to 43,944 (+21022)
IGDB IDs: from 40,457 to 60,718 (+20261)
Lutris game IDs: from 35,618 to 56,265 (+20647)
PCGamingWiki IDs: from 16,158 to 24,557 (+8399)
MobyGames game IDs: from 42,191 to 46,444 (+4253)
RAWG IDs: from 18,435 to 36,800 (+18365)
HowLongToBeat IDs: from 18,803 to 24,023 (+5220)

In total, that’s 98,167 new external IDs on Wikidata (and that doesn’t cover all of the IDs that were imported, there were some more niche IDs imported as well)!

A huge thank you again to Facenapalm for all his help with the scripting and data enrichment here, he saved me literally days worth of time with the QuickStatements script he put together for me. Months, if I compare this to having done all of this work manually. And his computer has spent a lot of time over the last week running scripts to enrich all the newly-imported items.

Thank you JeanFred for encouraging me to write this blog post and helping us gut-check some of the decisions we were making with this import. And thank you to the rest of Wikidata’s WikiProject Video games for the help with everything video game-related on Wikidata.