Using GNIS data to find potential additions and corrections

Posted by Kai Johnson on 28 January 2023 in English.

I’ve started a new project working with watmildon. While we were working together on applying the USGS Sq___ name changes to OSM we noticed was that there were often features in OSM that were out of sync with official name changes that happened years ago.

That got us thinking about walking through the USGS GNIS data set to find places where names had changed and OSM could be updated. After all, there are many features in OSM that have gnis:feature_id (and similar) tags that can be directly matched back to the GNIS data set.

After kicking the idea around for a while, we recently started writing some code. I’ve been working on a matching engine in C# that matches records from GNIS to OSM by Feature ID. The code also looks for likely matches where the feature name, primary tags, and geometry are close to the information from GNIS. So far the results are pretty good, but we’re still working on improving the matching.

Meanwhile, watmildon did some large scale statistical analysis on a local PBF file to look at the scale and scope of the problem. The results were very interesting!

Of the 2.3 million features in GNIS, there are only 1 million corresponding features with GNIS IDs in OSM. Some portion of these are surely existing features that just don’t have the gnis:feature_id (or similar) tags. But given our manual review of results from the matching code, there are a lot of GNIS features that are not present in OSM at all.

That’s not too much of a surprise. Some of the most common types of missing features are Streams, Valleys, Lakes, Springs, and Ridges – all things that not widely mapped in the US.

GNIS recently archived the feature classes for civil names and man-made features. About half of the 1.3 million GNIS records that don’t have corresponding features in OSM are for those archived features. You might reasonably wonder whether it’s worth tagging the archived features in OSM. But that leaves about 600,000 current GNIS features that aren’t fully tagged in OSM. And a large portion of those are likely not mapped at all.

At this point, we’re still working on improving our tools, collecting, and analyzing the data. There do seem to be some opportunities for some automated tag cleanup, and if that makes sense we’ll follow community practices for anything like that.

But fixing the untagged/missing features is going to require manual review and there are too many features for us do that alone. We’ll have to keep working to find ways to enlist the rest of the community to help!

Discussion

Comment from watmildon on 29 January 2023 at 07:48

As we work through this, I’m getting more and more interested in a general deduplication/simplification/deletion of common US import tags. Many objects have more than one synonym (or roughly synonym with the same value) which leads to maintainance issues (ex: merged nodes with different GNIS IDs, but they are in different synonymous keys.)

Some cleanup that would immediately help reduce confusion/headaches: The 6 ways to tag GNIS ID. The many ways to encode name data NHD:GNIS_Name, gnis:name, the GNIS tags that are just is_in:* tags of a different format, etc

I wonder if there’s precedent for merging out synonyms etc. Something to ask around about I suppose.

Comment from Kai Johnson on 29 January 2023 at 16:30

I think there’s a lot to be said for cleaning up tags from old imports. That might be the first way this project could contribute something to the map.

Since there were only a few examples globally, I went ahead and manually fixed all the malformed gnis:feature_id tags, which were generally all duplicates of existing data (mostly duplicates of name). Turns out, many of those features actually had real GNIS IDs that I could fill in.

And there was one exception where a mapper in Germany put some apparently useful data in that field. I didn’t know what to do with it, so I just left a comment on the changeset.

OpenStreetMap

Kai Johnson's Diary

Using GNIS data to find potential additions and corrections

Discussion

Log in to leave a comment