Here is my excuse
On the road
As a child in the 1950s and 60s our family visited my grandfather’s beach house every summer. When we drove west on US80 from Arizona we often ate at the Major’s Coffeeshop in Pine Valley for a late lunch as that was the first cool spot to stop at after crossing the desert.
I recently moved to a Southern California beach town and had to make a number of trips to Southern Arizona and so was retracing this old route, now on I-8 rather than US80. On one trip I wondered if the Major’s Coffeeshop still existed and if the food there was still decent. Looking up “Major’s Coffeeshop”, “Major” and even “Pine Valley” on OsmAnd (offline mode) turned up nothing. Not to surprising about the coffee shop not showing up but a little curious that “Pine Valley” did not show up.
But there was a freeway exit sign pointing to Pine Valley so we took it to see what we’d find. Sure enough, about 1/4 mile from the freeway on the old road was the 50+ year old sign for the Major’s Coffeeshop with a small addition on the bottom for the current name, Major’s Diner. We had lunch (adequate and average) and went on our way secure in knowing OsmTracker would show where we stopped and having a receipt that showed the address so I could add it to OSM.
When I pulled the area up in JOSM the building was missing so I added it and added the tags for the address and that it was a restaurant, etc.
We have a problem here
But I noticed that there were a lot of nodes sprinkled through the area with addresses. Curious as to why “Pine Valley” did not come up in my search on OsmAnd if there were address information in the area I looked at them. They all were tagged with addr:city=”San Diego”.
Huh? San Diego is a long way from Pine Valley and on any direct route you would need to go through other incorporated citys like El Cajon and Alpine. Curiously, the nodes also had values set for addr:postcode and when I looked up the ZIP/postcode I found the U.S. Postal Service thinks that code is for Pine Valley. And when in the area I noticed the post office near the restaurant with “Pine Valley” clearly posted on the front. It seemed the addr:city values for all addresses in that area were wrong or at least very suspect.
I also noted that the addr:street values contained abbreviations which is not standard OSM practice. Further, all these address nodes seemed be from an old import and the mappers involved were no longer active and did not respond to my query about the import.
So I thought I could do some “arm chair” mapping and clean up the obvious errors. But first a check on the mailing lists to see what the group opinion might be. Typical responses from people far, far from the area who apparently didn’t even look at the place in question were pretty negative about trying to fix this without on the ground surveying. But a response from a mapper who was raised in the area confirmed that the locals there view it as Pine Valley and not some oddly displaced portion of the city of San Diego.
About that time another email showed up on the list from someone who noticed all the addr:street values through out San Diego county had abbreviations. I took a look a bit farther afield than just Pine Valley and discovered they were right: All of San Diego county was covered with address nodes with bogus street names and it seemed that any node that was not in an incorporated city was tagged as being in the city of San Diego. Basically a mess with lots of wrong data.
Making things better
Despite the stigma of “arm chair” mapping and especially the stigma of automated or semi-automated edits, I thought I could improve the situation. I decided to start with the sparsely populated mountainous and desert area of eastern San Diego County. My first attempts were simply using JOSM. I’d do a grep search for a addr:street value that ended in, say, “Ave”, select one of the results and search for all matching key/values. Once I had a selection I’d zoom in and see what the underlying street name was and correct them. Wash, rinse and repeat. Once the addr:street values were taken care of, I’d repeat for addr:postcode values to assure the addr:city names seemed reasonable. Very slow, very tedious, very error prone. And that was in an area with almost no address nodes. This was not going to work for the densely populated western coastal area at all.
Time to semi-automate the process. I wrote a script that would read a .osm file and expand out the values found in addr:street tags on nodes only (I thought that any polygons would have been entered manually by a individual mapper rather than through the flawed import). It would also flag any node where the was a mismatch between the addr:postcode and addr:city values. The workflow would be to select an area in JOSM, download the data for it, export the data, run the script, load the changed data, verify all the changes and upload. Repeat for another area, etc.
I found you have to be very careful with expanding abbreviations. For example “St” at the beginning of a street name is probably “Saint” while at the end is probably “Street”. A “Tr” suffix is probably “Trail” in the eastern mountain or desert areas but likely “Terrace” in the western coastal areas. And some of the coastal cities have alphabetically named streets, so you want to expand “E Ave” to “E Avenue” not “East Avenue”.
There appear to be no hard and fast rules that work universally for expanding address abbreviations even for a limited area in the United States. Any expansion the script made had to be checked by a human. So I had script output a log of the changes it made to help with verifying they made sense. This was slow going even with semi-automated find/replace logic but at least is was many times faster than just doing the same thing in JOSM alone.
This helped a lot on reducing the errors reported by OSM Inspector but there were still a bunch of “Street not found” issues. Taking another look, it seems that there were a bunch of capitalization issues and spelling errors. So the next version of the script was to have it build a list of highway names and then print a list of suspect addr:street values, those being ones that did not match a name value for an nearby highway. Looking at these, some could be resolved by reviewing the highway name (usually from the original TIGER import), the highway name shown in the latest TIGER overlay and the preponderance of spelling on addr:street values.
But in many other cases, especially along state highways, it was not possible to guess the correct street name. If the addr:street values along a road are a mix of “Highway 79”, “SR 79” and “CA 79” and the highway does not have a name value only a ref tag, which to choose? It depends on the actual signs along the road so it can’t be resolved remotely.
The next issue discovered was that many addresses were imported more than once, often in widely different positions. So the script was modified again to output a list of duplicate addresses. Using Bing satellite imagery and the highway locations shown in JOSM sometimes these can be resolved. But in many cases, it just isn’t clear so a field survey is needed.
Result
The the current state of addresses in San Diego county, after many, many thousands of address nodes were cleaned up:
-
The addr:city tag should match the postal city given by the ZIP code.
-
There should be no abbreviated values in the addr:street tags.
-
Most, but far from all, of the addr:street values match the name value for a near by street.
-
Some of the duplicate addresses have been resolved.
But I am now listed as the most recent editor of a lot of address nodes that are duplicates and/or don’t have a matching highway name.
Discussion
Comment from Warin61 on 7 March 2016 at 01:40
Those who think a survey is required to identify a city .. well they need to do some more mapping. (Not napping.)
It should be possible to make a semi automated replacement for trailing abbreviations - those; at the end of the text have a leading space character
They should match a string e.g. St St. st st. Ave etc. and be replaced with the appropriate full string. Each replacement should be confirmed or rejected by a single key/mouse press - thus not a full automation, but removes typo to a great extent.
This process can not only be used for nodes but for ways that are also tagged highway=* areas that are also tagged for addresses (things like buildings, parks may all have street addresses).
I would be a little more cautious of leading abbreviations like Saint.
Comment from n76 on 7 March 2016 at 03:52
“I would be a little more cautious of leading abbreviations like Saint.” Definitely! My impression from this exercise is that you should be cautious expanding any abbreviation. In this case I decided to check each expansion against the full names of nearby streets to see if any matched. If I had a match then I assumed the expansion was okay. Those that did not match near by streets were logged for separate examination.
Comment from Piskvor on 7 March 2016 at 09:47
Great insights - and a great story to boot! :)
Comment from stevea on 23 August 2020 at 18:10
Very nice Diary entry! I also “grew up in the area” (and know Major’s first-hand, family had a place in nearby Guatay, did a lot of hiking out there as a kid…). The San Diego address import is indeed messy and I salute you for your cleanup. Simply because you are the “most recent editor” doesn’t mean you haven’t done a lot of good, hard work: you have. The histories show that.