villasv's Diary

Recent diary entries

Vancouver Cafe Scene Project Day 6

Posted by villasv on 12 July 2022 in English.

Today I was ables to swap out the pairing of OSM nodes and VSI stores. I’m now going over each VSI store and picking the closest OSM node with no threshold. Well, I am using a soft cap threshold of 1 km just to guarantee that the cross product of these two tables won’t explode, but no VSI store stays unmatched with that radius.

The analysis can be seen in this Data Studio dashboard.

As of now there are 140 good looking matches between OSM and VSI, basically half the dataset - much more than the initial count of 10%! I’ll go through these manually anyway just to confirm, because a few of them have a bit of a location discrepancy (>50 meters) and many have noticeable naming variations.

There are 138 non-matching pairs. There are a few of these that are just being missed by the name matching fuzziness logic, but most do seem to have significantly different names and locations. These will require a bit more survey work to sort out.

Vancouver Cafe Scene Project Day 5

Posted by villasv on 11 July 2022 in English.

Continued working on matching OSM nodes with VSI stores. There were 6 near-perfect matches (exact same name, points less than 10 m apart), but after introducing some fuzziness to the name comparison, that number jumps to 23!

I manually inspected those 23 pairs and indeed they seem to be referring to the same business. The matching remaining in the initial 10 m matching attempt is “49th parallel cafe & lucky’s doughnuts” in OSM vs “49th Parallel Coffee Roasters” in VSI. This one requires further inquiry to see if the OSM should really be updated or not.

I’m not totally satisfied with this 10 m threshold though. It’s arbitrary and not really what I’m looking after. I think I’ll redo the analysis but using “nearest VSI business within 100 m”, so that each OSM node will always match at most one VSI store and it won’t be so strict on the distance.

Vancouver Cafe Scene Project Day 4

Posted by villasv on 9 July 2022 in English.

Received some valuable feedback from the imports mailing list on the matters of data quality and the expectations on someone’s level of OSM experience before executing large scale automated data imports. I was pretty much well set in terms of data quality concerns, but it looks like I would need a bit more hand-holding from more experienced mappers and importers to properly execute a big import.

This is not a problem, though. It’s very reasonable and thankfully not a deal breaker to me because I chose a scope small enough that it’s feasible to execute this import manually instead of automated. In fact, my previous analysis that the VSI had about 570 coffee/café related business was an overestimation because - rookie mistake - I forgot to deduplicate by survey period.

The new numbers are:

OSM Nodes Matching Coffee/Cafe: 574
VSI Stores Matching Coffee/Cafe: 278
OSM Nodes within 10 m of a VSI Store: 28
OSM Nodes within 25 m of a VSI Store: 195

So yeah, lots of nearby matches to investigate. Now is the time to start fuzzy matching the business names and SK53 provided me some good reading material on that. It will be a bit challenging to do that with pure SQL (I’m trying to use dbt + BigQuery only for now), but I think it’s worth a try.

Vancouver Cafe Scene Project Day 3

Posted by villasv on 8 July 2022 in English.

I was finally able to start reconciling the Vancouver Storefronts Inventory (VSI from now on) and the OSM nodes. VSI has 578 coffe/café matches, OSM has 574. These numbers are so close, it gives me hope.

When searching from nodes in OSM that have a nearby (<10 m) node in VSI, 54 results come out. Of those, 51 are perfect matches (business name in OSM is the same as in VSI, except for things like “Starbucks” in OSM vs “Starbucks Coffee” in VSI). This isn’t too thrilling, but honestly a near 10% perfect match from the get go is pretty sweet.

Using 10 meters is pretty bold, so I’ll experiment a bit on a healthy threshold that gives me more matches but doesn’t yield too many false matches. A 25 m radius already jumps to 391 matches and a 50 m radius gives 705 which is obviously too much.

If I have the time, I should also probably start getting fancy with fuzzy matching business names to get the obvious non-identical matches out of the way so I can investigate proper mismatches.

Vancouver Cafe Scene Project Day 2

Posted by villasv on 7 July 2022 in English.

Getting familiar with the Vancouver Storefront Inventory dataset. Apparently there are around 578 businesses with names that include Coffee or Cafe/Café, which looks pretty good. Judging by name only is pretty unreliable, but at least the vast majority of the matches are in the Food & Beverages category which is reassuring.

By chance, one of these businesses is already permanently closed (according to other sources): Café Logos. It was literally the first one I randomly selected to investigate, and it’s already data that should not be imported to OSM. Oh well. This is going to be tough.

On the OSM front, today I learned how to use the BigQuery dataset that has a handy table containing the OSM areas/ways as GDAL objects, which I can use to further sub select the nodes that are inside the region of interest (Vancouver) using ST_DWithin.

Vancouver Cafe Scene Project Day 1

Posted by villasv on 6 July 2022 in English.

I have an endgame in mind, which is having a complete walkability study on a bunch of major cities in the world, which competes in quality with walkscore.com but it’s fully open data.

That’s too grand for me to accomplish in a year of less-than-part-time effort. So I’ve decided to scope it down to a single city I care about, which is Vancouver. How hard can it be to analyze walkability in a single city? Well, pretty damn hard actually.

The very first thing I’d like to consider analyzing walkability is proximity to amenities like cafés, markets and pharmacies. Turns out OSM seems to have pretty good coverage on shops in Vancouver already, but for me to be very confident on my analysis the study stars with an evaluation of data coverage.

If I’m going to compare OSM with an official source, say Vancouver’s Storefront Inventory, whatever the coverage might be… I might as well import what’s missing? I think I owe OSM this much, and it will be nice to say that the whole data used in the walkability study is from OSM instead of from multiple sources.

The thing is, my past experience with data imports is limited to a single one I made for Wikidata of Higher Education Institutions in Brazil, and it took a whole month to finish it. I had more tooling available (OpenRefine is very integrated with Wikidata), more time available (5 years ago I had more energy) and the data was much more straightforward (no mapping involved, just categorization)…

Considering all that, I’d estimate it would take me about 3 months to import the whole thing following proper procedures. So I’ve decided to scope it down once again, to a subset of the data that I can reasonably scan through manually. I’ll start with coffee shops / cafés only. That I think should bring the estimate back down to a single month or so. Hopefully.