Transliteration Final Report: GSOC 2025

Posted by anqixu on 30 August 2025 in English.

Hi everyone, this is the update on the final phase of my project in adding transliteration support to Nominatim’s search results! A quick refresher: this project focused on adding transliteration as an option to users who did not understand the local language of a name, in which an understandable tag was not available.

Background

For background, you can check the overview of the project and the midterm report down below:

The bulk of the work can be found in these pull requests:

Detailed Report of the Project

The detailed version of the report can be read here (version pending Github Commit).

What I did

Integrated transliteration into Nominatim so search results in unfamiliar scripts (e.g. 北京市) can be displayed in a user-readable form (e.g. Beijing).
Built a pluggable transliteration framework supporting Latin script via unidecode, with prototypes for Cantonese, Simplified Chinese, and Traditional Chinese.
Refactored the Locales class and results pipeline for clearer responsibilities, modularity, and maintainability.
Introduced a languages.yaml configuration file for language normalization and country-language mapping.
Implemented new logic for parsing browser language headers, including handling of ambiguous codes like zh.
Wrote extensive unit tests and updated GitHub workflows for optional dependencies.
Added documentation to explain the new localization and transliteration system.

Possible Next Steps

A summary of a few possible next steps are below:

Improve regionalization (e.g. Hong Kong and Macau, which Nominatim does not yet recognize as independent from China).
Refine fallback logic when multiple languages are present.
Extend the non-Latin transliteration framework with more language-specific implementations.
Expand testing for robustness and reliability.

I have started a draft to address some of these issues in a new pull request! Work on this is far from done though, so would appreciate any comments or concerns!

What I Have Learned

This project ran from June 2025 to August 2025. I started by exploring Nominatim’s internals and designing the transliteration pipeline, then moved on to implementation and refactoring, and finished with integration and testing. Along the way, I became much more comfortable with Git (rebasing, resolving conflicts, managing branches) and debugging large systems by tracing code paths. I also gained experience with static analysis tools like mypy and linters, which were invaluable for enforcing type safety and consistency. Overall, this project sharpened my technical debugging and collaborative development skills, and gave me confidence working in a large, evolving codebase.

Acknowledgments

I’m super duper grateful to my mentors Sarah Hoffman (@lonvia) and Marc Tobias (@mtmail) for such an amazing summer and for imparting on me even a small fraction of their wealth of knowledge! It was incredible working with people with such a deep understanding of open source and geocoding, and they truly taught me so much this summer, and put endless hours into helping me develop my own skills! I am also thankful to Google Summer of Code and the OpenStreetMap Foundation for making this opportunity possible!

Discussion

Comment from mboeringa on 31 August 2025 at 08:59

Seems like a really nice and useful project you have been working on, that will enhance the user experience a lot.

One question though: will this work also make the searching a bit more flexible in the sense of allowing to find places even if one enters a minor typo? E.g. if you currently enter a place name on the OpenStreetMap website with even a single character mismatch, no results or suggestions will popup. It would be nice to have suggestions in such cases. I realize this is likely a whole different project, but maybe you have something to say about this based on your experiences working on this one.

Comment from Sucheta_India on 6 December 2025 at 13:39

Hey! This looks like a really cool and genuinely useful project. I was curious — is it currently focused only on the China-specific dataset/language base, or are you planning to expand it further? If you’re open to it, I’d love to contribute additional language support, especially for Indian languages, to help improve search quality and make the tool more widely accessible. Would contributions in that direction be welcome?

OpenStreetMap

anqixu's Diary