A deep dive into unicode and string matching - variation selectors
Earlier this week I learned about token bombing attacks[^1], where users may prompt LLMs with arbitrarily long byte streams that render as a single character or emoji. This is done by exploiting "variation selectors" - special Unicode code points that modify character appearance - and is a good follow up from my earlier posts about the unicode standard[^2][^3].