Skip to main content

A deep dive into unicode and string matching - variation selectors

· 7 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

Earlier this week I learned about token bombing attacks1, where users may prompt LLMs with arbitrarily long byte streams that render as a single character or emoji. This is done by exploiting "variation selectors" - special Unicode code points that modify character appearance - and is a good follow up from my earlier posts about the unicode standard23.

Understanding variation selectors

Variation selectors are special unicode code points that modify how the preceding character is displayed4. While primarily designed to control character appearance in different contexts, these selectors can be used - or misused - in interesting ways. Before exploring those implications, let's look at how they work.

Consider these examples where variation selectors create a clear contrast between typographical and graphic representations of the same glyph:

☎︎ vs ☎️

✔︎ vs ✔️

⭐︎ vs ⭐️

⚡︎ vs ⚡️

Despite appearances, each glyph comprises multiple code points (remember that a code point is really a number and is typically represented by U+ followed by an hexadecimal number):

U+260E U+FE0E vs U+260E U+FE0F

U+2714 U+FE0E vs U+2714 U+FE0F

U+2B50 U+FE0E vs U+2B50 U+FE0F

U+26A1 U+FE0E vs U+26A1 U+FE0F

In reality any number of variation selectors can be chained together. For example this single visible glyph: ☎︀︁︂︃︄︅︆︇︈︉︊︋︌︍︎️ contains 17 codepoints.

U+260E U+FE00 U+FE01 U+FE02 U+FE03 U+FE04 U+FE05 U+FE06 U+FE07 U+FE08 U+FE09 U+FE0A U+FE0B U+FE0C U+FE0D U+FE0E U+FE0F

Variation selectors can be applied to any character, for example, this word "H️E️L️L️O️" actually has 10 codepoints instead of the expected 5 (one for each letter).

U+72 U+FE0F U+69 U+FE0F U+76 U+FE0F U+76 U+FE0F U+79 U+FE0F

The unicode standard specifies two blocks5 for variation selectors:

  • The variation selectors block (16 code points: U+FE00 to U+FE0F)
  • The variation selectors supplement block (240 code points: U+E0100 to U+E01EF)

Together, these blocks amount to 256 code points.

How can this be (mis)used?

This is the basis for Paul Butler's very interesting post "Smuggling arbitrary data through an emoji", where he demonstrates it is possible to encode any 256 characters, like the Basic Latin unicode block using variation selectors.

This is clearly an abuse of the unicode standard, however it has interesting implications not only for user input validation, but also steganography and watermarking.

When it comes to LLMs specifically, input tokens cost money, and despite increasingly generous token windows there are limits - so imagine a scenario where a user inputs a single emoji that expands to thousands of codepoints!

Database implications

Understanding how databases handle these unicode quirks is interesting for many applications. Let's explore this through practical experiments with PostgreSQL 17.

Let's start with a simple experimental setting where we run Postgres17 in docker, create a table, insert some test values and observe how this affects things like constraints, string comparison and normalization.

Setting up the environment

Let's create a test environment using Docker and PostgreSQL (instead of using psql you may also use pgadmin[6] for instance):

# Start PostgreSQL 17
docker run -p 5432:5432 -e POSTGRES_PASSWORD=password -d postgres:17.2-bookworm

# Connect to the database
psql -h localhost -U postgres

Create a test table and insert some initial data:

CREATE TABLE variation_selector_test(name VARCHAR(32) NOT NULL, UNIQUE(name));

INSERT INTO variation_selector_test (name) VALUES ('John Doe');
INSERT INTO variation_selector_test (name) VALUES ('Jane Doe');

Finally let's run a quick check to make sure everything is as expected:

SELECT name, LENGTH(name) FROM variation_selector_test;

This checks the length of each name including spaces, and indeed everything looks normal:

   name   | length
----------+--------
John Doe | 8
Jane Doe | 8
(2 rows)

Playing with variation selectors

Now let's add variation selectors to the mix. We are going to insert a name with variation selectors after each character: "J️a️n️e️ ️D️o️e️". Note that if you copy this, the variation selectors will be carried over.

-- Using the actual characters (copy this exactly, as the name contains variation selectors)
INSERT INTO variation_selector_test (name) VALUES ('J️a️n️e️ ️D️o️e️');

-- Or using escape sequences (E prefix enables escape sequence interpretation)
INSERT INTO variation_selector_test (name) VALUES (E'J\uFE0Fa\uFE0Fn\uFE0Fe\uFE0F \uFE0FD\uFE0Fo\uFE0Fe\uFE0F');

The name was successfully inserted into the database, passing the uniqueness constraint! And if we check the name lengths again:

SELECT name, LENGTH(name) FROM variation_selector_test;

We see that the variation selectors count towards the character limit.

   name   | length
----------+--------
John Doe | 8
Jane Doe | 8
J️a️n️e️ ️D️o️e️ | 16
(3 rows)

This can of course be abused and one can craft a name like "John Doe️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️" that is over 32 characters long, surpassing the maximum lenght for this field.

INSERT INTO variation_selector_test (name) VALUES ('John Doe️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️️');

The query will fail with the following error: ERROR: value too long for type character varying(32)

Effects on string comparison

Standard string comparison operations don't recognize these visually identical strings:

-- Basic equality comparison
SELECT name FROM variation_selector_test WHERE name = 'Jane Doe';

-- Pattern matching
SELECT name FROM variation_selector_test WHERE name LIKE 'Ja%';

-- Full text search (using PostgreSQL's text search capabilities)
SELECT name FROM variation_selector_test WHERE to_tsvector(name) @@ to_tsquery('Jane');

Can normalization help?

I touched on the topic of unicode normalization forms previously2. In a nutshell, comparing strings with combining characters (e.g. diacritics) is not straightforward, and therefore the unicode standard defines four different normalization types that either try to break apart composite characters (decomposition) or convert to composite characters (composition).

PostgreSQL provides unicode normalization functions, but they don't help with variation selectors since these are intentionally preserved for semantic meaning:

SELECT name,
LENGTH(NORMALIZE(name, NFC)) as nfc_normalization,
LENGTH(NORMALIZE(name, NFD)) as nfd_normalization,
LENGTH(NORMALIZE(name, NFKC)) as nfKc_normalization,
LENGTH(NORMALIZE(name, NFKD)) as nfkd_normalization
FROM variation_selector_test;
   name   | nfc_normalization | nfd_normalization | nfkc_normalization | nfkd_normalization
----------+-------------------+-------------------+--------------------+--------------------
John Doe | 8 | 8 | 8 | 8
Jane Doe | 8 | 8 | 8 | 8
J️a️n️e️ ️D️o️e️ | 16 | 16 | 16 | 16

Where does that leave us?

The specific purpose and context where your code is deployed is crucial, but in general it's good to keep in mind that:

  • Visually identical strings may have different internal representations;
  • Visual length does not necessarily equals code point length;
  • For mission-critical applications, consider stripping variation selectors entirely unless they serve a legitimate purpose
  • Test your application with edge cases involving unicode modifiers

Remember processing text, especially from random strangers in the internet is hard, and as this post hopefully illustrates a malicious or mischievous user may create all sorts of "interesting" headaches with plain old unicode strings.


Footnotes

  1. Twitter. Also credit where credit is due, this is a great complementary explanation of what we are going to cover: https://paulbutler.org/2025/smuggling-arbitrary-data-through-an-emoji/

  2. A deep dive into unicode string matching - I 2

  3. A deep dive into unicode string matching - II

  4. Note that emoji variants work in a similar principle where the skin tone is a essentially a modifier sequence, but on a different unicode block. (see https://emojipedia.org/emoji-modifier-sequence)

  5. Quick recap on unicode blocks: it is a contiguous range of code points that are typically used in a particular language or domain (e.g. mathematics).