Hard Negation | Sentimark

I did not set out to build a negation dataset because models could not understand the word not.

For the most part, frontier LLMs had solved the obvious version of that problem. Give them a clean sentence, ask whether the negated form changes the meaning, and they usually behave. The public datasets around negation reflected that world: useful historically, but brittle, narrow, and a little too easy to memorize.

The failure I cared about was harder.

Embedding models and retrieval systems still collapse under pairs that share almost all their surface material while disagreeing in the one place that matters. The sentence looks close. The tokens overlap. The entities match. Cosine similarity smiles and nods. But the meaning has turned.

That is not a spelling problem. It is not a keyword problem. It is a structural problem.

So I stopped looking for more negation data and started making harder negation.

The Easy Data Was Too Easy

Bad negation data teaches the wrong lesson. If every example is just:

The policy applies.
The policy does not apply.

then the model learns a cheap alarm around not. That helps on toy tests and fails in the wild. Real negation is not always direct. Sometimes it lives in scope, conditionals, absence, failed expectations, exceptions, temporal shifts, comparatives, or a sentence that never says not but still reverses the proposition.

The goal was not to teach a model to recognize a negation token.

The goal was to make it stop confusing overlap with meaning.

That distinction changed the whole pipeline. The examples needed to be semantically sharp, not merely grammatical. The positives needed to preserve the intended meaning. The hard negatives needed to be close enough to tempt the embedding space, but different enough that a human would reject the match immediately.

In other words, the data had to be mean.

The Gauntlet

LLM-generated data has its own failure mode. Leave a generator running long enough and it starts to find grooves. The wording changes, but the move repeats. Same sentence shape. Same contrast. Same trick in a different jacket.

That is semantic collapse.

The gauntlet was built to catch it. Every candidate had to survive multiple checks: semantic separation, lexical overlap, polarity, near-duplicate filters, hard-negative overlap, and syntactic diversity. I did not want a pile of fluent examples. I wanted examples that remained hard after the cheap shortcuts were stripped away.

A speed nerd met spaCy and said… what if we did it this way?

That was one part of the answer: a Rust crate, now split out as warp_pos_spacy, that accelerates the POS-window scoring path around spaCy. spaCy still does the linguistic tagging. Rust does the hot scoring math: hash the part-of-speech window around an anchor, count the repeated shapes, and compute whether the batch is turning into a template monoculture.

On that scoring job, the Rust path was roughly 1,500x faster than the Python loop it replaced. That is not because Rust is magically better at language. It is because the language work had already been reduced to arrays, anchors, hashes, and counts. Once the problem became that shape, Rust could run it at memory speed.

The Part That Generalizes

Negation was the wedge, not the whole tool.

The more useful idea is smaller and broader: generated text needs a structural diversity meter.

Anyone generating text at scale eventually runs into the same quiet failure. The examples look different enough at the surface, but underneath they are repeating the same skeleton. Synthetic training data does it. Evaluation questions do it. RAG hard negatives do it. Prompt batches do it. Agent traces do it. The model finds a comfortable move and wears different clothes over it.

That does not always require a bigger model, a heavier pipeline, or a perfect theory of meaning. Sometimes it needs a cheap pressure gauge that says:

This pattern is taking over.

That is what the crate is becoming. Still small. Still boring. More useful.

The original path scored windows around negation anchors. The broader path can score any anchor you care about: a negation token, an answer span, an entity, a number, a citation, a verb, or a domain-specific hinge. If you do not know the anchor, it can scan every token position and show the most common structural patterns in the batch.

The important output is no longer just a hash. A hash is fine for a machine. A person needs to see the repeated move:

PRON AUX PART VERB DET NOUN

That is the difference between a metric and a diagnostic.

Why POS Still Matters

There is a lazy version of the current AI story that says structure is obsolete because vectors won.

I do not buy it.

Vectors are excellent at soft similarity. They are bad at caring about one small structural hinge when the rest of the sentence is screaming “same thing.” Negation is full of hinges. So are numbers, dates, roles, causality, modality, and scope.

Part-of-speech tags are not glamorous, but they are exactly the kind of boring instrument that keeps a generator honest. If a model produces 10,000 examples with the same hidden grammar, the POS windows reveal the repetition. If a batch spreads across many grammatical shapes, the diversity metrics move.

That signal is not the whole truth. It is a pressure gauge.

And in a generator, pressure gauges matter.

What I Wanted

I wanted a dataset that would not flatter the model.

I wanted hard negatives that were not just wrong, but wrong in ways embedding systems are tempted to accept. I wanted data that could push against the specific failure where “almost the same sentence” becomes “same meaning.” I wanted a generator that could be told, in effect: stop using that trick, I saw it already.

That is why the POS crate mattered. It made the syntactic feedback loop cheap enough to run continuously. If the generator drifted into a pattern, the gauntlet could see it quickly and push back.

The result was less like collecting examples and more like training a sparring partner.

Not every punch should look the same.

Not every negation should wear the word not on its sleeve.

Not every fluent pair deserves to survive.

That is the point of hard negation: not to prove the model can read a negation marker, but to force it to care when meaning turns on a hinge.