Home

March 28, 2026

MLIR as the Basis for a Search Engine

Search engines like Hoogle and Loogle are great for quick high-level queries. But there aren’t many (any?) options out there for deep semantic searches of type definitions and proof structures for either Lean or Rocq, let alone across both at once.

ProofSouq’s search engine aims to fill that gap, for both manual and automated use cases.

This post describes ProoSouq’s use of MLIR in building a search index underlying such a search engine. MLIR provides a uniform IR representation that can be Merkle hashed, indexed, and queried against. This IR currently represents Lean and Rocq constructs, and is easily expandable to others in the future, making searches across these languages possible.

The inner workings are described below: the way in which MLIR operations are converted into index entries, how those entries are identified across semantic scopes, how queries are matched against the index, and how results are ranked. The Lean/Rocq specifics are incidental: the same approach will apply to any MLIR-based dialect(s) which one might be interested in indexing.

Overview

Corpus documents and user queries are processed via the same pipeline. They are both compiled into MLIR, lowered via a first pass, and canonicalized via a second pass. The resulting MLIR is then walked, and protobufs are emitted which represent each index entry to be added to the search index:

At a high level, the data flow looks like this:

Source document or user query
    ↓ (Lean/Rocq compiler)
MLIR (`lean`, `rocq`, `base`, etc. dialects)
    ↓ (canonicalzation pass)
MLIR (`search`, `base`, etc. dialects)
    ↓ (MLIR walk)
C++ structs/classes
    ↓ (tree walk)
Protobuf
    ↓ (index builder)
SearchIndexStore

MLIR is built up via Java interop calls to a C++ shared library. Java then invokes the same library to run the MLIR passes and walk IR trees; it receives a protobuf in return, which it then processes for addition into the search index.

The `search` and `base` MLIR dialects

Language-specific constructs must be expressible as operations/types from these dialects, or they must be lowerable to them, in other words, convertible. Without this, it’s not possible to unify constructs with the same semantics across dialects.

Generally, we’re interested in preserving as much information as possible in order to serve a wide variety of query shapes, but we typically want to throw away syntax or anything that doesn’t contribute to the meaning/shape of something.

One thing to note about the search dialect: there must (should) be some way to express the absence of something, e.g. an operation, a type, etc., so that users can query for partial terms like 8 * or Inductive foo and still match with entries in the index.

Scope-Based Hashing

The core of the search index builder is a walk over the MLIR tree/graph which, for each operation, produces a set of scope-addressed hashes.

Think of a scope as being akin to an angle of observation: a “match” can mean different things depending upon the circumstances. Two theorems might share a proposition but have different names. Two expressions might have the same structural shape but use different concrete values. An index builder that only produces a single hash per entity cannot distinguish between these cases.

The solution is to compute separate hashes for each semantic dimension (scope) of an entity’s content:

Identity: identifiers, symbol names, declarations (the names given to things)
Value: concrete values, numerals, constants
Type: type signatures, sort annotations
Structural: operation shape, e.g. binary addition, unary negation, theorem vs. lemma, etc.

Each terminal in the MLIR tree/graph is classified into one of these scopes based on what it represents. When the walk visits a node, it groups its children by scope and computes a hash to identify that node within each particular scope.

At the end, we’re left with a collection of index entries, each identified by one or more scope-specific addresses, and with each address inclusive of the corresponding scoped addresses of that entry’s children. Our query engine can therefore know in O(1) time whether a search term matches any terms in the index, and if so, in what ways (structurally, nominally, etc.).

Canonicalization

For some constructs, order matters, and for others it does not. Plenty of binary operations are commutative (e.g., addition and multiplication on naturals and integers), type parameters can be specified in any order, and so on. If a user searches for a * b they’re most likely interested in finding b * a as well.

MLIR is great at supporting this. Traits like Commutative can be added to relevant operations:

def MulOp : Base_BinaryOp<"mul", [Commutative]>;
def AddOp : Base_BinaryOp<"add", [Commutative]>;
def EqOp : Base_BinaryOp<"eq", [Commutative]>;
def OrOp : Base_BinaryOp<"or", [Commutative]>;

And then used at canonicalization-time like so:

auto matchAndRewrite(mlir::Operation* op,
                     mlir::PatternRewriter& rewriter) const -> mlir::LogicalResult override {
  if (op->hasTrait<mlir::OpTrait::IsCommutative>()) {
    // ...
  }
}

This is probably the simplest example; notions like type parameter commutativity are doable, but the specifics of handling them depend more on how the parent operations of those parameters are defined.

Regardless of the specifics, the goal is to reorder/rewrite relevant entities in the MLIR graph/tree into a canonical form such that all canonically “equivalent” entities receive the same hash.

Persisting the Index in Java

When the Java layer receives the new index entries in protobuf form from the C++ layer, it persists them to a SearchIndexStore containing a collection of maps:

// ...
private final Multimap<IndexAddress, IndexEntry> index;
// ...

Some details have been elided, but with the main index expressed as a Multimap, the query engine is now able to run through a collection of hashes for a user’s search term and collect all of the content matching those hashes.

Querying

A user’s query string is run through the same pipeline as a source document: parsed into MLIR, lowered and canonicalized, walked, and emitted as protobuf index entries. Instead of persisting those index entries, the query engine visits them in priority order, and builds up a collection of matches by repeatedly querying the search index store.

After every address has been visited, the aggregated matches are themselves ordered, both with respect to the scopes involved, and with respect to the percentage of matching content within each scope. With this approach, hits which may not match one scope of a user’s query, but match all others, will be prioritized over those which match every scope, but only in part.

Result Presentation

Each match carries:

Locations: the locations (line/column offsets) involved
Location scopes: each location’s scope
Match percentages: per-scope and overall

Each location belongs to a terminal from the input, and has a scope, which allows us to make incredibly detailed highlights for every match. Rather than display just a single yellow highlight across a broad swath of code, ProofSouq can highlight just the parts that match, in colors corresponding to just those scopes that match.

Future Path

The approach described above lends itself well to queries submitted by proof authors and developers, and may also turn out to be fruitful when incorporated into automation. The per-scope matches reported by the query engine should be granular enough to drive an initial foray into proof/term (re)writing: by submitting open goals to the query engine, and attempting to adopt (or adapt) each match in priority order, it may be possible to meaningfully narrow the search space for a given open goal, and thereby reduce the effort/time required to close it.