Autonomous Research Pipeline

by Martin Russmann

On filtering, not flooding, market intelligence

A useful research system is defined as much by what it suppresses as by what it produces.

That claim is easy to miss in the current AI cycle, because most systems are evaluated by fluency, breadth, and output volume. Yet from a scientific perspective, these are secondary properties. The harder and more interesting problem is selection: under conditions of abundant possible signals, which observations deserve to survive, and by what rule?

Macropulze is being developed as a research and teaching platform precisely around that question. The aim is not only to expose analytical tools, but to make research architectures inspectable, discussable, and pedagogically useful. A system is therefore valuable here not merely when it generates plausible output, but when its logic of escalation, retention, and suppression can be made legible.

The Autonomous Research Pipeline, or ARP, should be understood in that context. It draws architectural inspiration from Andrej Karpathy’s AutoResearch framework, not by copying its implementation, but by transferring its deeper discipline: candidate generation alone is insufficient; outputs must pass through a rule of evaluation before they deserve to remain visible.

This logic maps naturally into market research. Not because the domain resembles neural-network training in any literal sense. It does not. But because both settings confront the same structural problem: many possible analytical paths can be generated, while only a few are worth preserving. What matters, then, is not merely the capacity to produce research, but the capacity to decide which research earns the right to be seen.

ARP is built around precisely that distinction.

From generation to selection

Many AI systems are built around data prediction, signal generation, and automated summarization. This is technically impressive, but methodologically incomplete. Once output expands beyond the user’s cognitive budget, the burden of filtration simply returns at the human level.

ARP begins from the opposite premise. A serious research system should not maximize textual production. It should maximize informational discrimination.

In practice, this means three things.

First, the system should scan broadly. It should be able to inspect a watchlist or portfolio as a whole and identify where the state of evidence appears to have shifted.

Second, it should investigate selectively. Deeper analysis should be reserved for those names where the first-pass evidence suggests that something materially changed.

Third, it should report sparingly. A finding should not be surfaced merely because it can be generated. It should be surfaced only if it exceeds a threshold of significance.

That final point is decisive. ARP is not organized around the proposition that more research is always better. It is organized around the proposition that only some findings deserve survival.

For a research and teaching platform, this is not a minor design choice. It is part of the lesson. What matters is not only that a system can produce language on demand, but that it can make clear why one result is elevated, why another is suppressed, and which rule governs the distinction.

The architecture of a research loop

Operationally, ARP runs as a three-phase pipeline. It may be scheduled, triggered manually, or embedded in a broader analytical workflow. The essential logic remains the same.

In the first phase, the system performs a broad scan across the user’s watchlist and portfolio. A small set of core analytical agents evaluates each stock from several first-order perspectives, including fundamentals, technical structure, and sentiment. The aim is not yet to produce a definitive thesis. The aim is to estimate whether the evidentiary state of the stock has changed relative to prior runs.

This point deserves emphasis. In a research context, the important question is often not “What is this stock?” but “What changed since the last time it was examined?” A stock that remains steadily strong may be less urgent than a stock whose profile has shifted from ambiguous to fragile, or from neglected to newly interesting. ARP therefore computes a change score rather than merely assigning a static label. Directional reversals, confidence shifts, and first-time observations are treated as signals of differing weight.

In the second phase, only the top-ranked stocks by change score advance to deeper analysis. This is where the system concentrates its analytical budget. The selected names are examined through a wider panel of hedge-fund-style agents and through deeper interpretive blocks, such as investment thesis, bull-versus-bear structure, moat analysis, margin development, and free-cash-flow quality.

This selective concentration is not an implementation detail. It is the core discipline of the architecture. A serious research system should not apply maximal reasoning everywhere. It should apply it where there is evidence that the return on additional analysis is likely to be highest.

For a teaching platform, this phase has a second function. It shows how layered analytical systems can be composed without collapsing into indiscriminate complexity. Broad scan, selective escalation, and explicit retention together form a structure that can be studied, critiqued, and improved. The architecture is therefore not only operational, but pedagogical. It makes the research process visible.

In the third phase, the findings are evaluated and filtered. Each stock receives a significance score constructed from several components: the degree of change, the strength of consensus across analytical perspectives, the richness of the returned evidence, and the novelty of the result relative to recent surfaced findings. Only those findings that exceed a threshold are surfaced to the user. The others remain in the background record, but do not claim attention.

Here the affinity with Karpathy’s AutoResearch becomes conceptually clear. In both cases, the system is less interesting as a generator than as a selector. Its intelligence lies not simply in producing candidate outputs, but in deciding what survives.

Why significance matters

The significance score is the epistemic center of ARP.

This is not because a single number can dissolve uncertainty. It cannot. Markets are not laboratory systems, and significance in this setting is not identical with statistical significance in the narrow textbook sense. Rather, the score functions as a structured decision rule: it translates multiple dimensions of evidence into a disciplined criterion for surfacing or suppressing output.

That matters because many automated research tools are weak precisely at this boundary. They can summarize, synthesize, and elaborate, but they often lack a principled answer to the most practical question: why should attention be spent on this now?

ARP tries to answer that question explicitly.

A stock may exhibit strong analytical agreement but little novelty, in which case the result may be informationally stale. Another may be novel but weakly supported, in which case escalation may be premature. A third may show meaningful change, strong consensus, and rich supporting analysis, and thus merit immediate attention. The significance score does not abolish judgment. It organizes the conditions under which judgment should be invoked.

This has an important consequence. A successful run of ARP may produce only a small number of surfaced findings. Sometimes it may produce very few. That should not be interpreted as inactivity. On the contrary, it is evidence that the system has done the harder work of withholding what did not meet the bar.

For a research and teaching platform, that restraint is itself instructive. Scientific seriousness often begins with the willingness not to report everything one has seen.

Research under user-specific priorities

Another important component of ARP is the research program.

Users can provide a free-text directive that shapes how the system interprets relevance during deep analysis. This is necessary because research is not objective in the simplistic sense that the same evidence is equally important to all observers. A long-horizon value investor, a growth investor, a dividend-focused investor, and a capital-preservation-oriented investor will all rank the same facts differently.

The research program allows this difference to be formalized without fragmenting the architecture. The data remains the same, but the evaluative context changes. One may describe this as a user-conditioned prior over what constitutes a material finding. In one case, the system may overweight balance-sheet fragility. In another, it may prioritize revenue acceleration, competitive moat erosion, or dividend sustainability.

This matters pedagogically as well. A research system should not create the illusion that there is only one neutral way to evaluate evidence. It should help make explicit where priorities, assumptions, and research styles enter the process. The research program turns that implicit layer into something visible and discussable.

Bounded autonomy

A final point is worth making, because the word “autonomous” has acquired an unfortunate rhetorical inflation.

ARP is autonomous, but only in a bounded sense. It runs without constant supervision, but within explicit limits. It does not rewrite its own logic. It does not place trades. It does not wander indefinitely through open-ended tasks. It operates under constraints of budget, scope, and threshold.

This boundedness is not a concession. It is part of the scientific character of the system.

Any research process worthy of trust must be able to say not only why it continued, but also why it stopped. In ARP, the answer is given in operational terms: the stock universe is bounded, the analytical budget is bounded, the escalation criteria are bounded, and the surfacing rule is bounded. This makes the system not only more controllable, but more intelligible.

For a research and teaching platform, this matters doubly. The point is not merely to present autonomous systems as impressive. The point is to show the conditions under which their autonomy becomes methodologically defensible.

Autonomy without limits is theatrical. Autonomy with explicit constraints can become analytical.

What remains after the run

The practical effect of ARP is simple to describe, though harder to engineer.

Once a run has completed, the system has already performed the first-order filtration that usually consumes the beginning of the research process. It has scanned the covered universe, estimated where the state of evidence changed, spent deeper computation only where change justified it, and retained only those findings that survived a significance filter. The user no longer begins with an undifferentiated field of possible signals. The user begins with a narrower field of candidates that have already passed a preliminary test of relevance.

This is, in a modest but important sense, a different conception of AI.

Not an oracle. Not a machine for endless commentary. Not a generator of synthetic certainty.

Something quieter, and perhaps more useful: a system that treats attention as scarce, evidence as graded, and reporting as a privilege rather than a reflex.

That is also why ARP belongs naturally on a research and teaching platform. It is not only meant to be used. It is meant to be understood. Its value lies partly in the outputs it produces, but also in the architecture it exposes: a bounded loop of scan, escalation, evaluation, and retention. That structure can be studied, adapted, and taught.

That is the guiding idea behind the Autonomous Research Pipeline.

It asks a limited but consequential question: among everything that could have been said, what has actually earned the right to be seen?

For the original AutoResearch framework, see Andrej Karpathy’s GitHub repository: karpathy/autoresearch.