Simulating Action Dynamics with Neural Process Networks

An overview of Neural Process Networks (NPN), a model that treats procedural text as a sequence of state transformations, learning to simulate how actions like 'bake' or 'slice' modify entity states (e.g., temperature, shape) even when those changes are implicit.
Script
If I ask you to bake a cake, you instinctively know the batter changes from liquid to solid and gets hot, even though I never explicitly said so. But for standard language models, these unstated physical changes are often invisible. This paper introduces Neural Process Networks to bridge that gap by teaching models to simulate the hidden dynamics of the world.
To understand the core challenge, consider that most procedural text, like a recipe, is a sequence of instructions where the results are implied rather than stated. When a recipe says 'bake,' it implies a change in cookedness and temperature, but traditional models struggle to represent these unstated state changes. Without tracking these invisible variables, a model cannot truly comprehend the narrative or generate coherent next steps.
To solve this, the researchers shift the perspective from reading text to simulating a process. While traditional approaches might just track linguistic patterns, the Neural Process Network treats the document as a simulation where verbs function as learned operators. In this framework, an action isn't just a word; it is a function that actively transforms the state of an entity, such as modifying its temperature or shape.
The underlying mechanism works by processing sentences sequentially to update a persistent memory. For each step, the model identifies the relevant action and entity, applies the action as a tensor transformation, and then writes the new state back to the entity's memory slot. Crucially, the system learns what these actions actually do—like identifying that 'boil' changes temperature—using weak supervision from the text itself.
The validity of this approach is backed by strong empirical results on recipe datasets. The model achieves an F1 score of 55.39 on entity selection, significantly outperforming recurrent baselines, while the learned action embeddings naturally cluster by function, grouping similar actions like 'cut' and 'slice' together. Furthermore, when used to generate new recipe steps, the model produces more coherent instructions because it conditions its output on the simulated state of the ingredients.
By explicitly modeling verbs as operators that transform the state of the world, this paper demonstrates that deep learning models can move beyond surface patterns to capture causal dynamics. To explore more about neural process networks and entity tracking, visit EmergentMind.com.