Solvers#
Solvers are responsible for action selection in ManiDreams. A solver generates candidate actions, evaluates them using the TSIP and Cage, and returns the best action. Solvers are organized into two families: Samplers (policy-based) and Optimizers (planning-based).
Abstract Interface#
class SolverBase(ABC):
@abstractmethod
def solve(self, action_space, cage, tsip, current_dris, verbose=False)
-> Tuple[Any, List[float], List[bool]]:
"""Select the best action. Returns (best_action, costs, validations)."""
@abstractmethod
def action_generator(self, action_space=None) -> List[Any]:
"""Generate candidate actions for evaluation."""
def reset(self) -> None: ...
The solve() method is the primary entry point. When operating in CAGE mode, it internally calls action_generator() to produce candidates, tsip.next() to predict outcomes, cage.evaluate()/cage.validate() to score them, and returns the best action. The selection strategy prefers valid actions first, then selects the one with lowest cost.
Samplers#
Samplers (src/manidreams/solvers/samplers/) provide policy-based action generation.
SamplerBase#
Abstract base class defining sample(num_samples) -> List[Any].
DiscreteSampler#
Enumerates all actions in a discrete action space as single-step trajectories [[0], [1], ..., [n-1]]. Used with GeometricOptimizer for tasks with discrete action spaces.
GaussianSampler#
Samples continuous action trajectories from a Gaussian distribution N(mean, std) with shape [horizon, action_dim]. Supports online distribution updates for CEM-style optimization. Used with MPPIOptimizer.
PolicySampler#
The primary solver for RL-based tasks. Extends SolverBase (not SamplerBase) because it implements the full solve loop. Operates in two modes:
Baseline mode (num_samples=1 or cage=None):
obs → policy.get_action(obs, deterministic=True) → single action → execute
No TSIP evaluation. No cage checking. The policy’s output is directly executed.
CAGE mode (num_samples > 1 with cage and TSIP):
obs → policy.sample_actions(N) → N candidate actions
→ tsip.next(dris, [a_0, ..., a_{N-1}]) → [DRIS_0, ..., DRIS_{N-1}]
→ cage.evaluate([DRIS_0, ..., DRIS_{N-1}]) → [cost_0, ..., cost_{N-1}]
→ cage.validate([DRIS_0, ..., DRIS_{N-1}]) → [valid_0, ..., valid_{N-1}]
→ select best: prefer valid, then lowest cost
→ execute best action
Constructor:
PolicySampler(
policy_model, # RL policy with get_action() interface
obs_encoder=None, # Optional encoder (e.g., PointCloudAE)
obs_mapper=None, # Maps raw observations to model input
action_mapper=None, # Maps model output to control commands
num_samples=1, # Number of candidate actions (1 = baseline)
deterministic=True, # Use policy mean vs. stochastic sampling
device='cpu',
)
Optional components:
obs_encoder: Transforms raw observations before feeding to the policy (e.g., PointCloudAE encodes point clouds to a latent vector)obs_mapper: Maps raw environment observations to the format expected by the policyaction_mapper: Maps policy output to the control command format expected by the backend
Optimizers#
Optimizers (src/manidreams/solvers/optimizers/) provide planning-based action selection via trajectory optimization.
MPCOptimizer#
Base MPC solver with configurable horizon, discount factor, and trajectory rollout. Implements _rollout() (serial or parallel), _evaluate() (discounted cumulative cost over trajectory), and _select_best(). Automatically detects whether the TSIP supports parallel rollout and falls back to serial execution otherwise.
For planning-based solvers, the pipeline extends to multi-step trajectory evaluation:
1. Generate K trajectories of horizon H:
T_k = [a_k^0, a_k^1, ..., a_k^{H-1}] for k = 0, ..., K-1
2. Rollout each trajectory through TSIP:
DRIS_k^0 → (a_k^0) → DRIS_k^1 → (a_k^1) → ... → DRIS_k^{H-1}
3. Evaluate trajectory cost with discounting:
cost_k = sum_{h=0}^{H-1} discount^h * cage.evaluate(DRIS_k^h)
valid_k = all(cage.validate(DRIS_k^h) for h in 0..H-1)
4. Select best trajectory: k* = argmin_{k: valid_k} cost_k
5. Execute first action of best trajectory: a_{k*}^0
GeometricOptimizer#
Extends MPCOptimizer. Uses DiscreteSampler to enumerate all actions in a discrete action space and evaluates single-step trajectories. Used for the multi-object pushing task where the action space is 16 discrete directions.
MPPIOptimizer#
Implements Model Predictive Path Integral with Cross-Entropy Method (CEM) for continuous offset actions. Operates in 6D offset space [dx, dy, dz, droll, dpitch, dyaw].
Key features:
Multi-iteration CEM refinement (
num_iterations, default 6)MPPI weighting:
w_i ~ exp(-temperature * cost_i), invalid samples receive zero weightWarm-start: caches offset distribution between timesteps for temporal coherence
Configurable standard deviation bounds (
min_std,max_std)Used for the picking task with multi-step horizon planning
NaiveOptimizer#
Geometric heuristic that selects actions based on the angle between the cage motion direction and the vector from cage center to DRIS center. No TSIP prediction — pure geometric reasoning.
PixelOptimizer#
Delegates action selection to the cage’s compute_direction() method. No TSIP prediction. Used with CircularPixelCage where the cage directly computes the optimal 16-directional push action from the PSS-to-cage-center vector.
Quick API Reference#
SolverBase
.solve(action_space, cage, tsip, current_dris) -> (action, costs, validations)
.action_generator(action_space) -> List[actions]
.reset()
PolicySampler(policy_model, obs_encoder=None, obs_mapper=None, action_mapper=None,
num_samples=1, deterministic=True, device='cpu')
.initialize(env, initial_obs)
.solve(action_space, cage, tsip, current_dris) -> (best_action, costs, validations)
.sample_actions(num_samples) -> Tensor
.update_after_step(new_obs)
.reset()
MPPIOptimizer(config={'horizon': 5, 'num_samples': 16, 'num_elites': 8,
'num_iterations': 6, 'temperature': 0.5, 'discount': 0.95})
.solve(action_space, cage, tsip, current_dris) -> (best_offset, costs, validations)
.reset()