Vision-Guided Pick-and-Place with a 7DoF Robotic Arm

Franka Emika Panda 7DoF manipulator mounted on the workspace, stacking blocks from the static block table.

Together with Solomon Gonzalez, Mateusz Jaszczuk, and Andrik Puentes, we developed a robotic system that could autonomously detect, grasp, and stack blocks in both static and dynamic environments. The platform was built on the Franka Emika Panda, a seven degree of freedom arm, and combined vision-based perception, kinematic planning, and grasp execution in a unified control pipeline. The final evaluation required robots to build tall, stable towers under time constraints, with performance measured on speed, reliability, and the ability to adapt to both stationary and moving objects.

Competition environment layout and key dimensions. Blocks for the static task were placed on a rectangular workspace with AprilTags for detection, while dynamic blocks moved on a rotating turntable with known angular velocity. A central scoring platform with limited surface area constrained stacking precision, and towers above a certain height risked collapse if placement was misaligned.

System Architecture

The control framework was organized around two primary modules. The ArmController mapped task-level objectives into joint trajectories using inverse kinematics with null-space projection, ensuring that the arm reached targets while maintaining stable joint configurations. The ObjectDetector processed AprilTag detections from a wrist-mounted camera, transformed block poses into the world frame, and passed them to the controller for grasp execution. Together, these modules enabled continuous block localization and stable motion planning.

To reduce planning complexity, we relied on a small set of predefined task configurations: an initialization pose, scanning poses for both static and dynamic blocks, and a placement pose over the stacking platform. These anchor points improved predictability and reduced computational load during repeated tasks.

Static Block Manipulation

The static block pipeline was highly reliable. Once an AprilTag was detected, the block’s pose was computed in the world frame, and the ArmController generated an alignment trajectory. The gripper descended, closed, and lifted the block, then placed it onto the scoring platform with a z-offset proportional to the current tower height. Simulation results showed perfect success rates, while hardware introduced some errors due to calibration offsets, block irregularities, and inconsistent gripper force. Overall, the static task provided consistent points during competition.

Panda manipulator pick-and-stack sequence for a static block during competition. From left to right, the sequence shows the arm moving into the detection configuration, descending to grasp the block, lifting it cleanly, and gently stacking it onto the existing tower. The workspace lighting, background calibration markers, and tight tolerance of the scoring platform highlight the importance of precision in each step.

Dynamic Block Manipulation

The dynamic task introduced greater complexity. Blocks moved along a rotating turntable, requiring prediction of future positions. By estimating the angular velocity of the platform, the system extrapolated the expected angular displacement at grasp time, then computed the target pose in the world frame. The ArmController generated interception trajectories aligned to these predictions.

Simulation success exceeded seventy percent, but hardware performance was closer to fifty-seven percent, affected by camera noise, delays in pose estimation, and calibration drift. Mitigations included rejecting detections near the gripper and elevating approach trajectories to avoid collisions, which improved consistency but underscored the sensitivity of dynamic grasping to timing and noise.

Dynamic grasp sequence on the rotating turntable. The sequence illustrates block detection at the periphery, trajectory prediction based on angular velocity, alignment of the gripper, and interception during rotation. The limited field of view and reflective surface of the turntable contributed to noise in the AprilTag detections, while the proximity to the table edge constrained approach trajectories.

Block Stacking

Both static and dynamic blocks were stacked using the same placement routine. After each successful grasp, the system updated the tower height estimate and applied incremental offsets for precise placement. The inverse kinematics solver maintained alignment across multiple placements, producing stable towers under repeated trials. Placement accuracy was especially important as the scoring platform offered only a narrow surface, and any lateral drift accumulated over successive layers.

Evaluation and Strategy

Our competitive strategy prioritized reliability by targeting static blocks first, then attempting dynamic blocks with remaining time. This ensured steady point accumulation while still showcasing the dynamic tracking pipeline. The system advanced to the quarterfinals and demonstrated both robust static stacking and functional, if less consistent, dynamic interception.

Challenges and Lessons Learned

Hardware testing revealed discrepancies with simulation. Narrower camera fields of view and calibration offsets caused systematic errors, while variations in gripper force reduced grasp reliability. Real-time delays in perception and trajectory execution limited dynamic interception success, and frequent inverse kinematics calls imposed computational overhead. We addressed these issues with repeated calibration, offset tuning, optimized code paths, and safety thresholds in detection. The project emphasized the importance of robust perception and hardware-aware planning when transferring algorithms from simulation to real systems.