When AI Enters the Physical World: Controlling a Robotic Arm with Your Voice, Not Code

Y Jiang
5 min readNov 26, 2023

--

Note: Original article is available https://yexijiang.substack.com/p/when-ai-enters-the-physical-world. Please subscribe to see more articles in the future.

In the article I wrote in March 2023, “Let’s Make a MOSS from The Wandering Earth,” I set forth a challenge: to transform MOSS from just a conversational AI into a central system that operates in the physical world. My vision for MOSS was to control various tools and devices, just as depicted in the movie.

The original words at the end of that article were:

The system discussed in this article is merely a prototype. MOSS is far more than a conversational robot. It should be a central system capable of mobilizing various tools and controlling devices in the physical world. For instance, in the movie, we see MOSS controlling planetary engines, various cameras, and numerous drones.

How to make MOSS have a presence in the physical world is what I want to do next. Implementing this functionality is challenging but also very interesting. There has been some research in this area, and I plan to study it in the coming period.

A hole I dug myself, and one that must be filled. And now, filling it is much easier than before.

Time flies, and eight months have passed. In this time, the field of artificial intelligence has advanced rapidly. Many components that initially required independent development can now be easily implemented with mature tools. For example, some functionalities I planned to develop based on the Toolformer paper can now be directly executed through GPT’s function calls.

Toolformer: Language Models Can Teach Themselves to Use Tools (Paper:https://arxiv.org/abs/2302.04761)

Standing on the shoulders of giants, I managed to create a prototype in just one day. Let’s start with a video.

As shown in the video, we can operate a six-axis robotic arm through simple voice commands. These operations are instantly translated into API instructions for the robotic arm to execute.

System Architecture

A high level overview of the system

As we can also see from the video, the system is roughly divided into three parts:

  1. Local System: Responsible for commanding various modules and handling voice interactions. It also contains some proprietary knowledge of the system, such as API knowledge, and intermediate instructions for translating natural language into robot API commands.
  2. Cloud System: Handles voice recognition and interaction, and is responsible for translating commands into intermediate instructions.
  3. Robotic Arm’s Embedded System: Receives translated instructions from the local system and executes them. The hardware I used here is Nvidia Jetson, and the software is developed based on the ROS2 framework.

The local system connects to the cloud via the internet, and it communicates with the robotic arm over a local network. This system spans public and private networks, crossing from the data space into the physical space.

Areas for Improvement

At present, this system is full of holes and has many areas for improvement. Let’s list some of the obvious ones.

  1. Weak Intermediate Instruction Set: In this system, I established a simple set of intermediate instructions between natural language and the robotic arm. The purpose of this instruction set is twofold: one is to enable information to be transmitted at a higher density between systems. The other is to allow AI to focus on higher-order instructions when converting natural language, rather than low-level instructions like controlling servo motors. Currently, the instruction set is still too weak.
  2. Lack of Vision: Currently, the actions that can be performed are purely servo motor-related operations. Although the robotic arm has motors, and the embedded system contains a GPU, I haven’t utilized them. If I were to integrate a real-time target detection algorithm like Yolo, it should be possible to create more interesting functions.
  3. Low Completion Rate of High-Level Instructions: Right now, I can only send very direct instructions, rather than more abstract ones. Sometimes when I ask it to nod, it turns its head instead. Higher-level instructions, like asking it to perform a set of calisthenics, are even less achievable.
  4. AI’s Limited Understanding of Its ‘Body’: Because I’m using cloud-based GPT, and GPT itself knows very little about my robotic arm. All its knowledge comes from the local instruction set I provided. Imagine if I told you there’s a robotic arm a thousand kilometers away that you can control, but you don’t even know what it looks like. How well could you control it?
  5. Low Fault Tolerance: As a multi-machine and multi-network system, communication issues are likely. I estimate there’s about a 20% chance that something will go wrong in one step of the control process, causing the entire flow to get stuck.

Possible Next Steps

This is an exciting project with much room for improvement. Additionally, I have a camera-equipped programmable drone and a programmable tank with LiDAR, dual cameras, a mechanical arm, and a GPU. When I have time, I might play around with these too.

Entry level DJI programmable drone
A programmable robotic tank equipped with robotic arm, depth camera, lidar, and an Nvidia GPU

Of course, I’m doing this just for fun, to satisfy my curiosity. Even if I devote myself fully, it’s still just a deeply bankrupt version of Google’s PaLM. In terms of resources and capability, I’m far behind a team. I miss the days at Uber ATG, working with a group of idealists on technology. Although idealists might not be best suited for business, it’s always good to strive for the same dream.

Google PaLM can fulfill high-order commands like fetch the snack for you, and is fault-tolerant

It feels like the scenes in “The Wandering Earth 2” are becoming less and less sci-fi. I guess the recent drama at OpenAI (see “On the ‘Split’ of OpenAI: The Real Opponents of Idealists Might Not Be Realists but Another Group of Idealists”) also hints at something.

Drone swarm in the movie

--

--