GPT can complete complex tasks by controlling the browser

GPT controlling a browser, sped up.

With the help of using the ReAct prompting pattern, GPT-4 can break down complex browser-based tasks into actionable steps and navigate the web by evaluating custom browser-driving code that it itself generates. This approach was already possible with GPT-3 but the recent increases in the context size make it a bit more practical.

Practical considerations:

  • Currently, it's still pretty slow. Network requests to the API take a considerable amount of time as it is and the GPT4 endpoints are extra slow.

  • The 8k token context window is still too small to fit everything in one request. HTML clean-up and/or recursive summarization are necessary.

  • GPT's lack of training in interactive programming sessions is quite noticeable. Rather than inspecting the available web context to make selections, it sometimes starts reciting old selectors it must have seen in its training data.

  • Self-healing. Any exceptions arising from the GPT-generated code are fed back into the model. Usually, this is enough for it to correct any mistakes it's made in the loop's next cycle.