Amazon Surpasses OpenAI's ChatGPT with Multimodal-CoT Generative AI

Recent advancements in AI technology have allowed large language models (LLMs) to perform well in complex reasoning tasks through chain-of-thought (CoT) prompting.

Junja Choudhary
Junja Choudhary Official | Verified Expert • 13 Apr, 2026Editorial Desk
February 13, 2023 • 2:24 PM
T
Tech
NEWS CARD
Logo
Amazon Surpasses OpenAI's ChatGPT with Multimodal-CoT Generative AI
“Amazon Surpasses OpenAI's ChatGPT with Multimodal-CoT Generative AI”
Favicon
Read more onsangritoday.com
13 Feb 2023
https://www.sangritoday.com/amazon-surpasses-openais-chatgpt-with-multimodal-cot-generative-ai
Google News
Copied
Amazon Surpasses OpenAI's ChatGPT with Multimodal-CoT Generative AI
Amazon Surpasses OpenAI's ChatGPT with Multimodal-CoT Generative AI

OpenAI's ChatGPT has been making waves in the AI community for the past two months, with discussions about its potential impact on various fields including business and education. However, tech giants Google and Baidu have since entered the chatbot scene, showcasing their own generative AI technologies. Now, Amazon has entered the race with a new language model that outperforms OpenAI's GPT-3.5 on the ScienceQA benchmark by 16 percentage points, even surpassing human performance.

Recent advancements in AI technology have allowed large language models (LLMs) to perform well in complex reasoning tasks through chain-of-thought (CoT) prompting. However, the current research in CoT focuses solely on the language modality, often using a multimodal-CoT paradigm to find CoT reasoning in multiple inputs such as language and vision.

Most existing methods of multimodal-CoT combine multiple inputs into a single modality before asking LLMs to perform CoT. But this can lead to information loss and produce hallucinatory reasoning patterns. To overcome these limitations, Amazon researchers have developed Multimodal-CoT, which combines visual features in a separate training framework. The framework divides the reasoning process into two parts: finding a reason and determining the answer. By incorporating vision in both stages, the model is able to provide more convincing arguments and draw more accurate conclusions.

The inference and reasoning-generating stages of the Multimodal-answer CoT use the same model architecture but differ in their inputs and outputs. In the rationale generation stage, the model is fed data from both the visual and language domains, and the rationale is then added to the language input in the answer inference step. The textual representation of the language is made through a Transformer encoder and combined with the visual representation, which is then fed into the Transformer decoder.

Junja Choudhary

Junja Choudhary Official | Verified Expert • 13 Apr, 2026Editorial Desk

Junja Choudhary serves as the Editor-in-Chief of Sangri Today. A dynamic news personality and rigorous fact-checker, he brings more than 7 years of professional experience in print and digital media. His editorial leadership is defined by a strong commitment to journalistic ethics, truth-seeking, and delivering well-researched, balanced reporting on critical issues.

homeHomeamp_storiesWeb Storieslocal_fire_departmentTrendingplay_circleVideosmark_email_unreadNewsletter