Microsoft Germany CTO, Andreas Braun, confirmed that GPT-4 is coming within a week of March 9, 2023 and that it will be multimodal. Multimodal AI means that it will be able to operate within multiple kinds of input, like video, images and sound.
Multimodal Large Language Models
The big takeaway from the announcement is that GPT-4 is multimodal (SEJ predicted GPT-4 is multimodal in January 2023).
Modality is a reference to the input type that (in this case) a large language model deals in.
Multimodal can encompass text, speech, images and video.
GPT-3 and GPT-3.5 only operated in one modality, text.
According to the German news report, GPT-4 may be able operate in at least four modalities, images, sound (auditory), text and video.
Dr. Andreas Braun, CTO Microsoft Germany is quoted:
“We will introduce GPT-4 next week, there we will have multimodal models that will offer completely different possibilities – for example videos…”
The reporting lacked specifics for GPT-4, so it’s unclear if what was shared about multimodality was specific to GPT-4 or just in general.
Microsoft Director Business Strategy Holger Kenn explained multimodalities but the reporting was unclear if he was referencing GPT-4 multimodality or multimodality in genera.
I believe his references to multimodality were specific to GPT-4.
The news report shared:
“Kenn explained what multimodal AI is about, which can translate text not only accordingly into images, but also into music and video.”
Another interesting fact is that Microsoft is working on “confidence metrics” in order to ground their AI with facts to make it more reliable.
Something that apparently was underreported in the United States is that Microsoft released a multimodal language model called Kosmos-1 at the beginning of March 2023.
According to the reporting by German news site, Heise.de:
“…the team subjected the pre-trained model to various tests, with good results in classifying images, answering questions about image content, automated labeling of images, optical text recognition and speech generation tasks.
…Visual reasoning, i.e. drawing conclusions about images without using language as an intermediate step, seems to be a key here…”
Kosmos-1 is a multimodal modal that integrates the modalities of text and images.
GPT-4 goes further than Kosmos-1 because it adds a third modality, video, and also appears to include the modality of sound.
Works Across Multiple Languages
GPT-4 appears to work across all languages. It’s described as being able to receive a question in German and answer in Italian.
That’s kind of strange example because, who would ask a question in German and want to receive an answer in Italian?
This is what was confirmed:
“…the technology has come so far that it basically “works in all languages”: You can ask a question in German and get an answer in Italian.
With multimodality, Microsoft(-OpenAI) will ‘make the models comprehensive’.”
I believe the point of the breakthrough is that the model transcends language with its ability to pull knowledge across different languages. So if the answer is in Italian it will know it and be able to provide the answer in the language in which the question was asked.
That would make it similar to the goal of Google’s multimodal AI called, MUM. Mum is said to be able provide answers in English for which the data only exists in another language, like Japanese.
There is no current announcement of where GPT-4 will show up. But Azure-OpenAI was specifically mentioned.
Google is struggling to catch up to Microsoft by integrating a competing technology into its own search engine. This development further exacerbates the perception that Google is falling behind and lacks leadership in consumer-facing AI.
Google already integrates AI in multiple products such as Google Lens, Google Maps and other areas that consumers interact with Google.
It’s just that the way Microsoft is implementing it is more visible.
Read the original German reporting here:
Featured image by Shutterstock/Master1305