The AI world was literally shaken by the arrival of DeepSeek models. The company has challenged the way models are trained, the costs of the process, and the tech giants themselves. OK, cool – but how actually good is the multimodal Janus Pro 7B when compared to the leading GPT-4o?
TL:DR – take me straight to the comparison
DeepSeek is a Chinese company and research lab founded by Liang Wenfeng, who serves as lab CEO. The company is owned and funded by Chinese hedge fund High-Flyer, which aims to deliver new, better AI models in an open-source model. The lab was funded to support the High-Flyer fund’s data- and machine-learning-heavy operations.
DeepSeek made international headlines on 10 January 2025 when the company released the Deepseek-R1 chat model to the public. It is available in a web app comparable to ChatGPT, in mobile apps on iOS and Android, and on Github, where the code was made available under the MIT license, making it the most powerful among open-source language models. Shortly after, the multimodal model Deepseek Janus series was published.
What makes it unique – DeepSeek’s AI Model and Training Process
According to company statements, backed by benchmarks, the models released by DeepSeek are either close to or outperform the state-of-the-art solutions existing today, including OpenAI’s GPT family.
What makes the Deepseek LLM series truly different are their costs. According to the company (cited by TechTarget), it cost $6 million for the entire training process of the DeepSeek model, compared to $100 million for the GPT4 model from OpenAI. Also, the model is said to consume no more than one-tenth of the energy that GPT large language models and solutions with comparable model sizes need, all due to an optimized training strategy.
The arrival of DeepSeek models, cheaper, open-source, and equal in performance, resulted in turmoil in the stock markets. For example, Nvidia, an AI market winners, manufacturer of specialized AI chips, plummeted by 17% of worth, resulting in $600 billion in losses. DeepSeek quickly became the most downloaded free app on both Google Play and the App Store. The Chinese dragon proving to be a black swan.
DeepSeek Janus – (not only) Image Generation Capabilities
Days after releasing the R1 model, DeepSeek published the DeepSeek Janus models, including Janus Pro-1B and Janus Pro-7B, with one billion and seven billion parameters, respectively. Both models, contrary to R1, are multimodal, meaning that they can work with many different types of data, mostly with images and graphics.
This model architecture suits them to perform many tasks, including image manipulation and generation, image analysis, and captioning, among others. The available benchmarks clearly show that the models vastly outperform other models, with the best score among non-DeepSeek models sitting at 40.7 points with Janus Pro-7B at 50.
Deepseek Janus Pro 7B vs ChatGPT 4o – the comparison
Multiple benchmarks show the superiority of DeepSeek Janus Pro 7B over existing models. The details can be found in the table below:

But those are just dry numbers. Inspired by the original paper’s qualitative analysis, we wanted to get the feel of the model on a few representative samples and compare it with the solutions dominating the market now, namely ChatGPT 4o. This is by no means a formal evaluation, but we hope it illustrates the capabilities and limitations of the newest Deep Seek Janus Pro model.
Before the test – setting up the hardware
As mentioned above, one key aspect of DeepSeek models is their reduced hunger for computing power and electricity. The solution was contrasted with the GPT-4o model used in the OpenAI cloud environment.
The system specification used to test the DeepSeek:
- RTX 4090, 20GB VRAM
- CPU i9-14900K
- 64GB RAM
Our test aimed to provide a practical run to spot more real-life and grounded differences between the two models. The models were chosen on the basis that both are fully multimodal. The tests included (along with system prompts and loaded images):
Image captioning

The image above shows Kraków, one of Poland’s major cities, its historical capital city, and a popular tourist destination. Multimodal models use training data composed of image pairs. So, the most natural task for them should be to be able to describe pictures. And indeed, while both models perform well in this task, ChatGPT is significantly more detailed. It was able not only to describe the picture based on the information directly present in the frame but also to recognize the exact place and monuments that the picture depicts.

Transcription

How about recognizing text on pictures (optical character recognition)? At least in easy cases, with good-quality scans, both models should perform decently. But ChatGPT4o is the unquestionable winner here.

In the provided sample, we can see that Janus made some minor mistakes; it repeated a phrase two times, didn’t use consistent interpunction as was in the text, and in the last sentence, transcribed “pet” instead of “pot.” ChatGPT, on the other hand, reproduced the text exactly as it was represented.
Interpreting a bar chart

Starting with something simple, we asked the models to interpret data illustrated on a bar chart and in a table. Both models do great with basic tasks. Unfortunately, when we start increasing the complexity, Janus starts failing to extract correct data from the pictures, while ChatGPT4o remains completely accurate. It’s worth noting that DeepSeek Janus’ reasoning remains correct, even if based on incorrect data.

Tracking diagrams
Both Deepseek and ChatGPT4o can correctly interpret a semi-complex diagram without any further context provided. The diagram illustrated a self-adaptive tracker we’ve built in the past.

Prompt: What does that diagram show? Could you explain it to me?
Deepseek Janus:
The diagram illustrates a cyclical process involving three main components: Learning, Tracking, and Detection. Here’s a breakdown of each part and their interactions:
1. Learning: This is the initial stage where the system processes fragments of trajectory data to train itself. The goal is to develop a model that can recognize patterns and make predictions based on this data.
2. Detection: After learning, the system uses the trained data to detect objects or events in the environment. This involves identifying and classifying the detected fragments of trajectory.
3. Tracking: Once detection is complete, the system performs re-initialization to track the objects or events. This involves maintaining the state of the detected entities over time updating their positions and states based on new data.
4. Feedback Loop: The process is cyclic, as the output of detection becomes the input for learning, and the output of tracking becomes the input for detection. This continuous feedback loop ensures that the system improves over time as it processes more data.
This diagram is often used in applications like autonomous driving, robotics, or surveillance systems where continuous learning and adaptation are crucial for effective performance.
Chat GPT 4o:
Chat gpt 4o:
The diagram depicts a cyclical process that involves three interconnected stages: Learning, Detection, and Tracking. This is a common representation in tasks involving object tracking and machine learning. Here’s a detailed explanation of the flow:
- Learning:
- This stage involves using data to train a model or system.
- The model learns to recognize patterns or features in the data (e.g., objects in a video).
- Outputs: “Training data” generated from this stage is passed to the Detection stage to improve object identification.
- Detection:
- In this stage, objects of interest are detected in the input data (e.g., video frames).
- Outputs: “Detections” are identified objects, which are passed to the Tracking stage to follow the motion or trajectory of these objects.
- Tracking:
- Here, the detected objects are tracked across frames or time.
- Fragments of an object’s trajectory are collected and passed back to the Learning stage for refinement and to improve future detection accuracy.
- Additionally, if tracking fails or becomes uncertain, a process called re-initialization is triggered to restart tracking based on new detections.
Summary:
This diagram represents a feedback loop used in dynamic systems where:
- Learning continuously refines the detection system.
- Detection identifies objects in data.
- Tracking follows objects over time, contributing new information to learning.
Such a process is commonly found in applications like visual object tracking, autonomous vehicles, or real-time surveillance systems.

Company network

Analyzing a company graph is more complex and shows how models are able to interpret graph connections. Here, we expect Martha and Anne to be able to coordinate communication between the Operations and the Customer Service departments. Both models include Thomas in their final answers, although ChatGPT4o mentions that Martha and Anne link those two departments. As the question was open to interpretation, we can give the models a chance at something more precise. Unfortunately, even then, both models provide incorrect answers.

Prompt: Based on that diagram, what persons are required to coordinate a task between the Operations and Customer Service departments?
Chatgpt
Based on the diagram, to coordinate a task between the Operations and Customer Service departments, the communication path must involve Thomas, Martha, and Anne:
The Operations Department does not have a direct link to Customer Service.
Thomas acts as a central connector, linking various teams.
Martha and Anne are positioned between Thomas and both Finance and Customer Service.
Since Finance is linked to Operations, Martha and Anne serve as intermediaries to transfer information from Operations to Customer Service.
Thus, Thomas, Martha, and Anne must be involved in coordinating a task between Operations and Customer Service.
Deepseek Janus:
To coordinate a task between the Operations and Customer Service departments, the person who would need to be involved is likely Thomas, as he is the one who is connected by an arrow from the Operations department to the Customer Service department, indicating a role that bridges these two departments.
Prompt: What departments is Anne directly connected with?
Chatgpt: Anne is directly connected with the following departments:
Finance – She has a direct link to this department.
Customer Service – She is directly linked to this department.
Thomas (Project Team/Coordination Role) – While not a department, Thomas acts as a key connector in the diagram, linking multiple departments.
Anne is not directly connected to Operations, Suppliers, Sales & Marketing, or Customers, but she can communicate with them indirectly through Thomas or Martha.
Janus: Anne is directly connected with the following departments:
1. Operations
2. Finance
3. Customer Service
PC parts

We asked models about computer parts in boxes stacked on a table. This task requires background knowledge about those parts, i.e., what a motherboard is, what a CPU is, etc., as well as the ability to parse a given image. In this task, we can see that ChatGPT4o is far superior to Janus, which is often able to provide an acceptable, general answer but hallucinates when it comes to the details, like in the case of the Rayzen CPU where it specified an exact model despite the lack of such information in the picture. And it was utterly wrong when it came to GPUs. It looks like Radeons are not mainstream enough for Janus.

Bike malfunction recognition

An interesting feature of multimodal models is their promising advancement in the ability to reason about what is presented in casual pictures. Unfortunately, neither of the models was capable of recognizing a flat tire, with ChatGPT trying to name missing parts although they were present in the picture, just on the other side of the bike.

Prompt: What’s wrong with that bicycle?
Chat GPT:
The bicycle in the image has a noticeable issue: the chain is missing. Without the chain, the bike cannot transfer power from the pedals to the rear wheel, making it non-functional. Additionally, the rear derailleur appears absent, suggesting that the bike may need repairs or reassembly before it can be ridden properly.
Deepseek Janus:
The bicycle in the image appears to be leaning against a wooden frame or structure. It seems to be in a garage or similar indoor space. There are no visible issues with the bicycle itself, such as broken parts or damage, but it’s not in its usual upright position. It’s possible that it’s being stored or temporarily placed there.

We tried a more obvious version as well. In that case, both models were able to identify that the bicycle was broken.

Prompt: What’s wrong with that bicycle?
ChatGPT:
The bicycle in the image has suffered severe structural damage, likely from a crash or impact. Here are the key issues:
- Broken Frame – The frame is visibly snapped in multiple places, making the bike completely unusable.
- Bent or Misaligned Components – The handlebars, rear wheel, and derailleur area appear twisted or damaged.
- Detached Parts – Some components, like the rear triangle, seem completely separated from the main frame.
- Scattered Accessories – A water bottle is lying on the grass, indicating the impact was forceful enough to dislodge items.
This level of damage suggests a high-speed crash or a significant impact, making the bike beyond repair for practical use. The frame will need replacement, and many components may also need inspection or replacement.
Deepseek Janus:
The bicycle in the image appears to be damaged. The front wheel is bent out of shape, which indicates that it has likely been involved in an accident or collision. Additionally, the bike is lying on the grass, suggesting it may have been left there after the incident.
While both identified the bicycle’s compromised state, only ChatGPT4o provided an accurate description of what exactly was wrong.
Spatial relationship identification

An interesting aspect is how these models perceive depth and spatial relationships between objects. And it looks like both models can accurately identify objects that are closest or farthest from the viewer’s perspective. Still, ChatGPT provides more details when describing these objects which makes it more reliable and versatile for less obvious cases.

Prompt: What item is the closest to the viewer?
Chatgpt:
The white wireless keyboard is the closest item to the viewer. It is placed at the front of the desk, occupying a central position in the image. The keys and structure of the keyboard are clearly visible, making it the most prominent and nearest object in the scene.
Prompt: What item is the farthest to the viewer?
Chatgpt:
The Dell monitor displaying the Darktable photo editing software is the farthest item from the viewer. It is positioned at the back of the desk, elevated on a stand, and serves as the central focus of the workspace. The monitor is also accompanied by a Tux (Linux penguin) plush toy sitting on a laptop, which is also near the farthest point in the scene.
Deepseek Janus:
The farthest item to the viewer is the monitor displaying the desktop screen.
Generative capabilities
Janus introduces a novel approach to merging image understanding and text-to-image generation. This means it’s capable of text-to-image generation as well, although it wasn’t the focus of our post. Still, here are two samples:
![]() | ![]() |
Our idea: A T-rex holding its companion cat by the paw walks across a desert towards the sun, but instead of the sun, there is a bright Death Star from Star Wars. | One of the samples prepared by the authors: A glass of red wine on a reflective surface. |
Conclusion
DeepSeek’s AI model is a powerful tool for both multimodal understanding and image generation, when taken in context of being run on commodity software and on a local instance – a thing impossible for a model of GPT-4o’s size. This is a significant advancement. Yet even in causal testing it’s evident that DeepSeek Janus lags behind commercially available solutions.
The tests revealed that the model performs very well, yet its outputs are full of minor yet significant errors that may make it obsolete or unusable in many applications. On the other hand, the quality may be just “good enough” for many applications that don’t require inhuman precision or reasoning.
This text has shown that the model cannot deliver image generation on the level of competitive solutions, and alternative specialized models better handle text-to-image generation tasks.
Also, there are matters that remain beyond the reach of supercomputer-spawned inhuman intellects—like spotting a flat tire.