• Author(s): Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, Gal Chechik

“Make It Count: Text-to-Image Generation with an Accurate Number of Objects” introduces a novel approach to generating images from textual descriptions while ensuring that the generated images contain the correct number of objects specified in the text. This research addresses a significant challenge in text-to-image generation, where existing models often struggle to accurately represent the desired number of objects in the generated images.

The proposed method, called Make It Count, focuses on enhancing the ability of text-to-image models to generate images with the precise number of objects mentioned in the input text. The authors introduce a new training objective that explicitly encourages the model to learn the correspondence between the textual description and the number of objects in the generated image. This objective is based on a counting loss that penalizes the model for generating images with an incorrect number of objects. To facilitate the learning of accurate object counting, Make It Count employs a two-stage generation process. In the first stage, the model generates an initial image based on the input text using a standard text-to-image architecture. In the second stage, the model refines the generated image by iteratively adjusting the object counts to match the desired number specified in the text. This refinement process is guided by the counting loss, which provides feedback to the model on the accuracy of the object counts.

The paper provides extensive experimental results to demonstrate the effectiveness of Make It Count. The authors evaluate their approach on multiple datasets and compare it with state-of-the-art text-to-image models. The results show that Make It Count consistently outperforms existing methods in terms of generating images with the correct number of objects. The generated images exhibit high visual quality and accurately reflect the object counts specified in the input text. Furthermore, the paper includes qualitative examples that showcase the ability of Make It Count to generate images with precise object counts across various object categories and scene types. These examples highlight the model’s capability to handle complex textual descriptions and generate visually coherent images that faithfully represent the desired number of objects.

“Make It Count: Text-to-Image Generation with an Accurate Number of Objects” presents a significant advancement in text-to-image generation. By introducing a novel training objective and a two-stage generation process, Make It Count enables the generation of images with accurate object counts, addressing a critical limitation of existing models. This research has important implications for various applications that require precise control over the visual content of generated images, such as in design, advertising, and visual storytelling.